If you work in news publishing, there’s a good chance you already use some type of tags in your published content.
After all, Parse.ly found that 70 percent of digital media publishers use tags and that was back in 2015. The number is surely even higher in 2021. But as our research has shown, different publishers adopt wildly different approaches to tagging content, with very different results.
So whether or not you already use tags, an overview of what tags are and what they can do might be useful. In this post, I’ll try to summarize not only that but also why using NLP (natural language processing) for tagging might be a good idea.
What are tags?
In the context of news content, tags are a type of metadata. A news article is usually accompanied by metadata, such as the post’s author, timestamp, and info about which of the site’s sections it belongs to (such as news, business, culture, etc.). Tags—sometimes also referred to as topics or keywords—can supplement them.
Before talking about tags themselves, it’s a good idea to briefly discuss the sections that make up a (news) site. Sections form the structure of a news site. The sections can be designed in-house or it can be based on an existing standard. There are several comprehensive classification systems, such as IPTC or IAB, which publishers can use or customize to suit their own needs.
These systems consist of a controlled vocabulary of categories organized into a hierarchy of topics. They comprise hundreds of categories, so it makes sense to choose the depth that suits your coverage. A newspaper that covers general news might only use top-level categories (e.g., politics, sport, or science), while a more specialized magazine will delve deeper into their particular field of interest. Sections are designed so that any article will fit into one or more of the pre-selected categories, which means that unless, I don’t know, a global pandemic hits, the number of sections used on a site is usually limited.
This makes perfect sense for the structure of a news site but it also means that articles about a variety of topics will end up in the same section. It’s great to know that an article belongs to, let’s say, “World politics,” but it’s not exactly enough to tell anyone what the article is about. Is it about the relationship between the US and Russia? Or elections in Germany? That’s where “keyword tags” come in.
“Keyword”, “topic”, or “subject matter” tags are the most commonly used tags. They allow for a fine-tuned classification of articles. Yet, other types of tags can be useful too—for instance, tags based on content type, sentiment, tone, length, etc. Even the so-called “keyword tags” can go into more detail than simple keywords. They can include more detailed information about the particular subject, such as whether it refers to a person or an organization and how it relates to other entities in the real world.
A great thing about tags is that they are metadata accompanying the article, which means they can be used without having to access the article itself. But what exactly can they be used for?
What are tags good for?
In general terms, tags allow for better analysis. Articles without tags are hard to analyze based solely on main categories like sport, news, or business. Specific tags allow for granular data analysis of readership engagement, improved archive organization, and more. If publishers are interested in readership analytics, they can use tags to measure which topics lead to the most page views, which have the best engagement rates and are getting shared the most, or, conversely, which are losing readers the fastest. Additionally, if your site has a paywall for certain content, tags can be useful for determining what should be free and what should only be accessible to paid subscribers.
Tags can also be useful for advertising. They can help match ads to article topics and avoid embarrassing situations when inappropriate ads are matched to sensitive content. Similarly, tagging sponsored content allows publishers to track performance metrics, and, if relevant, they can then share this data with advertisers. On top of that, tags help with search engine optimization (SEO). Tags create links between articles and give your site a structure that will allow search engines to have a better chance of making the same semantic connections.
Thinking about tags differently in terms of how they benefit readers and publishers means we can consider them as two different types of tags. Simpler, external tags for readers can be selected from more complex, internal tags. As I’ve already mentioned above, tags can contain additional information about the subject which can be used to create taxonomies and link related tags together. For instance, if your tags for Theresa May and Boris Johnson contain the information that they have both served as PMs of the UK, you can then use this information to include both their tags in the same category and in this way create more complex tag hierarchies and a better structure on your site and in your archive.
Choosing how to tag
Choosing the “right” tags for a news article is not as easy as it may seem. When a publisher doesn’t have a well-thought-out tagging strategy in place, this can lead to problems, as haphazardly applied tags make meaningful analysis much more difficult. Often, articles either miss important tags or have redundant or irrelevant tags. Also, the level of detail can vary greatly. There are some good guidelines that explain how to apply tags properly (you can check out one of these on this site), but even when you have a strategy in place, there may still be trouble ahead. Designing a strategy is one thing; sticking to it in practice is another.
Even with thorough checks, the fact remains that when journalists write tags manually, they are prone to making mistakes. Too often, manually applied tags suffer from duplicities and typos. When the Süddeutsche Zeitung took a look at their tags, they realized they had a lot of unnecessary duplicated tags. To take just one example, ‘“Chancellor Merkel,” “Chancellor Angela Merkel,” “German Chancellor Angela Merkel,” and “A Merkel” were all tags being used on SZ. Similarly, it’s typical that the same tag is often used in both the singular and plural form (“iPhone” vs. “iPhones”).
Additionally, as no two people are the same, journalists will often choose different tags for similar articles. We at Geneea have conducted our own research into this issue, which showed that the consistency between human journalists tagging the same articles is less than 20 percent. Different journalists use synonyms for the same concept or have different ideas about how specific tags should be. As a result, one journalist may use the tag “human rights,” while another uses “LGBT rights”; one will put “Middle East,” while another uses “Palestine” and “Israel,” and so on. All of these inconsistencies will obviously lead to issues in any analysis or anywhere else where tags are used.
It’s also worth mentioning that tagging is a task that many journalists loathe. So is there another way to identify tags that are consistent, relevant, unique, and contain important details? Fortunately, there is.
Tagging with the help of NLP
The best way to identify tags successfully is to use natural language processing (NLP). NLP software is an AI system trained to read texts like a human. Thanks to functions like part-of-speech tagging and parsing, the AI understands syntactic structures not only at the level of one sentence but also between sentences. This means it can decode the entity (whether it’s a person, an organization or anything else) that represents the main subject of an article—even when it’s only mentioned by name once. Similarly, it can give more weight to entities that serve as subjects rather than objects in the text. That way, it goes beyond simply counting how many times someone or something is mentioned in a text.
What’s more, thanks to named-entity recognition (NER), NLP software can distinguish between named and unnamed (general) entities and, at the same time, it assigns the named entities several types (such as the person, place, etc.). This classification can then be used to prioritize certain entities over others in tags, depending on the desired strategy. The strategy can vary for different types of articles, so, for instance, the software can favor people and organizations in political articles, products such as cars or mobile phones in technical articles, and general concepts in scientific articles. In a way, NLP AI works with the text like a human, but its advantage is that once it’s given a strategy, it will stick to it, which means it will always choose the same tags for the same article.
Another big advantage of NLP software is that it can be linked to a knowledge base. Thanks to the knowledge base, the AI recognizes one entity even when it’s referred to by different names and can always provide the exact same tag for it (to avoid the abovementioned issue with duplicates). The knowledge base can also be used to provide additional information about the tagged entities and their relations to others which makes tags more capable of providing better functions than simple keywords. At the same time, the AI can use the information it has to disambiguate different entities with the same name so that articles about Adam Scott, the American actor, won’t get confused with articles about Adam Scott, the Australian golfer.
These are but a few of the many advantages of NLP-assisted tagging, so it’s no surprise that more and more publishers are turning to some form of automated or semi-automated tagging. While some bigger media houses can afford to develop their own solution (like The New York Times), for many publishers this isn’t feasible. Fortunately, there are several companies today who can provide this service to publishers. Such a service should be easy to integrate into your CMS and flexible enough to fit your particular needs. Then it can help you get consistent, highly relevant tags and, at the same time, save your journalists time doing work they often consider tedious—time they can better spend doing things that only people can do (for now).
Linguistic Data Analyst, Geneea
Headquartered in Prague, Czech Republic, Geneea is an AI startup developing products for automatic text processing. They focus on advanced NLP and NLG solutions for media companies and for analysis of news in general. Their Media Assistant is an AI-based tool created specifically for journalists to support them in their work. The system automatically analyses the text of articles, and based on their content, it offers ranked suggestions of keywords, images, and related articles, taking into account internal rules and preferences of the media house.