Field guide to tag hunting: The hidden world of Internet data collection

What’s going on and why it’s getting worse

The above picture is among my most prized possessions, a snapshot of how wild and woolly the world of Internet data collection has become.

Beginning in 2013, I’ve spent thousands of hours scanning millions of websites to better understand how data is collected and visitors are tracked. This journey started when I launched my data start-up Mezzobit, but it soon became an obsession, consuming many nights and weekends researching the companies behind these activities, resulting in a database of nearly 4,000 firms.

Facebook and its missteps with Cambridge Analytica may represent an extreme case of collecting and tracking, but that’s only the tip of the iceberg. About $250B in global digital advertising spend and nearly $3 trillion in e-commerce revenue are powered by these activities. But implementation of GDPR in Europe at month’s end — combined with the Sturm und Drang surrounding Facebook — has started exposing what’s really happening behind the scenes.

The worst scan that I’ve ever seen is shown in that graphic, with each dot representing one or more JavaScript tags, tracking pixels, ads and other third-party calls in a single page. The unlucky publisher, whose identity I conceal out of pity, is a well-known U.S. news site with a fairly promiscuous attitude — 30x worse than average — towards bringing external code into its pages.

Lest you start reaching for pearls to clutch, data collection and tracking isn’t — and shouldn’t — go away. The Internet that we know and love, plus trillions of dollars of created value, wouldn’t be possible without it.

But the industry, consumers and regulators are rapidly realizing that a line has been crossed, as this data feeding frenzy degrades user experience, causes compliance violations, and sucks untold value from unsuspecting online trading partners.

It’s time to rethink and reboot, to find a happy medium where profits can continue in parallel with enhanced transparency, accountability and respect for consumer privacy.

But first, what’s really going on

For simplicity’s sake, we’ll refer to site operators as publishers, but the same thing happens to a degree for sites run by e-commerce companies, brands, governments, non-profits and anyone else with a digital presence. Similarly, we’ll use the word tag to refer to any object called by a webpage, even though there are hundreds of different types of code that could be involved.

web page may appear monolithic to the visitor — everything you see appears to be coming from the servers of the site operator — but that’s rarely how it works.

Think about the typical functions of the average webpage and nearly all of them have some third-party involvement. Harnessing the high-tech industry has been liberating for website operators, allowing them to quickly tap into rich capabilities at low or no cost, but it also transforms every web page into an Internet flash mob.

