How We Determined Which Disinformation Publishers Profit From Google’s Ad Systems

Our story “How Google’s Ad Business Funds Disinformation Around the World” found that, despite Google’s public commitments to fight disinformation, it continues to allow websites to use Google’s ad systems to profit from false and misleading content. Our reporting identified websites that were allowed to continue to collect revenue from Google ads, even on stories that appeared to be in violation of the company’s policies against unreliable and harmful claims related to COVID-19, health, elections and climate change. We also found that websites containing misinformation in languages other than English and smaller markets were more likely to be allowed to continue to profit from Google ads than similar English-language websites.

We analyzed datasets of articles and websites containing false claims to determine what proportion of them made money using Google’s ad platforms. We obtained these datasets from organizations that track online disinformation around the world and wrote software to determine whether a web address was currently earning money from Google ads. Between Aug. 23 and Sept. 13, 2022, we ran the datasets through this software system to calculate the proportion of web addresses monetizing with Google ads for each dataset. We include our detailed findings in Appendix A.

Data Sources

We analyzed 17 article and website datasets, totaling more than 13,000 active articles and over 8,000 domains, obtained from nine fact-checking and news quality monitoring organizations. Some of the datasets cover articles and websites from a particular country or region, while others cover subject matter, such as COVID-19 misinformation or climate change misinformation. In Appendix B we include a description of each dataset and the organizations that provided them.

Data Cleaning

The datasets varied in size, types of content and level of curation. We filtered all URL datasets to include only articles published after 2019 to keep the datasets recent and roughly within the same time frame. If the dataset provided information on the type of fact-checked content, we limited it to the most serious forms of disinformation or disinformation purveyors. For example, Brazil’s Netlab provided a column distinguishing between suspected and confirmed purveyors of disinformation, allowing us to select confirmed purveyors.

Some datasets included links to social media platforms, such as Facebook or Twitter. We excluded these links from our analysis. Some datasets also had links to images or pdfs, which we similarly excluded. See Appendix C for a full list of exclusions.

The datasets from the International Fact-Checking Network and Raskrinkavanje included articles that had been archived using a webpage archiving service such as archive.today. In these cases, we wrote programs to extract the original web addresses of the false or misleading articles. For the IFCN dataset, we extracted by hand any addresses that we could not extract by code. For Raskrinkavanje, we excluded from our final analysis any remaining links that could not be extracted. Links that could not be extracted accounted for less than 1% of the total webpages from the datasets. We do not have reason to believe these excluded links biased our results. See Appendix C for more detailed information.

Analyzing a Web Address

Our system to determine whether a web address was currently earning money with Google’s ad systems consists of two components: a web sc*****r and a data analysis script.

Web Sc*****r

A web sc*****r is software that can systematically extract and save data from a visited web page. ProPublica’s sc*****r uses a library called Playwright, which can mimic human behavior when visiting a site and is often used for automated website testing.

When our web sc*****r visits any web address, URL or base domain, it collects and saves the following information:

All network requests initiated by the webpage. Network requests are used to retrieve web content such as images, text and ads or to provide information such as user actions or profile information back to the web servers.
The response for each network request, if those requests went out to Google servers (a handful of servers we identified as serving or related to Google’s ad content). When successful, these responses contain ad content that the website loads onto the page.
The webpage content. Once the webpage loads, the sc*****r captures its HTML, the code that defines what a visitor to that page would see.

When our web sc*****r visits a base domain, the location at which an entire site resides, it also saves the following information:

The ads.txt file: The ads.txt file lists all of a website’s advertising partners. Not all websites make this file available to visitors, but it is highly recommended by Google and the IAB Tech Lab as a web advertising transparency best practice.
A random subpage: When visiting a website, the sc*****r will select an arbitrary subpage link found on the base domain (e.g. for test.com, test.com/morecontent) and also sc***** the same information for that page. This is done to capture cases where the homepage for a website does not run ads, but sections of the website do.

Analysis Script

Our analysis tool processes the above data from each URL to determine whether the address is valid, and if so whether it is monetizing with Google’s ad systems.

We manually identified 10 separate network request and response pairs that indicate a webpage is making a request to a Google server for one or multiple ads. If the response did not contain advertising content, then we did not count the website as monetizing with Google. (This may occur, for example, if the webpage makes an ad request, but Google has demonetized the specific page or website.) We then wrote software that would look for these request-response pairs in the data collected by our web sc*****r.

We also identified scenarios where a sc*****r visit did not result in valid webpage content. These invalid visits can mean the sc*****r was redirected to a different page from the original page, the content at the web address is no longer available, or the server is no longer reachable.

Thus, for a single web address, there are three possible outcomes of the analysis:

The web address is valid, and it is monetizing with Google’s ad systems.
The web address is valid, but it is not monetizing with Google’s ad systems.
The web address is not valid or the content has been removed.

We sc*****d and analyzed each web address in our 17 datasets to determine which of the three categories it fell under. We then compiled the results in a spreadsheet. Appendix A provides the detailed results of this analysis.

Verifying the Results

We hand-checked the results of all of the smaller domain datasets by visiting each page and determining the validity of its web address and whether the webpage was monetizing via Google’s ad systems. For the larger datasets containing individual webpages, we extracted and checked a random sample of web addresses by hand, using a 90% confidence level and 10% margin of error.

The sc*****r and analysis tools were designed to make false positives (where we falsely flag a web address as monetizing with Google) very rare. In fact, we never identified a false positive during our audit. There were some instances where ads were displayed at the time of the sc***** but not when we manually visited the page later on (or vice versa). In these cases, we manually examined the sc*****d data to confirm ad content was served at the time of the sc*****. There were a few rare instances where content returned from the ad server was never loaded on the page, possibly because of coding errors on the webpage. We still counted these cases as positives, since they are indications of an active monetization relationship with Google.

False negatives (where the sc*****r did not find ads on the page but ads were present) were more common due to several scenarios: For example, the sc*****r was sometimes blocked from accessing a page or failed to bypass page pop-ups such as consent forms. In our audits we saw false negative rates of between 0% and 13%.

Because we found false negatives more often than false positives, the true proportion of these web addresses monetizing with Google’s ad systems is likely slightly higher than what we reported.

Dataset name	Data source	Languages covered	Regions covered	Domains or Web Pages	Number of valid web addresses analyzed	Number of valid web addresses monetizing Google ads	% of valid web addresses monetizing Google ads
Africa Check Misinformation Web Pages	Africa Check	English	Nigeria, South Africa, and Kenya	Web pages	66	38	57.6
Africa Check Misinformation Web Pages Senegal	Africa Check	French	Senegal, Guinea, Mali, Côte d'Ivoire, and Cameroon	Web pages	44	29	65.9
Balkans MisinformationWeb Pages	Raskrinkavanje	Bosnian-Croatian-Serbian	Serbia, Croatia, Bosnia and Herzegovina	Web pages	9,973	6,216	62.3
Balkans Publishers	Raskrinkavanje	Bosnian-Croatian-Serbian	Serbia, Croatia, Bosnia and Herzegovina	Domains	30	26	86.7
Brazil Publishers	Netlab	Portuguese	Brazil	Domains	30	24	80
Latin American Publishers	Chequeado	Spanish, Portuguese	Argentina, Bolivia, Brazil, Colombia, Costa Rica, Cuba, Ecuador, Venezuela, Peru and Mexico	Domains	49	19	38.8
Covid Disinformation Pages	International Fact-Checking Network	Various	Global	Web pages	814	338	41.5
NewsGuard Publisher list	NewsGuard	Various	Global	Domains	7,739	4,186	54.1
Turkey Disinformation Pages	Teyit	Turkish	Turkey	Web pages	1,035	756	73
Turkey Publishers	Teyit	Turkish	Turkey	Domains	50	45	90
Spanish Language Publishers	EU DisinfoLab	Spanish	Spain	Domains	32	14	43.8
German Language Publishers	EU DisinfoLab	German	Germany, Austria and Switzerland	Domains	30	10	33.3
EU Disinformation Pages	EU DisinfoLab	Various	EU	Web pages	235	57	24.3
Climate Disinformation Pages	Science Feedback	Various	Global	Web pages	427	86	20.1

Appendix B: Organization and Dataset details

All datasets were filtered to remove duplicates, archived URLs that could not be successfully unarchived, data before 2019 and URLs from social media sites such as Facebook, Twitter, Weibo, Pinterest, Telegram and WhatsApp (see full list in Appendix C).

Africa Check

Website: https://africacheck.org/

Description: Africa Check is an African nonprofit fact-checking organization founded in South Africa in 2012.

Datasets analyzed:

Articles in French from Senegal, Guinea, Mali, Côte d’Ivoire and Cameroon between 2019 and 2022 fact-checked and determined to be misinformation.
Articles in English from Nigeria, South Africa and Kenya between 2019 and 2022 fact-checked and determined to be misinformation.
Raskrinkavanje

Website: https://raskrinkavanje.ba/

Description: Raskrinkavanje is a fact-checking program for media organizations in the Balkans. It was founded in 2017 by Zašto ne, a civil society organization based in Bosnia and Herzegovina.

Datasets analyzed:

Articles from the region between 2019 and July 2022 that were fact-checked by Raskrinkavanje and determined to be misinformation.
Thirty websites that were most frequently identified as publishing misinformation by Raskrinkavanje in the region from 2019 to July 2022.
Netlab

Website: https://www.netlab.eco.ufrj.br/

Description: Netlab is a research laboratory of the School of Communication of the Federal University of Rio de Janeiro (UFRJ) that uses network analysis to study online misinformation.

Datasets analyzed:

A list of websites shared within Brazilian right wing and left wing WhatsApp and Telegram groups and channels in August 2022 and flagged by researchers as a source of disinformation in Portuguese.
Chequeado

Website: https://chequeado.com/

Description: Chequeado is a nonpartisan, nonprofit news monitoring and fact-checking organization founded in Argentina in 2010.

Datasets analyzed:

Websites determined by LatamChequea, Chequado’s fact-checking partners in Latin America, to be spreading false information.
International Fact-Checking Network

Website: https://www.poynter.org/ifcn/

Description: The International Fact-Checking Network is a network of 100 fact-checking organizations around the world. It was launched in 2015 by the Poynter Institute, a nonprofit journalism institute based in St. Petersburg, Florida.

Datasets analyzed:

COVID: links to social media and news content spreading misinformation about the COVID-19 pandemic.
NewsGuard

Website: https://www.newsguardtech.com/

Description: NewsGuard is a company that provides trust ratings for the most visited websites in the U.S., U.K., Canada, Germany, France and Italy.

Datasets analyzed:

Domains for news websites around the world rated by NewGuard. Reliability ratings range from 0 to 100 (0 being completely untrustworthy).
Teyit

Website: https://teyit.org/

Description: Teyit is a Turkish nonprofit fact-checking and media literacy social enterprise founded in 2016.

Datasets analyzed:

Articles that were published in 2019 or later that contained claims categorized as “incorrect association,” “manipulation,” or “distortion” and which the fact-checkers had not seen subsequently corrected. (Fact-checkers provided access to a database containing a wide range of thousands of fact-checks which ProPublica filtered based on the previous criteria.)
EU DisinfoLab

Website: https://www.disinfo.eu/

Description: EU DisinfoLab is a Brussels-based nonprofit organization that studies misinformation in the EU.

Datasets analyzed:

Articles from the region between 2019 and present that were fact-checked by EU DisinfoLab and determined to be misinformation.
Websites from Spain and German-speaking countries that were identified as sources of false and misleading claims in the regions.
Science Feedback

Website: https://sciencefeedback.co/

Description: Science Feedback is a nonprofit based in France that produces scientist-expert fact-checks for health and climate news articles.

Datasets analyzed:

Articles related to climate and climate change published in 2019 or later that Science Feedback rated their lowest rating, “False.”

Appendix C: Dataset Cleaning Criteria

All datasets were cleaned with the intention of removing invalid links, social media traffic, archived content and images/PDFs.

Any links originating from the below social media or content hosting sites were removed from the final analysis.

Google Drive
Facebook
Instagram
Pinterest
Telegram
TikTok
Twitter
Vimeo
Weibo
WhatsApp
YouTube

Any links ending in any of the below were automatically excluded from the final analysis:

.png
.jpg
.jpeg
.pdf
?type=image

Any of the archiving sites below were visited and an attempt was made to extract the archived URL. If the extraction failed or the extracted link was of a type that should be excluded from the final analysis anyway, the URL was discarded.

Web.archive.org
Webcache.googleusercontent.com
Archive.today
google.com/url?
perma.cc

How We Determined Which Disinformation Publishers Profit From Google’s Ad Systems

We identified websites that collected Google ad revenue despite publishing false claims about COVID-19, climate change and other issues in apparent violation of Google policies.

Data Sources

Data Cleaning