Our story “How Google’s Ad Business Funds Disinformation Around the World” found that, despite Google’s public commitments to fight disinformation, it continues to allow websites to use Google’s ad systems to profit from false and misleading content. Our reporting identified websites that were allowed to continue to collect revenue from Google ads, even on stories that appeared to be in violation of the company’s policies against unreliable and harmful claims related to COVID-19, health, elections and climate change. We also found that websites containing misinformation in languages other than English and smaller markets were more likely to be allowed to continue to profit from Google ads than similar English-language websites.
We analyzed datasets of articles and websites containing false claims to determine what proportion of them made money using Google’s ad platforms. We obtained these datasets from organizations that track online disinformation around the world and wrote software to determine whether a web address was currently earning money from Google ads. Between Aug. 23 and Sept. 13, 2022, we ran the datasets through this software system to calculate the proportion of web addresses monetizing with Google ads for each dataset. We include our detailed findings in Appendix A.
Data Sources
We analyzed 17 article and website datasets, totaling more than 13,000 active articles and over 8,000 domains, obtained from nine fact-checking and news quality monitoring organizations. Some of the datasets cover articles and websites from a particular country or region, while others cover subject matter, such as COVID-19 misinformation or climate change misinformation. In Appendix B we include a description of each dataset and the organizations that provided them.
Data Cleaning
The datasets varied in size, types of content and level of curation. We filtered all URL datasets to include only articles published after 2019 to keep the datasets recent and roughly within the same time frame. If the dataset provided information on the type of fact-checked content, we limited it to the most serious forms of disinformation or disinformation purveyors. For example, Brazil’s Netlab provided a column distinguishing between suspected and confirmed purveyors of disinformation, allowing us to select confirmed purveyors.
Some datasets included links to social media platforms, such as Facebook or Twitter. We excluded these links from our analysis. Some datasets also had links to images or pdfs, which we similarly excluded. See Appendix C for a full list of exclusions.
The datasets from the International Fact-Checking Network and Raskrinkavanje included articles that had been archived using a webpage archiving service such as archive.today. In these cases, we wrote programs to extract the original web addresses of the false or misleading articles. For the IFCN dataset, we extracted by hand any addresses that we could not extract by code. For Raskrinkavanje, we excluded from our final analysis any remaining links that could not be extracted. Links that could not be extracted accounted for less than 1% of the total webpages from the datasets. We do not have reason to believe these excluded links biased our results. See Appendix C for more detailed information.
Analyzing a Web Address
Our system to determine whether a web address was currently earning money with Google’s ad systems consists of two components: a web sc*****r and a data analysis script.
Web Sc*****r
A web sc*****r is software that can systematically extract and save data from a visited web page. ProPublica’s sc*****r uses a library called Playwright, which can mimic human behavior when visiting a site and is often used for automated website testing.
When our web sc*****r visits any web address, URL or base domain, it collects and saves the following information:
- All network requests initiated by the webpage. Network requests are used to retrieve web content such as images, text and ads or to provide information such as user actions or profile information back to the web servers.
- The response for each network request, if those requests went out to Google servers (a handful of servers we identified as serving or related to Google’s ad content). When successful, these responses contain ad content that the website loads onto the page.
- The webpage content. Once the webpage loads, the sc*****r captures its HTML, the code that defines what a visitor to that page would see.
- The ads.txt file: The ads.txt file lists all of a website’s advertising partners. Not all websites make this file available to visitors, but it is highly recommended by Google and the IAB Tech Lab as a web advertising transparency best practice.
- A random subpage: When visiting a website, the sc*****r will select an arbitrary subpage link found on the base domain (e.g. for test.com, test.com/morecontent) and also sc***** the same information for that page. This is done to capture cases where the homepage for a website does not run ads, but sections of the website do.
Analysis Script
Our analysis tool processes the above data from each URL to determine whether the address is valid, and if so whether it is monetizing with Google’s ad systems.
We manually identified 10 separate network request and response pairs that indicate a webpage is making a request to a Google server for one or multiple ads. If the response did not contain advertising content, then we did not count the website as monetizing with Google. (This may occur, for example, if the webpage makes an ad request, but Google has demonetized the specific page or website.) We then wrote software that would look for these request-response pairs in the data collected by our web sc*****r.
We also identified scenarios where a sc*****r visit did not result in valid webpage content. These invalid visits can mean the sc*****r was redirected to a different page from the original page, the content at the web address is no longer available, or the server is no longer reachable.
Thus, for a single web address, there are three possible outcomes of the analysis:
- The web address is valid, and it is monetizing with Google’s ad systems.
- The web address is valid, but it is not monetizing with Google’s ad systems.
- The web address is not valid or the content has been removed.
We sc*****d and analyzed each web address in our 17 datasets to determine which of the three categories it fell under. We then compiled the results in a spreadsheet. Appendix A provides the detailed results of this analysis.
Verifying the Results
We hand-checked the results of all of the smaller domain datasets by visiting each page and determining the validity of its web address and whether the webpage was monetizing via Google’s ad systems. For the larger datasets containing individual webpages, we extracted and checked a random sample of web addresses by hand, using a 90% confidence level and 10% margin of error.
The sc*****r and analysis tools were designed to make false positives (where we falsely flag a web address as monetizing with Google) very rare. In fact, we never identified a false positive during our audit. There were some instances where ads were displayed at the time of the sc***** but not when we manually visited the page later on (or vice versa). In these cases, we manually examined the sc*****d data to confirm ad content was served at the time of the sc*****. There were a few rare instances where content returned from the ad server was never loaded on the page, possibly because of coding errors on the webpage. We still counted these cases as positives, since they are indications of an active monetization relationship with Google.
False negatives (where the sc*****r did not find ads on the page but ads were present) were more common due to several scenarios: For example, the sc*****r was sometimes blocked from accessing a page or failed to bypass page pop-ups such as consent forms. In our audits we saw false negative rates of between 0% and 13%.
Because we found false negatives more often than false positives, the true proportion of these web addresses monetizing with Google’s ad systems is likely slightly higher than what we reported.
Dataset name | Data source | Languages covered | Regions covered | Domains or Web Pages | Number of valid web addresses analyzed | Number of valid web addresses monetizing Google ads | % of valid web addresses monetizing Google ads |
Africa Check Misinformation Web Pages | Africa Check | English | Nigeria, South Africa, and Kenya | Web pages | 66 | 38 | 57.6 |
Africa Check Misinformation Web Pages Senegal | Africa Check | French | Senegal, Guinea, Mali, Côte d'Ivoire, and Cameroon | Web pages | 44 | 29 | 65.9 |
Balkans MisinformationWeb Pages | Raskrinkavanje | Bosnian-Croatian-Serbian | Serbia, Croatia, Bosnia and Herzegovina | Web pages | 9,973 | 6,216 | 62.3 |
Balkans Publishers | Raskrinkavanje
| Bosnian-Croatian-Serbian | Serbia, Croatia, Bosnia and Herzegovina | Domains | 30 | 26 | 86.7 |
Brazil Publishers | Netlab | Portuguese | Brazil | Domains | 30 | 24 | 80 |
Latin American Publishers | Chequeado | Spanish, Portuguese | Argentina, Bolivia, Brazil, Colombia, Costa Rica, Cuba, Ecuador, Venezuela, Peru and Mexico | Domains | 49 | 19 | 38.8 |
Covid Disinformation Pages | International Fact-Checking Network | Various | Global | Web pages | 814 | 338 | 41.5 |
NewsGuard Publisher list | NewsGuard | Various | Global | Domains | 7,739 | 4,186 | 54.1 |
Turkey Disinformation Pages | Teyit | Turkish | Turkey | Web pages | 1,035 | 756 | 73 |
Turkey Publishers | Teyit | Turkish | Turkey | Domains | 50 | 45 | 90 |
Spanish Language Publishers | EU DisinfoLab | Spanish | Spain | Domains | 32 | 14 | 43.8 |
German Language Publishers | EU DisinfoLab | German | Germany, Austria and Switzerland | Domains | 30 | 10 | 33.3 |
EU Disinformation Pages | EU DisinfoLab | Various | EU | Web pages | 235 | 57 | 24.3 |
Climate Disinformation Pages | Science Feedback | Various | Global | Web pages | 427 | 86 | 20.1 |
Appendix B: Organization and Dataset details
All datasets were filtered to remove duplicates, archived URLs that could not be successfully unarchived, data before 2019 and URLs from social media sites such as Facebook, Twitter, Weibo, Pinterest, Telegram and WhatsApp (see full list in Appendix C).
Africa Check
Website: https://africacheck.org/
Description: Africa Check is an African nonprofit fact-checking organization founded in South Africa in 2012.
Datasets analyzed:
- Articles in French from Senegal, Guinea, Mali, Côte d’Ivoire and Cameroon between 2019 and 2022 fact-checked and determined to be misinformation.
- Articles in English from Nigeria, South Africa and Kenya between 2019 and 2022 fact-checked and determined to be misinformation.
Raskrinkavanje
Website: https://raskrinkavanje.ba/
Description: Raskrinkavanje is a fact-checking program for media organizations in the Balkans. It was founded in 2017 by Zašto ne, a civil society organization based in Bosnia and Herzegovina.
Datasets analyzed:
- Articles from the region between 2019 and July 2022 that were fact-checked by Raskrinkavanje and determined to be misinformation.
- Thirty websites that were most frequently identified as publishing misinformation by Raskrinkavanje in the region from 2019 to July 2022.
Netlab
Website: https://www.netlab.eco.ufrj.br/
Description: Netlab is a research laboratory of the School of Communication of the Federal University of Rio de Janeiro (UFRJ) that uses network analysis to study online misinformation.
Datasets analyzed:
- A list of websites shared within Brazilian right wing and left wing WhatsApp and Telegram groups and channels in August 2022 and flagged by researchers as a source of disinformation in Portuguese.
Chequeado
Website: https://chequeado.com/
Description: Chequeado is a nonpartisan, nonprofit news monitoring and fact-checking organization founded in Argentina in 2010.
Datasets analyzed:
- Websites determined by LatamChequea, Chequado’s fact-checking partners in Latin America, to be spreading false information.
International Fact-Checking Network
Website: https://www.poynter.org/ifcn/
Description: The International Fact-Checking Network is a network of 100 fact-checking organizations around the world. It was launched in 2015 by the Poynter Institute, a nonprofit journalism institute based in St. Petersburg, Florida.
Datasets analyzed:
- COVID: links to social media and news content spreading misinformation about the COVID-19 pandemic.
NewsGuard
Website: https://www.newsguardtech.com/
Description: NewsGuard is a company that provides trust ratings for the most visited websites in the U.S., U.K., Canada, Germany, France and Italy.
Datasets analyzed:
- Domains for news websites around the world rated by NewGuard. Reliability ratings range from 0 to 100 (0 being completely untrustworthy).
Teyit
Website: https://teyit.org/
Description: Teyit is a Turkish nonprofit fact-checking and media literacy social enterprise founded in 2016.
Datasets analyzed:
- Articles that were published in 2019 or later that contained claims categorized as “incorrect association,” “manipulation,” or “distortion” and which the fact-checkers had not seen subsequently corrected. (Fact-checkers provided access to a database containing a wide range of thousands of fact-checks which ProPublica filtered based on the previous criteria.)
EU DisinfoLab
Website: https://www.disinfo.eu/
Description: EU DisinfoLab is a Brussels-based nonprofit organization that studies misinformation in the EU.
Datasets analyzed:
- Articles from the region between 2019 and present that were fact-checked by EU DisinfoLab and determined to be misinformation.
- Websites from Spain and German-speaking countries that were identified as sources of false and misleading claims in the regions.
Science Feedback
Website: https://sciencefeedback.co/
Description: Science Feedback is a nonprofit based in France that produces scientist-expert fact-checks for health and climate news articles.
Datasets analyzed:
- Articles related to climate and climate change published in 2019 or later that Science Feedback rated their lowest rating, “False.”
Appendix C: Dataset Cleaning Criteria
All datasets were cleaned with the intention of removing invalid links, social media traffic, archived content and images/PDFs.
Any links originating from the below social media or content hosting sites were removed from the final analysis.
- Google Drive
- Telegram
- TikTok
- Vimeo
- YouTube
Any links ending in any of the below were automatically excluded from the final analysis:
- .png
- .jpg
- .jpeg
- ?type=image
Any of the archiving sites below were visited and an attempt was made to extract the archived URL. If the extraction failed or the extracted link was of a type that should be excluded from the final analysis anyway, the URL was discarded.
- Web.archive.org
- Webcache.googleusercontent.com
- Archive.today
- google.com/url?
- perma.cc