The destruction and distortion of U.S. government data

As the new U.S. administration takes down and alters government data, initiatives to preserve data spring up.


Concerned about the Trump administration's whirlwind of Executive Orders during the first month of his presidency, the most pressing issue for librarians was the takedown of government websites with vital information. Under attack was data concerning gender (the word was excised and replaced with sex), race, disabilities, climate change (mentions of global warming and climate change were to be removed) and DEI (diversity, equity, inclusion). This affected almost every U.S. government agency's data collection.

Data collected by U.S. government agencies stretch well beyond this one country. International data has long been an important component in data sources. In the scientific, technical and medical realm, data transcends national borders. Plus, with the U.S. market being so large, data about potential consumer demographics and industry statistics interest companies wanting to sell into the U.S.A.

 Data preservation efforts

 Although the new administration's draconian approach to information and data is extreme, every administration brings change to government websites, even if it's as obvious (and benign) as changing the names of the people running the agencies. The Internet Archive, in 2008, began capturing and saving U.S. government websites at the end of each presidential administration on an End of Term (EOT) Web Archive. For historical and research purposes, the Internet Archive believes that older data should be retained. This year's crawl collected over 500 terabytes of material, which includes over 100 million webpages, most from top level domains such as .gov and .mil, as well as government websites hosted on .org, .edu, and others. The EOT Web Archive resides on the Filecoin network, part of the Internet Archive's Democracy's Library project   https://archive.org/details/democracys-library  . 

The Harvard Law School Library Innovation Lab is in the forefront of dataset preservation and has been for years. It created a "data vault" to download data, authenticate it and make copies available. It scoured data.gov, Github repositories, and PubMed to grab portions of datasets tracked there. The data vault is different from the Internet Archive in that it collects and preserves datasets, not webpages. The two have complementary rather than overlapping missions.

On 6 February 2025, Harvard Law announced the release of Data.gov Archive on Source Cooperative. This 16 terabyte collection includes over 311,000 datasets harvested during 2024 and 2025. This constitutes a complete archive of federal public datasets linked by data.gov. It will be updated daily as new datasets are added to data.gov. It preserves not just the datasets but also detailed metadata and has now released open source software and documentation to allow researchers to replicate its work and create similar repositories. This builds on its work with the perma.cc web archiving tool, the Caselaw Access Project  and the Century Sale Storage.     

 An information coalition of librarians and library associations banded together to create the Data Rescue Project.  Designed to reduce duplication of rescue efforts, Data Rescue Tracker provides a consolidated overview of who is downloading which dataset from which government websites and who is maintaining the data. Suggestions about datasets to preserve can be entered into submission form on the website. You can follow the Data Rescue Project on BlueSky.

 Journalists, as well as librarians have a commitment to data so that they can fulfill their mission of holding public officials accountable and speaking truth to power. The Journalist's Resource, a project of Harvard Kennedy School's Shorenstein Center, has likewise set up a list of non-government websites that have health data, noting that some of them use government data in their report creation. In addition to sources of health data, it lists data archiving efforts, including the Data Rescue Project and the Harvard Innovation Lab.

Jessica Hilburn, in an impassioned NewsBreaks article, provides some alternative sources for data from universities and non-governmental organisations (NGOs). She makes an excellent case for why libraries should not be neutral, stating "When compassion and inclusion are labeled the enemy and the diversity created by our great American experiment is lambasted as a social ill, claiming that libraries are neutral or apolitical is not only incorrect, it’s complicit."

Coherent Digital is taking the approach of trying to collect data from NGOs that are in danger of disappearing due to cessation of U.S. government funding.  It sees this as in line with its mission statement.  "Coherent is committed to preserving and making all information accessible—both for today and for the future. That’s why we’re accelerating efforts to capture and safeguard at-risk content, particularly from government agencies, NGOs, and think tanks." The team has already started to identify, locate, and save critical documents before they disappear and are striving to get as many pieces of information ingested as possible. NGOs in the Global South are particularly at risk.

Data in bibliographic databases

The U.S. government funds two of the very first bibliographic databases searchable online, both of which are used daily by librarians around the globe. The National Institutes of Health's National Library of Medicine, part of Health and Human Services (HHS), launched MEDLARS in 1964. MEDLINE came online in 1971; and PubMed in 1996. ERIC, from the Education Department, began in 1966. With HHS under fire to drastically reduce staff and the Education Department likely to be eliminated, the future for these critical databases is in doubt. They could be taken down completely, cease updating, delete articles currently in the databases on topics disliked by this administration and include only those in line with government policy going forward, or change indexing to conform with political views. We already saw the swift approval by the Library of Congress to change its subject heading for Gulf of Mexico to Gulf of America. Would vaccination disappear as an index term in PubMed or diversity from ERIC? Commercial databases also contain information sourced from U.S. government agencies. Could government overreach affect those databases as well?

If funding for government research evaporates, as is an all to likely future scenario, preservation is of limited value. Data sets come to a screeching halt if no additional information is added. What happens next with government funding for research, information distortion and data preservation is an open question. U.S. courts have challenged some of the more outrageous takedowns but have not addressed the issue of material changes in the information contained in government data. Information professionals face a new information and data literacy problem—can U.S. government statistics be trusted? This is truly new territory for us.

This is a rapidly evolving situation, so it's imperative that the worldwide library community follow developments and take action when needed.