The Library of Congress, Internet Archive, University of North Texas Libraries, George Washington University Libraries, Stanford University Libraries, EDGI, and the U.S. Government Publishing Office have joined together for a collaborative project to preserve public United States Government web sites at the end of the current presidential administration ending January 20, 2021. This harvest is intended to document federal agencies' presence on the World Wide Web during the transition of Presidential administrations and to enhance the existing collections of the partner institutions.
In this collaboration, the partners will structure and execute a comprehensive harvest of the Federal Government .gov domain. The Internet Archive will crawl broadly across the entire .gov domain. The University of North Texas and others will supplement and extend the broad comprehensive crawl with focused, in-depth crawls based on prioritized lists of URLs, including social media. This dual-edged approach seeks to capture a comprehensive snapshot of the Federal government on the Web at the close of the current administration.
Harvested content from previous End of Term Presidential Harvests is available at http://eotarchive.cdlib.org/.
The project has two phases: A broad, comprehensive baseline crawl of .gov sites and more selective, focused crawls based on priorities established by the partners. This focused selection seeks to capture sites in greater depth and to identify those at greater risk of rapid change or disappearance.
Comprehensive Crawl - The Internet Archive will undertake a comprehensive crawl of the .gov domain (all of the URLs identified for this project) beginning in mid September 2020, and again in early 2021, after the inauguration.
Prioritized Crawl - The project team will assemble a list of related URL’s and social media feeds. As a result, the project team is calling upon government information specialists, including librarians, political and social science researchers, and academics – to assist in the selection and prioritization of the selected web sites to be included in the collection, as well as identifying the frequency and depth of the act of collecting. The schedule for crawling of the prioritized URLs is still to be determined but will be announced as the project gets underway, on the project’s listserv, and in other communications to the public.
Participants will be asked to refine the existing URL list by browsing or searching .gov URLs in the Nomination Tool. Specialists will review the URLs to determine if they are in scope or out of scope for the end-of-term project. Additional URLs may also be added by participants, during the duration of crawling.
In Scope = Federal Government websites (.gov) in the Legislative, Executive, or Judicial branches of government, and related social media accounts. Also in scope are Federal Government Websites on other domains, such as .mil, .edu, and .com, Of particular interest for prioritization are websites that are likely to change dramatically or disappear during the transition of government in 2020-2021.
Out of Scope = Local or state government websites, any other non-government sites including those documenting the U.S. Elections..
Each URL will be assigned a weighted score based on the in scope/out of scope recommendations. By identifying something as in scope, specialists are assigning it a higher priority for crawling (+1 is added to the score). If a specialist identifies a URL as not in scope, it lowers the prioritization factor (subtracts 1 from the score) but doesn’t remove it from the collection. The scores will help the project team identify which URLs require higher prioritization for crawling. Further examples of in scope URLs will be provided as the project continues.
The tool also allows for optional metadata about the URLs to be added by specialists participating in the project. The metadata elements are not required and not the focus of the project at this point, but will provide options to more specifically identify resources for future reference.
The URL Nomination Tool has been designed by the project team and developed by the University of North Texas to provide the community of subject specialists with a convenient means to contribute information on specific sites for the focused crawl. The tool was initially populated with .gov URLs (“seeds”) from previous harvests. Using the tool, participants may suggest additional URLs, add recommendations for priority, and add brief additional information as appropriate.
We understand that specialists may not have time to review all the URLs, so we recommend concentration on a specific area of interest to review, prioritize URLs already added, and add new URLS.