The Library of Congress, the California Digital Library, the University of North Texas Libraries, the Internet Archive, and the U.S. Government Publishing Office have joined together for a collaborative project to preserve public United States Government web sites at the end of the current presidential administration ending January 20, 2013. This harvest is intended to document federal agencies' presence on the World Wide Web during the transition of Presidential administrations and to enhance the existing collections of the five partner institutions.
In this collaboration, the partners will structure and execute a comprehensive harvest of the Federal Government .gov domain. The Internet Archive will crawl broadly across the entire .gov domain. The University of North Texas and the California Digital Library, will supplement and extend the broad comprehensive crawl with focused, in-depth crawls based on prioritized lists of URLs. This dual-edged approach seeks to capture a comprehensive snapshot of the Federal government on the Web at the close of the current administration.
The project has two phases: A broad, comprehensive baseline crawl of .gov sites and more selective, focused crawls based on priorities established by the partners. This focused selection seeks to capture sites in greater depth and to identify those at greater risk of rapid change or disappearance.
Comprehensive Crawl - The Internet Archive will undertake a comprehensive crawl of the .gov domain (all of the URLs identified for this project) beginning in late August 2012, and again in early 2013, after the inauguration.
Prioritized Crawl - The project team is calling upon government information specialists, including librarians, political and social science researchers, and academics – to assist in the selection and prioritization of the selected web sites to be included in the collection, as well as identifying the frequency and depth of the act of collecting. The schedule for crawling of the prioritized URLs is still to be determined but will be announced as the project gets underway, on the project’s listserv.
Participants will be asked to refine the existing URL list by browsing or searching .gov URLs in the Nomination Tool. Specialists will review the URLs to determine if they are in scope or out of scope for the end-of-term project. Additional URLs may also be added by participants.
In Scope = Federal Government websites (.gov) in the Legislative, Executive, or Judicial branches of government. Of particular interest for prioritization are websites that are likely to change dramatically or disappear during the transition of government in 2012-2013.
Out of Scope = Local or state government websites, any other sites not in the .gov domain.
Each URL will be assigned a weighted score based on the in scope/out of scope recommendations. By identifying something as in scope, specialists are assigning it a higher priority for crawling (+1 is added to the score). If a specialist identifies a URL as not in scope, it lowers the prioritization factor (subtracts 1 from the score) but doesn’t remove it from the collection. The scores will help the project team identify which URLs require higher prioritization for crawling.
The tool also allows for optional metadata about the URLs to be added by specialists participating in the project. The metadata elements are not required and not the focus of the project at this point, but will provide options to more specifically identify resources for future reference.
The URL Nomination Tool has been designed by the project team and developed by the University of North Texas to provide the community of subject specialists with a convenient means to contribute information on specific sites for the focused crawl. The tool was initially populated with .gov URLs (“seeds”) from previous harvests. Using the tool, participants may suggest additional URLs, add recommendations for priority, and add brief additional information as appropriate.
We understand that specialists may not have time to review all the URLs, so we recommend that individuals concentrate on a specific area of interest to review, prioritize URLs already added, and add new URLS.