Preserving internet content seems to be a Sisyphean task, especially content from social media. A recent article from Forbes.com on Why We Need To Archive The Web In Order To Preserve Twitter by Kalev Leetaru made a great point about the challenges of preserving online content:
Perhaps the most important lesson is the reminder that in a networked information world, preserving a single object in isolation may not actually preserve it if it consists of links to other resources which are lost.
The content of the web changes every second, and a website can be taken down at any time. If a Tweet links to a website that’s no longer available, how useful is an archived version of that Tweet? And that’s just the tip of the iceberg…the Library of Congress (LOC) has been trying to figure how out to create a usable archive of Tweets since 2010.
Obstacles in Archiving Twitter
An article last year from The Atlantic, Can Twitter Fit Inside the Library of Congress? by Andrew McGill, provides an overview of the agreement the Library of Congress made with Twitter in April 2010:
Twitter promised to hand over all the tweets posted since the company’s launch in 2006, as well as a regular feed of new submissions. In return, the library agreed to embargo the data for six months and ensure that private and deleted tweets were not exposed.
The Library of Congress has the raw data, but it struggles with the ever-growing size and complexity of the Tweets archive. With 500 million Tweets added a day (in 2012) and the added metadata of embedded images, videos, and conversation threads, the archive of Tweets has become nearly unsearchable with current technology available to the LOC. The Atlantic article quotes an LOC blog post from 2013 that describes how “executing a single search of just the fixed 2006-2010 archive on the Library’s systems could take 24 hours.” Researchers desperately want free access to the Twitter archives, but the sheer volume, variety, and velocity of this big data makes it extremely difficult to create an easily searchable portal. Even if the LOC does create a searchable portal for the Twitter archives, how useful will those preserved Tweets really be without the context of working links?
Preserving the Internet: The Internet Archive
Twitter is just a single social media platform…how can we possibly preserve all versions of all websites ever available on the web? Many well written articles have already pondered this question:
- The New York Times, How an Archive of the Internet Could Change History by Jenna Wortham
- The New Yorker, The Cobweb: Can the Internet be Archived? by Jill Lepore
- The Atlantic, Raiders of the Lost Web by Adrienne LaFrance
One thread uniting these articles are mentions of the Internet Archive, which describes itself as “a non-profit library of millions of free books, movies, software, music, websites, and more.” You can search everything from copyright records to TV clips of President Trump on the Internet Archive, but the crowning achievement of the site is the Wayback Machine, which allows users to explore more than 299 billion web pages saved over the past two decades.
For example, if I want to explore all archived versions of the MedlinePlus homepage, I can just search by the URL and view 3,551 versions of the page, saved between April 7, 2000 and July 21, 2017. Some links on the archived pages will take you to similar archived versions of the linked webpages (although the captures of the linked pages may have a different time stamp). Many of the images and drop-down menus are also preserved, so you get a relatively accurate feel for what the webpage looked like during that time. The Wayback Machine is a fascinating tool for cultural and historical research, and it’s even used for more creative purposes like patent searching and improving search engine optimization (SEO).
Exhibiting the Internet: The Library of Congress
Although the Library of Congress has yet to release a usable Twitter archive, the LOC still offers plenty of smaller online content archives which provide valuable insights into web culture. The LOC recently announced the release of the Webcomics Web Archive and the Web Cultures Web Archive. The Webcomics archive focuses on “award-winning comics as well as webcomics that are significant for their longevity, reputation or subject matter”, while the Web Cultures archive includes “a representative sampling of websites documenting the creation and sharing of emergent cultural traditions on the web such as GIFs, memes and emoji.”
Each archived website includes a metadata page with a representative screenshot and bibliographic data about the website (including a summary and description of the site). The archived website page also links to a timeline of all captured versions of the site. For example, the Cute Overload! 😉 archived website page links to 122 captures of the Cute Overload homepage between October 3, 2006 to June 1, 2016.
While the Internet Archive aims for quantity and preserving as many webpage captures as possible, the Library of Congress online collections aim for a representative sample of high-quality sites. The Library of Congress collections also include helpful metadata for each archived website, so they are easily discoverable. The LOC collection can be used as an internet history museum, while the Internet Archive Wayback Machine is the closest thing we currently have to an actual archive of the internet. Hopefully we’ll eventually also have access to a full Twitter archive from LOC, but that may be years down the road.