Preserving the Internet: Library of Congress and the Internet Archive

Preserving internet content seems to be a Sisyphean task, especially content from social media.  A recent article from Forbes.com on Why We Need To Archive The Web In Order To Preserve Twitter by Kalev Leetaru made a great point about the challenges of preserving online content:

Perhaps the most important lesson is the reminder that in a networked information world, preserving a single object in isolation may not actually preserve it if it consists of links to other resources which are lost.

The content of the web changes every second, and a website can be taken down at any time.  If a Tweet links to a website that’s no longer available, how useful is an archived version of that Tweet?  And that’s just the tip of the iceberg…the Library of Congress (LOC) has been trying to figure how out to create a usable archive of Tweets since 2010.

Obstacles in Archiving Twitter

An article last year from The Atlantic, Can Twitter Fit Inside the Library of Congress? by Andrew McGill, provides an overview of the agreement the Library of Congress made with Twitter in April 2010:

Twitter promised to hand over all the tweets posted since the company’s launch in 2006, as well as a regular feed of new submissions. In return, the library agreed to embargo the data for six months and ensure that private and deleted tweets were not exposed.

The Library of Congress has the raw data, but it struggles with the ever-growing size and complexity of the Tweets archive.  With 500 million Tweets added a day (in 2012) and the added metadata of embedded images, videos, and conversation threads, the archive of Tweets has become nearly unsearchable with current technology available to the LOC.  The Atlantic article quotes an LOC blog post from 2013 that describes how “executing a single search of just the fixed 2006-2010 archive on the Library’s systems could take 24 hours.” Researchers desperately want free access to the Twitter archives, but the sheer volume, variety, and velocity of this big data makes it extremely difficult to create an easily searchable portal.  Even if the LOC does create a searchable portal for the Twitter archives, how useful will those preserved Tweets really be without the context of working links?

Preserving the Internet: The Internet Archive

Twitter is just a single social media platform…how can we possibly preserve all versions of all websites ever available on the web?  Many well written articles have already pondered this question:

One thread uniting these articles are mentions of the Internet Archive, which describes itself as “a non-profit library of millions of free books, movies, software, music, websites, and more.”  You can search everything from copyright records to TV clips of President Trump on the Internet Archive, but the crowning achievement of the site is the Wayback Machine, which allows users to explore more than 299 billion web pages saved over the past two decades.

For example, if I want to explore all archived versions of the MedlinePlus homepage, I can just search by the URL and view 3,551 versions of the page, saved between April 7, 2000 and July 21, 2017.  Some links on the archived pages will take you to similar archived versions of the linked webpages (although the captures of the linked pages may have a different time stamp). Many of the images and drop-down menus are also preserved, so you get a relatively accurate feel for what the webpage looked like during that time. The Wayback Machine is a fascinating tool for cultural and historical research, and it’s even used for more creative purposes like patent searching and improving search engine optimization (SEO).

Capture
Capture of the MedlinePlus homepage from April 7, 2000 on the Wayback Machine.

Exhibiting the Internet: The Library of Congress

Although the Library of Congress has yet to release a usable Twitter archive, the LOC still offers plenty of smaller online content archives which provide valuable insights into web culture.  The LOC recently announced the release of the Webcomics Web Archive and the Web Cultures Web Archive. The Webcomics archive focuses on “award-winning comics as well as webcomics that are significant for their longevity, reputation or subject matter”, while the Web Cultures archive includes “a representative sampling of websites documenting the creation and sharing of emergent cultural traditions on the web such as GIFs, memes and emoji.”

Each archived website includes a metadata page with a representative screenshot and bibliographic data about the website (including a summary and description of the site).  The archived website page also links to a timeline of all captured versions of the site.  For example, the Cute Overload! 😉 archived website page links to 122 captures of the Cute Overload homepage between October 3, 2006 to June 1, 2016.

Capture
Archived Website page for Cute Overload on the Library of Congress website.

While the Internet Archive aims for quantity and preserving as many webpage captures as possible, the Library of Congress online collections aim for a representative sample of high-quality sites.  The Library of Congress collections also include helpful metadata for each archived website, so they are easily discoverable.  The LOC collection can be used as an internet history museum, while the Internet Archive Wayback Machine is the closest thing we currently have to an actual archive of the internet.  Hopefully we’ll eventually also have access to a full Twitter archive from LOC, but that may be years down the road.

Librarians on Twitter: Hashtags, Twitter Chats, and Beyond

book

I’ll admit it: I’m a few years behind the game with starting a personal-professional Twitter account.  I’ve used Twitter plenty for work over the years, but Tweeting to promote a brand or a website is different than Tweeting just to promote yourself and your own ideas. There is already a thriving Twitosphere of librarians out there, and it can be daunting to try to jump in and join the conversation.  Who should I follow?  What hashtags should I use?  Where can I find other medical librarians on Twitter?

Here are a few lessons learned while starting my new Twitter account, @jamornini:

So I guess I need to get Tweeting.  Maybe in the future (if I’m brave enough) I’ll follow the librarian community onto Instagram or Snapchat.  Social media is a brave new world, and librarians are constantly adapting to sharing information through the latest digital channels.