Preserving the Internet: Library of Congress and the Internet Archive

Preserving internet content seems to be a Sisyphean task, especially content from social media.  A recent article from Forbes.com on Why We Need To Archive The Web In Order To Preserve Twitter by Kalev Leetaru made a great point about the challenges of preserving online content:

Perhaps the most important lesson is the reminder that in a networked information world, preserving a single object in isolation may not actually preserve it if it consists of links to other resources which are lost.

The content of the web changes every second, and a website can be taken down at any time.  If a Tweet links to a website that’s no longer available, how useful is an archived version of that Tweet?  And that’s just the tip of the iceberg…the Library of Congress (LOC) has been trying to figure how out to create a usable archive of Tweets since 2010.

Obstacles in Archiving Twitter

An article last year from The Atlantic, Can Twitter Fit Inside the Library of Congress? by Andrew McGill, provides an overview of the agreement the Library of Congress made with Twitter in April 2010:

Twitter promised to hand over all the tweets posted since the company’s launch in 2006, as well as a regular feed of new submissions. In return, the library agreed to embargo the data for six months and ensure that private and deleted tweets were not exposed.

The Library of Congress has the raw data, but it struggles with the ever-growing size and complexity of the Tweets archive.  With 500 million Tweets added a day (in 2012) and the added metadata of embedded images, videos, and conversation threads, the archive of Tweets has become nearly unsearchable with current technology available to the LOC.  The Atlantic article quotes an LOC blog post from 2013 that describes how “executing a single search of just the fixed 2006-2010 archive on the Library’s systems could take 24 hours.” Researchers desperately want free access to the Twitter archives, but the sheer volume, variety, and velocity of this big data makes it extremely difficult to create an easily searchable portal.  Even if the LOC does create a searchable portal for the Twitter archives, how useful will those preserved Tweets really be without the context of working links?

Preserving the Internet: The Internet Archive

Twitter is just a single social media platform…how can we possibly preserve all versions of all websites ever available on the web?  Many well written articles have already pondered this question:

One thread uniting these articles are mentions of the Internet Archive, which describes itself as “a non-profit library of millions of free books, movies, software, music, websites, and more.”  You can search everything from copyright records to TV clips of President Trump on the Internet Archive, but the crowning achievement of the site is the Wayback Machine, which allows users to explore more than 299 billion web pages saved over the past two decades.

For example, if I want to explore all archived versions of the MedlinePlus homepage, I can just search by the URL and view 3,551 versions of the page, saved between April 7, 2000 and July 21, 2017.  Some links on the archived pages will take you to similar archived versions of the linked webpages (although the captures of the linked pages may have a different time stamp). Many of the images and drop-down menus are also preserved, so you get a relatively accurate feel for what the webpage looked like during that time. The Wayback Machine is a fascinating tool for cultural and historical research, and it’s even used for more creative purposes like patent searching and improving search engine optimization (SEO).

Capture
Capture of the MedlinePlus homepage from April 7, 2000 on the Wayback Machine.

Exhibiting the Internet: The Library of Congress

Although the Library of Congress has yet to release a usable Twitter archive, the LOC still offers plenty of smaller online content archives which provide valuable insights into web culture.  The LOC recently announced the release of the Webcomics Web Archive and the Web Cultures Web Archive. The Webcomics archive focuses on “award-winning comics as well as webcomics that are significant for their longevity, reputation or subject matter”, while the Web Cultures archive includes “a representative sampling of websites documenting the creation and sharing of emergent cultural traditions on the web such as GIFs, memes and emoji.”

Each archived website includes a metadata page with a representative screenshot and bibliographic data about the website (including a summary and description of the site).  The archived website page also links to a timeline of all captured versions of the site.  For example, the Cute Overload! 😉 archived website page links to 122 captures of the Cute Overload homepage between October 3, 2006 to June 1, 2016.

Capture
Archived Website page for Cute Overload on the Library of Congress website.

While the Internet Archive aims for quantity and preserving as many webpage captures as possible, the Library of Congress online collections aim for a representative sample of high-quality sites.  The Library of Congress collections also include helpful metadata for each archived website, so they are easily discoverable.  The LOC collection can be used as an internet history museum, while the Internet Archive Wayback Machine is the closest thing we currently have to an actual archive of the internet.  Hopefully we’ll eventually also have access to a full Twitter archive from LOC, but that may be years down the road.

How can health science librarians get involved in big data?

The following reflection was written for the online class Big Data in Healthcare: Exploring Emerging Roles, a fantastic free course provided by the National Network of Libraries of Medicine (NNLM).

Enormous data sets containing a broad variety of information produced at high velocity are transforming the healthcare field.  This “big data” is being used for clinical research, patient diagnosis and treatment, analysis of public health trends, and in many other innovative ways to move healthcare into a new era of highly personalized medicine.  Patients provide the health data, programmers and data scientists create new tools to manipulate the data, and clinicians and other healthcare professionals consult and analyze the data.  Health science librarians may wonder what roles they can play in this daunting but incredibly important new domain.  Librarians can use their specialized skills to fill three key roles in the big data field: they can act a liaisons between healthcare professionals and programmers, they can act as advocates for patients, and they can act as educators for patients and healthcare professionals.

Librarians regularly perform reference interviews and user needs assessments to determine the information and programming needs of their patrons, and these skills can help librarians become effective liaisons between healthcare professionals and programmers who create tools to manipulate big data.  In the presentation The Triple Aim at the Front Lines: Lessons from a VA Experience in using data to drive change, Dr. Nick Meo describes how in order to create more effective data tools for physicians, programmers need to know how frontline physicians are using these tools in their everyday practice.  Librarians can be the intermediaries in this situation.  After performing reference interviews, focus groups, and other forms of needs assessments with healthcare professionals, the librarian can then work with programmers to create data tools that fit the information needs and diagnostic/treatment processes of the healthcare team.

Librarians can also act as advocates for patients, by learning about patient concerns related to use of their personal health data and communicating these concerns to both the programmers and healthcare professionals.  In the article A ‘green button’ for using aggregate patient data at the point of care, Christopher Longhurst, Robert Harrington, and Nigam Shah suggest a change to HIPAA, so that it will be “acceptable for front-line clinicians to use aggregate patient data, even if identified, for the purpose of treating a similar patient under their care” (1233).  This idea may make aggregated patient data more easily accessible to clinicians, but how would patients feel about their personal health data being used in this manner?  Librarians can work with patients to gain their viewpoints on possible new uses for health data like the suggested “green button”, and patients may reveal ethical, privacy, or security concerns that programmers and healthcare professionals had not previously considered.

Finally, librarians can act as educators for both healthcare professionals and patients to demonstrate the value of utilizing big data in healthcare. Harlan Krumholz describes in the article Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system how healthcare professionals will need to change their viewpoints about best practices for research in order to fully embrace big data.  Librarians can begin changing viewpoints by presenting healthcare professionals with concrete examples of how big data has been used to improve patient care, as well as training resources for learning more about data science.  Librarians can also promote participation for patients within big data initiatives, by explaining how the projects will benefit public health.  For instance, librarians can explain to patients and the general public how participation in the All of Us Research Program may improve personalized medicine for current and future generations.

Health science librarians don’t need advanced programming skills or a medical degree as a prerequisite to work with big data.  Librarians already possess valuable communication and training skills which will make them effective liaisons between patients, healthcare professionals, and programmers who contribute to generating, analyzing, and creating tools for big data.

Where to Start with Big Data?

Librarians have to sink or swim in the constantly shifting waters of the information field, and the latest wave sweeping over information sciences is Big Data. I started learning about the importance of data analysis and visualization while working with patents, where analysis of large patent portfolios could be used for competitive intelligence, planning acquisitions, spotting trends in a technology sector, and much more.

Now working in the health field, I’m truly beginning to see why everyone calls it “Big Data.”  The amount of data generated through general healthcare services and biomedical research is truly staggering, ranging from data in electronic health records to genomic data generated through human genome sequencing.  How do we make this data searchable and reusable, so researchers can discover new innovations from existing data sets?  How do we also protect personal information, especially with data generated from electronic health records?  Can researchers retain intellectual property rights to their data while still making their data searchable and reusable?  There are so many thorny issues to consider and new concepts to learn surrounding Big Data and data science in general, and it can be a daunting task trying to find a place to start.

Here are a few resources which are helping me wrap my mind around basic data science concepts and the current state of Big Data:

  1. To get an overview of how data science is impacting the healthcare field, I’m taking the National Network of Libraries of Medicine (NNLM) online course Big Data in Healthcare: Emerging Roles.  (I highly recommend checking the NNLM Upcoming Classes list for other free courses and webinars you can sign up for.)
  2. Check out this recording of a webinar called Data Science 101: An Introduction for Librarians (also from NNLM), which provides a quick overview of data science concepts like the data science pipeline, machine learning, supervised learning, unsupervised learning, natural language processing, etc.
  3. IBM produced a great infographic called The Four V’s of Big Data, which describes how big data can be broken down into four dimensions: volume, velocity, variety, and veracity of the data.
  4. Learn about the FAIR Data Principles, which suggest that all data sets should be findable, accessible, interoperable, and re-usable.  A recent article in Nature gives a detailed overview of the FAIR Data Principles.
  5. I found the blog post Is Big Data Still a Thing? (The 2016 Big Data Landscape) by Matt Turck to be a useful overview of the current state of Big Data, especially the infographic included in the post which illustrates many of the major players in the field.