Text as a form of data has roots in the beginning of computer science. Anandarajan, Hill, and Nolan provide a very good overview of its emergence (M. Anandarajan et al., 2019). They describe the development of topic modelling and sentiment analysis, which use text to identify broader patterns from texts.
Increases in computing power allow software to identify patterns that a human might take years to uncover. The popularity of text mining has grown as tools have become readily available. The University of Chicago Library has faced increasing demand for sources that researchers could use to perform this type of large-scale analysis. Meeting this demand has been challenging. Publishers have been reluctant to make content available at scale and it took time for database vendors to develop solutions. This article describes how the library approached the problem and worked to provide an evolving set of options for researchers over time.
Early Days
A 2005 study of bias in media was the first text-mining project librarians became aware of that utilized library content. The researchers selected phrases that they believed showed bias in some way and used the library’s subscriptions to ProQuest Historical Newspapers to collect their data. They used an automated script to count the occurrences of these phrases, which would now trigger safeguards on most database platforms that automatically block access. Web scraping at scale did not seem to be something database providers had considered when this research took place.The research was first released as a working paper in 2006 (M. Gentzkow & Shapiro, 2006) and was later published in Econometrica (M. A. Gentzkow & Shapiro, 2010). Librarians frequently had other researchers request help building similar datasets, which was not possible under current licenses. This did not reduce the demand and there were frequent problems with researchers writing scripts to do similar scraping. Most researchers understood the issues involved when it was explained to them, but the library was frequently in the position of reacting to a license breach. This led the library to search for solutions that could meet this increasing demand.
Scraping By
An early effort by the library to provide text as data was the purchase of structured XML files from Gale and ProQuest that could be distributed to researchers. These were hosted locally, so any authorized user could download them and use them on their own computer. Purchased data included some major newspapers, including The New York Times and The Wall Street Journal. However, these files have several disadvantages. Most of the content available from ProQuest is from the early twentieth century and there is little to no recent content. ProQuest provides the data in one large ZIP file, which is very large and can be difficult to work with. Gale titles such as The Economist have more recent content, but these are provided issue by issue. Researchers have had to write scripts to download these in batches. No usage data is available for these files, due to the way library technical staff implemented the hosting, so it was never clear whether these were meeting the needs of researchers.
Database providers responded to the demand for text mining by developing online solutions. The first tool the library offered was the Gale Digital Scholar Lab. This offered access to most primary source collections that the library licensed through Gale and some common tools for doing analysis. Datasets were limited to 10,000 documents, which researchers found too small for serious work. It did not get much use, and the subscription was cancelled after two years.
ProQuest approached the library as a potential development partner for a new text mining product in 2019. This was the precursor to their current TDM Studio product. The library was an eager participant, and we were able to provide access to two research teams that had repeatedly asked for a way to analyze newspaper content. All data and analysis in TDM Studio occur in a hosted online platform, which allows ProQuest to adhere to their licenses with their publishing partners. The researchers indicated they would prefer to be able to download data to their own workspace, but they were otherwise very happy to be able perform this kind of analysis. This strong feedback led to the library subscribing to the full product when it was released.
Ithaka, publisher of JSTOR, invited the library to participate in the beta version of their text analysis product in 2021. This included access workshops led by Ithaka staff on various tools, including using Python for text analysis. This led to a full subscription to Constellate when it launched. It is a similar product to TDM Studio, but primarily for academic journals. Constellate is free for anyone to use; a full subscription allows researchers to build larger data sets.
The library also evaluated a solution from LexisNexis but decided it would require too much work by staff. It had shared accounts, which librarians would need to manage. A researcher could be assigned access for a specified amount of time. The library did not want to be in the position of prioritizing projects or cutting access before a project had been completed.
(Almost) a Service
The library relies on an extensive LibGuide (University of Chicago Library, 2024) as the entry point for current support for text mining. We highlight TDM Studio and Constellate as our primary solutions. TDM Studio now allows users to self-register, which has eased the work of the librarians supporting it. It had previously required back and forth with ProQuest as new users were interested in access. Constellate uses single sign-on, so all authorized users can get access immediately. Librarians have promoted Constellate as a key source for journals from the American Economic Association. This is primarily due to the frequency with which the AEA has notified us of suspended access due to the unauthorized use of scripts.
The guide also highlights major journal publishers that allow text mining. Some offer API access to our subscribed content, notably Elsevier and Wiley. Others allow it through other means, and we point researchers to our list as questions arise. Elsevier’s API has been popular with business researchers, due to the number of important titles in accounting, economics, and finance.
The library is also adding text mining language to licenses when we add new resources or review old licenses. This has been successful with many societies and smaller publishers and there are now almost one hundred publishers with an agreement. Model language is included as an appendix.
Conclusion
Publishers want to protect their property, so there may never be a text mining solution that satisfies the needs of both publishers and researchers. The resources the library offers have given University of Chicago users a base with which to work, but the library continues to evaluate possibilities. The increasing popularity of large language models is likely to be the next challenge, as researchers will want even larger text corpora to train their models.
References
Anandarajan, M., Hill, C., & Nolan, T. (2019). Practical Text Analytics: Maximizing the Value of Text Data (Vol. 2). Springer International Publishing. https://doi.org/10.1007/978-3-319-95663-3https://doi.org/10.1007/978-3-319-95663-3
Big Ten Academic Alliance. (2020). Library initiatives standardized agreement language. https://btaa.org/library/programs-and-services/consortial-licensing/standardized-agreement-languagehttps://btaa.org/library/programs-and-services/consortial-licensing/standardized-agreement-language
Gentzkow, M., & Shapiro, J. M. (2006). What Drives Media Slant? Evidence from U.S. Daily Newspapers. National Bureau of Economic Research Working Paper Series, No. 12707. https://doi.org/ 10.3386/w12707 https://doi.org/ 10.3386/w12707
Gentzkow, M. A., & Shapiro, J. M. (2010). What Drives Media Slant? Evidence from U.S. Daily Newspapers. Econometrica, 78(1), 35-71. https://doi.org/10.3982/ECTA7195https://doi.org/10.3982/ECTA7195
University of Chicago Library. (2024, April 29). Text and data mining. https://guides.lib.uchicago.edu/textmininghttps://guides.lib.uchicago.edu/textmining
Appendix
Model license language for text mining rights, adapted from the Big Ten Academic Alliance Library Initiatives Standardized Agreement Language (2020): https://btaa.org/docs/default-source/default-document-library/standardized-agreement-language-june-2020.pdf
““Authorized Users may use the licensed material to perform and engage in text mining/data mining activities for academic research, scholarship, and other educational purposes, and to utilize and share the outputs of text and data mining in their scholarly work. [Publisher] will cooperate with Licensee and Authorized Users in making the licensed materials available in a manner and form most useful to the Authorized User. Any [Publisher] fees for provision of copies will be on a time and materials basis only.”