Starting to Text Mine the Digitized Library with HathiTrust Features.
February 7, 2020 @ 10:00 am - 12:00 pm EST| Free
Millions of books have been digitized in the past two decades. Thanks to a 2014 court ruling, about 15 million books are available for computational analysis in the HathiTrust including data about word counts on each individual page. In the next year or two, similar data will become available for JStor and Portico books. This session will address the following issues necessary for working with this dataset.
1. What books have been scanned, and which ones end up in Hathi?
2. How do you build up a list of Hathi volumes to address a feature set?
3. How do you acquire and work with Hathi’s “Feature Count” data programmatically?
4. What sort of questions can you answer with these word counts, anyway?
Equipment Requirements: Laptop or high-powered tablet.
Prerequisites: None; this session will generally be at a high enough level that it should be useful for those who wish to supervise research programmers rather than do it directly. Those with basic programming experience who wish to use it in the workshop should consider installing the ‘htrc-feature-counts’ module (for python) or the ‘hathidy’ package (for R).