Build Your Own Text-as-Data Corpus: A Print-to-Bytes Primer
February 11 @ 6:00 pm - 8:00 pm EST
This hands-on workshop will teach participants how to construct their own digital text corpus for conducting humanities data analysis. We’ll cover simple tools for turning printed texts in a variety of languages into computer-readable files, the use of Optical Character Recognition (OCR) software, and consider helpful tools for post-process correction of digitized texts. We’ll also look at open-access text-as-data sources available over simple web-browser-based API calls. The workshop is geared toward digital humanists needing to assemble text data that are not yet compiled or in computer readable form for analysis, and who are looking for an introduction to the workflows and software suited to building the research materials needed for analysis. We’ll learn how to use Tesseract, an open-source OCR software, consider the anatomy of an HOCR file (the output of OCR efforts), and deploy techniques for extracting structured information from a page.
Computer with a text editor installed such as BBEdit, TextWrangler, Atom, Notepad++ or the like; administrator access to install open-source software (Tesseract).