The Historian's Craft and AI (Part III)
The archive lies at the heart of historical research. And to highlight its importance, the archive box is colored light green in the workflow diagram below.
Let’s jump into the archive process to see where AI might be helpful. Broadly speaking, archive documents fall into four categories: Images, Mixed Image / Text, Handwritten, and Printed. Again, these are rough categories as I am not trying to define a final or ultimate archive classification here. The critical point to keep in mind right now is that AI models train best with simple text or image files. They can train on annotated files, but those files should be free of XML tags. I realize this contradicts accepted digital humanities data preparation practices, specifically those promoted by the text encoding initiative (TEI). In a TEI project, the scholar’s first task is to mark up a document set using the organization’s standardized XML tags. The repetition of tags, though, can gum up model training processes, creating data distortions where they are not wanted or needed.
AI models train best with simple text or image files.
When TEI was launched in 1987, it was a great idea, allowing humanities scholars to share annotated documents with each other seamlessly. It still is. TEI delivers on its promise of document interoperability, and it should not be abandoned. So, how should one move forward? Here's what we did on a recent project. We first made a copy of the dataset and then wrote and ran a function to strip all the XML tags from the documents in that copy. This left us with two datasets, one with tags for digital humanities work and another for AI model training.
Let’s now turn our attention to specific document processing workflows. The first workflow is the mixed document one that contains both images and text. Here we see a book that was created by scribes and illuminators prior to or shortly after Gutenberg’s invention of the printing press circa 1436. Books printed shortly after the introduction of the printing press are now called incunabula, the Latin word for cradle. During this period, hand-written texts continued to be produced even as printed works entered the market. Here’s the initial workflow for artifacts written by hand. It all begins with a scan of the document, letter, or text.
I will not discuss scanning technology or the various kinds of scanners here as that would take too much time. This is where a friendly archivist enters the picture, a professional who can help you design a document scan workflow that fits your budget and project. Once a clean scan is available, we then use AI object detection to split the document into two parts, iconography (images) and handwritten text. A format split is necessary because the underlying AI models are format specific. That is, model architecture differs depending on the type of data being processed. From there, the content is passed to either the iconography or handwriting workflows. We will examine the details of those two processes and variations of them in a future post.