Do you know what a site survey form looks like? No? See the one below for an example. This site survey form is from Box 15 of Series 108042.
We decided to try using adobe to OCR the scans, and export them in the format of an excel spreadsheet. Both I and a colleague experimented with this method to see if it yielded acceptable results. Unfortunately, it did not for two primary reasons: the number of extreme errors in the OCR text results and the incorrect orientation of data in excel that adobe defaulted to when exporting text. See a snapshot below for an example.
By November, we were exploring other options. While at a professional conference in North Carolina, I questioned other digital archivists about better OCR technology that could help us to extract text from our site forms and use in a database. It was at this time that ABBYY FineReader was suggested. Upon returning to South Carolina, I researched ABBYY FineReader projects, and I found a recent presentation by a group from University of Maryland at MARAC (Mid-Atlantic Regional Archives Conference) 2016. This group presented on using ABBYY FineReader in conjunction with a Python script to extract data from urban renewal documents. View the presentation here. Relevant pages are 19-24.
In December 2016, I got in touch with the project group members (Greg Jansen, Mary Kendig, and Myeong Lee) to ask for further information and guidance. At this time, I also got in touch with a colleague at UNC-Chapel Hill. The digitization lab there has ABBYY FineReader and ran ten sample cards through the system in order to help us gauge the accuracy and efficiency of ABBYY for our documents.
In January 2017, we were able to access the University of Maryland’s virtual environment (courtesy of Mary Kendig) in order to use ABBYY FineReader. This trial period allowed us to ascertain ease of use, different aspects of the user interface, and software efficiency. At the end of February/beginning of March, we obtained a 30 day free trial of ABBYY FineReader 14 from the vendor. We used this to begin experimenting on a box of site survey cards.
On Friday, March 17, 2017, we purchased ABBYY FineReader 14 Corporate. The software is a one time purchase without an annual fee.
Overall, we've been able to utilize our interns to the fullest by having them focus on a process that we cannot automate (scanning) as we work on automating the data entry process. We did not ultimately pursue the idea of using a Python script. We did experiment with this process, but it proved cumbersome with our site survey forms for several reasons: the lack of standard language, the lack of standard continuation denotation on the versos of site survey forms, etc. However, we are able to use ABBYY to create large excel spreadsheets equivalent to a box of data at a time, and then upload this metadata to SCHPR in batch.
I would definitely recommend ABBYY FineReader to other archivists, and I'm glad someone recommended it to me.