155 — TexT - Text Extractor Tool for Handwritten Document Transcription and Annotation

Hast et al (1801.05367)

Read on 22 January 2018
#OCR  #text-recognition  #HTR  #handwriting  #handwritten-text-recognition  #transcription  #cultural-heritage 

TexT is a tool that combines a novel handwriting-recognition system with a user-friendly front-end to enable simple transcription of text from historical or handwritten documents. Handwritten text recognition (HTR) is a complex field because of the enormous amount of variance in both style and the type and quality of the scan. Most handwritten documents are still annotated by hand because the technology doesn’t exist yet to do anything else (even though OCR on printed text is nearly trivial now).

TexT aims to fill that gap by providing a progressively refined annotation of the target text: While a human annotates text in realtime, similar words or duplicates of already-annotated words can be contributed by a machine teammate.

This system is currently implemented in MATLAB, which means that it may be tricky to scale to a larger corpus. However, I think that this idea can be easily transitioned to larger, more scalable technologies: Computer vision is a field with a very large amount of libraries and modules and I imagine there are many fields, such as medicine, law, and sociology, that would benefit greatly from a similar system.