Handwritten Text Recognition Workflow
Transkribus has to be trained. Handwritten Text Recognition (HTR) is NOT like OCR - where you press the button and your handwritten document will be recognized automatically. We hope that over the course of the READ project we will be able to provide a general model, but for now the HTR needs to be trained to understand the specific writing style of your documents.
- You have a collection of several hundreds or thousands of pages, of digitised handwritten or printed (early modern printing) material
- You want to transcribe these pages anyway, or you are interested in being able to conduct a full-text search (without prior transcription)
- HTR technology is trained by being shown images of documents and their accurate transcriptions - we recommend starting with around 20,000 words (100 pages) of training data
- If you have existing transcriptions, we can use these as training data for HTR thanks to our new Text2Image matching tool. Contact us for more information?
- Alternatively, you can create training data in Transkribus. Upload your images to the platform, segment each image into lines using our automatic tools and then transcribe the contents of each page. See our How to Guides for more info!
- Once you have segmented and transcribed around 20,000 words (100 pages), you now have a set of training data which can be used to train the HTR engine.
Training and Recognition
- Once you have a set of training data, you can contact the Transkribus team who will activate the training button in Transkribus for you.
- Once the training process is complete, you will be able to select your HTR model within the Tools Tab in Transkribus and apply it to other pages in your document collection. Note: the pages which will be automatically recognized by the HTR model will need to have text regions and lines/baselines already defined
Correction and Search
- You can use the Text Editor in Transkribus to correct the automatically transcribed text.
- You can conduct a full-text search of your documents via the Search button in the Main Menu
Improve your HTR results
- Using high quality images is the most important prerequisite for HTR and OCR
- If you have text which is similar to your collection (e.g. from other transcriptions, from the Internet, etc.), provide us with this and we can use it to further train the HTR model.
How to measure the performance of HTR and OCR
- The Tools Tab offers you a tool to measure the performance of the HTR (and OCR) with the Word Error Rate and the Character Error Rate. Compare a reference page (the page with the result you expect) and the HTR page.