Handwritten Text Recognition Workflow

From Transkribus Wiki
Revision as of 10:09, 11 December 2017 by Louise Seaward (Talk | contribs)

Jump to: navigation, search

Transkribus has to be trained. Handwritten Text Recognition (HTR) is NOT like OCR - where you press the button and your handwritten document will be recognized automatically. We hope that over the course of the READ project we will be able to provide a general model, but for now the HTR needs to be trained to understand the specific writing style of your documents.

Prerequisites

  • You have a collection of several hundreds or thousands of pages, of digitised handwritten or printed (early modern printing) material
  • You want to transcribe these pages anyway, or you are interested in being able to conduct a full-text search (without prior transcription)

Basic Workflow

  • Upload your images to Transkribus. We recommend working with a sample of at least 20,000 words (around 100 pages). All documents uploaded to Transkribus are private by default
  • Segment your images into text regions, lines and baselines. This can be done automatically using automatic detection tools in Transkribus. The best practice is to draw text regions on your documents manually and then automatically detect lines and baselines. See our How to Guides for more info!
  • Transcribe text line by line
  • Once you have segmented and transcribed around 20,000 words (100 pages), you now have a set of training data which can be used to train the HTR engine

Training and Recognition

  • Once you have a set of training data, you can contact the Transkribus team (email@transkribus.eu) who will train the HTR engine on your collection
  • Once this process is complete, you will be able to select your HTR model within the Tools Tab in Transkribus and apply it to other pages in your document collection. Note: the pages which will be automatically recognized by the HTR model will need to have text regions and lines/baselines already defined

Correction and Search

  • You can use the Text Editor in Transkribus to correct the automatically transcribed text.
  • You can conduct a full-text search of your documents via the Search button in the Main Menu

Improve your HTR results

  • Using high quality images is the most important prerequisite for HTR and OCR
  • If you have text which is similar to your collection (e.g. from other transcriptions, from the Internet, etc.), provide us with this and we can use it to further train the HTR model.

How to measure the performance of HTR and OCR

  • The Tools Tab offers you a tool to measure the performance of the HTR (and OCR) with the Word Error Rate and the Character Error Rate. Compare a reference page (the page with the result you expect) and the HTR page.

Back to main page