Difference between revisions of "Handwritten Text Recognition Workflow"

From Transkribus Wiki
Jump to: navigation, search
Line 8: Line 8:
  
 
'''Basic Workflow'''
 
'''Basic Workflow'''
* Upload your images to Transkribus.  We recommend working with a sample of at least 20,000 words (around 100 pages).  All documents uploaded to Transkribus are private by default
+
* HTR technology is trained by being shown images of documents and their accurate transcriptions - we recommend starting with around 20,000 words (100 pages) of training data
* Segment your images into text regions, lines and baselines.  This can be done automatically using automatic detection tools in Transkribus.  The best practice is to draw text regions on your documents manually and then automatically detect lines and baselines.  See our [https://transkribus.eu/wiki/index.php/How_to_Guides How to Guides] for more info!
+
* If you have existing transcriptions, we can use these as training data for HTR thanks to our new Text2Image matching tool. [mailto:email@transkribus.eu Contact us] for more information?
 
+
* Alternatively, you can create training data in Transkribus. Upload your images to the platform, segment each image into lines using our automatic tools and then transcribe the contents of each page.  See our [https://transkribus.eu/wiki/index.php/How_to_Guides How to Guides] for more info!
* Transcribe text line by line
+
 
* Once you have segmented and transcribed around 20,000 words (100 pages), you now have a set of training data which can be used to train the HTR engine.
 
* Once you have segmented and transcribed around 20,000 words (100 pages), you now have a set of training data which can be used to train the HTR engine.
** '''Note:''' If you have existing transcripts of your documents, these can be used to train a HTR model. Contact us for more info!
 
  
 
'''Training and Recognition'''
 
'''Training and Recognition'''

Revision as of 10:16, 11 December 2017

Transkribus has to be trained. Handwritten Text Recognition (HTR) is NOT like OCR - where you press the button and your handwritten document will be recognized automatically. We hope that over the course of the READ project we will be able to provide a general model, but for now the HTR needs to be trained to understand the specific writing style of your documents.

Prerequisites

  • You have a collection of several hundreds or thousands of pages, of digitised handwritten or printed (early modern printing) material
  • You want to transcribe these pages anyway, or you are interested in being able to conduct a full-text search (without prior transcription)

Basic Workflow

  • HTR technology is trained by being shown images of documents and their accurate transcriptions - we recommend starting with around 20,000 words (100 pages) of training data
  • If you have existing transcriptions, we can use these as training data for HTR thanks to our new Text2Image matching tool. Contact us for more information?
  • Alternatively, you can create training data in Transkribus. Upload your images to the platform, segment each image into lines using our automatic tools and then transcribe the contents of each page. See our How to Guides for more info!
  • Once you have segmented and transcribed around 20,000 words (100 pages), you now have a set of training data which can be used to train the HTR engine.

Training and Recognition

  • Once you have a set of training data, you can contact the Transkribus team (email@transkribus.eu) who will train the HTR engine on your collection
  • Once this process is complete, you will be able to select your HTR model within the Tools Tab in Transkribus and apply it to other pages in your document collection. Note: the pages which will be automatically recognized by the HTR model will need to have text regions and lines/baselines already defined

Correction and Search

  • You can use the Text Editor in Transkribus to correct the automatically transcribed text.
  • You can conduct a full-text search of your documents via the Search button in the Main Menu

Improve your HTR results

  • Using high quality images is the most important prerequisite for HTR and OCR
  • If you have text which is similar to your collection (e.g. from other transcriptions, from the Internet, etc.), provide us with this and we can use it to further train the HTR model.

How to measure the performance of HTR and OCR

  • The Tools Tab offers you a tool to measure the performance of the HTR (and OCR) with the Word Error Rate and the Character Error Rate. Compare a reference page (the page with the result you expect) and the HTR page.

Back to main page