Difference between revisions of "HTR"

From Transkribus Wiki
Jump to: navigation, search
(Training)
(Training)
Line 7: Line 7:
 
* a description
 
* a description
 
* the language
 
* the language
* the collection ID where the input documents can be found
+
* the collection ID where the input documents can be found and where the resulting model will be linked
  
 
The input for training is described in the TrainList section of the XML and is made up of train elements where each includes:
 
The input for training is described in the TrainList section of the XML and is made up of train elements where each includes:

Revision as of 14:09, 19 December 2016

Training

For training a new HTR model using the new API, at first a configuration XML has to be created. Besides parameters (the example below includes the default values) mandatory fields are:

  • a model name
  • a description
  • the language
  • the collection ID where the input documents can be found and where the resulting model will be linked

The input for training is described in the TrainList section of the XML and is made up of train elements where each includes:

  • the document ID
  • a list of pages where each page includes
    • the page-ID
    • the ID of the transcript version that should be used for training

Optionally a test set can be specified in the TestList element analogously.

The training descriptor then should look like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<uroHtrTrainConfig>
    <modelName>Test Model</modelName>
    <description>A description</description>
    <language>German</language>
    <colId>2</colId>
    <numEpochs>200</numEpochs>
    <learningRate>2e-3</learningRate>
    <noise>both</noise>
    <trainSizePerEpoch>1000</trainSizePerEpoch>
    <TrainList>
        <train>
            <docId>1</docId>
            <pageList>
                <pages>
                    <pageId>1</pageId>
                    <tsId>1</tsId>
                </pages>
                <pages>
                    <pageId>2</pageId>
                    <tsId>2</tsId>
                </pages>
            </pageList>
        </train>
        <train>
            <docId>2</docId>
            <pageList>
                <pages>
                    <pageId>3</pageId>
                    <tsId>3</tsId>
                </pages>
                <pages>
                    <pageId>4</pageId>
                    <tsId>4</tsId>
                </pages>
            </pageList>
        </train>
    </TrainList>
    <TestList/>
</uroHtrTrainConfig>

That XML is then send via POST to

https://transkribus.eu/TrpServer/rest/recognition/htrTrainingCITlab

and the call returns the job-ID of the training.

Note, that the models are now linked to the collection they were started in (cf. colId element in training descriptor XML).