Difference between revisions of "Main Page"

From Transkribus Wiki
Jump to: navigation, search
(Known Issues)
(Known Issues)
Line 201: Line 201:
  
 
and unpack it into the Transkribus folder - the Transkribus.command file will automatically check for java installations in its sub directories!
 
and unpack it into the Transkribus folder - the Transkribus.command file will automatically check for java installations in its sub directories!
 
 
** Or you could make your java 8 installation the default one on command line following e.g. the instructions here:
 
** Or you could make your java 8 installation the default one on command line following e.g. the instructions here:
  

Revision as of 16:41, 26 November 2015

Transkribus Website
Transkribus Expert Interface

This page provides an overview of Transkribus.

Transkribus is a comprehensive Transcription and Recognition Platform (TRP) consisting of

The main objective of Transkribus is to support users who are engaged in the transcription of printed or handwritten documents, namely humanities scholars, archives, volunteers - and computer scientists. Transkribus offers a number of tools for the automated processing of documents, such as Handwritten Text Recognition (HTR) , Layout Analysis, Document Understanding, and Writer Identification. As a special service we also included the ABBYY FineReader Engine 11 for processing Gothic letter (Fraktur) fonts. All included services are for free.

Transkribus is still under development, a first beta release is expected at the end of 2015. Though it is not an Open Source project, we are happy to share our work with institutions interested in the further development.

The platform is hosted by the University of Innsbruck, Digitisation and Digital Preservation group (DEA) and supported by the European Commission. The main software for recognizing handwritten text was developed within the FP7 Project tranScriptorium, coordinated by the Pattern Recognition and Human Language Technology Research Centre (PRHLT) of the Technical University Valencia.

Getting started

Transkribus is free: Register, download, upload your documents, and use them in the way you like.

Transkribus provides a sheltered place: No need to share the documents you are uploading to the platform, only users who are authorized by you will be able to view your documents - but you can use Transkribus also to build up your team.

Transkribus needs segmented images before you can start to transcribe: HTR needs to know where the baseline of a text can be found, otherwise it will fail. The segmentation step is therefore crucial and has to be done, either automatically or manually.

Transkribus has to be trained: HTR (Handwritten Text Recognition) is NOT like OCR where you press the button and your handwritten document will be recognized automatically. We hope that in some years this will be the case, but for now the HTR needs to be trained for your specific writing style. As a rule of thumb: You have to transcribe 50 pages beforehand for the training to make sense.

Transkribus offers also OCR for printed documents: We have included a OCR engine capable of recognizing printed text, also in Gothic fonts, also with long "s".

Transkribus is connected to a Cloud infrastructure: The documents are stored at the Central Computer Service of the University of Innsbruck, and the tools also run on this central infrastructure. The reasons are twofold: Firstly, HTR processing requires huge processing units, which your local computer would not be able to cope with and secondly, the more transcribed documents are processed in Transkribus the higher the chance to improve the tools since no user actions get lost - all contributes to a continuous improvement cycle.

Sample Documents

Try out a local document

Download the following ZIP file which contains three documents:

  • HTR_Reichsgericht: A page from court decisions in Germany. The text comes directly from the HTR. PDF File
  • George Forrest Herbarium Specimens: A page from a collection of specimen gathered by George Forrest. Thanks to the Natural History Museum Edinburgh who organized a little test project with this kind of material. PDF File.
  • Briefwechsel_Goethe_Schiller_1794-1795: These two pages were produced manually and will serve as training data for the HTR - so that it will be able to read the hands of Friedrich Schiller and Johann Wolfgang von Goethe. PDF File

Open the directories locally in Transkribus. You will understand the main concepts, how image and text are connected, how you can render segmentation and text and which export formats we support.

In order to have a preview on the results of how the documents may look like after the automated or manual transcription process you can have a look to the PDF files which were produced automatically with Transkribus.

Once you have experienced the general "look and feel", login to the Transkribus Platform with your user account, go to the Transkribus Cloud Collection and try out the tools (which are only available via the Cloud).

Transkribus Website

Try out a sample document in the Transkribus Cloud

  • In the Transkribus Cloud collection you will find several sample documents with which you can play around and try out some of the features. This is the easiest way to explore the platform.

Document ID1927: HTR Reichsgericht

  • The document was written at the beginning of the 20th century as part of the High Court (Reichsgericht) in Germany in Kurrent script. Professional writers edited the document, so the lines are straight and no additions or deletions can be found - which makes it much easier for the Layout Analysis tool.
  • The HTR model was trained on 88 pages, no extra language data were used to improve the recognition rate.
  • Due to the fact that there are many users playing around with this document it is very likely that you get a "destroyed" version. Go to the "Versions Tab" and open the second oldest version which was produced by the HTR. Than you will have the result from a real world HTR recognition.

Note: Do not run the ABYYY FineReader on the HTR Document! It is made for printed texts, not for handwritten text.

  • You may also want to carry out the computer assisted transcription, in which case you will need to use the "HTR" version and enable in the Text Editor the HTR suggestions resp. CATTI to get assistance by the HTR engine.
  • The correct way to try out HTR is:

(1) Run "Region detection" , in which case (2) Run "Line and Baseline detection"

(3) Select "HTR_Training"

(4) Run HTR Processing (it will take several minutes!)

Afterwards you will see the expected results.

Note: There is a known issue with the CATTI server "overruling" sometimes the user - which makes current use a bit cumbersome. Will be resolved in one of the comming versions.

Document ID558: OCR Sample Document - Gothic letter

  • This document has 12 pages and consists of several images from various documents from the 17th to the 20th century. The main purpose is to show the general performance of the ABBYY FineReader SDK 11. Moreover you get familiar with the concept of text regions, line regions, baselines and word regions.
  • The sample also shows immediately how a "good scanned" image should look like: straight lines, cropped print space (no black borders), 300-400 ppi and - if possible - in 24bit colour. A good image quality is one of the main prerequisites for every kind of image processing and pattern recognition.
  • Important: Use the "Word based" button in the Text Editor to display the text!

How to train the HTR engine to read handwritten documents?

Prerequisites

  • You have a collection of several hundreds or thousands of pages, handwritten or printed (early modern printing)
  • You want to transcribe these pages anyway, or you are interested to search in the full-text (without prior transcription)
  • You have done a test with e.g. 5-10 pages and the performance of the automated line/baseline segmentation is satisfying
  • Contact us to examine your test pages. We are happy to provide you expertise, advice and support!
  • Note: The HTR_sample document in the Transkribus Cloud collection was processed exactly in the way it is described above.

Basic Workflow

  • Upload your images with the FTP Upload facility (works really fast) provided by Transkribus
  • Segment a page into Text Regions (rectangles are sufficient), either manually, or automatically (e.g. 1 minute)
  • Segment all text regions into line/baselines automatically (e.g. 30 seconds)
  • Correct baseline segmentation (no need to correct line regions!) (e.g. 1-5 minutes)
  • Transcribe text line by line (e.g. 30-60 minutes)
  • Do this for at least 50 pages, or 2000 lines (40 lines per page). Note: The more transcribed text is available for learning, the better.

Training and Recognition

  • Once you have 50 pages available, contact us so that we train the HTR engine on your reference data. This will take several hours, depending on the size of the documents.
  • Afterwards you are able to select your HTR model within the Tools Tab and apply it to a given page image. Note: Also the pages which shall be recongized must have text regions and line/baselines defined.
  • If you have more than one page already segmented into blocks and lines/baselines we can run the HTR engine on all pages, so that you need not to start it each time for a page.

Correction and Search

  • To correct the automatically transcribed text you can use the Text Editor in Transkribus. Optional you can work with Computer Assisted Text Transkription (CATTI) and HTR suggestions enabled. Both interfaces should reduce your correction work.
  • If you want to search in the full-text you need to download the document as PDF or RTF. Note: The University of Valencia is currently working on an indexing and search interface which will offer a more convenient way to search in automatically processed documents. A demonstrator is available at the Transcriptorium Website.

Improve your HTR results

  • To use "good" images is the most important prerequiste for HTR and OCR. More about this in the section below.
  • If you have similar text (e.g. from other transcriptions, from the Internet, etc.) compared to the one you are transcribing then provide us this text so that we can include it in the HTR model.

How to measure the performance of HTR and OCR

  • The Tools Tab offers you a tool to measure the performance of the HTR (and OCR) with the Word Error Rate and the Character Error Rate. Compare a reference page (the page with the result you expect) and the HTR page.

Installation

Supported Operating Systems

  • Transkribus is written in JAVA and therefore runs on Windows, Mac (Apple) as well as Linux.
  • Important: JAVA 7 or above needs to be installed on your computer which should be the case for most computers
  • If you need to check your JAVA version: https://www.java.com/de/download/help/version_manual.xml

Unzip ZIP File

  • After download you will see a ZIP File in the download directory of your computer.
  • Unzip the file before you try to start an executable file.

Run Transkribus via an executable file: .exe, .command, .sh

  • Open the Transkribus directory. You will find there the executable files for your operating system.
  • Start Transkribus from your user interface via doubleclick:
    • Windows: Transkribus.bat or use Transkribus.exe
    • Mac OS - Apple: Transkribus.command
    • Linux: Transkribus.sh

Notes for first launch on Windows

  • If you do not have "Administrator" rights, Windows will come a warning message, such as: Your Computer is Protected by Windows", etc.
  • Do not confirm, but go to "More Information". There you can agree that this is not maleware and that you want to run Transkribus on your computer.

Notes for first launch on MAC

    • If you run the program the first time, it may not start because it is a non-signed application ("... can't be opened because it is from an unidentified developer" message)
    • right-click (or control-click) the application in this case and choose "Open". In the appearing dialog click "Open" again!

Run Transkribus via command line

  • Transkribus is contained in the main jar file Transkribus-<version>.jar
  • To run the program from command line type: java -jar Transkribus-<version>.jar
  • Note: Java 7 is needed to run the program. Make sure Java 7 is either installed system wide or copy a JRE into the program directory!
  • Note: To run the scripts in Mac (or Linux) you may have to make them executable from the command line: (any version before 0.6.8)
    • Mac console basics
    • change into the program folder using 'cd' commands
    • chmod +x Transkribus.command (or chmod +x Transkribus.sh for Linux!)
  • Further more you will find several files in the Transkribus package copied to your computer:
    • config.properties can be modified to adjust simple appearance properties
    • virtualKeyboards.xml can be used to specify a set of virtual keyboards
    • logback.xml can be modified to adjust logging properties (for expert users only)
  • The 'libs' subfolder contains the necessary libraries for all platforms. Currently supported are:
    • Windows 32/64 bit
    • Linux 32/64 bit
    • OSX 64 bit

Known Issues

Logging in to the Server is not possible via Transkribus, but on the website it works.

  • Solution: There is a known issue with specific versions of Java 7 (e.g. Java 7u25). You can check your installed version by opening a terminal/command line and entering "java -version". If you encounter this problem, try updating Java on your machine.

Logging in is prevented by the Firewall of your Internet Provider

  • Some IT departments are blocking the SSL port 443 and/or unknown applications via a firewall. Check with your IT department if that might be the case.
  • In some cases it may be necessary to use a special command line to start the application with proxy, e.g.
   java -Dhttps.proxyHost=<proxyserver>
        -Dhttps.proxyPort=<proxyPort>
        -Dhttps.proxyUser=<user name for proxy>   
        -Dhttps.proxyPassword=<password for proxy>
        -jar Transkribus-0.7.0.jar


Norton Antivirus detects a threat and is blocking the zip file from being unpacked.

  • Solution: This is a false alarm which Norton gives when encountering software it is not familiar with (WS.Reputation.1). You should be able to restore the file from quarantine by following the instructions from the following resource [1].

Versions older or equal than 0.6.5 cannot update (very long error message):

  • Please click on the "Home" button (upper left corner), then "Install a specific version", select the newest version from "Releases" and tick the box beside "Download complete package".
  • Afterwards click on "Update" or "Replace". This way, the complete package is downloaded and the update should work.

Wrong JAVA Version on Mac

  • After opening the command file on the Mac, Transkribus says that there is a wrong Java version installed (1.6.0.65) instead of 1.7. However, there is the most current version of Java RE (1.8.0.66) installed.
    • The problem is that Java 1.6.0.65 is the default Java on the command line which the Transkribus.command uses. You can check the default version by opening the terminal and typing 'java -version'.
    • To solve the problem you can either download the latest jdk as a .tar.gz package from here:
   http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

and unpack it into the Transkribus folder - the Transkribus.command file will automatically check for java installations in its sub directories!

    • Or you could make your java 8 installation the default one on command line following e.g. the instructions here:
   http://myshittycode.com/2014/03/17/mac-os-x-setting-default-java-version/

Transkribus User Guide

Here you can find a detailed user's guide for the Transkribus expert tool which can be downloaded and installed on your local computer. Please feel free to contribute to the user's guide or to improve it!

Transkribus Webinterface

The webinterface of Transkribus is currently used

  • to provide some general information about Transkribus
  • to register for the Transkribus Platform
  • to modify your account settings, namely to reset your password and
  • to download the Transkribus Expert Tool

Registration at the Transkribus website

  • You need to register to be able to download and work with Transkribus.
  • Registration requires your name, your e-mail address, and that you accept our user agreement. We will also track your IP address.
  • According to Austrian data protection law we will respect your privacy and use the data only to improve our services and support research in humanities and computer science!

Download Transkribus Expert Tool

  • Once you have registered you are enabled to download the Transkribus tool from our website. The tool is platform independent and runs on Windows, Mac (Apple) as well as Linux.
  • Unzip the file before you start the programme.
  • See Installation section to get ready for starting the programme.
  • In the future it is planned to include a digital library application within the webinterface which enables you to view, edit and share your Transkribus Cloud documents with other users or the public.

Transkribus Cloud Services

The Transkribus Cloud is based on the IT infrastructure of the Central Computer Service of the University of Innsbruck. Main components are virtual servers for hosting the core Transkribus application and the integrated tools, a database server, a storage and backup facility, as well as a High Performance Computing unit.

Image Files

  • Though it is possible to work with local files, the full power of Transkribus can only be enjoyed if the documents reside in the Transkribus Cloud.
  • Once the images are uploaded they are stored and processed in the FileImageStore developed by UIBK.
  • Specifications of the FileImageStore
    • Following image formats can be processed: JPG, PNG, TIFF, JP2. Note: GIF and RAW formats from digital cameras are not supported by Transkribus
    • Resolution is always kept, but for working with the images a thumbnail, and a compressed file are produced autumatically.
    • Original filenaming is kept as metadata
    • Files get a unique address

PAGE Files

We are using the PAGE XML file format as internal master format. It was created by the University of Salford.

REST Interfaces

Most services within Transkribus are also exposed via RESTful interface. Developers are free to use the complete REST Interface.

Questions and Answers

In the Questions and Answers section we try to respond to some of the issues which occured when communicating with users. Q&A gives you a good impression on the variety of issues which come up but also may be solved when transcribing a text with Transkribus.

How to contribute to the Transkribus Wiki

Feel free to contribute to the Transkribus Wiki. Register directly here in the Wiki (your Transkribus user account does not work!) and start editing. There is also a German page, where we we would be thankful for your input!

Consult the User's Guide on using the wiki software.

MediaWiki

Credits

This work is co-funded by the European Commission within the FP7 Project tranScriptorium (2013-2015) and the H2020 Project READ (2015-2019).

Special thanks to the many, many users who provide their feedback per email and bug report - though we are often not able to directly follow your suggestions, we try our best to include them in the long run!