Difference between revisions of "Main Page"

From Transkribus Wiki
Jump to: navigation, search
(How to use Transkribus)
(How To Papers)
Line 57: Line 57:
 
These three papers provide a more detailed introduction to Transkribus and should answer some of your initial questions.
 
These three papers provide a more detailed introduction to Transkribus and should answer some of your initial questions.
  
 +
* [[Media:How_to_use_TRANSKRIBUS_-_10_steps.pdf |How To Use Transkribus - in 10 Steps]]
 
* [[Media:HowToPrepareTestProjects.pdf |How To Prepare Test Projects with Transkribus - for Archives and Libraries]]
 
* [[Media:HowToPrepareTestProjects.pdf |How To Prepare Test Projects with Transkribus - for Archives and Libraries]]
 
* [[Media:HowToTranscribe_SimpleMode.pdf |How To Transcribe Documents with Transkribus - Simple Mode]]
 
* [[Media:HowToTranscribe_SimpleMode.pdf |How To Transcribe Documents with Transkribus - Simple Mode]]

Revision as of 12:35, 7 September 2016

Transkribus Website
Transkribus Expert Interface

This page provides an overview of Transkribus.

Transkribus is a comprehensive Transcription and Recognition Platform (TRP) consisting of

The main objective of Transkribus is to support users who are engaged in the transcription of printed or handwritten documents, namely humanities scholars, archives, volunteers - and computer scientists.

Transkribus offers a number of tools for the automated processing of documents, such as Handwritten Text Recognition (HTR), Layout Analysis, Document Understanding, and Writer Identification. As a special service we also offer the ABBYY FineReader Engine 11 OCR software for processing Gothic letter (Fraktur) fonts. All included services are available for free.

Transkribus is still under development. Most parts of Transkribus are Open Source and we are happy to share our work with institutions interested in the further development. Visit our Github repository for further information.

The platform is hosted by the University of Innsbruck, Digitisation and Digital Preservation group (DEA) and supported by the European Commission. Until the end of 2015 the software for recognizing handwritten text was developed within the FP7 Project tranScriptorium, coordinated by the Pattern Recognition and Human Language Technology Research Centre (PRHLT) of the Technical University Valencia. Since 2016 Transkribus receives support via the H2020 Project READ (Recognition and Enrichment of Archival Documents), coordinated by the DEA group.

Basic concepts

Transkribus is Open Source and services are free: Register, download, upload your documents, and use them in the way you like.

Transkribus provides a sheltered space: No need to share the documents you are uploading to the platform, only users who are authorized by you will be able to view your documents - but you can use Transkribus also to build up your team.

Transkribus needs segmented images before you can start to transcribe: The Handwritten Text Recognition (HTR) engine needs to know where the baseline of a text can be found, otherwise it cannot work. The segmentation step is therefore crucial and has to be done, either automatically or manually.

Transkribus has to be trained: HTR is NOT like OCR - where you press the button and your handwritten document will be recognized automatically. We hope that over the course of the READ project we will be able to provide a general model, but for now the HTR needs to be trained to understand the specific writing style of your documents. As a rule of thumb: you have to transcribe at least 50 pages from your document collection for the training to make sense to the computer.

Transkribus also offers OCR for printed documents: We have included an OCR engine capable of recognizing printed text, including Gothic fonts and the long "s".

Transkribus is connected to a Cloud infrastructure: The documents are stored at the Central Computer Service of the University of Innsbruck, and the tools also run on this central infrastructure. The reasons are twofold: Firstly, HTR processing requires huge processing units, which your local computer would not be able to cope with. Secondly, the more transcribed documents are processed in Transkribus, the higher the chance of improving the tools. Since no user activity gets lost, every action contributes to a continuous cycle of improvement!

Transkribus Example Package

How to use Transkribus

How to use Transkribus - in 10 steps (or less)

This paper gives newcomers a basic overview of how to work with the Transkribus platform.

Download Example Package

  • Download the Example Package. It is a ZIP file and consists of six pages where we explain and show some of the most important rules for transcribing text. You can upload these pages to Transkribus and play around with them!
  • Download also the export files of the Example Package as they can be produced by Transkribus at any time. You will find
    • a PDF file with image in the foreground, text in the background, extra text pages and highlighted tags
    • a TEI (Text Encoding Initiative) file (for experts)
    • a Word File with line breaks according to the original document and highlighted tags

How To Papers

These three papers provide a more detailed introduction to Transkribus and should answer some of your initial questions.

Transkribus User's Guide

Here you can find a detailed user's guide for the Transkribus tool which can be downloaded and installed on your local computer. Please feel free to contribute to the user's guide or to improve it!

Questions and Answers

In the Questions and Answers section we try to respond to some of the common questions of new users. The Q&A gives you a good impression on the variety of issues which come up but which can also be solved when transcribing a text with Transkribus.

Transkribus Webinterface

At the Transkribus webinterface you can:

  • read some general information about Transkribus
  • register for the Transkribus Platform
  • modify your account settings, namely to reset your password and
  • download the Transkribus Expert Tool

Registration at the Transkribus website

  • You need to register to be able to download and work with Transkribus.
  • Registration requires your name, your e-mail address, and that you accept our user agreement. We will also track your IP address.
  • According to Austrian data protection law we will respect your privacy and use the data only to improve our services and support research in humanities and computer science!

Download Transkribus Expert Tool

  • Once you have registered, you are enabled to download the Transkribus tool from our website. The tool is platform independent and runs on Windows, Mac (Apple) as well as Linux.
  • Unzip the file before you start the programme.
  • See Installation section for help installing and starting the programme.
  • In the future it is planned to include a digital library application within the webinterface which enables you to view, edit and share your Transkribus Cloud documents with other users or the public.

How to train the HTR engine to read handwritten documents?

Prerequisites

  • You have a collection of several hundreds or thousands of pages, handwritten or printed (early modern printing)
  • You want to transcribe these pages anyway, or you are interested in being able to conduct a full-text search (without prior transcription)
  • You have uploaded around 5-10 pages to Transkribus as a test and are happy with the performance of the automated line/baseline segmentation
  • Contact us to examine your test pages. We are happy to provide you expertise, advice and support!
  • Note: The HTR_sample document in the Transkribus Cloud collection was processed exactly in the way described above.

Basic Workflow

  • Upload your images to Transkribus
  • Segment a page into text regions, either manually, or automatically (e.g. 1 minute)
  • Segment all text regions into line/baselines automatically (e.g. 30 seconds)
  • Correct baseline segmentation (no need to correct line regions!) (e.g. 1-5 minutes)
  • Transcribe text line by line (e.g. 30-60 minutes)
  • Do this for at least 50 pages, or 2000 lines (40 lines per page). Note: The more transcribed text is available for learning, the better.

Training and Recognition

  • Once you have 50 pages available, contact us so that we can train the HTR engine on your reference data. This will take several hours, depending on the size of the documents.
  • Afterwards you will be able to select your HTR model within the Tools Tab in Transkribus and apply it to a given page image. Note: the pages which will be automatically recognized by the HTR model will need to have text regions and lines/baselines already defined.
  • If you have more than one page already segmented into text regions and lines/baselines, we can run the HTR engine on all pages so that you need not to start it each time for a page.

Correction and Search

  • To correct the automatically transcribed text you can use the Text Editor in Transkribus. You have the option to work with Computer Assisted Text Transkription (CATTI) and HTR suggestions enabled. Both interfaces should reduce your correction work.
  • If you want to do a full-text search, you need to download the document as PDF or RTF. Note: The University of Valencia is currently working on an indexing and search interface which will offer a more convenient way to search in automatically processed documents. A demonstrator is available at the Transcriptorium Website.

Improve your HTR results

  • Using high quality images is the most important prerequisite for HTR and OCR.
  • If you have similar text (e.g. from other transcriptions, from the Internet, etc.) compared to the one you are transcribing then provide us this text so that we can include it in the HTR model.

How to measure the performance of HTR and OCR

  • The Tools Tab offers you a tool to measure the performance of the HTR (and OCR) with the Word Error Rate and the Character Error Rate. Compare a reference page (the page with the result you expect) and the HTR page.

Sample Documents

Try out a local document

Download the following ZIP file which contains three documents:

  • HTR_Reichsgericht: A page from court decisions in Germany. The text comes directly from the HTR. PDF File
  • George Forrest Herbarium Specimens: A page from a collection of specimen gathered by George Forrest. Thanks to the Natural History Museum Edinburgh who organized a little test project with this kind of material. PDF File.
  • Briefwechsel_Goethe_Schiller_1794-1795: These two pages were produced manually and will serve as training data for the HTR - so that it will be able to read the hands of Friedrich Schiller and Johann Wolfgang von Goethe. PDF File

Open the directories locally in Transkribus. You will understand the main concepts, how image and text are connected, how you can render segmentation and text and which export formats we support.

In order to have a preview on the results of how the documents may look like after the automated or manual transcription process you can have a look to the PDF files which were produced automatically with Transkribus.

Once you have experienced the general "look and feel", login to the Transkribus Platform with your user account, go to the Transkribus Cloud Collection and try out the tools (which are only available via the Cloud).

Transkribus Website

Try out a sample document in the Transkribus Cloud

  • In the Transkribus Cloud collection you will find several sample documents with which you can play around and try out some of the features. This is the easiest way to explore the platform.

Document ID1927: HTR Reichsgericht

  • The document was written at the beginning of the 20th century as part of the High Court (Reichsgericht) in Germany in Kurrent script. Professional writers edited the document, so the lines are straight and no additions or deletions can be found - which makes it much easier for the Layout Analysis tool.
  • The HTR model was trained on 88 pages, no extra language data were used to improve the recognition rate.
  • Due to the fact that there are many users playing around with this document it is very likely that you get a "destroyed" version. Go to the "Versions Tab" and open the second oldest version which was produced by the HTR. Than you will have the result from a real world HTR recognition.

Note: Do not run the ABYYY FineReader on the HTR Document! It is made for printed texts, not for handwritten text.

  • You may also want to carry out the computer assisted transcription, in which case you will need to use the "HTR" version and enable in the Text Editor the HTR suggestions resp. CATTI to get assistance by the HTR engine.
  • The correct way to try out HTR is:

(1) Run "Region detection" , in which case (2) Run "Line and Baseline detection"

(3) Select "HTR_Training"

(4) Run HTR Processing (it will take several minutes!)

Afterwards you will see the expected results.

Note: There is a known issue with the CATTI server "overruling" sometimes the user - which makes current use a bit cumbersome. Will be resolved in one of the comming versions.

Document ID558: OCR Sample Document - Gothic letter

  • This document has 12 pages and consists of several images from various documents from the 17th to the 20th century. The main purpose is to show the general performance of the ABBYY FineReader SDK 11. Moreover you get familiar with the concept of text regions, line regions, baselines and word regions.
  • The sample also shows immediately how a "good scanned" image should look like: straight lines, cropped print space (no black borders), 300-400 ppi and - if possible - in 24bit colour. A good image quality is one of the main prerequisites for every kind of image processing and pattern recognition.
  • Important: Use the "Word based" button in the Text Editor to display the text!

Installation

Supported Operating Systems

  • Transkribus is written in JAVA and therefore runs on Windows, Mac (Apple) as well as Linux.
  • Important: JAVA 7 or above needs to be installed on your computer. This should be the case for most computers
  • If you need to check your JAVA version: https://java.com/en/download/help/version_manual.xml

Unzip ZIP File

  • After download you will see a ZIP File in the download directory of your computer.
  • Unzip the file before you try to start an executable file.

Run Transkribus via an executable file: .exe, .command, .sh

  • Open the Transkribus directory. You will find there the executable files for your operating system.
  • Start Transkribus from your user interface via doubleclick:
    • Windows: Transkribus.bat or use Transkribus.exe
    • Mac OS - Apple: Transkribus.command
    • Linux: Transkribus.sh

Notes for first launch on Windows

  • If you do not have "Administrator" rights, Windows will produce a warning message, such as: Your Computer is Protected by Windows", etc.
  • Do not confirm, but go to "More Information". There you can agree that this is not maleware and that you want to run Transkribus on your computer.

Notes for first launch on MAC

  • If you run the program the first time, it may not start because it is a non-signed application ("... can't be opened because it is from an unidentified developer" message)
  • right-click (or control-click) the application in this case and choose "Open". In the appearing dialog click "Open" again!

Run Transkribus via command line

  • Transkribus is contained in the main jar file Transkribus-<version>.jar
  • To run the program from command line type: java -jar Transkribus-<version>.jar
  • Note: Java 7 is needed to run the program. Make sure Java 7 is either installed system wide or copy a JRE into the program directory!
  • Note: To run the scripts in Mac (or Linux) you may have to make them executable from the command line: (any version before 0.6.8)
    • Mac console basics
    • change into the program folder using 'cd' commands
    • chmod +x Transkribus.command (or chmod +x Transkribus.sh for Linux!)
  • Furthermore you will find several files in the Transkribus package copied to your computer:
    • config.properties can be modified to adjust simple appearance properties
    • virtualKeyboards.xml can be used to specify a set of virtual keyboards
    • logback.xml can be modified to adjust logging properties (for expert users only)
  • The 'libs' subfolder contains the necessary libraries for all platforms. Currently supported are:
    • Windows 32/64 bit
    • Linux 32/64 bit
    • OSX 64 bit

Known Issues

Logging in to the Server is not possible via Transkribus, but on the website it works.

  • Solution: There is a known issue with specific versions of Java 7 (e.g. Java 7u25). You can check your installed version by opening a terminal/command line and entering "java -version". If you encounter this problem, try updating Java on your machine.

Logging in is prevented by the Firewall of your Internet Provider

  • Some IT departments are blocking the SSL port 443 and/or unknown applications via a firewall. Check with your IT department if that might be the case.
  • In some cases it may be necessary to use a special command line to start the application with proxy, e.g.
   java -Dhttps.proxyHost=<proxyserver>
        -Dhttps.proxyPort=<proxyPort>
        -Dhttps.proxyUser=<user name for proxy>   
        -Dhttps.proxyPassword=<password for proxy>
        -jar Transkribus-0.7.0.jar


Norton Antivirus detects a threat and is blocking the zip file from being unpacked.

  • Solution: This is a false alarm which Norton gives when encountering software it is not familiar with (WS.Reputation.1). You should be able to restore the file from quarantine by following the instructions from the following resource [1].

Versions older or equal than 0.6.5 cannot update (very long error message):

  • Please click on the "Home" button (upper left corner), then "Install a specific version", select the newest version from "Releases" and tick the box beside "Download complete package".
  • Afterwards click on "Update" or "Replace". This way, the complete package is downloaded and the update should work.

Wrong JAVA Version on Mac

  • After opening the command file on the Mac, Transkribus says that there is a wrong Java version installed (1.6.0.65) instead of 1.7. However, there is the most current version of Java RE (1.8.0.66) installed.
    • The problem is that Java 1.6.0.65 is the default Java on the command line which the Transkribus.command uses. You can check the default version by opening the terminal and typing 'java -version'.
    • To solve the problem you can either download the latest jdk as a .tar.gz package from here:
   http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

and unpack it into the Transkribus folder - the Transkribus.command file will automatically check for java installations in its sub directories!

    • Or you could make your java 8 installation the default one on command line following e.g. the instructions here:
   http://myshittycode.com/2014/03/17/mac-os-x-setting-default-java-version/

Transkribus does not start on (Fedora) Linux - 'MOZILLA_FIVE_HOME not set' error message

  • The package "libwebkitgtk" may not be installed. On Fedora you can install the package using dnf on the command line (use "yum" instead of "dnf" in older versions of Fedora):
   sudo dnf install webkitgtk

Transkribus Cloud Services

The Transkribus Cloud is based on the IT infrastructure of the Central Computer Service of the University of Innsbruck. Main components are virtual servers for hosting the core Transkribus application and the integrated tools, a database server, a storage and backup facility, as well as a High Performance Computing unit.

Image Files

  • Though it is possible to work with local files, the full power of Transkribus can only be enjoyed if the documents reside in the Transkribus Cloud.
  • Once the images are uploaded they are stored and processed in the FileImageStore developed by UIBK.
  • Specifications of the FileImageStore
    • Following image formats can be processed: JPG, PNG, TIFF, JP2. Note: GIF and RAW formats from digital cameras are not supported by Transkribus
    • Resolution is always kept, but for working with the images a thumbnail, and a compressed file are produced autumatically.
    • Original filenaming is kept as metadata
    • Files get a unique address

PAGE Files

We are using the PAGE XML file format as internal master format. It was created by the University of Salford.

REST Interfaces

Most services within Transkribus are also exposed via RESTful interface. Developers are free to use the complete REST Interface.

How to contribute to the Transkribus Wiki

Feel free to contribute to the Transkribus Wiki. Register directly here in the Wiki (your Transkribus user account does not work!) and start editing. There is also a German page, where we we would be thankful for your input!

Consult the User's Guide on using the wiki software.

MediaWiki

Credits

This work is co-funded by the European Commission within the FP7 Project tranScriptorium (2013-2015) and the H2020 Project READ (2015-2019).

Special thanks to the many, many users who provide their feedback per email and bug report - though we are often not able to directly follow your suggestions, we try our best to include them in the long run!