Main Page/DataSets

From Transkribus Wiki
Jump to: navigation, search

This is a preliminary list of data sets for READ.

Note: Our suggestion is that we collect images from all institutions centrally and host them via an FTP server and a database. All datasets will need to meet some basic requirements (metadata, file naming) and will be uploaded to the central READ repository by the collaborating institutions. Technical partners will have access to the image collections via the FTP server and will be able to download them for their own purposes.


READ Large Scale Demonstrators

State Archive Zurich

The Project Transliteration and Digitisation of the handwritten Parliament and Government Council of the Canton of Zurich Protocols 1803–1887/1897 started in 2009 and stands in an advanced stage as 150‘000 of 200‘000 pages of text in so-called German Kurrent handwriting (running text, but also tables with numbers) is transcribed and can be used for evaluation purposes of HTR technology. The data management and data storage of all the projects of the State Archives of Zurich is based on the archive’s database, the archival information system scopeArchivTM. Public access to metadata, full-text-transcriptions and pictures of the original document is ensured by the the Online Cataloge (Query) of the archive (http://suche.staatsarchiv.djiktzh.ch/suchinfo.aspx) and furthermore by the meta search engine Archives Online (http://www.archivesonline.org/search.aspx), established in 2010 by the State Archives of Zurich and four other archives, including meanwhile the Queries of 17 archives.

National Archive Finland

National Archives of Finland has digitized about 25.000.000 documents, including 200.000 maps. Most of the digitized documents are hand written. The period of hand written documents are 1530´s to 1960´s. About 60% of the documents are digitized from microfilms, but since 2014 the amount of originals digitized is rapidly increasing. 100 years older person register and 60 years older court documents are free for usage via internet and also available for the use in this project.

Archive Bistum Passau

The archive has ca. 750.000 images of handwritten registers online and a database providing access to the most important sets of data to be found in the registers. So a project on handwriting recognition has a large basis for practical experience and for data base control of the results of the HWR tool.

http://www.data.matricula.info/php/main.php

The data are very similar with the dataset provided by Quidenus, a scanning company from Austria, see below.

Venice Time Machine (State Archive Venice)

The Venice Time Machine is an international scientific program launched by the EPFL and the University Ca’Foscari of Venice with the generous support of the Fondation Lombard Odier. It aims at building a multidimensional model of Venice and its evolution covering a period of more than 1000 years. The project ambitions to reconstruct a large open access database that could be used for research and education. Thanks to a partnership with the Archivio di Stato in Venice, kilometers of archives will be digitized, transcribed and indexed setting the base of the largest database ever created on Venetian documents. In complementary to these primary sources, the content of thousands of monographs will be indexed and made searchable. The information extracted from these sources will be organized in a semantic graph of linked data and unfolded in space and time in an historical geographical information system.

MOU Partners

Australian National Library (allocated to UCL)

The National Library of Australia's objective is to ensure that all Australians can access, enjoy and learn from a national collection that documents Australian life and society. We are committed to providing open access to the national collection and our online Services. The Library's Manuscript collection ranges from single items to large collections, encompassing a wide variety of unpublished and handwritten materials including letters, diaries, sketches and artworks, notebooks, maps, photographs, literary works, and organizational records. The collections predominantly relate to Australia, but there are also important holdings relating to Papua New Guinea, New Zealand and the Pacific. Some of our most important manuscript collections have already been digitized, and can be accessed through Trove. http://trove.nla.gov.au/ Trove is a national discovery service built and managed by the National Library of Australia that brings together content from libraries, museums, archives and other research organizations, including the National Library's own collections, and allows users to search text from digitized print materials (but not currently from digitized handwritten text). The Library's collaboration with the READ project would be in the areas of providing test and sample data (i.e. digitized images from our collections) for further experimentation and developmental work, and in utilizing the platform in our own environment to assess its potential to further build on the work we have done with digitized newspapers through Trove.

Gottfried Wilhelm Leibniz Bibliothek - Niedersächsische Landesbibliothek (allocated to URO)

The collection of manuscript papers of Gottfried Wilhelm Leibniz at the Gottfried Wilhelm Leibniz Bibliothek - Niedersächsische Landesbibliothek encompasses about 50,000 items, comprising 150,000 to 200,000 sheets. These include about 15,000 letters (approximately 30,000 sheets) from and to about 1,100 correspondents. About forty percent of these letters were written by Leibniz and about sixty percent are letters written to Leibniz. In particular this correspondence extending to all parts of Europe and beyond, even as far as China, reveals the wide range of topics Leibniz worked on. At the same time this correspondence provides an invaluable insight into the extent to which Leibniz influenced the thought of the scientific world of his time. Leibniz was a central figure in the Republic of Letters of his time. Leibniz established a global network of correspondents and thus exchanged letters with the most eminent scientists and scholars of his day. His correspondence marks a turning point in the development of thought and of technology as it represents the transition from humanist-baroque thought to the age of enlightenment. Since 2007 the Leibniz correspondence is part of the UNESCO Memory of the World programme. From 2015 on the Gottfried Wilhelm Leibniz Bibliothek is going to digitize large amounts of Leibniz manuscripts. The Leibniz handwriting is not very easy to read, even for experts. It comprises texts in Latin, French and German language with different scripts for each language. Many manuscripts contain corrections, deletions, insertions, and even drawings or mathematical formulas. The Leibniz papers are being edited by the Berlin-Brandenburgische Akademie der Wissenschaften and the Akademie der Wissenschaften zu Göttingen with editing Offices in Berlin, Hannover, Münster and Potsdam. Due to the complexity and the large amount of material the Leibniz edition is a project for more than 100 years. By taking part in the READ project we hope therefore to be able to present desperately needed texts to researchers all over the world. As there exist edited papers as well as unedited papers it would be easy to test the quality of the optical recognition process in comparing the results with the results of the editors. The community of researchers on Leibniz is vast and contains historians, philosophers, mathematicians, historians of technology and Science. Therefore the Gottfried Wilhelm Leibniz Bibliothek is convinced that the READ Service and its platform would be of great benefit to this worldwide interdisciplinary community.

National Library of Spain (allocated to UPVLC)

The National Library of Spain (BNE) holds one of the richest manuscript collections in Spain fhttp://www.bne.es/en/Colecciones/Manuscritos/index.htmll. with more than 83.000 items. It comprises medieval Codices, many with miniatures, dramatic manuscripts, countless autographs, and a great many historical and genealogical documents. This collection is only partially digitized and only 12,53% can be accessed online (corresponding to 10.400 titles). The digital format of manuscripts although it's an improvement in terms of improving the access of citizens to their heritage, is still not enough to ease the access to their intellectual content. In many cases these works can only be fully understood by high qualified researchers (palaeographers), and in any case full text search can be accomplished (unless transcription is specifically undergone by specialists). These two issues, altogether with the high cost of transcription processes, constitute main caveats for a real openness of handwritten collections to the public. Because of all these facts we consider this project quite promising, as it could ameliorate the access to the content of BNE's manuscripts. We would be able to provide images of our manuscripts for research purposes. Depending on the resources available and allocated to the library we might also be able to collaborate on the validation of tools and dissemination activities. This latter commitment would have to be carefully defined once the project was granted.

Centre virtuel de la connaissance sur l'Europe Digital Humanities Lab (CVCE) – Luxembourg (allocated to ULCC)

CVCE currently provides free online access to ca. 16.000 digitized documents, including facsimile scans, newspaper articles, interview transcripts, photos and audio-visual materials. A collection of 10.000 documents which are kept in the Pierre Werner estate. These documents appear in form typewritten, handwritten form and in combinations of both. CVCE is about to complete a pilot project on 600 documents associated with the Western European Union. These documents were OCRed and manually annotated according to XML/TEI Standards. For the future we plan to process a larger part of our collections in this way with the ambition to create an interlinked repository of primary sources. CVCE gathered experience in crowd-sourcing in the context of an FP7-funded project on multimedia search (CUbRIK). Generic and expert crowds were used to detect and identify faces in historical photographs. CVCE is actively seeking out opportunities to use crowds to help annotate parts of its collection.

The Linnean Society of London – United Kingdom (allocated to UCL)

The Linnean Society of London holds a number of important and extensive manuscript and correspondence collections, as well as a large number of annotated books, the transcription of which remains beyond our reach (time and resource-wise) if attempted in a traditional set-up. However, transcription is the crucial first step to unlock the information contained in these unique, but as yet little researched, collections. Specialist skills sets are required to work with these collections, including palaeography, good knowledge of Latin, a background of the history of natural history, etc. Being able to get a large number of pages transcribed without the extensive resources traditionally required, would enable us to dramatically enhance access to and utilisation of these collections for research. A number of these collections have been digitized to a high standard and are freely available through the Online Collections platform provided by ULCC (University of London Computer Centre). They span the 18th and 19th Century, and languages include Latin, but also all major European languages. Particularly relevant collections would be: The Manuscripts of Carl Linnaeus (1707-1778): Carl Linnaeus is one of the founding fathers of modern biology. He introduced the binominal system of naming living organisms which we still use today. His works are the starting point of the official naming of plants and animals in use today. His manuscripts are an invaluable resource, scientifically and historically. As a consequence of overseas discoveries, early modern scientists were faced with what has been termed the "first bio-information crisis". The sheer amount of exotic, hitherto unknown species that reached the shores of Europe forced scientists to reconsider the ways in which they wrote and thought about the natural world. The manuscript collection contains approximately 20,190 sheets, for the majority of which recto and verso are used. 19 manuscripts are currently online, the remaining collection is currently catalogued, conserved and digitized as part of a two-year Mellon funded project. http://linnean-online.org/linnaean mss.html The Correspondence of Carl Linnaeus: Nobody was more aware than Linnaeus of the scientific value of his correspondence. His correspondents were, he said, the most learned and distinguished in Europe, and they sent him the latest publications and kept him abreast of new discoveries. In an autobiographical text from the 1760s Linnaeus listed seventy-one correspondents, from Russia and Turkey in the East to America in the West. In subsequent years the number of letters and correspondents continued to grow and when Linnaeus died in 1778 more than 200 persons in Sweden and around 400 in other countries had been in contact with him. Over three thousand letters had been sent to him by scientists in Europe, America, Asia and Africa and by admirers such as Jean-Jacques Rousseau. A significant part of the correspondence came from Linnaeus's own students, who reported to their professor from their travels around the world. http://linnean-online.org/correspondence.html The Annotated books of Carl Linnaeus: It would be interesting to see how a new handwriting-recognition technology could work alongside an advanced version of OCR for our annotated Linnaean books. This collection is due to go online by the end of this year and contains approximately 60,378 images. The Correspondence of Sir James Edward Smith (1759-1828): This collection comprises the scientific and personal correspondence of Smith, presented to the Linnean Society between 1857 and 1872 by Pleasance Smith (1773-1877), and since complemented by additional series. It provides an invaluable insight into the networks of information sharing in the natural Sciences across Europe in the 19th Century. It also provides a fascinating picture of culture, politics and personal lives. http://linnean-online.org/smith correspondence.html

The Hessian State Archive Marburg – Germany (allocated to StAZH)

The Hessian State Archive in Marburg is one of the historically most important archives in Germany. It holds more than 120.000 Charters, 300.000 maps, plans and architectural drawings, and more than 70 kilometres of files. Its stockings run from the 8th Century up to modern times. Due to this large amount of archival records and their various and complex handwritings, indexing and transcription, the crucial and first steps to make the information contained accessible, are extremely difficult even for the specialists. The opportunity of getting a large number of pages transcribed without the extensive resources traditionally required, would enable us to dramatically enhance access to and utilisation of our stockings for everybody. Since the Hessian State Archive Marburg holds quite all kinds of European written documents since the early Middle Ages it can contribute to the project making available a large variety of handwritings through the centuries in digitized and original form. We are also interested in collaborating in the field of dissemination and would like to offer to the project the historical and representative location of our archives for meetings, press Conferences and similar events. Further Information can be found following the link www.staatsarchiv-marburg.hessen.de

There are two especially interesting collections:

Grimm Collection with mainly letters and personal papers from the Grimm Brothers. About 35.000 digitised pages. All are available for the project. The Grimm collection receives a lot of attention in Germany.

Online: http://orka.bibliothek.uni-kassel.de/viewer/search/nachlaesse.nachlassgrimm.jacobundwilhelmgrimm/-/1/-/-/

Register Books: Very similar to the Passau, Quidenus and Ancestry.com collection a large number of register books.

A small test set is available in the Transkribus Test Collection "Marburg".

Suggestion: Since the letters are coming from many different writers and since metadata are available it would be a good showcase for letters as well as writer identification.

The Munch Museum. The digital archive of Edvard Munch's writings – Norway (allocated to NAF)

The Munch Museum has a collection of approximately 30,000 manuscript pages. Among these manuscripts is Munch's correspondence with family, friends, acquaintances, galleries, museums, printers, collectors, admirers etc. from a period of almost 70 years (1876-1944). We estimate that we own in total around 10,000 letters and letter drafts. Alongside the correspondence the manuscripts also comprise Munch's private writings; literary sketches, prose poems, diaries and diverse notes from e.g. when he learnt to write as a young boy until only days before his death. The museum has been working on transcribing and digitizing the manuscripts for years and is building a free online archive with facsimiles and transcriptions of all of them. Currently we are working with the received correspondence in a crowdsourcing solution built on Bentham's MediaWiki application. Apart from Munch's writings we also own related, but smaller collections of writings that are not yet digitized. These are also handwritten manuscripts. We are happy to collaborate with READ to help develop tools for the automated recognition and indexing of handwritten material and will gladly submit our material to help in this process. We hope that our experiences could be of help to the project. We also see this as an opportunity to learn from others.

Musikinstrumentenmuseum der Universität Leipzig – Germany (allocated to ASV)

Das Musikinstrumentenmuseum der Universität Leipzig (MIMUL) besitzt etwa 6.000 Musikinstrumente und ähnliche Artefakte, die im Verlauf ihrer Fiersteilung, ihres Gebrauchs und ihrer Besitzgeschichte Signaturen und Inschriften von Zulieferen, Fierstellern, Fländlern, Besitzern oder Benutzern erhielten. Sie sind teilweise an schlecht einsehbaren Stellen angebracht oder infolge späterer Veränderungen nur fragmentarisch erhalten. Etwa tausend Signaturen und Inschriften sind fotografisch dokumentiert; etwa zweitausend liegen in Transkriptionen vor. Weitere Image- und Volltext-Digitalisate sind in Vorbereitung. Eine bestandsübergreifendes Repositorium von Signaturen und Inschriften auf Musikinstrumenten soll damit angelegt werden.

The Civic Archives of Bozen-Bolzano – Italy (allocated to BAP)

The old towns registers of the alpine city of Bozen-Bolzano, located in the Northern-most part of Italy near the Austrian border, are providing an invaluable documentation of the European pre-modern history. The so-called 'Ratsprotokolle' or 'Ratschlagbücher' (minutes of the council) are fully available and accessible under a Creative Commons Attribution-NonCommercial 3.0 Unported License on the BOhisto - Bozen-Bolzano's History Online website <http://stadtarchiv-archiviostorico.gemeinde.bozen.it/bohisto/en>. The manuscripts digitized up to now span from 1470 to the 18th Century and shed light on the administration, the economy and the townspeople's life of one of the main urban centres of the Tyrolean region, situated on the most important transalpine route between Germany and Italy. The rich database offered by the archival data is an unique playground to further explore the urban history of a central European area. Therefore, a collaboration with READ would be of very great interest for the BOhisto-project, aiming at a still wider use of the whole material by the historical Sciences, especially for teaching and researching purposes, and providing in the long run a better readability and an in-depth knowledge of the rich Bolzano corpus.

About 30.000 images are already in Innsbruck. GT for some 500 pages is available and was already part of the tS Project.

Examples available in the GT Collection of Transkribus: Bozen

University and Research Library Erfurt/Gotha Research Library Gotha – Germany (allocated to URO)

The Gotha Research Library is one of the most prominent German libraries with historic Stocks from the 16th to the 18th Century. It also holds the Perthes Collection, encompassing maps, archival material and a geographic-cartographic library from the 18th to the late 20th Century. The main part of the Early Modern collection makes up an extensive collection on the history of the Reformation. The 261 volumes of the so-called Reformation manuscripts containing approximately 17,000 individual pieces and more than 7,000 informal speeches. They include significant pieces from the early reformers of the first and second generation (correspondence, transcripts of lectures, etc.) and the unique stock of the part -or main legacies of Georg Spalatin, Paul Eber, Jean Calvin and Theodore de Beze, etc. The Research Library prepares digital editions of the reformers' correspondence (model project with 1,500 letters: Paul Eber). In this context, the library is highly interested to collaborate with READ in preparing and delivering test material (digitized page images and metadata) in order to carry out experiments and to evaluate the tools and Services developed in the project and to qualify the library's digital edition projects in utilizing the platform.

Institutions willing to share their collections

National History Museums

A group of NHM institutions from all over Europe and the US. Interested in recognizing specimen index cards.

A test run is going on with them, examples available in Transkribus.

National Archives London

Interested in Transkribus, willing to share documents for testing purposes. Very rich and heterogenous holdings.

Close contact with Mark Bell. Some sample documents already in Transkribus.

http://www.nationalarchives.gov.uk/

Adam Matthew Digital

A company strongly cooperating with the National Archives. Has digitised millions of files and is willing to share them with the project.

MarineLives

A private transcription project with a good basis of volunteers and collaborators. Transcription of 17-18C admirality reports.

http://www.marinelives.org/wiki/MarineLives

A test project is currently performed, 200 example pages are in the MarineLives collection in Transkribus.

Quidenus

An Austrian company which has digitised about 25 mill. register (church) books. Willing to share hundreds of thousands of files. Rather homogenous material from the 17th to the 20th century.

Independently of READ several test runs by several universities and companies are currently taking place.

http://quidenus.com/

Ancestry.com

Very similar to Quidenus Ancestry has digitised millions of register books. Willing to share thousands of files.

http://ancestry.com/

Sigmund Freud Digital Edition

An Austrian goup of researchers who plan to release a digital edition of Freuds works.

About 50.000 images are available, and are processed by CVL.

A first test is performed, some sample pages can be found in Transkribus "Freud" Collection.

Wiener Library - London

Have a collection of personal papers from the Holocaust. Mixed papers with postcards, letters, forms, photographs, etc.

A test collection is available in Transkribus: Wiener Library.