CITlab Keyword Spotting

From Transkribus Wiki
Revision as of 08:31, 6 April 2020 by Philip Kahle (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

CITlab Keyword Spotting can be used on all documents that have been processed with CITlab HTR(+). Searching is currently restricted to a single documents per search request.

  • Sending a search request:


  • List KWS search processes. As those might take some time to finish, watch the "status" field of a process until its value is "Completed":


  • Retrieve the hits of a completed search process:


The GET requests allow paging, e.g. use the query parameters "?index=0&nValues=2" to get the first two elements

Expert syntax

Extended query syntax can be enabled by sending a parameter map in the POST request starting the search, enabling this feature:

   "type" : "kwsParameters",
   "entry" : [ {
      "key" : "caseSensitive",
      "value" : "true"
   }, {
      "key" : "expert",
      "value" : "true"
   } ]

Instead of words, the search patterns are now defined by regular expressions.

To define the part of interest, one must to define a group "KW". As result the part which contains this group will be returned, e.g.

  • date: .*(?<KW>[0-3][0-9]\.[0-1][0-9]\.[0-9]{4}).* matches any line containing a date of the form TT.MM.JJJJ
  • abbreviations: .*(?<KW>Dr\.|Doctor).* matches any line containing Doctor and its abbreviation Dr.
  • uncertainties: .*(?<KW>(k|c|che|chh)rist?).* matches any line containing Old High German spellings for Christ: e.g. kris, krist, crist, cherist, chhrist

In contrast to standard usage of regular expressions, the search patterns have to match the whole line, e.g. .*[0-9]{4,6} will match only lines which end with a number of at least 4 digits. To allow arbitrary characters after the 4 digits, one has to add .* at the end: .*[0-9]{4,6}.* Analogously, [0-9]{4,6}.* matches only lines which begin with 4 digits.

Standard regular expression features which are supported:

. any character
+ one or more repetitions of the previous literal
* zero or more repetitions of the previous literal
[] class of characters, e.g. [0-9] any digit between 0 and 9; [aeiou] any vowel; [A-Z] any capital letter
? the previous literal is optional
{X} repeat previous literal X times
{X,Y} repeat previous literal between X and Y times
| b means either a or b
() b)c matches ac or bc while a|bc matches a or bc
\ escape operator: to match e.g. a + or . one needs to escape it by \+ or \.

Standard regular expression features which are not supported:

^ begin of line is not supported
[^....] negation in character is not supported
{,Y} {X,} open repetitions are not supported (in case {,X} write {0,X})
$ end of line is not supported
[:alpha:] predefined character classes like this alphabetical class are not supported