Difference between revisions of "CITlab Keyword Spotting"

From Transkribus Wiki
Jump to: navigation, search
(Created page with "CITlab Keyword Spotting can be used on all documents that have been processed with CITlab HTR(+). Searching is currently restricted to a single documents per search request....")
 
 
Line 16: Line 16:
  
 
The GET requests allow paging, e.g. use the query parameters "?index=0&nValues=2" to get the first two elements
 
The GET requests allow paging, e.g. use the query parameters "?index=0&nValues=2" to get the first two elements
 +
 +
 +
=== Expert syntax ===
 +
 +
Extended query syntax can be enabled by sending a parameter map in the POST request starting the search, enabling this feature:
 +
 +
<pre>
 +
{
 +
  "type" : "kwsParameters",
 +
  "entry" : [ {
 +
      "key" : "caseSensitive",
 +
      "value" : "true"
 +
  }, {
 +
      "key" : "expert",
 +
      "value" : "true"
 +
  } ]
 +
}
 +
</pre>
 +
 +
Instead of words, the search patterns are now defined by regular expressions.
 +
 +
To define the part of interest, one must to define a group "KW". As result the part which contains this group will be returned, e.g.
 +
 +
* date: <code>.*(?<KW>[0-3][0-9]\.[0-1][0-9]\.[0-9]{4}).*</code>  matches any line containing a date of the form TT.MM.JJJJ
 +
 +
* abbreviations: <code>.*(?<KW>Dr\.|Doctor).*</code> matches any line containing Doctor and its abbreviation Dr.
 +
 +
* uncertainties: <code>.*(?<KW>(k|c|che|chh)rist?).*</code>  matches any line containing Old High German spellings for Christ: e.g. kris, krist, crist, cherist, chhrist
 +
 +
 +
In contrast to standard usage of regular expressions, the search patterns have to match the whole line, e.g. <code>.*[0-9]{4,6}</code> will match only lines which end with a number of at least 4 digits. To allow arbitrary characters after the 4 digits, one has to add <code>.*</code> at the end:  <code>.*[0-9]{4,6}.*</code>
 +
Analogously, <code>[0-9]{4,6}.*</code> matches only lines which begin with 4 digits.
 +
 +
Standard regular expression features which are supported:
 +
 +
{|
 +
|<code>.</code>||any character
 +
|-
 +
|<code>+</code>||one or more repetitions of the previous literal
 +
|-
 +
|<code>*</code>||zero or more repetitions of the previous literal
 +
|-
 +
|<code>[]</code>||class of characters, e.g. [0-9] any digit between 0 and 9; [aeiou] any vowel; [A-Z] any capital letter
 +
|-
 +
|<code>?</code>||the previous literal is optional
 +
|-
 +
|<code>{X}</code>||repeat previous literal X times
 +
|-
 +
|<code>{X,Y}</code>||repeat previous literal between X and Y times
 +
|-
 +
|<code><nowiki>|</nowiki></code>||or operation, e.g. a|b means either a or b
 +
|-
 +
|<code>()</code>||parentheses are used to group the regular expression: (a|b)c matches ac or bc while a|bc matches a or bc
 +
|-
 +
|<code>\</code>||escape operator: to match e.g. a + or . one needs to escape it by \+ or \.
 +
|}
 +
 +
Standard regular expression features which are not supported:
 +
{|
 +
|<code>^</code>||begin of line is not supported
 +
|-
 +
|<code>[^....]</code>||negation in character is not supported
 +
|-
 +
|<code>{,Y} {X,}</code>||open repetitions are not supported (in case {,X} write {0,X})
 +
|-
 +
|<code>$</code>||end of line is not supported
 +
|-
 +
|<code>[:alpha:]</code>||predefined character classes like this alphabetical class are not supported
 +
|}

Latest revision as of 08:31, 6 April 2020

CITlab Keyword Spotting can be used on all documents that have been processed with CITlab HTR(+). Searching is currently restricted to a single documents per search request.

  • Sending a search request:

POST https://transkribus.eu/TrpServerTesting/rest/kws/queries?collId={myCollectionId}&id={myDocId}&query={searchTerm1}&query={searchTerm2}&query=...

  • List KWS search processes. As those might take some time to finish, watch the "status" field of a process until its value is "Completed":

GET https://transkribus.eu/TrpServerTesting/rest/kws/queries

  • Retrieve the hits of a completed search process:

GET https://transkribus.eu/TrpServerTesting/rest/kws/queries/{myKwsJobId}/hits


The GET requests allow paging, e.g. use the query parameters "?index=0&nValues=2" to get the first two elements


Expert syntax

Extended query syntax can be enabled by sending a parameter map in the POST request starting the search, enabling this feature:

{
   "type" : "kwsParameters",
   "entry" : [ {
      "key" : "caseSensitive",
      "value" : "true"
   }, {
      "key" : "expert",
      "value" : "true"
   } ]
}

Instead of words, the search patterns are now defined by regular expressions.

To define the part of interest, one must to define a group "KW". As result the part which contains this group will be returned, e.g.

  • date: .*(?<KW>[0-3][0-9]\.[0-1][0-9]\.[0-9]{4}).* matches any line containing a date of the form TT.MM.JJJJ
  • abbreviations: .*(?<KW>Dr\.|Doctor).* matches any line containing Doctor and its abbreviation Dr.
  • uncertainties: .*(?<KW>(k|c|che|chh)rist?).* matches any line containing Old High German spellings for Christ: e.g. kris, krist, crist, cherist, chhrist


In contrast to standard usage of regular expressions, the search patterns have to match the whole line, e.g. .*[0-9]{4,6} will match only lines which end with a number of at least 4 digits. To allow arbitrary characters after the 4 digits, one has to add .* at the end: .*[0-9]{4,6}.* Analogously, [0-9]{4,6}.* matches only lines which begin with 4 digits.

Standard regular expression features which are supported:

. any character
+ one or more repetitions of the previous literal
* zero or more repetitions of the previous literal
[] class of characters, e.g. [0-9] any digit between 0 and 9; [aeiou] any vowel; [A-Z] any capital letter
? the previous literal is optional
{X} repeat previous literal X times
{X,Y} repeat previous literal between X and Y times
| b means either a or b
() b)c matches ac or bc while a|bc matches a or bc
\ escape operator: to match e.g. a + or . one needs to escape it by \+ or \.

Standard regular expression features which are not supported:

^ begin of line is not supported
[^....] negation in character is not supported
{,Y} {X,} open repetitions are not supported (in case {,X} write {0,X})
$ end of line is not supported
[:alpha:] predefined character classes like this alphabetical class are not supported