Difference between revisions of "CITlab Keyword Spotting"
Philip Kahle (Talk | contribs) (Created page with "CITlab Keyword Spotting can be used on all documents that have been processed with CITlab HTR(+). Searching is currently restricted to a single documents per search request....") |
Philip Kahle (Talk | contribs) |
||
Line 16: | Line 16: | ||
The GET requests allow paging, e.g. use the query parameters "?index=0&nValues=2" to get the first two elements | The GET requests allow paging, e.g. use the query parameters "?index=0&nValues=2" to get the first two elements | ||
+ | |||
+ | |||
+ | === Expert syntax === | ||
+ | |||
+ | Extended query syntax can be enabled by sending a parameter map in the POST request starting the search, enabling this feature: | ||
+ | |||
+ | <pre> | ||
+ | { | ||
+ | "type" : "kwsParameters", | ||
+ | "entry" : [ { | ||
+ | "key" : "caseSensitive", | ||
+ | "value" : "true" | ||
+ | }, { | ||
+ | "key" : "expert", | ||
+ | "value" : "true" | ||
+ | } ] | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | Instead of words, the search patterns are now defined by regular expressions. | ||
+ | |||
+ | To define the part of interest, one must to define a group "KW". As result the part which contains this group will be returned, e.g. | ||
+ | |||
+ | * date: <code>.*(?<KW>[0-3][0-9]\.[0-1][0-9]\.[0-9]{4}).*</code> matches any line containing a date of the form TT.MM.JJJJ | ||
+ | |||
+ | * abbreviations: <code>.*(?<KW>Dr\.|Doctor).*</code> matches any line containing Doctor and its abbreviation Dr. | ||
+ | |||
+ | * uncertainties: <code>.*(?<KW>(k|c|che|chh)rist?).*</code> matches any line containing Old High German spellings for Christ: e.g. kris, krist, crist, cherist, chhrist | ||
+ | |||
+ | |||
+ | In contrast to standard usage of regular expressions, the search patterns have to match the whole line, e.g. <code>.*[0-9]{4,6}</code> will match only lines which end with a number of at least 4 digits. To allow arbitrary characters after the 4 digits, one has to add <code>.*</code> at the end: <code>.*[0-9]{4,6}.*</code> | ||
+ | Analogously, <code>[0-9]{4,6}.*</code> matches only lines which begin with 4 digits. | ||
+ | |||
+ | Standard regular expression features which are supported: | ||
+ | |||
+ | {| | ||
+ | |<code>.</code>||any character | ||
+ | |- | ||
+ | |<code>+</code>||one or more repetitions of the previous literal | ||
+ | |- | ||
+ | |<code>*</code>||zero or more repetitions of the previous literal | ||
+ | |- | ||
+ | |<code>[]</code>||class of characters, e.g. [0-9] any digit between 0 and 9; [aeiou] any vowel; [A-Z] any capital letter | ||
+ | |- | ||
+ | |<code>?</code>||the previous literal is optional | ||
+ | |- | ||
+ | |<code>{X}</code>||repeat previous literal X times | ||
+ | |- | ||
+ | |<code>{X,Y}</code>||repeat previous literal between X and Y times | ||
+ | |- | ||
+ | |<code><nowiki>|</nowiki></code>||or operation, e.g. a|b means either a or b | ||
+ | |- | ||
+ | |<code>()</code>||parentheses are used to group the regular expression: (a|b)c matches ac or bc while a|bc matches a or bc | ||
+ | |- | ||
+ | |<code>\</code>||escape operator: to match e.g. a + or . one needs to escape it by \+ or \. | ||
+ | |} | ||
+ | |||
+ | Standard regular expression features which are not supported: | ||
+ | {| | ||
+ | |<code>^</code>||begin of line is not supported | ||
+ | |- | ||
+ | |<code>[^....]</code>||negation in character is not supported | ||
+ | |- | ||
+ | |<code>{,Y} {X,}</code>||open repetitions are not supported (in case {,X} write {0,X}) | ||
+ | |- | ||
+ | |<code>$</code>||end of line is not supported | ||
+ | |- | ||
+ | |<code>[:alpha:]</code>||predefined character classes like this alphabetical class are not supported | ||
+ | |} |
Latest revision as of 08:31, 6 April 2020
CITlab Keyword Spotting can be used on all documents that have been processed with CITlab HTR(+). Searching is currently restricted to a single documents per search request.
- Sending a search request:
- List KWS search processes. As those might take some time to finish, watch the "status" field of a process until its value is "Completed":
GET https://transkribus.eu/TrpServerTesting/rest/kws/queries
- Retrieve the hits of a completed search process:
GET https://transkribus.eu/TrpServerTesting/rest/kws/queries/{myKwsJobId}/hits
The GET requests allow paging, e.g. use the query parameters "?index=0&nValues=2" to get the first two elements
Expert syntax
Extended query syntax can be enabled by sending a parameter map in the POST request starting the search, enabling this feature:
{ "type" : "kwsParameters", "entry" : [ { "key" : "caseSensitive", "value" : "true" }, { "key" : "expert", "value" : "true" } ] }
Instead of words, the search patterns are now defined by regular expressions.
To define the part of interest, one must to define a group "KW". As result the part which contains this group will be returned, e.g.
- date:
.*(?<KW>[0-3][0-9]\.[0-1][0-9]\.[0-9]{4}).*
matches any line containing a date of the form TT.MM.JJJJ
- abbreviations:
.*(?<KW>Dr\.|Doctor).*
matches any line containing Doctor and its abbreviation Dr.
- uncertainties:
.*(?<KW>(k|c|che|chh)rist?).*
matches any line containing Old High German spellings for Christ: e.g. kris, krist, crist, cherist, chhrist
In contrast to standard usage of regular expressions, the search patterns have to match the whole line, e.g. .*[0-9]{4,6}
will match only lines which end with a number of at least 4 digits. To allow arbitrary characters after the 4 digits, one has to add .*
at the end: .*[0-9]{4,6}.*
Analogously, [0-9]{4,6}.*
matches only lines which begin with 4 digits.
Standard regular expression features which are supported:
. |
any character |
+ |
one or more repetitions of the previous literal |
* |
zero or more repetitions of the previous literal |
[] |
class of characters, e.g. [0-9] any digit between 0 and 9; [aeiou] any vowel; [A-Z] any capital letter |
? |
the previous literal is optional |
{X} |
repeat previous literal X times |
{X,Y} |
repeat previous literal between X and Y times |
| |
b means either a or b |
() |
b)c matches ac or bc while a|bc matches a or bc |
\ |
escape operator: to match e.g. a + or . one needs to escape it by \+ or \. |
Standard regular expression features which are not supported:
^ |
begin of line is not supported |
[^....] |
negation in character is not supported |
{,Y} {X,} |
open repetitions are not supported (in case {,X} write {0,X}) |
$ |
end of line is not supported |
[:alpha:] |
predefined character classes like this alphabetical class are not supported |