Did you know ... Search Documentation:
Pack ape -- prolog/parser/tokenizer.pl
PublicShow source
author
- Kaarel Kaljurand
- Tobias Kuhn
version
- 2010-03-28

Comments:

  • Strings (between double quotes) are tokenized into [", content, "]. This can be misleading, e.g. in the sentence: `"1" represents 1.' the verb `represents' is reported as the 4th token. One could instead produce the term string(content), this would probably also fix the buggy handling of `Every dot is ".".'.
  • Saxon Genitives are tokenized as [noun, ', s] and [nouns, ']
  • Digits cannot start a token that contains other symbols than digits and dots. Consider the following input string and its tokenization.
    123man man123 123man123 man123man
    
    [123, man, man123, 123, man123, man123man]

    BUG: future work: add character counting to be able to report character offsets together with every token.

    Example:

    ?- tokenizer:codes_to_tokens("2men;can't like\"#@\"and:everything@.", T), writeq(T). [2, men, can, not, like, '"#@"', and, :, every, '-thing', '.']

 tokenize(+ACEText:term, -Tokens:list) is det
Breaks the ACEText (either an atom, or a list of character codes) into a list of tokens (atoms or numbers). The input ACEText can either be an atom like 'this is an example' or a list of character codes like [116,104,105,115,32,105,...] (possibly written as "this is an example").

Tokenization will never fail. In case something goes wrong (e.g. a string or comment is not closed) then an error message is asserted.

Arguments:
ACEText- is the input text, it is either an atom or a string
Tokens- is a list of tokens, i.e. the tokenization of the input text