Pack ape -- prolog/parser/tokenizer.pl

author

- Kaarel Kaljurand

- Tobias Kuhn

version

- 2010-03-28

Comments:

Strings (between double quotes) are tokenized into [", content, "]. This can be misleading, e.g. in the sentence: `"1" represents 1.' the verb `represents' is reported as the 4th token. One could instead produce the term string(content), this would probably also fix the buggy handling of `Every dot is ".".'.
Saxon Genitives are tokenized as [noun, ', s] and [nouns, ']
Digits cannot start a token that contains other symbols than digits and dots. Consider the following input string and its tokenization.
```
123man man123 123man123 man123man

[123, man, man123, 123, man123, man123man]
```
BUG: future work: add character counting to be able to report character offsets together with every token.

Example:

?- tokenizer:codes_to_tokens("2men;can't like\"#@\"and:everything@.", T), writeq(T). [2, men, can, not, like, '"#@"', and, :, every, '-thing', '.']

tokenize(+ACEText:term, -Tokens:list) is det

Breaks the ACEText (either an atom, or a list of character codes) into a list of tokens (atoms or numbers). The input ACEText can either be an atom like 'this is an example' or a list of character codes like [116,104,105,115,32,105,...] (possibly written as "this is an example").

Tokenization will never fail. In case something goes wrong (e.g. a string or comment is not closed) then an error message is asserted.

Arguments:

`ACEText`	- is the input text, it is either an atom or a string
`Tokens`	- is a list of tokens, i.e. the tokenization of the input text