Pack tokenize -- prolog/tokenize.pl

This module offers a simple tokenizer with flexible options.

author

- Shon Feder

license

Rational:

tokenize_atom/2, in library(porter_stem), is inflexible, in that it doesn't allow for the preservation of white space or control characters, and it only tokenizes into a list of atoms.

The tokenize library is meant to be easy to use while allowing for relatively flexible input and output. Features include

options for tokenization of spaces, numbers, strings, control characters and punctuation
options to output packed tokens
options to represent tokens in any of the common SWI-Prolog text formats
option to preserve or ignore case
a predicate to emit text given a list of tokens

E.g.,

?- tokenize('Tokenizes: words,"strings", 1234.5\n', Tokens, [cased(true), spaces(false)]),
|    untokenize(Tokens, Codes).
Tokens = [word('Tokenizes'), punct(:), word(words), punct(','), string(strings), punct(','), number(1234.5), cntrl('\n')],
Codes = "Tokenizes:words,"strings"...34.5
".

tokenize is much more limited and much less performant than a lexer generator, but it is dead simple to use and flexible enough for many common use cases.

tokenize(+Text:text, -Tokens:list(term)) is semidet

See also: - tokenize/3 when called with an empty list of options: thus, with defaults.

tokenize(+Text:text, -Tokens:list(term), +Options:list(term)) is semidet

True when Tokens is unified with a list of tokens representing the text from Text, according to the options specified in Options.

Each token in Tokens will be one of:

word(W): Where W is comprised of contiguous alpha-numeric chars.
punct(P): Where char_type(P, punct).
cntrl(C): Where char_type(C, cntrl).
space(S): Where S == ' '.
number(N): Where number(N).
string(S): Where S was a sequence of bytes enclosed by double quotation marks.

Note that the above describes the default behavior, in which the token is represented as an atom. This representation can be changed by using the to option described below.

Valid Options are:

cased(+boolean): Determines whether tokens perserve cases of the source text. Defaults to cased(false).
spaces(+boolean): Determines whether spaces are represted as tokens or discarded. Defaults to spaces(true).
cntrl(+boolean): Determines whether control characters are represented as tokens or discarded. Defaults to cntrl(true).
punct(+boolean): Determines whether punctuation characters are represented as tokens or discarded. Defaults to punct(true).
numbers(+boolean): Determines whether the tokenizer represents and tags numbers. Defaults to numbers(true).
strings(+boolean): Determines whether the tokenizer represents and tags strings. Defaults to strings(true).
pack(+boolean): Determines whether tokens are packed or repeated. Defaults to pack(false).
to(+one_of([strings, atoms, chars, codes])): Determines the representation format used for the tokens. Defaults to to(atoms).

tokenize_file(+File:atom, -Tokens:list(term)) is semidet

See also: - tokenize_file/3 when called with an empty list of options: thus, with defaults.

tokenize_file(+File:atom, -Tokens:list(term), +Options:list(term)) is semidet

True when Tokens is unified with a list of tokens represening the text of File.

See also: - tokenize/3 which has the same available options and behavior.

untokenize(+Tokens:list(term), -Untokens:list(codes)) is semidet

True when Untokens is unified with a code list representation of each token in Tokens.