This module offers a simple tokenizer with flexible options.
- author
- - Shon Feder
- license
- - http://unlicense.org/
Rational:
tokenize_atom/2, in library(porter_stem), is inflexible, in that it doesn't
allow for the preservation of white space or control characters, and it only
tokenizes into a list of atoms.
The tokenize library is meant to be easy to use while allowing for relatively
flexible input and output. Features include
- options for tokenization of spaces, numbers, strings, control characters and punctuation
- options to output packed tokens
- options to represent tokens in any of the common SWI-Prolog text formats
- option to preserve or ignore case
- a predicate to emit text given a list of tokens
E.g.,
?- tokenize('Tokenizes: words,"strings", 1234.5\n', Tokens, [cased(true), spaces(false)]),
| untokenize(Tokens, Codes).
Tokens = [word('Tokenizes'), punct(:), word(words), punct(','), string(strings), punct(','), number(1234.5), cntrl('\n')],
Codes = "Tokenizes:words,"strings"...34.5
".
tokenize is much more limited and much less performant than a lexer generator,
but it is dead simple to use and flexible enough for many common use cases.
- tokenize(+Text:text, -Tokens:list(term)) is semidet
-
- See also
- - tokenize/3 when called with an empty list of options: thus, with defaults.
- tokenize(+Text:text, -Tokens:list(term), +Options:list(term)) is semidet
- True when Tokens is unified with a list of tokens representing the text from
Text, according to the options specified in Options.
Each token in Tokens will be one of:
- word(W)
- Where W is comprised of contiguous alpha-numeric chars.
- punct(P)
- Where
char_type(P, punct).
- cntrl(C)
- Where
char_type(C, cntrl).
- space(S)
- Where
S == ' '.
- number(N)
- Where
number(N).
- string(S)
- Where S was a sequence of bytes enclosed by double quotation marks.
Note that the above describes the default behavior, in which the token is
represented as an atom. This representation can be changed by using the
to option described below.
Valid Options are:
- cased(+boolean)
- Determines whether tokens perserve cases of the source text. Defaults to
cased(false).
- spaces(+boolean)
- Determines whether spaces are represted as tokens or discarded. Defaults to
spaces(true).
- cntrl(+boolean)
- Determines whether control characters are represented as tokens or discarded. Defaults to
cntrl(true).
- punct(+boolean)
- Determines whether punctuation characters are represented as tokens or discarded. Defaults to
punct(true).
- numbers(+boolean)
- Determines whether the tokenizer represents and tags numbers. Defaults to
numbers(true).
- strings(+boolean)
- Determines whether the tokenizer represents and tags strings. Defaults to
strings(true).
- pack(+boolean)
- Determines whether tokens are packed or repeated. Defaults to
pack(false).
- to(+one_of([strings, atoms, chars, codes]))
- Determines the representation format used for the tokens. Defaults to
to(atoms).
- tokenize_file(+File:atom, -Tokens:list(term)) is semidet
-
- See also
- - tokenize_file/3 when called with an empty list of options: thus, with defaults.
- tokenize_file(+File:atom, -Tokens:list(term), +Options:list(term)) is semidet
- True when Tokens is unified with a list of tokens represening
the text of File.
- See also
- - tokenize/3 which has the same available options and behavior.
- untokenize(+Tokens:list(term), -Untokens:list(codes)) is semidet
- True when Untokens is unified with a code list representation of each
token in Tokens.