Did you know ... | Search Documentation: |
SWI-Prolog Natural Language Processing Primitives |
The library library(double_metaphone)
implements the Double
Metaphone algorithm developed by Lawrence Philips and described in “The
Double-Metaphone Search Algorithm” by L Philips, C/C++ User's
Journal, 2000. Double Metaphone creates a key from a word that
represents its phonetic properties. Two words with the same Double
Metaphone are supposed to sound similar. The Double Metaphone algorithm
is an improved version of the Soundex algorithm.
The Double Metaphone algorithm is copied from the Perl library that holds the following copyright notice. To the best of our knowledge the Perl license is compatible to the SWI-Prolog license schema and therefore including this module poses no additional license conditions.
Copyright 2000, Maurice Aubrey <maurice@hevanet.com>. All rights reserved.This code is based heavily on the C++ implementation by Lawrence Philips and incorporates several bug fixes courtesy of Kevin Atkinson <kevina@users.sourceforge.net>.
This module is free software; you may redistribute it and/or modify it under the same terms as Perl itself.
The library(porter_stem)
library implements the stemming
algorithm described by Porter in Porter, 1980, “An algorithm for
suffix stripping” , Program, Vol. 14, no. 3, pp 130-137. The
library comes with some additional predicates that are commonly used in
the context of stemming.
[-+][0-9]+(\.[0-9]+)?([eE][-+][0-9]+)? | number |
[:alpha:][:alnum:]+ | word |
[:space:]+ | skipped |
anything else | single-character |
Character classification is based on the C-library iswalnum() etc. functions. Recognised numbers are passed to Prolog read/1, supporting unbounded integers.
It is likely that future versions of this library will provide tokenize_atom/3 with additional options to modify space handling as well as the definition of words.
The code is based on the original Public Domain implementation by Martin Porter as can be found at http://www.tartarus.org/martin/PorterStemmer/. The code has been modified by Jan Wielemaker. He removed all global variables to make the code thread-safe, added the unaccent and tokenize code and created the SWI-Prolog binding.
This module encapsulates "The C version of the libstemmer library" from the Snowball project. This library provides stemmers in a variety of languages. The interface to this library is very simple:
Here is an example:
?- snowball(english, walking, S). S = walk.
The implementation maintains a cache of stemmers for each thread that accesses snowball/3, providing high-perfomance and thread-safety without locking.
Algorithm | is the (english) name for desired algorithm or an 2 or 3 letter ISO 639 language code. |
Input | is the word to be stemmed. It is either an atom, string or list of chars/codes. The library accepts Unicode characters. Input must be lowercase. See downcase_atom/2. |
domain_error(snowball_algorithm, Algorithm)
type_error(atom, Algorithm)
type_error(text, Input)
semidet
if Algorithm is given.
The library(isub)
implements a similarity measure
between strings, i.e., something similar to the Levenshtein distance.
This method is based on the length of common substrings.
?- isub('E56.Language', 'languange', D, [normalize(true)]). D = 0.4226950354609929. % [-1,1] range ?- isub('E56.Language', 'languange', D, [normalize(true),zero_to_one(true)]). D = 0.7113475177304964. % [0,1] range ?- isub('E56.Language', 'languange', D, []). % without normalization D = 0.19047619047619047. % [-1,1] range ?- isub(aa, aa, D, []). % does not work for short substrings D = -0.8. ?- isub(aa, aa, D, [substring_threshold(0)]). % works with short substrings D = 1.0. % but may give unwanted values % between e.g. 'store' and 'spore'. ?- isub(joe, hoe, D, [substring_threshold(0)]). D = 0.5315315315315314. ?- isub(joe, hoe, D, []). D = -1.0.
This is a new version of isub/4 which replaces the old version while providing backwards compatibility. This new version allows several options to tweak the algorithm.
Text1 | and Text2 are either an atom, string or a list of characters or character codes. |
Similarity | is a float in the range [-1,1.0], where 1.0 means most similar. The range can be set to [0,1] with the zero_to_one option described below. |
Options | is a list with elements described
below. Please note that the options are processed at compile time using
goal_expansion to provide much better speed. Supported options are:
|