Did you know ... | Search Documentation: |
Pack logicmoo_nlu -- ext/regulus/PrologLib/CorpusTools/ngram_tools_doc.txt |
Basic scheme:
tokenize_sents_in_file(+InFile, +OutFile)
Tokenizes sents in InFile and puts results in OutFile. InFile may have been created by the utilities for cleaning and sentence-tokenizing French data.
Both files are UTF-8.
extract_ngrams(+InFile, +TmpFile, +SortedTmpFile, +OutFile)
:-
InFile is a tokenized file produced by tokenize_sents_in_file/2.
TmpFile and SortedTmpFile are temporary working files.
OutFile is a sorted file associating ngrams with counts.
All files are UTF-8.
normalise_ngram_file(+InFile, +OutFile)
InFile is a sorted ngram/count file produced by extract_ngrams/4.
OutFile is an ngram/frequency file.
combine_normalised_ngram_files(+InFile1, +InFile2, +OutFile)
InFile1 and InFile2 are ngram/frequency files.
OutFile is an ngram/frequency file which combines them.
order_combined_ngrams(+InFile, +FilterPred, +OutFile)
InFile is an ngram/frequency file.
FilterPred is a predicate defined in this file which may hold of an n-gram
OutFile is a sorted ngram/frequency file containing the elements for which FilterPred holds.