/* train_classifier(+AnnotatedWavfiles, +RecParamsAlist, +FeatureExtractionSpecAlist, +DiscriminantSpec, +Discriminants, +RecOrNoRec) Example call: train_classifier( corpora(annotated_wavfiles_training_in_domain), [1-[package=ebllm_gemini(recogniser_sem), grammar='.MAIN', 'rec.Pruning'=1200], 2-[package=slm(package), grammar='.TOP', 'rec.Pruning'=1200]], [1-[hand_coded_patterns, sem_triples, lf_postproc_pred=riacs_postproc_lf], 2-[class_unigrams, class_bigrams, class_trigrams]], [], alterf_generated_files('discriminants_slm.pl'), rec([1-alterf_generated_files('batchrec_results_train1.pl'), 2-alterf_generated_files('batchrec_results_train2.pl')]) ). Top-level call to build discriminants file. The arguments are as follows: 1. AnnotatedWavfiles. File of training data. A line is of the form | | - is either the absolute pathname of a wavfile (necessary if we want to do training using real speech data), or else an arbitrary ID (permissible if we are going to train in text mode). - is a space-separated list of semantic atoms. These can be either simple unquoted Prolog atoms, or else one of the following special atoms: a) Positive integers, e.g. 1, 5, 42 b) Numbers with a decimal point separator, e.g. 1.1, 2.5, 3.10. . is interpreted NOT as a standard decimal number, but as a pair of the form <, >. c) Times, expressed in the notation N:NN or NN:NN, e.g. 3:20, 05:15, 10:24, 8:03 d) Ranges of numbers, expressed in the form -, e.g. 3-15, 3.1-10.2, 5.1-20 - . A space-separated list of atoms representing the text form of the training utterance. Examples of possible lines from training file: C:\home\speech\Corpora\wavfiles\checklist\se_091102\utt07.wav | load water | load water procedure C:\home\speech\Corpora\wavfiles\checklist3\2003617_1254\utt64.wav | go_to caution step 5 | go to the caution for step five C:\home\speech\Corpora\wavfiles\checklist3\2003617_1254\utt43.wav | stop_alarm 01:21 | cancel the alarm for oh one twenty one C:\home\speech\Corpora\wavfiles\checklist3\2003617_1254\utt13.wav | read line 4 | read substep four dummy_file174 | correction 6.5 | no i meant six point five 2. RecParamsAlist. List containing one or more elements of the form -, where is an arbitrary Prolog atom and is a list of elements of the form =. The RecParamsAlist specifies how recognition is to be carried out for each recogniser package used. Each package needs to specify at least the following two parameters: a) package. E.g. package=ebllm_gemini(recogniser_sem) says to use the compiled Nuance package ebllm_gemini(recogniser_sem) b) grammar. Top-level grammar to use in the specified package. Other keys are interpreted as extra parameters to pass to the Nuance batchrec process, if training is carried out in speech mode. They are ignored if training is carried out in text mode. 3. FeatureExtractionSpecAlist. List containing one or more elements of the form -, where the s are the same as the ones used for the RecParamsAlist, and the s are lists specifying the features that are to be extracted for each recogniser. The following types of features are currently supported: - unigrams. Single words from surface string. - bigrams. Pairs of words from surface string. - trigrams. Triples of words from surface string. - class_unigrams. Single words from surface string, backed off using a tagging grammar defined by the predicate user:tagging_grammar/3. An example of a tagging grammar appears in home/speech/rialist/checklist/Alterf/Prolog/checklist_tagging_grammar.pl - class_bigrams. Like class_unigrams, but pairs. - class_trigrams. Like class_unigrams, but triples. - sem_triples. Semantic triples. These are closely modelled on the semantic triples used in the SRI CLE project, and are extracted from logical forms. - hand_coded_patterns. Matches of subforms of the logical form to patterns defined by the predicate user:alterf_pattern/3. Examples of patterns appears in home/speech/rialist/checklist/Alterf/Prolog/checklist_alterf_patterns.pl If you are using sem_triples or hand_coded_patterns, you may need to define a post processing predicate, which is applied to the LF before further processing is carried out. The syntax is lf_postproc_pred=. For the current Checklist system, we need lf_postproc_pred=riacs_postproc_lf. 4. DiscriminantSpec. List containing zero or more elements of the form =, specifying how the discriminants are to be calculated. The following keys and values are supported. Many of these are probably now only of historical interest, and represent unsuccessful attempts to tune the algorithm. The definitions are in make_discriminants.pl. - bayes_version. Possible values: [naive, normalised]. 'naive' -> discriminant score = log2(P(Atom | Feat)), 'normalised' -> log2(P(Atom | Feat)) - log2(P(Atom) 'Normalised' seemed to give better results for small amounts of training data, but didn't hold up. - use_proportion_of_data Real number between 0 and 1. What proportion of the data to use, rest is discarded. - rec_ok_weight. Positive integer. If greater than 1, count each correctly recognised utterance as though it had occurred N times rather than just once. Didn't turn out useful in the end. - confidence_threshold_ignore. Positive number between 0 and 100. Discard utterances tagged "ignore" if they are under threshold. Current default is 45, but right now we don't train on the ignored utterances so this is irrevelevant. - confidence_threshold_non_ignore. Positive number between 0 and 100. Discard utterances not tagged "ignore" if they are under threshold. Current default is 45. Training on low-confidence utterances seems to be a bad idea, so this is useful. - discriminant_score_threshold. Real. Discard discriminants with values less than given threshold. This makes the discriminant table much smaller, so it's useful. Default threshold is 0.5. - discriminant_n_good_examples_threshold. Integer. Discard discriminants based on insufficiently many positive examples. One positive example by default. - mle_formula. Possible values: [standard, carter]. Use the normal MLE formula or the compicated formula from the appendix to the SLT book. 'carter' may be better, but so far not significantly so. 'normal' is default. - hand_coded_rule_bonus. Real. Can add a bonus to hand-coded rule scores if we want to prioritise them. Zero by default - higher values haven't given better results. - assumed_minimum_n_good_examples_for_rule. We can assume we have more positive examples for a rule-based feature if we want. Minimum of 10 by default. 5. Discriminants. Write out discriminants to this file. The format is d(, , , , ). So for example d(class_trigram([the,alarm,for]),stop_alarm,2,0,1.0). means that there were two good examples and zero bad examples associating the class trigram feature for the trigram [the,alarm,for] and the semantic atom stop_alarm, giving a score of 1.0. Similarly, d(sem_triple([bag,adj,small]),show,18,7,0.75). means that there were 18 good examples and 7 bad examples associating the sem triple feature for the triple [bag,adj,small] and the semantic atom show, giving a score of 0.75. You can find an example of a discriminants file in home/speech/rialist/checklist/Alterf/GeneratedFiles/discriminants.pl. 6. RecOrNoRec This can have three possible types of value: - text. Carry out training in text mode, using only the transcriptions. - rec(). is a list containing one or more elements of the form -, where the s are the same as the ones used for the RecParamsAlist, and the recognition results get written to the s for each recogniser. - saved_rec(). is a list in the same format as for rec(). Training doesn't carry out recognition, but just uses the results saved in the designated files. */ %-------------------------------------------------------------------------------------------- :- module(classifier_trainer, [train_classifier/6] ). %-------------------------------------------------------------------------------------------- :- use_module('$REGULUS/Alterf/Prolog/parse_annotated_wavfiles'). :- use_module('$REGULUS/Alterf/Prolog/batchrec'). :- use_module('$REGULUS/Alterf/Prolog/extract_feats'). :- use_module('$REGULUS/Alterf/Prolog/make_discriminants'). :- use_module('$REGULUS/Alterf/Prolog/classifier_utilities'). :- use_module('$REGULUS/PrologLib/utilities'). :- use_module(library(system)). :- use_module(library(lists)). %:- use_module(library(ordsets)). %-------------------------------------------------------------------------------------------- /* Top-level is very simple: - Do any front-end processing required on the training data (real batchrec, text batchrec, or retrieve saved results). - Extract features from batchrec/text-batchrec results using routines in extract_feats.pl - Turn feat vectors into discriminants using routines in make_discriminants.pl */ train_classifier(AnnotatedWavfiles0, RecParamsAlist, FeatureExtractionSpecAlist, DiscriminantSpec, DiscriminantScores, RecOrNoRec) :- check_alists_are_compatible([RecParamsAlist, FeatureExtractionSpecAlist]), make_classifier_tmp_files(TmpFeatVectors, TmpWavfiles, TmpTranscriptions, TmpSents, TmpAnnotatedWavfiles), parse_annotated_wavfiles(AnnotatedWavfiles0, TmpWavfiles, TmpTranscriptions, TmpSents, TmpAnnotatedWavfiles), ( RecOrNoRec = rec(RecResultsAlist) -> check_alists_are_compatible([RecParamsAlist, RecResultsAlist]), do_batchrec_multiple(TmpWavfiles, TmpTranscriptions, RecParamsAlist, RecResultsAlist) ; RecOrNoRec = text -> do_text_batchrec_multiple(TmpTranscriptions, RecParamsAlist, RecResultsAlist) ; RecOrNoRec = saved_rec(RecResultsAlist) -> check_alists_are_compatible([RecParamsAlist, RecResultsAlist]), check_saved_rec_results_alist(RecResultsAlist) ; format('~N*** Error: bad value "~w" for last arg to train_classifier/6. Must be "rec()", "text" or "saved_rec()".~n', [RecOrNoRec]), fail ), rec_results_to_feature_vectors(RecResultsAlist, FeatureExtractionSpecAlist, TmpFeatVectors), feature_vectors_to_discriminant_scores(TmpAnnotatedWavfiles, TmpFeatVectors, DiscriminantSpec, DiscriminantScores). %-------------------------------------------------------------------------------------------- make_classifier_tmp_files(TmpFeatVectors, TmpWavfiles, TmpTranscriptions, TmpSents, TmpAnnotatedWavfiles) :- make_tmp_file(feat_vectors, TmpFeatVectors), make_tmp_file(wavfiles, TmpWavfiles), make_tmp_file(transcriptions, TmpTranscriptions), make_tmp_file('sents.pl', TmpSents), make_tmp_file('annotated_wavfiles.pl', TmpAnnotatedWavfiles). %--------------------------------------------------------------------------------------------