logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

slmbuild - generate language model from idngram file

Author

       Originally  written   by   Phill.Zhang   <phill.zhang@sun.com>.    Currently   maintained   by   Kov.Chai
       <tchaikov@gmail.com>.

Description

slmbuild generates a back-off smoothing language model from a given idngram file. Generally, the
       idngram_file is created by ids2ngram.

Example

       Following example read 'all.id3gram' and write trigram model 'all.slm'.

       At 1-gram level, use Good-Turing discount with cut-off  0,  i<R>=8,  dis=0.9995.  At  2-gram  level,  use
       Absolute  discount  with cut-off 3, dis auto-calc. At 3-gram level, use Absolute discount with cut-off 2,
       dis auto-calc. Word id 10,11,12 are breakers (sentence/para/paper breaker, etc). Exclude-ID is 9. Lexicon
       contains 200000 words. The result languagme model uses -log(pr).

       slmbuild-l-n3-oall.slm-w200000-c0,3,2-dGT,8,0.9995-dABS-dABS-b10,11,12-e9all.id3gram

Name

       slmbuild - generate language model from idngram file

Note

-n  must  be given before -c-b. And -c must give right number of cut-off, also -ds must appear exactly N
       times specifying the discounts for 1-gram, 2-gram..., respectively.

       BREAKER-IDs could be SentenceTokens or ParagraphTokens. Conceptually, these ids have no meaning when they
       appeared in the middle of n-gram.

       EXCLUDE-IDs could be ambiguious-ids. Conceptually, n-grams which contain those ids are meaningless.

       We can not erase ngrams according to BREAKER-IDS and EXCLUDE-IDs directly from IDNGRAM file, because some
       low-level information is still useful in it.

Options All The Following Options Are Mandatory.

-n,--NMaxN
           1 for unigram, 2 for bigram, 3 for trigram. Any number not in the range of 1..3 is not valid.

       -o, --outoutput-file
           Specify the output xfilei name.

       -l, --log
           using -log(pr), use pr directly by default.

       -w, --wordcountN
           Lexican size, number of different words.

       -b, --brkid...
           Set the ids which should be treated as breaker.

       -e, --eid...
           Set the ids which should not be put into LM.

       -c, --cutc...
           k-grams whose freq <= c[k] are dropped.

       -d, --discountmethod, param...
           The k-th -d parm specifies the discount method

           For k-gram, possibble values for method/param are:

                 B<GT>,I<R>,I<dis>  : B<GT> discount for r E<lt>= I<R>, r is the freq of a ngram.
                             Linear discount for those r E<gt> I<R>, i.e. r'=r*dis
                             0 E<lt>E<lt> dis E<lt> 1.0, for example 0.999
                 B<ABS>,[I<dis>] : Absolute discount r'=r-I<dis>. And I<dis> is optional
                             0 E<lt>E<lt> I<dis> E<lt> cut[k]+1.0, normally I<dis> E<lt> 1.0.
                 LIN,[I<dis>] : Linear discount r'=r*dis. And dis is optional
                             0 E<lt> dis E<lt> 1.0

See Also

ids2ngram(1), slmprune(1).

perl v5.36.0                                       2023-08-08                                        SLMBUILD(1)

Synopsis

       slmbuild [option]... idngram_file...

See Also