logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

hfst-tokenize - =perform matching/lookup on text streams

Description

       perform matching/lookup on text streams

   Commonoptions:-h, --help
              Print help message

       -V, --version
              Print version info

       -v, --verbose
              Print verbosely while processing

       -q, --quiet
              Only print fatal erros and requested output

       -s, --silent
              Alias of --quiet-n, --newline
              Newline as input separator (default is blank line)

       -a, --print-all
              Print nonmatching text

       -w, --print-weight
              Print weights (overrides earlier -W option)

       -W, --no-weights
              Don't print weights (default; overrides earlier -w, or -w implied by -g, options)

       -m, --tokenize-multichar Tokenize multicharacter symbols
              (by  default  only one utf-8 character is tokenized at a time regardless of what is present in the
              alphabet)

       -b, --beam=B
              Output only analyses whose weight is within B from best result

       -tS, --time-cutoff=S
              Limit search after having used S seconds per input

       -lN, --weight-classes=N
              Output no more than N best weight classes (where analyses with equal weight constitute a class

       -u, --unique
              Remove duplicate analyses

       -z, --segment
              Segmenting / tokenization mode (default)

       -i, --space-separated
              Tokenization with one sentence per line, space-separated tokens

       -x, --xerox
              Xerox output

       -c, --cg
              Constraint Grammar output

       -S, --superblanks
              Ignore contents of unescaped [] (cf. apertium-destxt); flush on NUL

       -g, --giella-cg
              CG format used in Giella  infrastructure  (implies  -w  and  -l2,  treats  @PMATCH_INPUT_MARK@  as
              subreading separator, expects tags to be Multichar_symbols, flush on NUL)

       -C--conllu
              CoNLL-U format

       -f, --finnpos
              FinnPos output

       -L, --visl
              VISL input and output (implies -W, handles <s> as blocks and <STYLE> inline)

       Use standard streams for input and output (for now).

Name

       hfst-tokenize - =perform matching/lookup on text streams

Reporting Bugs

       Report     bugs     to    <hfst-bugs@helsinki.fi>    or    directly    to    our    bug    tracker    at:
       <https://github.com/hfst/hfst/issues>

       hfst-tokenize home page: <https://github.com/hfst/hfst/wiki/HfstTokenize>
       General help using HFST software: <https://github.com/hfst/hfst/wiki>

Synopsis

hfst-tokenize [--segment|--xerox|--cg|--giella-cg] [OPTIONS...] RULESET

See Also