logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

KinoSearch1::Analysis::Tokenizer - customizable tokenizing

Description

       Generically, "tokenizing" is a process of breaking up a string into an array of "tokens".

           # before:
           my $string = "three blind mice";

           # after:
           @tokens = qw( three blind mice );

       KinoSearch1::Analysis::Tokenizer decides where it should break up the text based on the value of
       "token_re".

           # before:
           my $string = "Eats, Shoots and Leaves.";

           # tokenized by $whitespace_tokenizer
           @tokens = qw( Eats, Shoots and Leaves. );

           # tokenized by $word_char_tokenizer
           @tokens = qw( Eats Shoots and Leaves   );

License, Disclaimer, Bugs, Etc.

       See KinoSearch1 version 1.01.

perl v5.40.0                                       2024-10-20              KinoSearch1::Analysis::Tokenizer(3pm)

Methods

new
           # match "O'Henry" as well as "Henry" and "it's" as well as "it"
           my $token_re = qr/
                   \b        # start with a word boundary
                   \w+       # Match word chars.
                   (?:       # Group, but don't capture...
                      '\w+   # ... an apostrophe plus word chars.
                   )?        # Matching the apostrophe group is optional.
                   \b        # end with a word boundary
               /xsm;
           my $tokenizer = KinoSearch1::Analysis::Tokenizer->new(
               token_re => $token_re, # default: what you see above
           );

       Constructor.  Takes one hash style parameter.

       •   token_re - must be a pre-compiled regular expression matching one token.

Name

       KinoSearch1::Analysis::Tokenizer - customizable tokenizing

Synopsis

           my $whitespace_tokenizer
               = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/\S+/, );

           # or...
           my $word_char_tokenizer
               = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/\w+/, );

           # or...
           my $apostrophising_tokenizer = KinoSearch1::Analysis::Tokenizer->new;

           # then... once you have a tokenizer, put it into a PolyAnalyzer
           my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new(
               analyzers => [ $lc_normalizer, $word_char_tokenizer, $stemmer ], );

See Also