logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

Lingua::Stem::Snowball - Perl interface to Snowball stemmers.

Authors

       Lingua::Stem::Snowball  was originally developed to provide access to stemming algorithms for the OpenFTS
       (full text search engine) project (<http://openfts.sourceforge.net>), by Oleg Bartunov, <oleg at sai  dot
       msu dot su> and Teodor Sigaev, <teodor at stack dot net>.

       Currently maintained by Marvin Humphrey <marvin at rectangular dot com>.  Previously maintained by Fabien
       Potencier <fabpot at cpan dot org>.

Description

       Stemming reduces related words to a common root form -- for instance, "horse", "horses", and "horsing"
       all become "hors".  Most commonly, stemming is deployed as part of a search application, allowing
       searches for a given term to match documents which contain other forms of that term.

       This module is very similar to Lingua::Stem -- however, Lingua::Stem is pure Perl, while
       Lingua::Stem::Snowball is an XS module which provides a Perl interface to the C version of the Snowball
       stemmers.  (<http://snowball.tartarus.org>).

   SupportedLanguages
       The following stemmers are available (as of Lingua::Stem::Snowball 0.95):

           |-----------------------------------------------------------|
           | Language   | ISO code | default encoding | also available |
           |-----------------------------------------------------------|
           | Danish     | da       | ISO-8859-1       | UTF-8          |
           | Dutch      | nl       | ISO-8859-1       | UTF-8          |
           | English    | en       | ISO-8859-1       | UTF-8          |
           | Finnish    | fi       | ISO-8859-1       | UTF-8          |
           | French     | fr       | ISO-8859-1       | UTF-8          |
           | German     | de       | ISO-8859-1       | UTF-8          |
           | Hungarian  | hu       | ISO-8859-1       | UTF-8          |
           | Italian    | it       | ISO-8859-1       | UTF-8          |
           | Norwegian  | no       | ISO-8859-1       | UTF-8          |
           | Portuguese | pt       | ISO-8859-1       | UTF-8          |
           | Romanian   | ro       | ISO-8859-2       | UTF-8          |
           | Russian    | ru       | KOI8-R           | UTF-8          |
           | Spanish    | es       | ISO-8859-1       | UTF-8          |
           | Swedish    | sv       | ISO-8859-1       | UTF-8          |
           | Turkish    | tr       | UTF-8            |                |
           |-----------------------------------------------------------|

   Benchmarks
       Here is a comparison of Lingua::Stem::Snowball and Lingua::Stem, using The Works of Edgar Allen Poe,
       volumes 1-5 (via Project Gutenberg) as source material.  It was produced on a 3.2GHz Pentium 4 running
       FreeBSD 5.3 and Perl 5.8.7.  (The benchmarking script is included in this distribution:
       devel/benchmark_stemmers.plx.)

           |--------------------------------------------------------------------|
           | total words: 454285 | unique words: 22748                          |
           |--------------------------------------------------------------------|
           | module                        | config        | avg secs | rate    |
           |--------------------------------------------------------------------|
           | Lingua::Stem 0.81             | no cache      | 2.029    | 223881  |
           | Lingua::Stem 0.81             | cache level 2 | 1.280    | 355025  |
           | Lingua::Stem::Snowball 0.94   | stem          | 1.426    | 318636  |
           | Lingua::Stem::Snowball 0.94   | stem_in_place | 0.641    | 708495  |
           |--------------------------------------------------------------------|

Methods / Functions

new
           my $stemmer = Lingua::Stem::Snowball->new(
               lang     => 'es',
               encoding => 'UTF-8',
           );
           die $@ if $@;

       Create a Lingua::Stem::Snowball object.  new() accepts the following hash style parameters:

       •   lang: An ISO code taken from the table of supported languages, above.

       •   encoding: A supported character encoding.

       Be  careful  with  the  values you supply to new(). If "lang" is invalid, Lingua::Stem::Snowball does not
       throw an exception, but instead sets $@.  Also, if you supply an invalid combination of values for "lang"
       and "encoding", Lingua::Stem::Snowball will not warn you, but  the  behavior  will  change:  stem()  will
       always return undef, and stem_in_place() will be a no-op.

   stem
           @stemmed = $stemmer->stem( WORDS, [IS_STEMMED] );
           @stemmed = stem( ISO_CODE, WORDS, [LOCALE], [IS_STEMMED] );

       Return lowercased and stemmed output.  WORDS may be either an array of words or a single scalar word.

       In a scalar context, stem() returns the first item in the array of stems:

           $stem       = $stemmer->stem($word);
           $first_stem = $stemmer->stem(\@words); # probably wrong

       LOCALE  has  no  effect;  it  is  only  there as a placeholder for backwards compatibility (see Changes).
       IS_STEMMED must be a reference to a scalar; if it is supplied, it will be set to 1 if the output  differs
       from the input in some way, 0 otherwise.

   stem_in_place
           $stemmer->stem_in_place(\@words);

       This  is  a  high-performance,  streamlined  version  of  stem()  (in  fact, stem() calls stem_in_place()
       internally). It has no return value, instead modifying each item in an  existing  array  of  words.   The
       words must already be in lower case.

   lang
           my $lang = $stemmer->lang;
           $stemmer->lang($iso_language_code);

       Accessor/mutator  for  the lang parameter. If there is no stemmer for the supplied ISO code, the language
       is not changed (but $@ is set).

   encoding
           my $encoding = $stemmer->encoding;
           $stemmer->encoding($encoding);

       Accessor/mutator for the encoding parameter.

   stemmers
           my @iso_codes = stemmers();
           my @iso_codes = $stemmer->stemmers();

       Returns a list of all valid language codes.

Name

       Lingua::Stem::Snowball - Perl interface to Snowball stemmers.

Requests & Bugs

       Please report any requests, suggestions or bugs via the RT bug-tracking system at http://rt.cpan.org/  or
       email to bug-Lingua-Stem-Snowball@rt.cpan.org.

       http://rt.cpan.org/NoAuth/Bugs.html?Dist=Lingua-Stem-Snowball is the RT queue for Lingua::Stem::Snowball.
       Please check to see if your bug has already been reported.

See Also

       <http://snowball.tartarus.org>, Lingua::Stem.

perl v5.40.0                                       2024-10-20                        Lingua::Stem::Snowball(3pm)

Synopsis

           my @words = qw( horse hooves );

           # OO interface:
           my $stemmer = Lingua::Stem::Snowball->new( lang => 'en' );
           $stemmer->stem_in_place( \@words ); # qw( hors hoov )

           # Functional interface:
           my @stems = stem( 'en', \@words );

See Also