logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

Text::Names - Perl extension for proper name parsing, normalization, recognition, and classification

Author

       David Bourget, http://www.dbourget.com, with contributions by Zbigniew Lukasiak

Description

       This modules provides a number of name normalization routines, plus high-level parsing and name
       comparison utilities such as those illustrated in the synopsis.

       While it tries to accommodate non-Western names, this module definitely works better with Western names,
       especially English-style names.

       No subroutine is exported by default.

       This modules normalizes names to this format:

       Lastname(s) [Jr], Given name(s)

       Some examples:

       1) Bourget, David Joseph Richard

       2) Bourget Jr, David

       3) Bourget, D. J. R.

       These are all normalized names. This format is what is referred to as the normalized representation of a
       name here.

Export

       None by default.

Known Issues

       This module currently overwrites @Text::Capitalize::exceptions globally, which can have unintended side-
       effects.

Name

       Text::Names - Perl extension for proper name parsing, normalization, recognition, and classification

See Also

       The xPapers application framework from which this has been extracted, http://www.xpapers.org

       The related Biblio::Citation::Compare module.

Subroutines

abbreviationOf(stringname1,stringname2):boolean
       Returns true iff name1 is a common abbreviation of name2 in English. For example, 'Dave' is a common
       abbreviation of 'David'. The list of abbreviations used includes a number of old abbreviations such as
       'Davy' for 'David'.

   cleanName(stringname):string
       Like parseName, but a) returns the normalized form of the name instead of an array, and b) does
       additional cleaning-up. To be preferred to parseName in most cases and by default if processing variable
       or dubious data.

   composeName(stringgiven,stringlast):string
       Returns the name in the "last, given" format.

   isCommonFirstname(stringname,[floatthreshold]):boolean
       Returns true if the name is among the 1000 most popular firstnames (male or female) according to the 1990
       US Census. If a threshold percentage is passed, the name must have at least this frequency for the
       subroutine to return true. See http://www.census.gov/genealogy/www/data/1990surnames/names_files.html.

   isCommonSurname(stringname,[floatthreshold]):boolean
       Returns true if the name is among the 1000 most popular surnames according to the 1990 US Census. If a
       threshold percentage is passed, the name must have at least this frequency for the subroutine to return
       true. See http://www.census.gov/genealogy/www/data/1990surnames/names_files.html.

   firstnamePrevalence(stringname):float[0-100]
       Returns a float between 0 and 100 indicating how common the firstname is according to the 1990 US Census.
       Names that are not in the top 1000 return 0.

   surnamePrevalence(stringname):float[0-100]
       Returns a float between 0 and 100 indicating how common the surname is according to the 1990 US Census.
       Names that are not in the top 1000 return 0.

   normalizeNameWhitespace(stringname):string
       Normalizes the whitespace within a name. This is mainly for internal usage.

   parseName(stringname):array
       Takes a name in one of the multiple formats that one can write a name in, and returns it as an array
       representing the post-comma and pre-comma parts of its normalized form (in that order). For example,
       parseName("David Bourget") returns ('David','Bourget').

   parseName2(stringname):array
       Use on already-normalized names to split them into four parts: full given names, initials, last names,
       and suffix. The only 'suffix' recognied is 'Jr'.

   parseNameList(arraynames):array
       Takes an array of names (as strings) and returns an array of normalized representations of the names in
       the array.

   parseNames(stringnames):array
       Takes a string of names as parameter and returns an array of normalized representations of the names in
       the string. This routine understands a wide variety of formattings for names and lists of names typically
       found as list of authors in bibliographic citations. See the test 03-parseNames.t for multiple examples.

   reverseName(stringname):string
       Given a normalized name of the form "last, given", returns "given last".

   samePerson(stringname1,stringname2):string
       Returns a true value iff name1 and name2 could reasonably be two writings of the same name. For example,
       'D J Bourget' could reasonably be a writing of 'David Joseph Bourget'. So could 'D Bourget'. But 'D F
       Bourget' is not a reasonable writing of 'David Joseph Bourget'. The value returned is a (potentially new)
       name string which combines the most complete tokens of the two submitted name strings.

       Contrary to what one might expect, this subroutine does not use weakenings() behind the scenes. Another
       way to check for name compatibility would be to check that two names have a weakening in common (probably
       too permissive for most purposes) or that one name is a weakening of the other.

   setNameAbbreviations(array):undef
       Sets the abbreviation mapping used to determine whether, say, 'David' and 'Dave' are compatible name
       parts. The mapping is also used by abbreviationOf(). The format of the array should be: 'Dave', 'David',
       'Davy', 'David', etc, otherwise representable in Perl as 'Dave' => 'David', 'Davy' => 'David', etc.

   getNameAbbreviations
       Returns the abbreviation mapping.

   weakenings(stringfirst_name,stringlast_name):array
       Returns an array of normalized names which are weakenings of the first and last name passed as argument.
       Substituting a given names by an initial, or removing an initial, for example, are operations which
       generate weakenings of a name. Such operations are applied with arbitrary depth, until the name has been
       reduced to a single initial followed by the lastname, and all intermediary steps returned.

       You can use weakenings(parseName("Lastname, Firstname")) to weaken a first and last name as a single
       string.

   guessGender(stringfirstname,[floatthreshold]):string
       Returns 'F' if someone with the provided firstname is likely female, 'M' if likely male, and undef
       otherwise. A frequency threshold (default = 0) can be specified so that a gender is returned only if the
       name is found with at least this frequency among people with this gender (according to the US census). A
       threshold of 0.1 (which means 0.1%) ensures very reliable results (precision above 99%) with a recall of
       about 60%. When the threshold is lower, this function has a tendency to overestimate the number of
       females.

Synopsis

           use Text::Names qw/parseNames samePerson/;

           my @authors = parseNames("D Bourget, Zbigniew Z Lukasiak and John Doe");

           # @authors = ('Bourget, D.','Lukasiak, Zbigniew Z.','Doe, John')

           print "same!" if samePerson("Dave Bourget","David F. Bourget");

           # same!

           print guessGender("David");

           # "M"

See Also