mlpack_lmnn - large margin nearest neighbors (lmnn)

Additional Information

       For further information, including relevant papers, citations,  and  theory,  consult  the  documentation
       found at http://www.mlpack.org or included with your distribution of mlpack.

mlpack-4.5.1                                     29 January 2025                                  mlpack_lmnn(1)

Description

       This  program  implements Large Margin Nearest Neighbors, a distance learning technique. The method seeks
       to improve k-nearest-neighbor classification on a dataset. The method employes the strategy  of  reducing
       distance  between  similar  labeled  data points (a.k.a target neighbors) and increasing distance between
       differently labeled points (a.k.a impostors) using standard optimization techniques over the gradient  of
       the distance between data points.

       To  work,  this  algorithm  needs  labeled  data.  It  can  be given as the last row of the input dataset
       (specified  with  '--input_file  (-i)'),  or  alternatively  as  a  separate   matrix   (specified   with
       '--labels_file  (-l)').  Additionally, a starting point for optimization (specified with '--distance_file
       (-d)'can be given, having (r x d) dimensionality. Here r should satisfy 1 <= r <= d, Consequently a  Low-
       Rank  matrix will be optimized. Alternatively, Low-Rank distance can be learned by specifying the '--rank
       (-A)'parameter (A Low-Rank matrix with uniformly distributed values will  be  used  as  initial  learning
       point).

       The  program  also  requires  number  of  targets  neighbors to work with ( specified with '--k (-k)'), A
       regularization parameter can also be passed, It acts as a trade of between the pulling and pushing  terms
       (specified  with  ’--regularization (-r)'), In addition, this implementation of LMNN includes a parameter
       to decide the interval after which impostors must be  re-calculated  (specified  with  '--update_interval
       (-R)').

       Output  can  either  be  the  learned  distance  matrix  (specified  with  ’--output_file  (-o)'), or the
       transformed dataset (specified with ’--transformed_data_file (-D)'), or both. Additionally  mean-centered
       dataset (specified with '--centered_data_file (-c)') can be accessed given mean-centering (specified with
       '--center  (-C)') is performed on the dataset.  Accuracy on initial dataset and final transformed dataset
       can be printed by specifying the '--print_accuracy (-P)'parameter.

       This  implementation  of  LMNN  uses  AdaGrad,  BigBatch_SGD,  stochastic  gradient  descent,  mini-batch
       stochastic gradient descent, or the L_BFGS optimizer.

       AdaGrad,  specified  by  the  value  'adagrad' for the parameter '--optimizer (-O)', uses maximum of past
       squared gradients. It primarily on six parameters: the step size (specified with '--step_size (-a)'), the
       batch size (specified with '--batch_size (-b)'), the maximum number of passes (specified  with  ’--passes
       (-p)').  Inaddition,  a  normalized  starting  point  can  be  used  by specifying the '--normalize (-N)'
       parameter.

       BigBatch_SGD, specified by the value 'bbsgd' for the parameter '--optimizer (-O)', depends  primarily  on
       four  parameters:  the  step  size  (specified  with  ’--step_size (-a)'), the batch size (specified with
       '--batch_size (-b)'), the maximum number of passes (specified  with  '--passes  (-p)').  In  addition,  a
       normalized starting point can be used by specifying the '--normalize (-N)' parameter.

       Stochastic  gradient  descent, specified by the value 'sgd' for the parameter ’--optimizer (-O)', depends
       primarily on three parameters: the  step  size  (specified  with  '--step_size  (-a)'),  the  batch  size
       (specified  with ’--batch_size (-b)'), and the maximum number of passes (specified with ’--passes (-p)').
       In addition, a normalized starting point can be used by  specifying  the  '--normalize  (-N)'  parameter.
       Furthermore, mean-centering can be performed on the dataset by specifying the '--center (-C)'parameter.

       The  L-BFGS  optimizer, specified by the value 'lbfgs' for the parameter ’--optimizer (-O)', uses a back-
       tracking line search algorithm to minimize a function. The  following  parameters  are  used  by  L-BFGS:
       '--max_iterations  (-n)',  '--tolerance  (-t)'(the  optimization  is terminated when the gradient norm is
       below this value).  For  more  details  on  the  L-BFGS  optimizer,  consult  either  the  mlpack  L-BFGS
       documentation (in lbfgs.hpp) or the vast set of published literature on L-BFGS. In addition, a normalized
       starting point can be used by specifying the '--normalize (-N)' parameter.

       By default, the AMSGrad optimizer is used.

       Example  -  Let's  say  we  want  to  learn  distance  on  iris dataset with number of targets as 3 using
       BigBatch_SGD optimizer. A simple call for the same will look like:

       $ mlpack_lmnn--input_file iris.csv --labels_file iris_labels.csv --k 3 --optimizer  bbsgd  --output_file
       output.csv

       Another  program call making use of update interval & regularization parameter with dataset having labels
       as last column can be made as:

       $ mlpack_lmnn--input_file  letter_recognition.csv  --k  5  --update_interval  10  --regularization  0.4
       --output_file output.csv

Name

mlpack_lmnn - large margin nearest neighbors (lmnn)

Optional Input Options

--batch_size(-b)[int]
              Batch size for mini-batch SGD. Default value 50.

       --center(-C)[bool]
              Perform  mean-centering on the dataset. It is useful when the centroid of the data is far from the
              origin.

       --distance_file(-d)[unknown]
              Initial distance matrix to be used as starting point

       --help(-h)[bool]
              Default help info.

       --info[string]
              Print help on a specific option. Default value ''.

       --k(-k)[int]
              Number of target neighbors to use  for  each  datapoint.  Default  value  1.   --labels_file  (-l)
              [unknown] Labels for input dataset.

       --linear_scan(-L)[bool]
              Don't shuffle the order in which data points are visited for SGD or mini-batch SGD.

       --max_iterations(-n)[int]
              Maximum number of iterations for L-BFGS (0 indicates no limit). Default value 100000.

       --normalize(-N)[bool]
              Use  a  normalized  starting point for optimization. Itis useful for when points are far apart, or
              when SGD is returning NaN.

       --optimizer(-O)[string]
              Optimizer to use; 'amsgrad', 'bbsgd', 'sgd', or 'lbfgs'. Default value 'amsgrad'.

       --passes(-p)[int]
              Maximum number of full passes over  dataset  for  AMSGrad,  BB_SGD  and  SGD.  Default  value  50.
              --print_accuracy (-P) [bool] Print accuracies on initial and transformed dataset

       --rank(-A)[int]
              Rank of distance matrix to be optimized.  Default value 0.

       --regularization(-r)[double]
              Regularization for LMNN objective function  Default value 0.5.

       --seed(-s)[int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --step_size(-a)[double]
              Step size for AMSGrad, BB_SGD and SGD (alpha).  Default value 0.01.

       --tolerance(-t)[double]
              Maximum  tolerance  for  termination  of  AMSGrad,  BB_SGD,  SGD  or  L-BFGS. Default value 1e-07.
              --update_interval (-R) [int] Number of iterations after which impostors need to  be  recalculated.
              Default value 1.

       --verbose(-v)[bool]
              Display informational messages and the full list of parameters and timers at the end of execution.

       --version(-V)[bool]
              Display the version of mlpack.

Optional Output Options

--centered_data_file(-c)[unknown]
              Output  matrix  for mean-centered dataset.  --output_file (-o) [unknown] Output matrix for learned
              distance matrix.

       --transformed_data_file(-D)[unknown]
              Output matrix for transformed dataset.

Required Input Options

--input_file(-i)[unknown]
              Input dataset to run LMNN on.

Synopsis

mlpack_lmnn-iunknown [-bint] [-Cbool] [-dunknown] [-kint] [-lunknown] [-Lbool] [-nint] [-Nbool] [-Ostring] [-pint] [-Pbool] [-Aint] [-rdouble] [-sint] [-adouble] [-tdouble] [-Rint] [-Vbool] [-cunknown] [-ounknown] [-Dunknown] [-h-v]