mlpack_random_forest - random forests

Additional Information

       For  further  information,  including  relevant  papers, citations, and theory, consult the documentation
       found at http://www.mlpack.org or included with your distribution of mlpack.

mlpack-4.5.1                                     29 January 2025                         mlpack_random_forest(1)

Description

This program is an implementation of the standard random forest classification algorithm by Leo Breiman.
A random forest can be trained and saved for later use, or a random forest may be loaded and predictions
or class probabilities for points may be generated.

The training set and associated labels are specified with the '--training_file (-t)' and '--labels_file
(-l)' parameters, respectively. The labels should be in the range `[0, num_classes - 1]`. Optionally, if
'--labels_file (-l)' is not specified, the labels are assumed to be the last dimension of the training
dataset.

When a model is trained, the '--output_model_file (-M)' output parameter may be used to save the trained
model. A model may be loaded for predictions with the '--input_model_file (-m)'parameter. The
'--input_model_file (-m)' parameter may not be specified when the '--training_file (-t)' parameter is
specified. The '--minimum_leaf_size (-n)' parameter specifies the minimum number of training points that
must fall into each leaf for it to be split. The '--num_trees (-N)' controls the number of trees in the
random forest. The ’--minimum_gain_split (-g)' parameter controls the minimum required gain for a
decision tree node to split. Larger values will force higher-confidence splits. The '--maximum_depth
(-D)' parameter specifies the maximum depth of the tree. The '--subspace_dim (-d)' parameter is used to
control the number of random dimensions chosen for an individual node's split. If
’--print_training_accuracy (-a)' is specified, the calculated accuracy on the training set will be
printed.

Test data may be specified with the '--test_file (-T)' parameter, and if performance measures are desired
for that test set, labels for the test points may be specified with the '--test_labels_file (-L)'
parameter. Predictions for each test point may be saved via the '--predictions_file (-p)'output
parameter. Class probabilities for each prediction may be saved with the ’--probabilities_file (-P)'
output parameter.

For example, to train a random forest with a minimum leaf size of 20 using 10 trees on the dataset
contained in 'data.csv'with labels 'labels.csv', saving the output random forest to 'rf_model.bin' and
printing the training error, one could call

$ mlpack_random_forest--training_file data.csv --labels_file labels.csv --minimum_leaf_size 20
--num_trees 10 --output_model_file rf_model.bin --print_training_accuracy

Then, to use that model to classify points in 'test_set.csv' and print the test error given the labels
'test_labels.csv' using that model, while saving the predictions for each point to 'predictions.csv', one
could call

$ mlpack_random_forest--input_model_file rf_model.bin --test_file test_set.csv --test_labels_file
test_labels.csv --predictions_file predictions.csv

Name

mlpack_random_forest - random forests

Optional Input Options

--help(-h)[bool]
              Default help info.

       --info[string]
              Print help on a specific option. Default value ''.

       --input_model_file(-m)[unknown]
              Pre-trained random forest to use for classification.   --labels_file  (-l)  [unknown]  Labels  for
              training dataset.

       --maximum_depth(-D)[int]
              Maximum depth of the tree (0 means no limit).  Default value 0.

       --minimum_gain_split(-g)[double]
              Minimum gain needed to make a split when building a tree. Default value 0.

       --minimum_leaf_size(-n)[int]
              Minimum number of points in each leaf node.  Default value 1.

       --num_trees(-N)[int]
              Number of trees in the random forest. Default value 10.

       --print_training_accuracy(-a)[bool]
              If set, then the accuracy of the model on the training set will be predicted (verbose must also be
              specified).

       --seed(-s)[int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --subspace_dim(-d)[int]
              Dimensionality  of  random  subspace to use for each split. '0' will autoselect the square root of
              data dimensionality. Default value 0.

       --test_file(-T)[unknown]
              Test dataset to produce predictions for.

       --test_labels_file(-L)[unknown]
              Test dataset labels, if accuracy calculation is desired.

       --training_file(-t)[unknown]
              Training dataset.

       --verbose(-v)[bool]
              Display informational messages and the full list of parameters and timers at the end of execution.

       --version(-V)[bool]
              Display the version of mlpack.

       --warm_start(-w)[bool]
              If true and passed along with `training` and `input_model`  then  trains  more  trees  on  top  of
              existing model.

Optional Output Options

--output_model_file(-M)[unknown]
              Model to save trained random forest to.

       --predictions_file(-p)[unknown]
              Predicted classes for each point in the test set.

       --probabilities_file(-P)[unknown]
              Predicted class probabilities for each point in the test set.

Synopsis

mlpack_random_forest [-munknown] [-lunknown] [-Dint] [-gdouble] [-nint] [-Nint] [-abool] [-sint] [-dint] [-Tunknown] [-Lunknown] [-tunknown] [-Vbool] [-wbool] [-Munknown] [-punknown] [-Punknown] [-h-v]