logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

mlpack_preprocess_split - split data

Additional Information

       For further information, including relevant papers, citations,  and  theory,  consult  the  documentation
       found at http://www.mlpack.org or included with your distribution of mlpack.

mlpack-4.5.1                                     29 January 2025                      mlpack_preprocess_split(1)

Description

       This  utility  takes  a dataset and optionally labels and splits them into a training set and a test set.
       Before the split, the points in the dataset are randomly reordered. The percentage of the dataset  to  be
       used as the test set can be specified with the '--test_ratio (-r)' parameter; the default is 0.2 (20%).

       The output training and test matrices may be saved with the '--training_file (-t)' and '--test_file (-T)'
       output parameters.

       Optionally,  labels  can  also  be split along with the data by specifying the ’--input_labels_file (-I)'
       parameter. Splitting labels works the same way as splitting the data. The output training and test labels
       may be saved with the ’--training_labels_file (-l)'  and  '--test_labels_file  (-L)'  output  parameters,
       respectively.

       So,  a simple example where we want to split the dataset 'X.csv' into ’X_train.csv' and 'X_test.csv' with
       60% of the data in the training set and 40% of the dataset in the test set, we could run

       $  mlpack_preprocess_split--input_file  X.csv  --training_file   X_train.csv   --test_file   X_test.csv
       --test_ratio 0.4

       Also  by  default  the  dataset  is shuffled and split; you can provide the ’--no_shuffle (-S)' option to
       avoid shuffling the data; an example to avoid shuffling of data is:

       $  mlpack_preprocess_split--input_file  X.csv  --training_file   X_train.csv   --test_file   X_test.csv
       --test_ratio 0.4 --no_shuffle

       If  we  had  a  dataset  'X.csv'  and  associated  labels  'y.csv',  and  we  wanted  to split these into
       'X_train.csv', 'y_train.csv', 'X_test.csv', and 'y_test.csv', with 30% of the data in the  test  set,  we
       could run

       $  mlpack_preprocess_split--input_file X.csv --input_labels_file y.csv --test_ratio 0.3 --training_file
       X_train.csv --training_labels_file y_train.csv --test_file X_test.csv --test_labels_file y_test.csv

       To maintain the ratio of each class in the train and test sets, the'--stratify_data (-z)' option  can  be
       used.

       $   mlpack_preprocess_split--input_file   X.csv  --training_file  X_train.csv  --test_file  X_test.csv
       --test_ratio 0.4 --stratify_data

Name

mlpack_preprocess_split - split data

Optional Input Options

--help(-h)[bool]
              Default help info.

       --info[string]
              Print help on a specific option. Default value ''.

       --input_labels_file(-I)[unknown]
              Matrix containing labels.

       --no_shuffle(-S)[bool]
              Avoid shuffling the data before splitting.

       --seed(-s)[int]
              Random seed (0 for std::time(NULL)). Default value 0.

       --stratify_data(-z)[bool]
              Stratify the data according to labels

       --test_ratio(-r)[double]
              Ratio of test set; if not set,the ratio defaults to 0.2 Default value 0.2.

       --verbose(-v)[bool]
              Display informational messages and the full list of parameters and timers at the end of execution.

       --version(-V)[bool]
              Display the version of mlpack.

Optional Output Options

--test_file(-T)[unknown]
              Matrix to save test data to.

       --test_labels_file(-L)[unknown]
              Matrix to save test labels to.

       --training_file(-t)[unknown]
              Matrix to save training data to.

       --training_labels_file(-l)[unknown]
              Matrix to save train labels to.

Required Input Options

--input_file(-i)[unknown]
              Matrix containing data.

Synopsis

mlpack_preprocess_split-iunknown [-Iunknown] [-Sbool] [-sint] [-zbool] [-rdouble] [-Vbool] [-Tunknown] [-Lunknown] [-tunknown] [-lunknown] [-h-v]

See Also