Preparations and Settings

Silas machine learning requires three configuration files to run correctly:

We shall use command templates of the following form:

silas command [OPTIONS] [para1] [para2] ...

where OPTIONS include optional parameters, and para1, para2, etc. are mandatory parameters.

You can skip the below sections and generate all these files automatically.

Metadata Settings

Metadata settings include two fields:

  • feature_type_settings: lists the data type for each feature,

  • missing_value_place_holders: lists string place holders for missing values.

The settings are stored in JSON format. The purpose of metadata settings is to be able to generate 元数据.

An example metadata settings file would look like the following:

"feature_type_settings":
[
    {
        "data_type": "collection",
        "feature_name": "Month"
    },
    {
        "data_type": "collection",
        "feature_name": "DayofMonth"
    }
],
"missing_value_place_holders":
[
    "",
    "NA",
    "?"
]

Data Type

There are three data types supported in Silas:

  • number: Used when a feature’s values are real/continuous numbers.

  • enumeration: Used when a feature’s values can be enumerated and sorted in a certain order. Example: low, medium, high.

  • collection: Used when a feature’s values can be enumerated but there is no obvious way to order the values. Example: east, west, north, south.

Generating Metadata Settings Automatically

To generate the metadata settings file, use the following command:

silas gen-metadata-settings [OPTIONS] [data_files...]

where data_files are a list of file paths for data sets, and OPTIONS include:

  • -h: print help message and exit.

  • -o file_path: output the metadata settings in the given file. If this option is not supplied, the metadata settings will be stored in metadata-settings.json in the directory where the command is issued.

  • -\-nh: a flag that indicates that the data set files do not have headers. In this case, Silas will generate new data set files that contain the same data and have headers. The new data set files will be saved in the same directory as the original data set files, the file names will end with “-w-headers”.

If you have multiple data set files, you can use any one of them in the command. For instance, to generate a metadata settings file from data/dataset1.csv and output the settings in data/metadata-settings.json, run the following instance:

silas gen-metadata-settings -o data/metadata-settings.json data/dataset1.csv

The user is encouraged to inspect the metadata settings file and select the data types for features manually.

Metadata

A metadata file defines the types of features and the features themselves in a given data set. These definitions are stored in JSON format.

You can skip the below details and jump to the 自动生成元数据 section.

Feature Type

N.B. In some literature feature is called attribute.

The metadata file includes a list of feature types.

A feature type defines the data type of the feature and the (range of) values of the feature.

If a feature’s data type (called super_type in the definition) is number, the definition of the feature type includes the name of the feature type, the min value, and the max value. For example, we define the feature type “ratio” as follows:

"super_type": "number",
"name": "ratio",
"number_definition":
{
    "min": 0.0,
    "max": 1.0
}

If a feature’s data type is enumeration, the definition of the feature type includes the name of the feature type, the min value, the max value, and the value mappings. For example, we define the feature type “size” as follows:

"super_type": "enumeration",
"name": "size",
"enumeration_definition":
{
    "min": 1.0,
    "max": 3.0,
    "value_map":
    {
        "Short": 1.0,
        "Tall": 2.0,
        "Grande": 3.0
    }
}

If a feature’s data type is collection, the definition of the feature type includes the name of the feature type and the list of values. For example, we define the feature type “colour” as follows:

"super_type": "collection",
"name": "colour",
"collection_definition":
[
    "Red",
    "Blue",
    "Green"
]

Feature

Each feature is defined by its name and feature type. For instance, below are some features in a flight data set:

{
    "name": "Dest",
    "type": "Location"
},
{
    "name": "Origin",
    "type": "Location"
},
{
    "name": "Distance",
    "type": "Distance"
},
{
    "name": "Month",
    "type": "Month"
},

Arithmetic Compound Feature

The user can define compound features that are derived from the basic features using arithmetic operators. Below we define the formal syntax of arithmetic terms supported in Silas.

An atomic arithmetic term can be of two forms:

  • ArithmeticVariable: When the operand is a feature.

  • ArithmeticConstant: When the operand is a number.

The user can build up compound arithmetic terms using the following operators:

  • Unary operators: Negative, Square.

  • Nnary operators: \(+\), \(-\), \(*\), \(/\).

The definition of a unary expression must include the type and the internal expression. For instance, the definition of \(-f\) where \(f\) is the name of a feature is given below:

"type": "Negative",
"internal":
{
    "type": "ArithmeticVariable",
    "name": "f"
}

The definition of a nary expression includes the type and the operands. For instance, the definition of \(f * 1.5\) where \(f\) is the name of a feature is given below:

"type": "Multiply",
"operands":  [
    {
        "type": "ArithmeticVariable",
        "name": "f"
    },
    {
        "type": "ArithmeticConstant",
        "value": 1.5
    }
]

The definition of an arithmetic compound feature includes the name of the feature and the surrogate arithmetic term in the above format. For instance, given two features \(a\) and \(b\), we can define an arithmetic compound feature called c using the term \((1 / a) - (1 / b)\) as follows:

"name": "c",
"surrogate_term": {
    "type": "Minus",
    "operands": [
        {
            "type": "Divide",
            "operands": [
                {
                    "type": "ArithmeticConstant",
                    "value": 1.0
                },
                {
                    "type": "ArithmeticVariable",
                    "name": "a"
                }
            ]
        },
        {
            "type": "Divide",
            "operands": [
                {
                    "type": "ArithmeticConstant",
                    "value": 1.0
                },
                {
                    "type": "ArithmeticVariable",
                    "name": "b"
                }
            ]
        }
    ]
}

Generating Metadata Automatically

Silas comes with a tool that can generate metadata automatically from the data set. To do so, use the following command:

silas gen-metadata [OPTIONS] [metadata_settings] [data_files...]

where metadata_settings is the file path for 元数据设置, data_files is a list of file paths of data sets, and OPTIONS include:

  • -h: Print help message and exit.

  • -o file_path: output the metadata in the given file. If this option is not supplied, the metadata will be stored in metadata.json in the directory where the command is issued.

For instance, to output the metadata in metadata1.json using metadata settings in data/metadata-settings1.json and data set source files data/dataset1.csv and data/dataset2.csv, use the following command:

silas gen-metadata -o metadata1.json data/metadata-settings1.json data/dataset1.csv data/dataset2.csv

Note that the gen-metadata command will also output the statistics of features in feature_type_stats.json in the directory where the output file is.

Plot Graphs of Feature Stats

Silas provides a simple visualisation of the dataset in terminal with the following command:

silas draw [OPTIONS] data_stats_file

where data_stats_file is the feature statistics file generated by silas gen-metadata, and OPTIONS only include a -h flag to show the help message.

Machine Learning Settings

Parameters in Settings

The machine learning settings file defines the parameters for Silas machine learning. The settings are stored in JSON format. The parameters include:

  • outcome_feature: The feature to be predicted or classified. Sometimes called class in the literature.

  • metadata_file: The path of the metadata file.

  • validation_method: The method used for testing/validation. Permitted values are:
    • “TT”: Train and test. This means that the user must supply a data set for training and a separated data set for testing.

    • “CV”: Cross-validation. In this case, the user should specify a number for the “number_of_cross_validation_partitions” field.

  • cv_settings: This field is only for cross-validation. It has the following fields:
    • dataset_file: The path and name of the data set. This field is only used when validation_method is “CV”.

    • number_of_cross_validation_partitions: The number of partitions in cross-validation. For 10-fold cross-validation, use 10. This field is only used when validation_method is “CV”.

    • number_of_runs: The number of times to run the machine learning and validation.

  • tt_settings: This field is only for train and test. It has the following fields:
    • training_dataset_file: The path and name of the training data set. This field is only used when validation_method is “TT”.

    • testing_dataset_file: The path and name of the testing data set. This field is only used when validation_method is “TT”.

  • number_of_trees: The number of decision trees in the ensemble model.

  • max_depth: The max depth of the decision trees. When the max depth is n, the tree has at most \(2^{n+1} - 1\) nodes in which \(2^n\) are leaf nodes. This parameter is used as a stopping condition to control the growth of trees. If max_depth is 32, the tree building algorithm stops expending a branch when the depth of the branch reaches 32.

  • desired_leaf_size: The desired number of data instances contained in a leaf node. This parameter is used as a stopping condition to control the growth of trees. If desired_leaf_size is 32, the tree building algorithm stops expending a branch when the leaf node contains less than 32 data instances.

  • number_of_outcome_subintervals: The number of bins in the histogram for computing entropy with regard to the outcome feature. If the outcome feature only has n possible values, then this number doesn’t need to be more than n.

  • selected_features: A list of features that are used in the machine learning process. Note that this list should not include the outcome feature.

  • feature_proportion: The proportion of the number of features used when building decision trees. Permitted values are:
    • “sqrt”: The square root of the total number of features (default).

    • “log”: natural log of the total number of features.

    • “log2”: log base 2 of the total number of features.

    • “golden”: 0.618 times the total number of features.

    • Any float number (without double quotes) between 0.0 and 1.0.

  • sampling_method: The method used to sample the data set. Permitted values are:
    • “balancing”: Balance the proportions of each class. If the smallest class has n instances, then randomly sample n instances from the other classes.

    • “uniform”: Uniformly sample the same proportion for each class.

  • sampling_proportion: The proportion of sampled data which is applied on top of the sampling method. For instance, when the sampling method is uniform, sampling_proportion = 0.8 means that the sampled data set contains 80% instances of the original data set for each class. When the sampling method is balancing, sampling_proportion = 0.8 means that Silas first balances each class by sampling majority classes to have the same number of instances as the least common class, then 80% instances are sampled from each class.

  • oob_proportion: The proportion of data instances in the out-of-the-bag (OOB) sample. If the oob_proportion is 0.1 then 10% of the sampled data are used as OOB instances which are not used when building a decision tree. The OOB splitting occurs after the sampling of the data set. Example: for a 1 million instances data set, if sampling_method = “uniform”, sampling_proportion = 0.7, and oob_proportion = 0.1, then only 630K instances are used for training a decision tree, and 70K instances are in the OOB set which is used to evaluate the decision tree.

Generating Machine Learning Settings Automatically

To generate a settings file automatically, use the following command:

silas gen-settings [OPTIONS] [metadata_file] [data_files...]

where metadata_file specifies the metadata file path, data_files is a list of data set file paths, and OPTIONS include:

  • -h: Print help message and exit.

  • -v validation_mode: Specify the validation mode. If this option is not supplied, the validation mode will be deduced from the number of data set files: cross-validation if only one data set file is supplied, and training and testing if more than one data set files are supplied, in which case the first file will be used for training and the second file will be used for testing. There are two options:
    • “cv”: Cross-validation. This means that the user has to specify at least 1 data set file. If multiple files are supplied, the first file will be used for training and validation.

    • “tt”: Training and testing. This means that the user has to specify at least 2 data set files. If multiple files are supplied, the first data set file will be used for training and the second one will be used for testing.

  • -o file_path: output the settings in the given file. If this option is not supplied, the settings will be stored in settings.json in the directory where the command is issued.

For example, to generate a settings file called settings1.json from the metadata file data/metadata1.json and the data set file data/dataset1.csv, use the following command:

silas gen-settings -o settings1.json data/metadata.json data/dataset1.csv

Generating All Configuration Files Automatically

To generate all the files required in Silas machine learning automatically, use the following command:

silas gen-all [OPTIONS] [data_files...]

where data_files is a list of file paths of data sets and OPTIONS include:

  • -h: Print help message and exit.

  • -v validation_mode: Specify the validation mode. If this option is not supplied, the validation mode will be deduced from the number of data set files: cross-validation if only one data set file is supplied, and training and testing if more than one data set files are supplied, in which case the first file will be used for training and the second file will be used for testing. There are two options:
    • “cv”: Cross-validation. This means that the user has to specify at least 1 data set file. If multiple files are supplied, the first data set file will be used for training and validation. The remaining data set files will be used only for computing the statistics of the data sets.

    • “tt”: Training and testing. This means that the user has to specify at least 2 data set files. If multiple files are supplied, the first data set file will be used for training and the second one will be used for testing. The remaining data set files will be used only for computing the statistics of the data sets.

  • -o directory: output the configuration files in the specified directory. If this option is not supplied, the configuration files will be stored in the directory where the command is issued.

  • -\-nh: a flag that indicates that the data set files do not have headers. In this case, Silas will generate new data set files that contain the same data and have headers. The new data set files will be saved in the same directory as the original data set files, the file names will end with “-w-headers”.

For instance, to generate all the configuration files from data/train.csv and data/test.csv for training and testing and storing the configuration files in config/, run the following command:

silas gen-all -v tt -o config data/train.csv data/test.csv

Sanitise the data

In case the dataset contains missing data or is of incorrect format, you can sanitise the dataset using the following command:

silas sanitise [OPTIONS] [metadata_settings] [feature_type_stats_file] [metadata] [data_files...]

where metadata_settings is the file path for 元数据设置; feature_type_stats_file is the file path for feature statistics, which is generated with meta-data; metadata is the file path for 元数据; data_files is a list of file paths of data sets; OPTIONS include:

  • -c with the following options:
    • new: replace missing categorical values with new category.

    • most-common: replace missing categorical values with the most common category.

    • least-common: replace missing categorical values with the least common category.

    • remove: remove missing categorical values.

  • -n with the following options:
    • mean: replace missing numerical values with the mean value.

    • median: replace missing numerical values with the median value.

    • new-above-max: replace missing numerical values with max + 1.

    • new-under-min: replace missing numerical values with min - 1.

    • remove: remove missing numerical values.

By default, the strategy for categorical values is to create a new category (“-c new”) and the strategy for numerical values is to use the mean (“-n mean”).

For instance, to sanitise example/data.csv using example/metadata-settings.json, example/feature-type-stats.json, and example/metadata.json, using the strategy that replaces categorical values with the most common category and replaces numerical values with the median, run the following command:

silas sanitise -c most-common -n median example/metadata-settings.json example/feature-type-stats.json example/metadata.json example/data.csv