Advanced Tutorial

In this tutorial, we will use the “diabetes” training and testing dataset. The goal of this dataset is to predict whether a patient is likely to have diabetes.

Navigate to the directory where the Silas executable is located, let’s call it bin. Create a new directory called tutorial2 in bin.

Move the dataset to bin/tutorial2/data/.

Machine Learning

At bin, run the following command to generate settings:

silas gen-all -o tutorial2 tutorial2/data/diabetes_train.csv tutorial2/data/diabetes_test.csv

This command will output all the settings files in the folder tutorial2. We can run the machine learning algorithm and see the result.

silas learn tutorial2/settings.json

You should see around 0.80 accuracy and 0.88 AUC using the default settings.

Open tutorial2/metadata-settings.json. We can inspect the default settings for each feature. The rules of thumb are:

  • If the attribute has numeric values and there are many unique values that can be ordered, then it’s type is numerical. Example: age, year, height, weight, price etc.

  • If the attribute only has a small number of unique values that cannot be ordered or are textual, then it’s type is nominal. Example: sex, direction, colour etc.

In this example, it seems that the automatically chosen feature types are correct, so we leave them as they are.

In you want to modify metadata-settings.json, then you need to generate new metadata and settings using the commands below.

silas gen-metadata -o tutorial2/metadata.json tutorial2/metadata-settings.json tutorial2/data/diabetes_train.csv tutorial2/data/diabetes_test.csv
silas gen-settings -o tutorial2/settings.json -v cv tutorial2/metadata.json tutorial2/data/diabetes_train.csv tutorial2/data/diabetes_test.csv

Open tutorial2/settings.json. Usually, the most important parameters in this file are the following:

  • number-of-trees: 100 is often good enough for small datasets. If the dataset contains a million or more instances, then 200 ~ 500 may give better results. If the dataset contains 10 million or more instances, you can consider building 1000 or more trees.

  • max-depth: For boosting-related algorithms you may want to set the depths to a small number such as 2 ~ 4. For random forest based algorithms you want to leave it as 64, which basically is as large as a tree can be.

  • desired-leaf-size: this number largely depends on the dataset, and it may change the performance quite a bit. You are encouraged to try any number between 1 ~ 128.

  • feature-proportion: if you use more features when choosing the decision nodes, you may get better results. But it may affect the generalisation of the model. Other popular options include golden, or any decimal number between 0 and 1.

  • The type of froest: ClassicForest is the fastest, it is optimised for binary classification. Other popular algorithms include SimpleForest and CascadeForest.

  • The type of trees: GreedyNarrow1D is the fastest, it is optimised for binary classification. Other popular algorithms include RdGreedy1D.

For more info regarding the last two points, refer to the documentation on the parameters.

For instance, with the below settings, you should get around 0.87 accuracy and 0.95 AUC.

{
   "metadata-file": "metadata.json",
   "output-feature": "class",
   "ignored-features": [],
   "learner-settings": {
      "mode": "classification",
      "reduction-strategy": "none",
      "grower-settings": {
            "forest-settings": {
               "type": "SimpleForest",
               "number-of-trees": 100,
               "sampling-proportion": 1.0,
               "oob-proportion": 0.05
            },
            "tree-settings": {
               "type": "RdGreedy1D",
               "feature-proportion": "sqrt",
               "max-depth": 64,
               "desired-leaf-size": 8
            }
      }
   },
   "training-dataset": {
      "type": "CSV",
      "path": "data/diabetes_train.csv"
   },
   "validation-settings": {
      "type": "TT",
      "testing-dataset": {
            "type": "CSV",
            "path": "data/diabetes_test.csv"
      }
   }
}

You can play around with the settings and see if you can improve the result further. Once you have settled on a set of settings, you can generate a model using the -o option.

silas learn -o tutorial2/model tutorial2/settings.json

The above command saves the model in the folder tutorial/model.

XAI: Model Explanation

Download the python code of OptExplain for Silas and put it in the bin folder.

This python program requires a file that contains predictions of Silas. Since we only have one test dataset in this example, we can generate a dummy prediction file from the test dataset.

silas predict -o tutorial2/pred.csv tutorial2/model tutorial2/data/diabetes_test.csv

Then we can go to the OptExplain folder. Before we use the OptExplain algorithm, we need to install the required dependencies.

pip install -r requirements.txt

Run the OptExplain algorithm using the following command:

python3 OptExplain.py -m ../tutorial2/model -t ../tutorial2/data/diabetes_test.csv -p ../tutorial2/pred.csv

It may take a while to compute the explanation, but eventually you should see something like the following text as a part of the output:

---------------- Explanation -----------------
max_rule 1.000000  max_node 0.918296
max_rule 0.000000  max_node 0.001355
EX Running time: 0.09924626350402832 seconds
Original #rules: 15488
Original scale: 183526
#rules after rule-filter: 26
No MAX-SAT
SAT running time: 7.772445678710938e-05 seconds

Classes: ['tested_negative', 'tested_positive']
Group  0: |  17 samples |  2 rules | (3.0, 0.0)
            9 samples        (feature_1 <= 152.17149353027344)
            8 samples        (feature_6 <= 0.6463924646377563)

Group  1: |  16 samples |  2 rules | (0.0, 3.0)
            9 samples        (feature_1 > 116.90687561035156)
            7 samples        (feature_5 > 27.237857818603516)

conjuncts num: 4

---------------- Performance -----------------
ET Running time: 0.009005069732666016 seconds
Sample size:     231
RF accuracy:     0.8787878787878788
RF AUC:          0.9571354794284732
EX accuracy:     0.7532467532467533
EX AUC:          0.7406180065415734
Coverage:        1.0
Overlap:         0.7705627705627706
*Performance:    0.8658008658008658

As you can see, the original forest has over 15K decision rules, which are certainly not explainable. The OptExplain algorithm simplifies the forest and derives only 4 decision rules, which retains 0.75 accuracy. The way to read the explanation is as follows: given a sample, test it against each of the 4 rules. For instance, a sample may satisfy the rules “feature_1 <= 152.17149353027344”, “feature_6 <= 0.6463924646377563” and “feature_5 > 27.237857818603516”. Then the outcome is expected to be (3.0, 0.0) * 9 + (3.0, 0.0) * 8 + (0.0, 3.0) * 7 = (51,21). Therefore, the prediction is tested_negative. In this way, we identify the most important decisions and their weights when making a prediction.

XAI: Feature Importance

Silas can compute feature importance for each prediction instance using the -e option of the predict command. For instance, we can run the prediction with the following example:

silas predict -o tutorial2/pred2.csv -e tutorial2/model tutorial2/data/diabetes_test.csv

Open pred2.csv, you will see that in addition to the prediction results, each instance has a number of extra columns. In particular, each feature has two columns: w_feature gives the weight (feature importance score) for that feature, and r_feature gives the range of the value for that feature that are important in the prediction.

XAI: Adversarial Samples

Run main.py such as the following example to compute adversarial samples using MUC:

python3 main.py -m ../tutorial2/model -t ../tutorial2/data/diabetes_test.csv -A

The computation may take a while, but eventually you should see something like the below output.

========== Adversarial Sample ==========
WARNING: nominal features should be one-hot encoded.
Generating for x:   [3, 199.0, 76.0, 43.0, 0.0, 42.9, 1.394, 22.0]
Original class:     1
Good tau found:     [0.000 47.754 6.780 7.754 6.794 12.553 0.095 0.000]. (12276.207s)
Best theta found:   [0.000 -42.348 -4.849 -4.504 -1.358 -12.423 0.081 0.000]. (4.498s)
Before opt:         [3.000 156.652 71.151 38.496 -1.358 30.477 1.475 22.000]
Distance:           8.195414922619799
Optimized sample found. (2.253s)
Opt sample:         [3.000 156.756 71.163 38.507 -1.355 30.508 1.475 22.000]
Distance:           8.17525936318561
Adv sample class:   0

The program automatically chooses the first instance ([3, 199.0, 76.0, 43.0, 0.0, 42.9, 1.394, 22.0]) in the testing dataset and generates an adversarial example. You can modify the testing dataset or the python program to generate based on other instances. The original instance’s class is 1, and the optimal adversarial instance ([3.000 156.756 71.163 38.507 -1.355 30.508 1.475 22.000])’s class is 0. The distance between the original instnace and the adversarial instnace is 8.17.