Trie based classifiers evaluation

Introduction

In this document we show how to evaluate TriesWithFrequencies, [AA5, AAp7], based classifiers created over well known Machine Learning (ML) datasets. The computations are done with packages from Raku’s ecosystem.

The classifiers based on TriesWithFrequencies can be seen as some sort of Naive Bayesian Classifiers (NBCs).

We use the workflow summarized in this flowchart:

For more details on classification workflows see the article “A monad for classification workflows”. [AA1].

Document execution

This is a “computable Markdown document” – the Raku cells are (context-consecutively) evaluated with the “literate programming” package “Text::CodeProcessing”, [AA2, AAp5].

Remark: This document can be also made using the Mathematica-and-Raku connector, [AA3], but by utilizing the package “Text::Plot”, [AAp6, AA8], to produce (informative enough) graphs, that is “less needed.”


Data

Here we get Titanic data using the package “Data::Reshapers”, [AA3, AAp2]:

use Data::Reshapers;
my @dsTitanic=get-titanic-dataset(headers=>'auto');
dimensions(@dsTitanic)
# (1309 5)

Here is data sample:

to-pretty-table( @dsTitanic.pick(5), field-names => <passengerAge passengerClass passengerSex passengerSurvival>)
# +--------------+----------------+--------------+-------------------+
# | passengerAge | passengerClass | passengerSex | passengerSurvival |
# +--------------+----------------+--------------+-------------------+
# |      40      |      1st       |    female    |      survived     |
# |      20      |      3rd       |     male     |        died       |
# |      30      |      2nd       |     male     |        died       |
# |      30      |      3rd       |     male     |        died       |
# |      -1      |      3rd       |    female    |      survived     |
# +--------------+----------------+--------------+-------------------+

Here is a summary:

use Data::Summarizers;
records-summary(@dsTitanic)
# +---------------+----------------+-----------------+-------------------+----------------+
# | passengerSex  | passengerClass | id              | passengerSurvival | passengerAge   |
# +---------------+----------------+-----------------+-------------------+----------------+
# | male   => 843 | 3rd => 709     | 503     => 1    | died     => 809   | 20      => 334 |
# | female => 466 | 1st => 323     | 421     => 1    | survived => 500   | -1      => 263 |
# |               | 2nd => 277     | 726     => 1    |                   | 30      => 258 |
# |               |                | 936     => 1    |                   | 40      => 190 |
# |               |                | 659     => 1    |                   | 50      => 88  |
# |               |                | 446     => 1    |                   | 60      => 57  |
# |               |                | 260     => 1    |                   | 0       => 56  |
# |               |                | (Other) => 1302 |                   | (Other) => 63  |
# +---------------+----------------+-----------------+-------------------+----------------+

Trie creation

For demonstration purposes let us create a shorter trie and display it in tree form:

use ML::TriesWithFrequencies;
my $trTitanicShort = 
  @dsTitanic.map({ $_<passengerClass passengerSex passengerSurvival> }).&trie-create
  .shrink;
say $trTitanicShort.form;  
# TRIEROOT => 1309
# ├─1st => 323
# │ ├─female => 144
# │ │ ├─died => 5
# │ │ └─survived => 139
# │ └─male => 179
# │   ├─died => 118
# │   └─survived => 61
# ├─2nd => 277
# │ ├─female => 106
# │ │ ├─died => 12
# │ │ └─survived => 94
# │ └─male => 171
# │   ├─died => 146
# │   └─survived => 25
# └─3rd => 709
#   ├─female => 216
#   │ ├─died => 110
#   │ └─survived => 106
#   └─male => 493
#     ├─died => 418
#     └─survived => 75

Here is a mosaic plot that corresponds to the trie above:

Titanic-mosaic-plot

(The plot is made with Mathematica.)


Trie classifier

In order to make certain reproducibility statements for the kind of experiments shown here, we use random seeding (with srand) before any computations that use pseudo-random numbers. Meaning, one would expect Raku code that starts with an srand statement (e.g. srand(889)) to produce the same pseudo random numbers if it is executed multiple times (without changing it.)

Remark: Per this comment it seems that a setting of srand guarantees the production of reproducible between runs random sequences on the particular combination of hardware-OS-software Raku is executed on.

srand(889)
# 889

Here we split the data into training and testing data:

my ($dsTraining, $dsTesting) = take-drop( @dsTitanic.pick(*), floor(0.8 * @dsTitanic.elems));
say $dsTraining.elems;
say $dsTesting.elems;
# 1047
# 262

(The function take-drop is from “Data::Reshapers”. It follows Mathematica’s TakeDrop, [WRI1].)

Alternatively, we can say that:

  1. We get indices of dataset rows to make the training data
  2. We obtain the testing data indices as the complement of the training indices

Remark: It is better to do stratified sampling, i.e. apply take-drop per each label.

Here we make a trie with the training data:

my $trTitanic = $dsTraining.map({ $_.<passengerClass passengerSex passengerAge passengerSurvival> }).Array.&trie-create;
$trTitanic.node-counts
# {Internal => 63, Leaves => 85, Total => 148}

Here is an example decision-classification:

$trTitanic.classify(<1st female>)
# survived

Here is an example probabilities-classification:

$trTitanic.classify(<2nd male>, prop=>'Probs')
# {died => 0.851063829787234, survived => 0.14893617021276595}

We want to classify across all testing data, but not all testing data records might be present in the trie. Let us check that such testing records are few (or none):

$dsTesting.grep({ !$trTitanic.is-key($_<passengerClass passengerSex passengerAge>) }).elems
# 0

Let us remove the records that cannot be classified:

$dsTesting = $dsTesting.grep({ $trTitanic.is-key($_<passengerClass passengerSex passengerAge>) });
$dsTesting.elems
# 262

Here we classify all testing records (and show a few of the results):

my @testingRecords = $dsTesting.map({ $_.<passengerClass passengerSex passengerAge> }).Array;
my @clRes = $trTitanic.classify(@testingRecords).Array;
@clRes.head(5)
# (died died died survived died)

Here is a tally of the classification results:

tally(@clRes)
# {died => 186, survived => 76}

(The function tally is from “Data::Summarizers”. It follows Mathematica’s Tally, [WRI2].)

Here we make a Receiver Operating Characteristic (ROC) record, [AA5, AAp4]:

use ML::ROCFunctions;
my %roc = to-roc-hash('survived', 'died', select-columns( $dsTesting, 'passengerSurvival')>>.values.flat, @clRes)
# {FalseNegative => 45, FalsePositive => 15, TrueNegative => 141, TruePositive => 61}

Trie classification with ROC plots

In the next code cell we classify all testing data records. For each record:

  • Get probabilities hash
  • Add to that hash the actual label
  • Make sure the hash has both survival labels
use Hash::Merge;
my @clRes = 
do for [|$dsTesting] -> $r {
    my $res = [|$trTitanic.classify( $r<passengerClass passengerSex passengerAge>,  prop => 'Probs' ), Actual => $r<passengerSurvival>].Hash;
    merge-hash( { died => 0, survived => 0}, $res)
}

Here we obtain the range of the label “survived”:

my @vals = flatten(select-columns(@clRes, 'survived')>>.values);
(min(@vals), max(@vals))
# (0 1)

Here we make list of decision thresholds:

my @thRange = min(@vals), min(@vals) + (max(@vals)-min(@vals))/30 ... max(@vals);
records-summary(@thRange)
# +-------------------------------+
# | numerical                     |
# +-------------------------------+
# | Max    => 0.9999999999999999  |
# | Min    => 0                   |
# | Mean   => 0.5000000000000001  |
# | 3rd-Qu => 0.7666666666666666  |
# | 1st-Qu => 0.2333333333333333  |
# | Median => 0.49999999999999994 |
# +-------------------------------+

In the following code cell for each threshold:

  • For each classification hash decide on “survived” if the
    corresponding value is greater or equal to the threshold
  • Make threshold’s ROC-hash
my @rocs = @thRange.map(-> $th { to-roc-hash('survived', 'died', 
                                                select-columns(@clRes, 'Actual')>>.values.flat, 
                                                select-columns(@clRes, 'survived')>>.values.flat.map({ $_ >= $th ?? 'survived' !! 'died' })) });
# [{FalseNegative => 0, FalsePositive => 156, TrueNegative => 0, TruePositive => 106} {FalseNegative => 0, FalsePositive => 148, TrueNegative => 8, TruePositive => 106} .]

Here is the obtained ROC-hash table:

to-pretty-table(@rocs)
# +---------------+---------------+--------------+--------------+
# | FalsePositive | FalseNegative | TrueNegative | TruePositive |
# +---------------+---------------+--------------+--------------+
# |      156      |       0       |      0       |     106      |
# |      148      |       0       |      8       |     106      |
# |      137      |       2       |      19      |     104      |
# |      104      |       9       |      52      |      97      |
# |       97      |       10      |      59      |      96      |
# |       72      |       13      |      84      |      93      |
# |       72      |       13      |      84      |      93      |
# |       55      |       15      |     101      |      91      |
# |       46      |       19      |     110      |      87      |
# |       42      |       23      |     114      |      83      |
# |       33      |       28      |     123      |      78      |
# |       25      |       36      |     131      |      70      |
# |       22      |       39      |     134      |      67      |
# |       22      |       39      |     134      |      67      |
# |       18      |       40      |     138      |      66      |
# |       18      |       40      |     138      |      66      |
# |       10      |       51      |     146      |      55      |
# |       10      |       51      |     146      |      55      |
# |       4       |       54      |     152      |      52      |
# |       3       |       57      |     153      |      49      |
# |       3       |       57      |     153      |      49      |
# |       3       |       57      |     153      |      49      |
# |       3       |       57      |     153      |      49      |
# |       3       |       57      |     153      |      49      |
# |       3       |       57      |     153      |      49      |
# |       3       |       57      |     153      |      49      |
# |       3       |       60      |     153      |      46      |
# |       2       |       72      |     154      |      34      |
# |       2       |       72      |     154      |      34      |
# |       2       |       89      |     154      |      17      |
# |       2       |       89      |     154      |      17      |
# +---------------+---------------+--------------+--------------+

Here is the corresponding ROC plot:

use Text::Plot; 
text-list-plot(roc-functions('FPR')(@rocs), roc-functions('TPR')(@rocs),
                width => 70, height => 25, 
                x-label => 'FPR', y-label => 'TPR' )
# +--+------------+-----------+-----------+-----------+------------+---+        
# |                                                                    |        
# +                                                        *    *  *   +  1.00  
# |                                                                    |        
# |                                         * *                        |        
# |                        *      *                                    |        
# |                    *                                               |        
# +                   *                                                +  0.80  
# |               *                                                    |        
# |                                                                    |        
# |            *                                                       |        
# |         * *                                                        |       T
# +                                                                    +  0.60 P
# |      *                                                             |       R
# |    *                                                               |        
# |   *                                                                |        
# +   *                                                                +  0.40  
# |                                                                    |        
# |   *                                                                |        
# |                                                                    |        
# |                                                                    |        
# +                                                                    +  0.20  
# |   *                                                                |        
# |                                                                    |        
# +--+------------+-----------+-----------+-----------+------------+---+        
#    0.00         0.20        0.40        0.60        0.80         1.00       
#                                  FPR

We can see the Trie classifier has reasonable prediction abilities – we get ≈ 75% True Positive Rate (TPR) with relatively small False Positive Rate (FPR), ≈ 20%.

Here is a ROC plot made with Mathematica (using a different Trie over Titanic data):

Titanic-Trie-classifier-ROC-plot


Improvements

For simplicity the workflow above was kept “naive.” A better workflow would include:

  • Stratified partitioning of training and testing data
  • K-fold cross-validation
  • Variable significance finding
  • Specifically for Tries with frequencies: using different order of variables while constructing the trie

Remark: K-fold cross-validation can be “simply”achieved by running this document multiple times using different random seeds.


References

Articles

[AA1] Anton Antonov, “A monad for classification workflows”, (2018), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “Raku Text::CodeProcessing”, (2021), RakuForPrediction at WordPress.

[AA3] Anton Antonov, “Connecting Mathematica and Raku”, (2021), RakuForPrediction at WordPress.

[AA4] Anton Antonov, “Introduction to data wrangling with Raku”, (2021), RakuForPrediction at WordPress.

[AA5] Anton Antonov, “ML::TriesWithFrequencies”, (2022), RakuForPrediction at WordPress.

[AA6] Anton Antonov, “Data::Generators”, (2022), RakuForPrediction at WordPress.

[AA7] Anton Antonov, “ML::ROCFunctions”, (2022), RakuForPrediction at WordPress.

[AA8] Anton Antonov, “Text::Plot”, (2022), RakuForPrediction at WordPress.

[Wk1] Wikipedia entry, “Receiver operating characteristic”.

Packages

[AAp1] Anton Antonov, Data::Generators Raku package, (2021), GitHub/antononcube.

[AAp2] Anton Antonov, Data::Reshapers Raku package, (2021), GitHub/antononcube.

[AAp3] Anton Antonov, Data::Summarizers Raku package, (2021), GitHub/antononcube.

[AAp4] Anton Antonov, ML::ROCFunctions Raku package, (2022), GitHub/antononcube.

[AAp5] Anton Antonov, Text::CodeProcessing Raku package, (2021), GitHub/antononcube.

[AAp6] Anton Antonov, Text::Plot Raku package, (2022), GitHub/antononcube.

[AAp7] Anton Antonov, ML::TriesWithFrequencies Raku package, (2021), GitHub/antononcube.

Functions

[WRI1] Wolfram Research (2015), TakeDrop, Wolfram Language function, (updated 2015).

[WRI2] Wolfram Research (2007), Tally, Wolfram Language function.

5 thoughts on “Trie based classifiers evaluation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s