## Introduction

In this document we show how to evaluate TriesWithFrequencies, [AA5, AAp7], based classifiers created over well known Machine Learning (ML) datasets. The computations are done with packages from Raku’s ecosystem.

The classifiers based on TriesWithFrequencies can be seen as some sort of Naive Bayesian Classifiers (NBCs).

We use the workflow summarized in this flowchart:

For more details on classification workflows see the article “A monad for classification workflows”. [AA1].

### Document execution

This is a “computable Markdown document” – the Raku cells are (context-consecutively) evaluated with the “literate programming” package “Text::CodeProcessing”, [AA2, AAp5].

**Remark:** This document *can be* also made using the Mathematica-and-Raku connector, [AA3], but by utilizing the package “Text::Plot”, [AAp6, AA8], to produce (informative enough) graphs, that is “less needed.”

## Data

Here we get Titanic data using the package “Data::Reshapers”, [AA3, AAp2]:

```
use Data::Reshapers;
my @dsTitanic=get-titanic-dataset(headers=>'auto');
dimensions(@dsTitanic)
```

`# (1309 5)`

Here is data sample:

`to-pretty-table( @dsTitanic.pick(5), field-names => <passengerAge passengerClass passengerSex passengerSurvival>)`

```
# +--------------+----------------+--------------+-------------------+
# | passengerAge | passengerClass | passengerSex | passengerSurvival |
# +--------------+----------------+--------------+-------------------+
# | 40 | 1st | female | survived |
# | 20 | 3rd | male | died |
# | 30 | 2nd | male | died |
# | 30 | 3rd | male | died |
# | -1 | 3rd | female | survived |
# +--------------+----------------+--------------+-------------------+
```

Here is a summary:

```
use Data::Summarizers;
records-summary(@dsTitanic)
```

```
# +---------------+----------------+-----------------+-------------------+----------------+
# | passengerSex | passengerClass | id | passengerSurvival | passengerAge |
# +---------------+----------------+-----------------+-------------------+----------------+
# | male => 843 | 3rd => 709 | 503 => 1 | died => 809 | 20 => 334 |
# | female => 466 | 1st => 323 | 421 => 1 | survived => 500 | -1 => 263 |
# | | 2nd => 277 | 726 => 1 | | 30 => 258 |
# | | | 936 => 1 | | 40 => 190 |
# | | | 659 => 1 | | 50 => 88 |
# | | | 446 => 1 | | 60 => 57 |
# | | | 260 => 1 | | 0 => 56 |
# | | | (Other) => 1302 | | (Other) => 63 |
# +---------------+----------------+-----------------+-------------------+----------------+
```

## Trie creation

For demonstration purposes let us create a *shorter* trie and display it in tree form:

```
use ML::TriesWithFrequencies;
my $trTitanicShort =
@dsTitanic.map({ $_<passengerClass passengerSex passengerSurvival> }).&trie-create
.shrink;
say $trTitanicShort.form;
```

```
# TRIEROOT => 1309
# ├─1st => 323
# │ ├─female => 144
# │ │ ├─died => 5
# │ │ └─survived => 139
# │ └─male => 179
# │ ├─died => 118
# │ └─survived => 61
# ├─2nd => 277
# │ ├─female => 106
# │ │ ├─died => 12
# │ │ └─survived => 94
# │ └─male => 171
# │ ├─died => 146
# │ └─survived => 25
# └─3rd => 709
# ├─female => 216
# │ ├─died => 110
# │ └─survived => 106
# └─male => 493
# ├─died => 418
# └─survived => 75
```

Here is a mosaic plot that corresponds to the trie above:

(The plot is made with Mathematica.)

## Trie classifier

In order to make certain reproducibility statements for the kind of experiments shown here, we use random seeding (with `srand`

) before any computations that use pseudo-random numbers. Meaning, one would expect Raku code that starts with an `srand`

statement (e.g. `srand(889)`

) to produce the same pseudo random numbers if it is executed multiple times (without changing it.)

**Remark:** Per this comment it seems that a setting of `srand`

guarantees the production of reproducible between runs random sequences on the particular combination of hardware-OS-software Raku is executed on.

`srand(889)`

`# 889`

Here we split the data into training and testing data:

```
my ($dsTraining, $dsTesting) = take-drop( @dsTitanic.pick(*), floor(0.8 * @dsTitanic.elems));
say $dsTraining.elems;
say $dsTesting.elems;
```

```
# 1047
# 262
```

(The function `take-drop`

is from “Data::Reshapers”. It follows Mathematica’s `TakeDrop`

, [WRI1].)

Alternatively, we can say that:

- We get indices of dataset rows to make the training data
- We obtain the testing data indices as the complement of the training indices

**Remark:** It is better to do stratified sampling, i.e. apply `take-drop`

per each label.

Here we make a trie with the training data:

```
my $trTitanic = $dsTraining.map({ $_.<passengerClass passengerSex passengerAge passengerSurvival> }).Array.&trie-create;
$trTitanic.node-counts
```

`# {Internal => 63, Leaves => 85, Total => 148}`

Here is an example *decision*-classification:

`$trTitanic.classify(<1st female>)`

`# survived`

Here is an example *probabilities*-classification:

`$trTitanic.classify(<2nd male>, prop=>'Probs')`

`# {died => 0.851063829787234, survived => 0.14893617021276595}`

We want to classify across all testing data, but not all testing data records might be present in the trie. Let us check that such testing records are few (or none):

`$dsTesting.grep({ !$trTitanic.is-key($_<passengerClass passengerSex passengerAge>) }).elems`

`# 0`

Let us remove the records that cannot be classified:

```
$dsTesting = $dsTesting.grep({ $trTitanic.is-key($_<passengerClass passengerSex passengerAge>) });
$dsTesting.elems
```

`# 262`

Here we classify all testing records (and show a few of the results):

```
my @testingRecords = $dsTesting.map({ $_.<passengerClass passengerSex passengerAge> }).Array;
my @clRes = $trTitanic.classify(@testingRecords).Array;
@clRes.head(5)
```

`# (died died died survived died)`

Here is a tally of the classification results:

`tally(@clRes)`

`# {died => 186, survived => 76}`

(The function `tally`

is from “Data::Summarizers”. It follows Mathematica’s `Tally`

, [WRI2].)

Here we make a Receiver Operating Characteristic (ROC) record, [AA5, AAp4]:

```
use ML::ROCFunctions;
my %roc = to-roc-hash('survived', 'died', select-columns( $dsTesting, 'passengerSurvival')>>.values.flat, @clRes)
```

`# {FalseNegative => 45, FalsePositive => 15, TrueNegative => 141, TruePositive => 61}`

## Trie classification with ROC plots

In the next code cell we classify all testing data records. For each record:

- Get probabilities hash
- Add to that hash the actual label
- Make sure the hash has both survival labels

```
use Hash::Merge;
my @clRes =
do for [|$dsTesting] -> $r {
my $res = [|$trTitanic.classify( $r<passengerClass passengerSex passengerAge>, prop => 'Probs' ), Actual => $r<passengerSurvival>].Hash;
merge-hash( { died => 0, survived => 0}, $res)
}
```

Here we obtain the range of the label “survived”:

```
my @vals = flatten(select-columns(@clRes, 'survived')>>.values);
(min(@vals), max(@vals))
```

`# (0 1)`

Here we make list of decision thresholds:

```
my @thRange = min(@vals), min(@vals) + (max(@vals)-min(@vals))/30 ... max(@vals);
records-summary(@thRange)
```

```
# +-------------------------------+
# | numerical |
# +-------------------------------+
# | Max => 0.9999999999999999 |
# | Min => 0 |
# | Mean => 0.5000000000000001 |
# | 3rd-Qu => 0.7666666666666666 |
# | 1st-Qu => 0.2333333333333333 |
# | Median => 0.49999999999999994 |
# +-------------------------------+
```

In the following code cell for each threshold:

- For each classification hash decide on “survived” if the

corresponding value is greater or equal to the threshold - Make threshold’s ROC-hash

```
my @rocs = @thRange.map(-> $th { to-roc-hash('survived', 'died',
select-columns(@clRes, 'Actual')>>.values.flat,
select-columns(@clRes, 'survived')>>.values.flat.map({ $_ >= $th ?? 'survived' !! 'died' })) });
```

`# [{FalseNegative => 0, FalsePositive => 156, TrueNegative => 0, TruePositive => 106} {FalseNegative => 0, FalsePositive => 148, TrueNegative => 8, TruePositive => 106} .]`

Here is the obtained ROC-hash table:

`to-pretty-table(@rocs)`

```
# +---------------+---------------+--------------+--------------+
# | FalsePositive | FalseNegative | TrueNegative | TruePositive |
# +---------------+---------------+--------------+--------------+
# | 156 | 0 | 0 | 106 |
# | 148 | 0 | 8 | 106 |
# | 137 | 2 | 19 | 104 |
# | 104 | 9 | 52 | 97 |
# | 97 | 10 | 59 | 96 |
# | 72 | 13 | 84 | 93 |
# | 72 | 13 | 84 | 93 |
# | 55 | 15 | 101 | 91 |
# | 46 | 19 | 110 | 87 |
# | 42 | 23 | 114 | 83 |
# | 33 | 28 | 123 | 78 |
# | 25 | 36 | 131 | 70 |
# | 22 | 39 | 134 | 67 |
# | 22 | 39 | 134 | 67 |
# | 18 | 40 | 138 | 66 |
# | 18 | 40 | 138 | 66 |
# | 10 | 51 | 146 | 55 |
# | 10 | 51 | 146 | 55 |
# | 4 | 54 | 152 | 52 |
# | 3 | 57 | 153 | 49 |
# | 3 | 57 | 153 | 49 |
# | 3 | 57 | 153 | 49 |
# | 3 | 57 | 153 | 49 |
# | 3 | 57 | 153 | 49 |
# | 3 | 57 | 153 | 49 |
# | 3 | 57 | 153 | 49 |
# | 3 | 60 | 153 | 46 |
# | 2 | 72 | 154 | 34 |
# | 2 | 72 | 154 | 34 |
# | 2 | 89 | 154 | 17 |
# | 2 | 89 | 154 | 17 |
# +---------------+---------------+--------------+--------------+
```

Here is the corresponding ROC plot:

```
use Text::Plot;
text-list-plot(roc-functions('FPR')(@rocs), roc-functions('TPR')(@rocs),
width => 70, height => 25,
x-label => 'FPR', y-label => 'TPR' )
```

```
# +--+------------+-----------+-----------+-----------+------------+---+
# | |
# + * * * + 1.00
# | |
# | * * |
# | * * |
# | * |
# + * + 0.80
# | * |
# | |
# | * |
# | * * | T
# + + 0.60 P
# | * | R
# | * |
# | * |
# + * + 0.40
# | |
# | * |
# | |
# | |
# + + 0.20
# | * |
# | |
# +--+------------+-----------+-----------+-----------+------------+---+
# 0.00 0.20 0.40 0.60 0.80 1.00
# FPR
```

We can see the Trie classifier has reasonable prediction abilities – we get ≈ 75% True Positive Rate (TPR) with relatively small False Positive Rate (FPR), ≈ 20%.

Here is a ROC plot made with Mathematica (using a different Trie over Titanic data):

## Improvements

For simplicity the workflow above was kept “naive.” A better workflow would include:

- Stratified partitioning of training and testing data
- K-fold cross-validation
- Variable significance finding
- Specifically for Tries with frequencies: using different order of variables while constructing the trie

**Remark:** K-fold cross-validation can be “simply”achieved by running this document multiple times using different random seeds.

## References

### Articles

[AA1] Anton Antonov, “A monad for classification workflows”, (2018), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “Raku Text::CodeProcessing”, (2021), RakuForPrediction at WordPress.

[AA3] Anton Antonov, “Connecting Mathematica and Raku”, (2021), RakuForPrediction at WordPress.

[AA4] Anton Antonov, “Introduction to data wrangling with Raku”, (2021), RakuForPrediction at WordPress.

[AA5] Anton Antonov, “ML::TriesWithFrequencies”, (2022), RakuForPrediction at WordPress.

[AA6] Anton Antonov, “Data::Generators”, (2022), RakuForPrediction at WordPress.

[AA7] Anton Antonov, “ML::ROCFunctions”, (2022), RakuForPrediction at WordPress.

[AA8] Anton Antonov, “Text::Plot”, (2022), RakuForPrediction at WordPress.

[Wk1] Wikipedia entry, “Receiver operating characteristic”.

### Packages

[AAp1] Anton Antonov, Data::Generators Raku package, (2021), GitHub/antononcube.

[AAp2] Anton Antonov, Data::Reshapers Raku package, (2021), GitHub/antononcube.

[AAp3] Anton Antonov, Data::Summarizers Raku package, (2021), GitHub/antononcube.

[AAp4] Anton Antonov, ML::ROCFunctions Raku package, (2022), GitHub/antononcube.

[AAp5] Anton Antonov, Text::CodeProcessing Raku package, (2021), GitHub/antononcube.

[AAp6] Anton Antonov, Text::Plot Raku package, (2022), GitHub/antononcube.

[AAp7] Anton Antonov, ML::TriesWithFrequencies Raku package, (2021), GitHub/antononcube.

### Functions

[WRI1] Wolfram Research (2015), TakeDrop, Wolfram Language function, (updated 2015).

[WRI2] Wolfram Research (2007), Tally, Wolfram Language function.

Here is a notebook making more or less the same computations in Mathematica : https://community.wolfram.com/groups/-/m/t/2605320 .

LikeLike