ML::TriesWithFrequencies

This blog post proclaims and describes the Raku package ML::TriesWithFrequencies, [AAp0], that has functions for creation and manipulation of Tries (Prefix trees) with frequencies.

The package provides Machine Learning (ML) functionalities, not “just” a Trie data structure.

This Raku implementation closely follows the Java implementation [AAp3].

The subset of functions with the prefix “trie-” follows the one used in the Mathematica package [AAp2]. That is the “top-level” sub-system of function names; the sub-system is follows the typical Object-Oriented Programming (OOP) Raku style.

Remark: Below Mathematica and Wolfram Language (WL) are used as synonyms.

Remark: There is a Raku package with an alternativeimplementation, [AAp6], made mostly for comparison studies. (See the implementation notes below.) The package in the repository, ML::TriesWithFrequencies, [AAp0], is my primary Tries-with-frequencies package.

Installation

Via zef-ecosystem:

zef install ML::TriesWithFrequencies

From GitHub:

zef install https://github.com/antononcube/Raku-ML-TriesWithFrequencies

Simple examples

Consider a trie (prefix tree) created over a list of words:

use ML::TriesWithFrequencies;
my $tr = trie-create-by-split( <bar bark bars balm cert cell> );
trie-say($tr);

# TRIEROOT => 6
# ├─b => 4
# │ └─a => 4
# │   ├─l => 1
# │   │ └─m => 1
# │   └─r => 3
# │     ├─k => 1
# │     └─s => 1
# └─c => 2
#   └─e => 2
#     ├─l => 1
#     │ └─l => 1
#     └─r => 1
#       └─t => 1

Here we convert the trie with frequencies above into a trie with probabilities:

my $ptr = trie-node-probabilities( $tr );
trie-say($ptr);

# TRIEROOT => 1
# ├─b => 0.6666666666666666
# │ └─a => 1
# │   ├─l => 0.25
# │   │ └─m => 1
# │   └─r => 0.75
# │     ├─k => 0.3333333333333333
# │     └─s => 0.3333333333333333
# └─c => 0.3333333333333333
#   └─e => 1
#     ├─l => 0.5
#     │ └─l => 1
#     └─r => 0.5
#       └─t => 1

Here we shrink the trie with probabilities above:

trie-say(trie-shrink($ptr));

# TRIEROOT => 1
# ├─ba => 0.6666666666666666
# │ ├─lm => 0.25
# │ └─r => 0.75
# │   ├─k => 0.3333333333333333
# │   └─s => 0.3333333333333333
# └─ce => 0.3333333333333333
#   ├─ll => 0.5
#   └─rt => 0.5

Here we retrieve a sub-trie with a key:

trie-say(trie-retrieve($ptr, 'bar'.comb))

# r => 0.75
# ├─k => 0.3333333333333333
# └─s => 0.3333333333333333

Here is a “dot-pipeline” that combines the steps above:

<bar bark bars balm cert cell>.&trie-create-by-split
.node-probabilities
.shrink
.retrieve(<ba r>)        
.form

# r => 0.75
# ├─k => 0.3333333333333333
# └─s => 0.3333333333333333

Remark: In the pipeline above we retrieve with <ba r>, not with <b a r>, because the trie is already shrunk.

The package provides a fair amount of functions in order to facilitate ML applications. In support of that statement, here are the methods of ML::TriesWithFrequencies::Trie:

ML::TriesWithFrequencies::Trie.^method_names

# (clone make merge insert create create-by-split node-probabilities leaf-probabilities leafQ position retrieve has-complete-match contains is-key shrink node-counts remove-by-threshold remove-by-pareto-fraction remove-by-regex select-by-threshold select-by-pareto-fraction select-by-regex root-to-leaf-paths words words-with-probabilities classify echo echo-function form trieRootLabel trieValueLabel getKey getValue getChildren setKey setValue setChildren toMapFormat hash WL toWLFormatRec XML toXMLFormatRec JSON toJSONFormatRec Str gist new key value children BUILDALL)

Neat example

Here is an example that shows how created and transformed versions of a trie are placed in a hash, and then that hash-of-tries is visualized in a table:

use ML::TriesWithFrequencies;
use Data::Reshapers;

my %tries;
<bar bark bars balm cert cell>.&trie-create-by-split
        .echo-function({%tries<created> = $_ })
        .node-probabilities
        .echo-function({ %tries<node-probabilities> = $_ })
        .shrink
        .echo-function({ %tries<shrunk> = $_ });


say to-pretty-table([%tries>>.form,], align => 'l');

# +----------------+---------------------------------+-------------------------------+
# | created        | node-probabilities              | shrunk                        |
# +----------------+---------------------------------+-------------------------------+
# | TRIEROOT => 6  | TRIEROOT => 1                   | TRIEROOT => 1                 |
# | ├─b => 4       | ├─b => 0.6666666666666666       | ├─ba => 0.6666666666666666    |
# | │ └─a => 4     | │ └─a => 1                      | │ ├─lm => 0.25                |
# | │   ├─l => 1   | │   ├─l => 0.25                 | │ └─r => 0.75                 |
# | │   │ └─m => 1 | │   │ └─m => 1                  | │   ├─k => 0.3333333333333333 |
# | │   └─r => 3   | │   └─r => 0.75                 | │   └─s => 0.3333333333333333 |
# | │     ├─k => 1 | │     ├─k => 0.3333333333333333 | └─ce => 0.3333333333333333    |
# | │     └─s => 1 | │     └─s => 0.3333333333333333 |   ├─ll => 0.5                 |
# | └─c => 2       | └─c => 0.3333333333333333       |   └─rt => 0.5                 |
# |   └─e => 2     |   └─e => 1                      |                               |
# |     ├─l => 1   |     ├─l => 0.5                  |                               |
# |     │ └─l => 1 |     │ └─l => 1                  |                               |
# |     └─r => 1   |     └─r => 0.5                  |                               |
# |       └─t => 1 |       └─t => 1                  |                               |
# +----------------+---------------------------------+-------------------------------+

Representation

Each trie is a tree of objects of the class ML::TriesWithFrequencies::Trie. Such trees can be nicely represented as hash-maps. For example:

my $tr = trie-shrink(trie-create-by-split(<core cort>));
say $tr.gist;

# {TRIEROOT => {TRIEVALUE => 2, cor => {TRIEVALUE => 2, e => {TRIEVALUE => 1}, t => {TRIEVALUE => 1}}}}

The function trie-say uses that Hash-representation:

trie-say($tr)

# TRIEROOT => 2
# └─cor => 2
#   ├─e => 1
#   └─t => 1

JSON

The JSON-representation follows the inherent object-tree representation with ML::TriesWithFrequencies::Trie:

say $tr.JSON;

# {"key":"TRIEROOT", "value":2, "children":[{"key":"cor", "value":2, "children":[{"key":"e", "value":1, "children":[]}, {"key":"t", "value":1, "children":[]}]}]}

XML

The XML-representation follows (resembles) the Hash-representation (and output from trie-say):

say $tr.XML;

# <TRIEROOT>
#  <TRIEVALUE>2</TRIEVALUE>
#  <cor>
#   <TRIEVALUE>2</TRIEVALUE>
#   <e>
#    <TRIEVALUE>1</TRIEVALUE>
#   </e>
#   <t>
#    <TRIEVALUE>1</TRIEVALUE>
#   </t>
#  </cor>
# </TRIEROOT>

Using the XML representation allows for XPath searches, say, using the package XML::XPath.
Here is an example:

use XML::XPath;
my $tr0 = trie-create-by-split(<bell best>);
trie-say($tr0);

# TRIEROOT => 2
# └─b => 2
#   └─e => 2
#     ├─l => 1
#     │ └─l => 1
#     └─s => 1
#       └─t => 1

Convert to XML:

say $tr0.XML;

# <TRIEROOT>
#  <TRIEVALUE>2</TRIEVALUE>
#  <b>
#   <TRIEVALUE>2</TRIEVALUE>
#   <e>
#    <TRIEVALUE>2</TRIEVALUE>
#    <l>
#     <TRIEVALUE>1</TRIEVALUE>
#     <l>
#      <TRIEVALUE>1</TRIEVALUE>
#     </l>
#    </l>
#    <s>
#     <TRIEVALUE>1</TRIEVALUE>
#     <t>
#      <TRIEVALUE>1</TRIEVALUE>
#     </t>
#    </s>
#   </e>
#  </b>
# </TRIEROOT>

Search for <b e l>:

say XML::XPath.new(xml=>$tr0.XML).find('//b/e/l');

# <l>
#     <TRIEVALUE>1</TRIEVALUE> 
#     <l>
#      <TRIEVALUE>1</TRIEVALUE> 
#     </l> 
#    </l>

WL

The Hash-representation is used in the Mathematica package [AAp2].
Hence, such WL format is provided by the Raku package:

say $tr.WL;

# <|$TrieRoot -> <|$TrieValue -> 2, "cor" -> <|$TrieValue -> 2, "e" -> <|$TrieValue -> 1|>, "t" -> <|$TrieValue -> 1|>|>|>|>

Cloning

All trie-* functions and ML::TriesWithFrequencies::Trie methods that manipulate tries produce trie clones.

For performance reasons I considered having in-place trie manipulations, but that, of course, confuses reasoning in development, testing, and usage. Hence, ubiquitous cloning.

Two styles of pipelining

As it was mentioned above the package was initially developed to have the functional programming design of the Mathematica package [AAp2]. With that design and using the feed operator ==> we can construct pipelines like this one:

my @words2 = <bar barman bask bell belly>;
my @words3 = <call car cast>;

trie-create-by-split(@words2)==>
trie-merge(trie-create-by-split(@words3))==>
trie-node-probabilities==>
trie-shrink==>
trie-say

# TRIEROOT => 1
# ├─b => 0.625
# │ ├─a => 0.6
# │ │ ├─r => 0.6666666666666666
# │ │ │ └─man => 0.5
# │ │ └─sk => 0.3333333333333333
# │ └─ell => 0.4
# │   └─y => 0.5
# └─ca => 0.375
#   ├─ll => 0.3333333333333333
#   ├─r => 0.3333333333333333
#   └─st => 0.3333333333333333

The package also supports “dot pipelining” through chaining of methods:

@words2.&trie-create-by-split
        .merge(@words3.&trie-create-by-split)
        .node-probabilities
        .shrink
        .form

# TRIEROOT => 1
# ├─b => 0.625
# │ ├─a => 0.6
# │ │ ├─r => 0.6666666666666666
# │ │ │ └─man => 0.5
# │ │ └─sk => 0.3333333333333333
# │ └─ell => 0.4
# │   └─y => 0.5
# └─ca => 0.375
#   ├─ll => 0.3333333333333333
#   ├─r => 0.3333333333333333
#   └─st => 0.3333333333333333

Remark: The trie-* functions are implemented through the methods of ML::TriesWithFrequencies::Trie. Given the method the corresponding function is derived by adding the prefix trie-. (For example, $tr.shrink vs trie-shrink($tr).)

Here is the previous pipeline re-written to use only methods of ML::TriesWithFrequencies::Trie:

{perl6, eval=FALSE} ML::TriesWithFrequencies::Trie.create-by-split(@words2) .merge(ML::TriesWithFrequencies::Trie.create-by-split(@words3)) .node-probabilities .shrink .form

Implementation notes

UML diagram

Here is a UML diagram that shows package’s structure:

The PlantUML spec and diagram were obtained with the CLI script to-uml-spec of the package “UML::Translators”, [AAp7].

Here we get the PlantUML spec:

to-uml-spec ML::TriesWithFrequencies > ./resources/class-diagram.puml

Here get the diagram:

to-uml-spec ML::TriesWithFrequencies | java -jar ~/PlantUML/plantuml-1.2022.5.jar -pipe > ./resources/class-diagram.png

Performance

This package is a Raku re-implementation of the Java Trie package [AAp3].

The initial implementation was:

≈ 5-6 times slower than the Mathematica implementation [AAp2]
≈ 100 times slower than the Java implementation [AAp3]

The initial implementation used:

General types for Trie nodes, i.e. Str for the key and Numeric for the value
Argument type verification with where statements in the signatures of the trie-* functions

After reading [RAC1] I refactored the code to use native types (num, str) and moved the where verifications inside the functions.

I also refactored the function trie-merge to use less copying of data and to take into account which of the two tries has smaller number of children.

After those changes the current Raku implementation is:

≈ 2.5 times slower than the Mathematica implementation [AAp2]
≈ 40 times slower than the Java implementation [AAp3]

After the (monumental) work on the new MoarVM dispatch mechanism, [JW1], was incorporated in standard Rakudo releases (September/October 2021) additional 20% speed-up was obtained. Currently this package is:

≈ 2.0 times slower than the Mathematica implementation [AAp2]
≈ 30 times slower than the Java implementation [AAp3]

These speed improvements are definitely not satisfactory. I strongly consider:

Re-implementing in Raku the Mathematica package [AAp2], i.e. to move into Tries that are hashes.
- (It turned out option 1 does not produce better results; see [AAp6].)
Re-implementing in C or C++ the Java package [AAp3] and hooking it up to Raku.

Moving from FP design and OOP design

The initial versions of the package – up to version 0.5.0 – had exported functions only in the namespace ML::TriesWithFrequencies with the prefix trie-. Those functions came from a purely Functional Programming (FP) design.

In order to get chains of OOP methods application that are typical in Raku programming the package versions with version 0.6.0 and later have trie manipulation transformation methods in the class ML::TriesWithFrequencies::Trie.

In order to get trie-class methods a fairly fundamental code refactoring was required. Here are the steps:

The old class ML::TriesWithFrequencies::Trie was made into the role ML::TriesWithFrequencies::Trieish.
The traversal and remover classes were made to use ML::TriesWithFrequencies::Trieish type instead of ML::TriesWithFrequencies::Trie.
The trie functions implementations – with the prefix “trie-” – of ML::TriesWithFrequencies were moved as methods implementations in ML::TriesWithFrequencies::Trie.
The trie functions in ML::TriesWithFrequencies were reimplemented using the methods of ML::TriesWithFrequencies::Trie.