AKA “Fairly fast and accurate trie-based classifier of DSL commands”
Introduction
In this (computational Markdown) document we show how to derive with Raku a fast, accurate, and compact Machine Learning (ML) classifier that classifies natural language commands made within the Domain Specific Languages (DSLs) of a certain set of computational ML workflows.
For example, such classifier should classify the command “calculate document term matrix” as a Latent Semantic Analysis (LSA) workflow command. (And, say, give it the label “LatentSemanticAnalysis”.)
The primary motivation for making a DSL-classifier is to speed up the parsing of specifications that belong to a (somewhat) large collection of computational workflow DSLs. (For example, the Raku package [AAp5] has twelve workflows.)
Remark: Such classifier is used in the Mathematica package provided by the “NLP Template Engine” project, [AAr2, AAv1].
Remark: This article can be seen as an extension of the article “Trie-based classifiers evaluation”, [AA2].
General classifier making workflow
Here is a mind-map that summarizes the methodology of ML classifier making, [AA1]:
Big picture flow chart
Here is a “big picture” flow-chart that encompasses the procedures outlined and implemented in this documents:
Here is a narration of the flow chart:
- Get a set of computational workflows as an input
- If the textual data is sufficiently large:
- Make a classifier
- Evaluate classifier’s measurements
- If the classifier is good enough then export it
- Finish
- Else:
- Go to Step 2
- Else:
- If specifications can be automatically generated:
- Generate specifications and store them in a database
- Else:
- Manually write specifications and store them in a database
- Go to Step 2
- If specifications can be automatically generated:
DSL specifications
Here are examples of computational DSL specifications for the workflows Classification, Latent Semantic Analysis, and Quantile Regression:
ToDSLCode WL "
DSL MODULE Classification;
use the dataset dfGoods;
split data with ratio 0.8;
make a logistic regression classifier;
show accuracy, precision;
"
ToDSLCode R "
DSL MODULE LatentSemanticAnalysis;
use aDocs;
create document-term matrix;
apply LSI functions IDF, Frequency, and Cosine;
extract 36 topics with the method NNMF and max steps 12;
show topics table
"
ToDSLCode R "
DSL MODULE QuantileRegression;
use dfStocksVolume;
summarize data;
computed quantile regression with 30 knots and order 2;
show date list plot
"
Problem formulation
Definition: We refer to the number of characters a parser could not parse as parsing residual.
Definition: If the parsing residual is 0 then we say that the parser “exhausted the specification” or “parsed the specification completely.”
Assumptions: It is assumed that:
- We have two or more DSL parsers.
- For each parser we can obtain a parsing residual.
Problem: For a given DSL specification order the available DSL parsers according to how likely each of them is to parse the given DSL specification completely.
Procedures outlines
In this section we outline:
- The brute force DSL parsing procedure
- The modification of the brute force procedure by using a DSL-classifier
- The derivation of a DSL-classifier
- Possible applications of Association Rule Learning algorithms
Inputs
- A computational DSL specification
- A list of available DSL parsers
Brute force DSL parsing
- Randomly shuffle the available DSL parsers.
- Attempt parsing with each of the available DSL parsers.
- If any parser gives a zero residual then stop the loop and use that parser as “the work parser.”
- The parser that gives the smallest residual is chosen as “the work parser.”
Parsing with the help of a DSL-classifier
- Apply the DSL classifier to the given spec and order the DSL parsers according to the obtained classification probabilities.
- Do the “Brute force DSL parsing” steps 2, 3, and 4.
Derivation of a DSL-classifier
- For each of the DSLs generate at least a few hundred random commands using their grammars.
- Label each command with the DSL it was generated with.
- Export to a JSON file and / or CSV file.
- Ingest the DSL commands data into a hash (dictionary or association.)
- Do basic data analysis.
- Summarize the textual data
- Split the commands into words
- Remove stop words, random words, words with (too many) special symbols
- Find, summarize, and display word frequencies
- Split the data into training and testing parts.
- Do stratified splitting, per label.
- Turn each command into a trie phrase:
- Split the command into words
- Keep frequent enough words (as found in step 3)
- Sort the words and append the DSL label
- Make a trie with trie commands of the training data part.
- Evaluate the trie classifier over the trie commands of the testing data part.
- Show classification success rates and confusion matrix.
Remark: The trie classifiers are made with the Raku package “ML::TriesWithFrequencies”, [AAp9].
Association rules study
- Create Association Rule Learning (ARL) baskets of words.
- Put all words to lower case
- Filter words using different criteria:
- Remove stop words
- Keep dictionary words
- Remove words that have special symbols or are random strings
- Find frequent sets that include the DSL labels.
- Examine frequent sets.
- Using the frequent sets create and evaluate trie classifier.
Remark: ARL algorithm Apriori can be implemented via Tries. See for example the Apriori implementation in the Raku package “ML::AssociationRuleLearning”,[AAp7].
Load packages
Here we load the Raku packages used below:
use ML::AssociationRuleLearning;
use ML::TriesWithFrequencies;
use Lingua::StopwordsISO;
use Data::Reshapers;
use Data::Summarizers;
use Data::Generators;
use Data::ExampleDatasets;
# (Any)
Remark: All packages are available at raku.land.
Obtain textual data
In this section we show how we obtain the textual data and do rudimentary pre-processing.
Read the text data – the labeled DSL commands – from a CSV file (using example-dataset
from “Data::ExampleDatasets”, [AAp2]):
my @tbl = example-dataset('https://raw.githubusercontent.com/antononcube/NLP-Template-Engine/main/Data/RandomWorkflowCommands.csv');
@tbl.elems
# 5220
Show summary of the data (using records-summary
from “Data::Summarizers”, [AAp4]):
records-summary(@tbl)
# +-------------------------------+---------------------------------------+
# | Workflow | Command |
# +-------------------------------+---------------------------------------+
# | RandomTabularDataset => 870 | summarize data => 27 |
# | QuantileRegression => 870 | summarize the data => 25 |
# | Classification => 870 | train => 16 |
# | Recommendations => 870 | graph => 14 |
# | LatentSemanticAnalysis => 870 | extract statistical thesaurus => 13 |
# | NeuralNetworkCreation => 870 | do quantile regression => 13 |
# | | drill => 13 |
# | | (Other) => 5099 |
# +-------------------------------+---------------------------------------+
Make a list of pairs:
my @wCommands = @tbl.map({ $_<Command> => $_<Workflow>}).List;
say @wCommands.elems
# 5220
Show a sample of the pairs:
srand(33);
.say for @wCommands.pick(12).sort
# compute profile for r98w0 , rkzbaou1g7 together with rkzbaou1g7 => Recommendations
# generate the recommender over the 0rl => Recommendations
# how many networks => NeuralNetworkCreation
# make a random-driven tabular data set for => RandomTabularDataset
# modify boolean variables into symbolic => Classification
# recommend over history y5g8v => Recommendations
# set decoder tokens => NeuralNetworkCreation
# set encoder Characters => NeuralNetworkCreation
# show classifier measurements test results classification threshold 742.444 of dahm7ip26g => Classification
# split the into 271.426 % of testing => Classification
# verify that FalseNegativeRate of tgvh is equal to 63.9506 => Classification
# what is the number of neural models => NeuralNetworkCreation
Remark: The labeled DSL commands ingested above were generated using the grammars of the project ConversationalAgents at GitHub, [AAr1], and the function GrammarRandomSentences
of the Mathematica package “FunctionalParsers.m”, [AAp12].
Remark: Currently it is very hard to generate random sentences using grammars in Raku. That is also true for other grammar systems. (If that kind of functionality exists, it is usually added much latter in the development phase.) I am hopeful that the Raku AST project is going to greatly facilitate grammar-based random sentence generation.
Word tallies
In this section we analyze the words presence in the DSL commands. Here we get more than 80,000 English dictionary words (using the function random-word
from “Data::Generators”, [AAp1]):
my %dictionaryWords = Set(random-word(Inf)>>.lc);
%dictionaryWords.elems
# 83599
Remark:: The set %dictionaryWords
is most likely a subset of the generally “known English words.” (And in this document we are fine with that.)
Here we:
- Split into words the key (i.e. command) of each of the data pairs
- Flatten into one list of words
- Trim each word and turn into lower case
- Find word tallies (using the function
tally
from “Data::Summarizers”, [AAp4].)
my %wordTallies = @wCommands>>.key.map({ $_.split(/ \s | ',' /) }).&flatten>>.trim>>.lc.&tally;
%wordTallies.elems
# 4090
Show summary of the word tallies:
records-summary(%wordTallies.values.List)
# +---------------------+
# | numerical |
# +---------------------+
# | 3rd-Qu => 2 |
# | Mean => 11.189731 |
# | Max => 2828 |
# | 1st-Qu => 1 |
# | Median => 1 |
# | Min => 1 |
# +---------------------+
Here we filter the word tallies to be only with words that are:
– Have frequency ten or higher
– Have at least two characters
– Dictionary words
– Not English stop words (using the function stopwords-iso
from “Lingua::StopwordsISO”, [AAp6])
my %wordTallies2 = %wordTallies.grep({ \$_.value ≥ 10 && \$_.key.chars > 1 && \$_.key (elem) %dictionaryWords && $_.key !(elem) stopwords-iso('English')});
%wordTallies2.elems
# 173
Instead of checking for dictionary words – or in conjunction – we can filter the word tallies to be only with words that are made of letters and dashes:
my %wordTallies3 = %wordTallies2.grep({ $_.key ~~ / ^ [<:L> | '-']+ $ /});
%wordTallies3.elems
# 172
Here we tabulate the most frequent words (in descending order):
my @tbls = do for %wordTallies3.pairs.sort(-*.value).rotor(40) { to-pretty-table(transpose([$_>>.key, $_>>.value])
.map({ %(<word count>.Array Z=> $_.Array) }), align => 'l', field-names => <word count>).Str }
to-pretty-table([%( ^@tbls.elems Z=> @tbls),], field-names => (0 ..^ @tbls.elems)>>.Str, align => 'l', :!header, vertical-char => ' ', horizontal-char => ' ');
# + + + + +
# +-------------+-------+ +----------------+-------+ +----------------+-------+ +-------------+-------+
# | word | count | | word | count | | word | count | | word | count |
# +-------------+-------+ +----------------+-------+ +----------------+-------+ +-------------+-------+
# | data | 1432 | | workflow | 121 | | latent | 54 | | dependent | 26 |
# | tabular | 513 | | recommendation | 120 | | extend | 52 | | cosine | 26 |
# | set | 485 | | normal | 120 | | generator | 52 | | curve | 26 |
# | create | 444 | | form | 118 | | verify | 51 | | class | 25 |
# | generate | 416 | | term | 112 | | fit | 48 | | input | 24 |
# | pipeline | 341 | | variable | 110 | | summary | 45 | | count | 24 |
# | frame | 331 | | history | 109 | | step | 44 | | method | 23 |
# | values | 330 | | partition | 107 | | plot | 43 | | basis | 23 |
# | context | 329 | | semantic | 106 | | add | 43 | | minutes | 22 |
# | display | 256 | | thesaurus | 105 | | repository | 39 | | inverse | 22 |
# | names | 254 | | explain | 100 | | assert | 38 | | false | 21 |
# | matrix | 243 | | wide | 99 | | rescale | 38 | | maximum | 21 |
# | layer | 242 | | filter | 95 | | reduction | 37 | | minute | 20 |
# | neural | 226 | | network | 94 | | modify | 36 | | temporal | 19 |
# | random | 225 | | recommend | 93 | | sum | 36 | | total | 19 |
# | max | 224 | | load | 87 | | image | 35 | | sequence | 19 |
# | profile | 215 | | model | 87 | | remove | 35 | | steps | 19 |
# | compute | 191 | | format | 86 | | drill | 35 | | hours | 19 |
# | train | 183 | | extract | 85 | | terms | 34 | | categorical | 18 |
# | standard | 177 | | transform | 84 | | symbolic | 34 | | element | 18 |
# | arbitrary | 176 | | chain | 83 | | divide | 34 | | absolute | 18 |
# | batch | 174 | | function | 82 | | tabulate | 33 | | map | 17 |
# | size | 173 | | frequency | 82 | | binary | 33 | | audio | 17 |
# | calculate | 172 | | current | 81 | | chart | 33 | | hour | 17 |
# | randomized | 169 | | decoder | 80 | | reduce | 33 | | collection | 16 |
# | regression | 167 | | list | 78 | | classification | 32 | | validating | 16 |
# | loss | 166 | | ensemble | 77 | | histogram | 31 | | fraction | 16 |
# | classifier | 163 | | entropy | 74 | | ingest | 30 | | squared | 16 |
# | assign | 160 | | initialize | 74 | | idf | 30 | | true | 15 |
# | column | 158 | | object | 73 | | boolean | 30 | | validation | 15 |
# | driven | 158 | | dimension | 70 | | interpolation | 30 | | ramp | 14 |
# | chance | 158 | | analysis | 69 | | roc | 30 | | probability | 14 |
# | min | 157 | | summarize | 65 | | equal | 30 | | testing | 14 |
# | time | 147 | | retrieve | 64 | | moving | 30 | | scalar | 14 |
# | consumption | 146 | | percent | 61 | | characteristic | 30 | | ctc | 14 |
# | echo | 138 | | apply | 61 | | receiver | 30 | | synonym | 13 |
# | document | 133 | | cross | 59 | | operating | 30 | | measurement | 13 |
# | item | 130 | | normalization | 59 | | axis | 28 | | day | 13 |
# | word | 128 | | graph | 56 | | degree | 27 | | label | 12 |
# | series | 124 | | statistical | 55 | | split | 27 | | naive | 12 |
# +-------------+-------+ +----------------+-------+ +----------------+-------+ +-------------+-------+
# + + + + +
Data split
In this section we split the data into training and testing parts. The split is stratified per DSL.
Here we:
– Categorize the DSL commands according to their DSL label
– Tabulate the corresponding number of commands per label
srand(83);
my %splitGroups = @wCommands.categorize({ $_.value });
to-pretty-table([%splitGroups>>.elems,])
# +--------------------+-----------------------+----------------------+-----------------+----------------+------------------------+
# | QuantileRegression | NeuralNetworkCreation | RandomTabularDataset | Recommendations | Classification | LatentSemanticAnalysis |
# +--------------------+-----------------------+----------------------+-----------------+----------------+------------------------+
# | 870 | 870 | 870 | 870 | 870 | 870 |
# +--------------------+-----------------------+----------------------+-----------------+----------------+------------------------+
Here each category is:
– Randomly shuffled
– Split into training and testing parts with the ratio 0.75 (using the functiontake-drop
from “Data::Reshapers”, [AAp3])
– The corresponding number of elements are tabulated
my %split = %splitGroups.map( -> $g { $g.key => %( ['training', 'testing'] Z=> take-drop($g.value.pick(*), 0.75)) });
to-pretty-table(%split.map({ $_.key => $_.value>>.elems }))
# +------------------------+----------+---------+
# | | training | testing |
# +------------------------+----------+---------+
# | Classification | 653 | 217 |
# | LatentSemanticAnalysis | 653 | 217 |
# | NeuralNetworkCreation | 653 | 217 |
# | QuantileRegression | 653 | 217 |
# | RandomTabularDataset | 653 | 217 |
# | Recommendations | 653 | 217 |
# +------------------------+----------+---------+
Here we aggregate the training and testing parts for each category and show the corresponding sizes:
my %split2;
for %split.kv -> $k, $v {
%split2<training> = %split2<training>.append(|$v<training>);
%split2<testing> = %split2<testing>.append(|$v<testing>);
};
%split2>>.elems
# {testing => 1302, training => 3918}
Here we show a sample of commands from the training part:
.raku.say for %split2<training>.pick(6)
# "set encoder Image by zvh zvh zvh zvh" => "NeuralNetworkCreation"
# "set decoder tokens" => "NeuralNetworkCreation"
# "GatedRecurrentLayer [ Total ]" => "NeuralNetworkCreation"
# "long short term memory layer for 593.498" => "NeuralNetworkCreation"
# "create a standard pipeline" => "QuantileRegression"
# "extend the recommended items using 21i9fg" => "Recommendations"
Trie creation
Here we obtain the unique DSL commands labels:
my @labels = unique(@wCommands>>.value)
# [Classification LatentSemanticAnalysis NeuralNetworkCreation QuantileRegression RandomTabularDataset Recommendations]
Here we make a “known words” set using the “frequent enough” words of the training data:
%wordTallies = %split2<training>>>.key.map({ $_.split(/ \s | ',' /) }).&flatten>>.trim>>.lc.&tally;
%wordTallies2 = %wordTallies.grep({ $_.value ≥ 12 && $_.key.chars > 1 && $_.key !(elem) stopwords-iso('English')});
%wordTallies3 = %wordTallies2.grep({ $_.key ~~ / ^ [<:L> | '-']+ $ /});
%wordTallies3.elems
# 216
my %knownWords = Set(%wordTallies3);
%knownWords.elems
# 216
Here we define a sub that converts commands into trie-phrases:
multi make-trie-basket(Str $command, %knownWords) {
$command.split(/\s | ','/)>>.trim>>.lc.grep({ $_ (elem) %knownWords }).unique.sort.Array
}
multi make-trie-basket(Pair $p, %knownWords) {
make-trie-basket($p.key, %knownWords).append($p.value)
}
# &make-trie-basket
Here is an example invocation of make-trie-basket
:
my $rb = %split2<training>.pick;
say $rb.raku;
say make-trie-basket($rb, %knownWords).raku;
# "moving map gfb for the 154.119 , and 122.428 together with 122.428 and 122.428 weights" => "QuantileRegression"
# ["map", "moving", "QuantileRegression"]
Here we convert all training data commands into trie-phrases:
my $tStart = now;
my @training = %split2<training>.map({ make-trie-basket($_, %knownWords) }).Array;
say "Time to process traning commands: {now - $tStart}."
# Time to process traning commands: 0.331982502.
Here we make the trie:
$tStart = now;
my $trDSL = @training.&trie-create.node-probabilities;
say "Time to make the DSL trie: {now - $tStart}."
# Time to make the DSL trie: 0.475193574.
Here are the trie node counts:
$trDSL.node-counts
# {Internal => 5318, Leaves => 1804, Total => 7122}
Here is an example classification of a command:
$trDSL.classify(make-trie-basket('show the outliers', %knownWords), prop => 'Probabilities'):!verify-key-existence
# {Classification => 0.6285714285714286, QuantileRegression => 0.3714285714285715}
Confusion matrix
In this section we put together the confusion matrix of derived trie classifier over the testing data.
First we define a sub that gives the actual and predicted DSL-labels for a given training rule:
sub make-cf-couple(Pair $p) {
my $query = make-trie-basket($p.key, %knownWords);
my $lbl = $trDSL.classify($query, :!verify-key-existence);
%(actual => $p.value, predicted => ($lbl ~~ Str) ?? $lbl !! 'NA', command => $p.key)
}
# &make-cf-couple
Here we classify all commands in the testing data part:
my $tStart = now;
my @actualPredicted = %split2<testing>.map({ make-cf-couple($_) }).Array;
my $tEnd = now;
say "Total time to classify {%split2<testing>.elems} tests with the DSL trie: {$tEnd - $tStart}.";
say "Time per classification: {($tEnd - $tStart)/@actualPredicted.elems}."
# Total time to classify 1302 tests with the DSL trie: 1.023266605.
# Time per classification: 0.0007859190514592935.
Here is the confusion matrix (using cross-tabulate
of “Data::Reshapers”, [AAp3]):
my $ct = cross-tabulate(@actualPredicted, "actual", "predicted");
to-pretty-table($ct, field-names=>@labels.sort.Array.append('NA'))
# +------------------------+----------------+------------------------+-----------------------+--------------------+----------------------+-----------------+----+
# | | Classification | LatentSemanticAnalysis | NeuralNetworkCreation | QuantileRegression | RandomTabularDataset | Recommendations | NA |
# +------------------------+----------------+------------------------+-----------------------+--------------------+----------------------+-----------------+----+
# | Classification | 162 | 14 | | 10 | 15 | 9 | 7 |
# | LatentSemanticAnalysis | 1 | 182 | | 6 | 10 | 15 | 3 |
# | NeuralNetworkCreation | 1 | | 195 | 5 | 1 | 5 | 10 |
# | QuantileRegression | 16 | 10 | 5 | 164 | 9 | 12 | 1 |
# | RandomTabularDataset | | | | 1 | 216 | | |
# | Recommendations | | 9 | 1 | 8 | 2 | 186 | 11 |
# +------------------------+----------------+------------------------+-----------------------+--------------------+----------------------+-----------------+----+
Here are the corresponding fractions:
my $ct2 = $ct.map({ $_.key => $_.value <</>> $_.value.values.sum });
to-pretty-table($ct2, field-names=>@labels.sort.Array.append('NA'))
# +------------------------+----------------+------------------------+-----------------------+--------------------+----------------------+-----------------+----------+
# | | Classification | LatentSemanticAnalysis | NeuralNetworkCreation | QuantileRegression | RandomTabularDataset | Recommendations | NA |
# +------------------------+----------------+------------------------+-----------------------+--------------------+----------------------+-----------------+----------+
# | Classification | 0.746544 | 0.064516 | | 0.046083 | 0.069124 | 0.041475 | 0.032258 |
# | LatentSemanticAnalysis | 0.004608 | 0.838710 | | 0.027650 | 0.046083 | 0.069124 | 0.013825 |
# | NeuralNetworkCreation | 0.004608 | | 0.898618 | 0.023041 | 0.004608 | 0.023041 | 0.046083 |
# | QuantileRegression | 0.073733 | 0.046083 | 0.023041 | 0.755760 | 0.041475 | 0.055300 | 0.004608 |
# | RandomTabularDataset | | | | 0.004608 | 0.995392 | | |
# | Recommendations | | 0.041475 | 0.004608 | 0.036866 | 0.009217 | 0.857143 | 0.050691 |
# +------------------------+----------------+------------------------+-----------------------+--------------------+----------------------+-----------------+----------+
Here is the diagonal of the confusion matrix:
to-pretty-table( @labels.map({ $_ => $ct2.Hash{$_;$_} }) )
# +------------------------+----------+
# | | 0 |
# +------------------------+----------+
# | Classification | 0.746544 |
# | LatentSemanticAnalysis | 0.838710 |
# | NeuralNetworkCreation | 0.898618 |
# | QuantileRegression | 0.755760 |
# | RandomTabularDataset | 0.995392 |
# | Recommendations | 0.857143 |
# +------------------------+----------+
By examining the confusion matrices we can conclude that the classifier is accurate enough. (We examine the diagonals of the matrices and the most frequent confusions.)
By examining the computational timings we conclude that the classifier is both accurate and fast enough.
Remark: We addition to the confusion matrix we can do compute the Top-K query statistics – not done here. (Top-2 query statistic is answering the question: “Is the expected label one of the top 2 most probable labels?”)
Here we show a sample of confused (misclassified) commands:
srand(883);
to-pretty-table(@actualPredicted.grep({ $_<actual> ne $_<predicted> }).pick(12).sort({ $_<command> }), field-names=><actual predicted command>, align=>'l')
# +------------------------+------------------------+--------------------------------------------------------------+
# | actual | predicted | command |
# +------------------------+------------------------+--------------------------------------------------------------+
# | Classification | LatentSemanticAnalysis | add into context as xrne |
# | LatentSemanticAnalysis | QuantileRegression | compute 774.213 topics using SVD , and 386.111 maximum steps |
# | LatentSemanticAnalysis | NA | consider the text m2qj6vu0as from fsce |
# | QuantileRegression | LatentSemanticAnalysis | display value for the context variable 1jd |
# | QuantileRegression | Recommendations | echo dates list plot |
# | Classification | QuantileRegression | generate workflow |
# | Classification | RandomTabularDataset | get data the 9u16pdk8h data for 5cb37hutm |
# | LatentSemanticAnalysis | RandomTabularDataset | get the data the 90azp3ge4 |
# | QuantileRegression | LatentSemanticAnalysis | load data with id bk8avr3xn |
# | Recommendations | NA | make |
# | Classification | Recommendations | put in context as yirvd3p |
# | QuantileRegression | Classification | summarize data |
# +------------------------+------------------------+--------------------------------------------------------------+
Remark: We observe that a certain proportion of the misclassified commands are ambiguous – they do not belong to only one DSL.
Association rules
In this section we go through the association rules finding outlined above.
Remark: The found frequent sets can be used for ML “feature engineering.” They can also be seen as a supplement or alternative to the ML classification “importance of variables” investigations.
Remark: We do not present the trie classifier making and accuracy results with frequent sets, but I can (bravely) declare that experiments with trie classifiers made with the words of found frequent sets produce very similar results as the trie classifiers with (simple) word-tallies.
Here we process the “word baskets” made from the DSL commands and append corresponding DSL workflow labels:
my $tStart = now;
my @baskets = @wCommands.map({ (\$_.key.split(/\s | ','/)>>.trim.grep({ \$_.chars > 0 && \$_ ~~ /<:L>+/ && \$_ ∈ %dictionaryWords && \$_ ∉ stopwords-iso('English')})).Array.append(\$_.value) }).Array;
say "Number of baskets: {@baskets.elems}";
say "Time to process baskets {now - $tStart}."
# Number of baskets: 5220
# Time to process baskets 17.622710602.
Here is a sample of the baskets:
.say for @baskets.pick(6)
# [transform symbolic numeric Classification]
# [echo data time series data graph QuantileRegression]
# [verify Classification]
# [compute moving average QuantileRegression]
# [calculate rescale add context time series data default step QuantileRegression]
# [partition LatentSemanticAnalysis]
Here is a summary of the basket sizes:
records-summary(@baskets>>.elems)
# +--------------------+
# | numerical |
# +--------------------+
# | Min => 1 |
# | 3rd-Qu => 5 |
# | Max => 37 |
# | 1st-Qu => 2 |
# | Mean => 4.081418 |
# | Median => 3 |
# +--------------------+
Here we find frequent sets of words (using the function frequent-sets
from “ML::AssociationRuleLearning”, [AAp7]):
my $tStart = now;
my @freqSets = frequent-sets(@baskets.grep({ 3 < $_.elems }).Array, min-support => 0.005, min-number-of-items => 2, max-number-of-items => 6):counts;
say "\t\tNumber of frequent sets: {@freqSets.elems}.";
my $tEnd = now;
say "Timing: {\$tEnd - \$tStart}."
# Number of frequent sets: 5110.
# Timing: 138.428897789.
Here is a sample of the found frequent sets:
.say for @freqSets.pick(12)
# (form frame max tabular values) => 17
# (LatentSemanticAnalysis item matrix word) => 41
# (RandomTabularDataset chance data frame tabular values) => 12
# (create data random values) => 12
# (data max randomized set tabular) => 12
# (generate max names) => 18
# (matrix term) => 70
# (RandomTabularDataset data generate max min names) => 13
# (RandomTabularDataset arbitrary form frame) => 12
# (arbitrary data randomized) => 13
# (RandomTabularDataset chance data driven min set) => 18
# (create set tabular) => 47
Conclusion
In this section we discuss assumptions, alternatives, and “final” classifier deployment.
Hopes
It is hoped that the classifier created with the procedures above is going to be adequate in the “real world.” This is largely dependent on the quality of the training data.
The data presented and used above use grammar-rules generated commands and those commands are generalized by: – Removing the sequential order of the words – Using only frequent enough, dictionary words
Using a recommender instead
We also experimented with a Recommender-based Classifier (RC) – the accuracy results with RC were slightly better (4±2%) than the trie-based classifier, but RC is ≈10 times slower. We plan to discuss RC training and results in a subsequent article.
Final result
Since we find the performance of the trie-based classifier satisfactory – both accuracy-wise and speed-wise – we make a classifier with all of the DSL commands data. See the resource file “dsl-trie-classifier.json”, of [AAp5].
my $trie-to-export = [|%split2<training>, |%split2<testing>].map({ make-trie-basket($_, %knownWords) }).Array.&trie-create;
$trie-to-export.node-counts;
# {Internal => 6671, Leaves => 2204, Total => 8875}
spurt 'dsl-trie-classifier.json', $trie-to-export.JSON;
# True
References
Articles
[AA1] Anton Antonov, “A monad for classification workflows”, (2018), MathematicaForPrediction at WordPress.
[AA2] Anton Antonov, “Trie-based classifiers evaluation”, (2022), RakuForPrediction at WordPress.
Packages
[AAp1] Anton Antonov, Data::Generators Raku package, (2021), GitHub/antononcube.
[AAp2] Anton Antonov, Data::ExampleDatasets Raku package, (2022), GitHub/antononcube.
[AAp3] Anton Antonov, Data::Reshapers Raku package, (2021), GitHub/antononcube.
[AAp4] Anton Antonov, Data::Summarizers Raku package, (2021), GitHub/antononcube.
[AAp5] Anton Antonov, DSL::Shared::Utilities::ComprehensiveTranslation Raku package, (2020-2022), GitHub/antononcube.
[AAp6] Anton Antonov, Lingua::StopwordsISO Raku package, (2022), GitHub/antononcube.
[AAp7] Anton Antonov, ML::AssociationRuleLearning Raku package, (2022), GitHub/antononcube.
[AAp8] Anton Antonov, ML::ROCFunctions Raku package, (2022), GitHub/antononcube.
[AAp9] Anton Antonov, ML::TriesWithFrequencies Raku package, (2021), GitHub/antononcube.
[AAp10] Anton Antonov, Text::Plot Raku package, (2022), GitHub/antononcube.
[AAp11] Anton Antonov, Functional parsers Mathematica package, (2014), MathematicaForPrediction
at GitHub.
Repositories
[AAr1] Anton Antonov, ConversationalAgents project, (2017-2022), GitHub/antononcube.
[AAr2] Anton Antonov, NLP Template Engine, (2021-2022), GitHub/antononcube.
Videos
[AAv1] Anton Antonov, “NLP Template Engine, Part 1”, (2021), Simplified Machine Learning Workflows at YouTube.
[AAv2] Anton Antonov “Raku for Prediction”, (2021), TRC-2021.
2 thoughts on “Fast and compact classifier of DSL commands”