Omni-slurping with LLMing

Introduction

In this blog post we demonstrate the use of the Raku package “Data::Importers”, that offers a convenient solution for importing data from URLs and files. This package supports a variety of data types such as CSV, HTML, PDF, text, and images, making it a versatile tool for data manipulation.

One particularly interesting application of “Data::Importers” is its inclusion into workflows based on Large Language Models (LLMs). Generally speaking, having an easy way to ingest diverse range of data formats — like what “Data::Importers” aims to do — makes a wide range of workflows for data processing and analysis easier to create.

In this blog post, we will demonstrate how “Data::Importers” can work together with LLMs, providing real-world examples of their combined usage in various situations. Essentially, we will illustrate the power of merging omni-slurping with LLM-ing to improve data-related activities.

The main function of “Data::Importers” is data-import. Its functionalities are incorporated into suitable overloads of the built-in slurp subroutine.

Post’s structure:

  1. Setup
  2. HTML summary and cross tabulation
  3. PDF summary
  4. CSV filtering
  5. Image vision
  6. Image vision with re-imaging

Setup

Here a lot of packages used below:

use Data::Importers;
use Data::Reshapers;
use Data::Summarizers;
use JSON::Fast;
use JavaScript::D3;

Here we configure the Jupyter notebook to display JavaScript graphics, [AAp3, AAv1]:

#% javascript

require.config({
     paths: {
     d3: 'https://d3js.org/d3.v7.min'
}});

require(['d3'], function(d3) {
     console.log(d3);
});

HTML summary and cross tabulation

A key motivation behind creating the “Data::Importers” package was to efficiently retrieve HTML pages, extract plain text, and import it into a Jupyter notebook for subsequent LLM transformations and content processing.

Here is a pipeline that gets an LLM summary of a certain recent Raku blog post:

my $htmlURL = 'https://rakudoweekly.blog/2024/03/25/2024-13-veyoring-again/';

$htmlURL
==> slurp(format => 'plaintext')
==> { say text-stats($_); $_ }()
==> llm-prompt('Summarize')()
==> llm-synthesize()
(chars => 2814 words => 399 lines => 125)

Paul Cochrane returns to the Raku community with a guide on enabling Continuous Integration on Raku projects using AppVeyor. Core developments include improvements by Elizabeth Mattijsen on # Metamodel classes for faster compilation and performance. New and updated Raku modules are also # featured in this week's news.

Here is another LLM pipeline that ingests the HTML page and produces an HTML table derived from the page’s content:

#% html

$htmlURL
==> slurp(format => 'plaintext')
==> { say "Contributors table:"; $_ }() 
==> {["Cross tabulate into a HTML table the contributors", 
        "and type of content with the content itself", 
        "for the following text:\n\n", 
        $_, 
        llm-prompt('NothingElse')('HTML')]}()
==> llm-synthesize(e => llm-configuration('Gemini', max-tokens => 4096, temperature => 0.65))
Contributors table:
ContributorContent TypeContent
Paul CochraneTutorialBuilding and testing Raku in AppVeyor
Dr. RakuTutorialHow To Delete Directories
Dr. RakuTutorialFun File Beginners Project
Dr. RakuTutorialHash Examples
Elizabeth MattijsenDevelopmentMetamodel classes for faster compilation and performance and better stability
Stefan SeifertDevelopmentFixed several BEGIN time lookup issues
Elizabeth MattijsenDevelopmentFixed an issue with =finish if there was no code
Samuel ChaseShoutoutNice shoutout!
Fernando SantagataSelf-awareness testSelf-awareness test
Paul CochraneDeep rabbit holeA deep rabbit hole
anavarroQuestionHow to obtain the Raku language documentation ( Reference) offline
Moritz LenzCommentOn ^ and $
LanXCommentThe latest name
ilyashCommentAutomatic parsing of args
emporasCommentCertainly looks nice
faifaceCommentWent quite bad
Ralph MellorCommentOn Raku’s design decisions regarding operators
optionExampleAn example Packer file
Anton AntonovModuleData::Importers
Ingy døt NetModuleYAMLScript
Alexey MelezhikModuleSparrow6, Sparky
Patrick BökerModuleDevel::ExecRunnerGenerator
Steve RoeModulePDF::Extract

PDF summary

Another frequent utilization of LLMs is the processing of PDF files found (intentionally or not) while browsing the Web. (Like, arXiv.org articles, UN resolutions, or court slip opinions.)

Here is a pipeline that gets an LLM summary of an oral argument brought up recently (2024-03-18) to The US Supreme Court, (22-842 “NRA v. Vullo”):

'https://www.supremecourt.gov/oral_arguments/argument_transcripts/2023/22-842_c1o2.pdf'
==> slurp(format=>'text')
==> llm-prompt('Summarize')()
==> llm-synthesize(e=>llm-configuration('ChatGPT', model => 'gpt-4-turbo-preview'))
The Supreme Court of the United States dealt with a case involving the National Rifle Association (NRA) and Maria T. Vullo, challenging actions taken by New York officials against the NRA's insurance programs. The NRA argued that their First Amendment rights were violated when New York officials, under the guidance of Maria Vullo and Governor Andrew Cuomo, used coercive tactics to persuade insurance companies and banks to sever ties with the NRA, citing the promotion of guns as the reason. These actions included a direct threat of regulatory repercussions to insurance underwriter Lloyd's and the issuance of guidance letters to financial institutions, suggesting reputational risks associated with doing business with the NRA. The court discussed the plausibility of coercion and the First Amendment claim, reflecting on precedents like Bantam Books, and the extent to which government officials can use their regulatory power to influence the actions of third parties against an organization due to its advocacy work.

CSV filtering

Here we ingest from GitHub a CSV file that has datasets metadata:

my $csvURL = 'https://raw.githubusercontent.com/antononcube/Raku-Data-ExampleDatasets/main/resources/dfRdatasets.csv';
my $dsDatasets = data-import($csvURL, headers => 'auto');

say "Dimensions   : {$dsDatasets.&dimensions}";
say "Column names : {$dsDatasets.head.keys}";
say "Type         : {deduce-type($dsDatasets)}";
Dimensions   : 1745 12
Column names : n_logical n_character n_numeric Doc Rows Cols Package Title Item CSV n_binary n_factor
Type : Vector(Assoc(Atom((Str)), Atom((Str)), 12), 1745)

Here is a table with a row sample:

#% html
my $field-names = <Package Item Title Rows Cols>;
my $dsDatasets2 = $dsDatasets>>.Hash.Array;
$dsDatasets2 = select-columns($dsDatasets2, $field-names);
$dsDatasets2.pick(12) ==> data-translation(:$field-names)
PackageItemTitleRowsCols
robustbasewagnerGrowthWagner’s Hannover Employment Growth Data637
openintroage_at_marAge at first marriage of 5,534 US women.55341
AERMurderRatesDeterminants of Murder Rates in the United States448
Stat2DataRadioactiveTwinsComparing Twins Ability to Clear Radioactive Particles303
rpartkyphosisData on Children who have had Corrective Spinal Surgery814
bootgravityAcceleration Due to Gravity812
survivaldiabeticDdiabetic retinopathy3948
gapmfblongInternal functions for gap300010
EcdatMofaInternational Expansion of U.S. MOFAs (majority-owned Foreign Affiliates in Fire (finance, Insurance and Real Estate)505
drcchickweedGermination of common chickweed (_Stellaria media_)353
MASSPima.trDiabetes in Pima Indian Women2008
MASSshrimpPercentage of Shrimp in Shrimp Cocktail181

Here we use an LLM to pick rows that related to certain subject:

my $res = llm-synthesize([
    'From the following JSON table pick the rows that are related to air pollution.', 
    to-json($dsDatasets2), 
    llm-prompt('NothingElse')('JSON')
], 
e => llm-configuration('ChatGPT', model => 'gpt-4-turbo-preview', max-tokens => 4096, temperature => 0.65),
form => sub-parser('JSON'):drop)
[{Cols => 6, Item => airquality, Package => datasets, Rows => 153, Title => Air Quality Data} {Cols => 5, Item => iris, Package => datasets, Rows => 150, Title => Edgar Anderson's Iris Data} {Cols => 11, Item => mtcars, Package => datasets, Rows => 32, Title => Motor Trend Car Road Tests} {Cols => 5, Item => USPersonalExpenditure, Package => datasets, Rows => 5, Title => US Personal Expenditure Data (1940-1950)} {Cols => 4, Item => USArrests, Package => datasets, Rows => 50, Title => US Arrests for Assault (1960)}]

Here is the tabulated result:

#% html
$res ==> data-translation(:$field-names)
PackageItemTitleRowsCols
AERCigarettesBCigarette Consumption Data463
AERCigarettesSWCigarette Consumption Panel Data969
plmCigarCigarette Consumption13809

Image vision

One of the cooler recent LLM-services enhancements is the access to AI-vision models. For example, AI-vision models are currently available through interfaces of OpenAI, Gemini, or LLaMA.

Here we use data-import instead of (the overloaded) slurp:

#% markdown
my $imgURL2 = 'https://www.wolframcloud.com/files/04e7c6f6-d230-454d-ac18-898ee9ea603d/htmlcaches/images/2f8c8b9ee8fa646349e00c23a61f99b8748559ed04da61716e0c4cacf6e80979';
my $img2 = data-import($imgURL2, format => 'md-image');

Here is AI-vision invocation:

llm-vision-synthesize('Describe the image', $img2, e => 'Gemini')
 The image shows a blue-white sphere with bright spots on its surface. The sphere is the Sun, and the bright spots are solar flares. Solar flares are bursts of energy that are released from the Sun's surface. They are caused by the sudden release of magnetic energy that has built up in the Sun's atmosphere. Solar flares can range in size from small, localized events to large, global eruptions. The largest solar flares can release as much energy as a billion hydrogen bombs. Solar flares can have a number of effects on Earth, including disrupting radio communications, causing power outages, and triggering geomagnetic storms.

Remark: The image is taken from the Wolfram Community post “Sympathetic solar flare and geoeffective coronal mass ejection”, [JB1].

Remark: The AI vision above is done Google’s “gemini-pro-vision’. Alternatively, OpenAI’s “gpt-4-vision-preview” can be used.


Image vision with re-imaging

In this section we show how to import a certain statistical image, get data from the image, and make another similar statistical graph. Similar workflows are discussed “Heatmap plots over LLM scraped data”, [AA1]. The plots are made with “JavaScript::D3”, [AAp3].

Here we ingest an image with statistics of fuel exports:

#% markdown
my $imgURL = 'https://pbs.twimg.com/media/GG44adyX0AAPqVa?format=png&name=medium';
my $img = data-import($imgURL, format => 'md-image')

Here is a fairly non-trivial request for data extraction from the image:

my $resFuel = llm-vision-synthesize([
    'Give JSON dictionary of the Date-Country-Values data in the image', 
    llm-prompt('NothingElse')('JSON')
    ], 
    $img, form => sub-parser('JSON'):drop)
[Date-Country-Values => {Jan-22 => {Bulgaria => 56, China => 704, Croatia => 22, Denmark => 118, Finland => 94, France => 140, Germany => 94, Greece => 47, India => 24, Italy => 186, Lithuania => 142, Netherlands => 525, Poland => 165, Romania => 122, South Korea => 327, Spain => 47, Sweden => 47, Total => 3192, Turkey => 170, UK => 68, USA => 24}, Jan-24 => {Brunei => 31, China => 1151, Egypt => 70, Ghana => 35, Greece => 103, India => 1419, Korea => 116, Myanmar => 65, Netherlands => 33, Oman => 23, Total => 3381, Turkey => 305, Unknown => 30}}]

Here is we modify the prompt above in order to get a dataset (an array of hashes):

my $resFuel2 = llm-vision-synthesize([
    'For data in the image give the corresponding JSON table that is an array of dictionaries each with the keys "Date", "Country", "Value".', 
    llm-prompt('NothingElse')('JSON')
    ],
    $img, 
    max-tokens => 4096,
    form => sub-parser('JSON'):drop)
[{Country => USA, Date => Jan-22, Value => 24} {Country => Turkey, Date => Jan-22, Value => 170} {Country => Croatia, Date => Jan-22, Value => 22} {Country => Sweden, Date => Jan-22, Value => 47} {Country => Spain, Date => Jan-22, Value => 47} {Country => Greece, Date => Jan-22, Value => 47} {Country => Bulgaria, Date => Jan-22, Value => 56} {Country => UK, Date => Jan-22, Value => 68} {Country => Germany, Date => Jan-22, Value => 94} {Country => Finland, Date => Jan-22, Value => 94} {Country => Denmark, Date => Jan-22, Value => 118} {Country => Romania, Date => Jan-22, Value => 122} {Country => France, Date => Jan-22, Value => 140} {Country => Lithuania, Date => Jan-22, Value => 142} {Country => Poland, Date => Jan-22, Value => 165} {Country => Italy, Date => Jan-22, Value => 186} {Country => Netherlands, Date => Jan-22, Value => 525} {Country => India, Date => Jan-22, Value => 24} {Country => Japan, Date => Jan-22, Value => 70} {Country => South Korea, Date => Jan-22, Value => 327} {Country => China, Date => Jan-22, Value => 704} {Country => Unknown, Date => Jan-24, Value => 30} {Country => Ghana, Date => Jan-24, Value => 35} {Country => Egypt, Date => Jan-24, Value => 70} {Country => Oman, Date => Jan-24, Value => 23} {Country => Turkey, Date => Jan-24, Value => 305} {Country => Netherlands, Date => Jan-24, Value => 33} {Country => Greece, Date => Jan-24, Value => 103} {Country => Brunei, Date => Jan-24, Value => 31} {Country => Myanmar, Date => Jan-24, Value => 65} {Country => Korea, Date => Jan-24, Value => 116} {Country => China, Date => Jan-24, Value => 1151} {Country => India, Date => Jan-24, Value => 1419}]

Here is how the obtained dataset looks like:

#% html
$resFuel2>>.Hash ==> data-translation()
ValueDateCountry
24Jan-22USA
170Jan-22Turkey
22Jan-22Croatia
47Jan-22Sweden
47Jan-22Spain
47Jan-22Greece
56Jan-22Bulgaria
68Jan-22UK
94Jan-22Germany
94Jan-22Finland
118Jan-22Denmark
122Jan-22Romania
140Jan-22France
142Jan-22Lithuania
165Jan-22Poland
186Jan-22Italy
525Jan-22Netherlands
24Jan-22India
70Jan-22Japan
327Jan-22South Korea
704Jan-22China
30Jan-24Unknown
35Jan-24Ghana
70Jan-24Egypt
23Jan-24Oman
305Jan-24Turkey
33Jan-24Netherlands
103Jan-24Greece
31Jan-24Brunei
65Jan-24Myanmar
116Jan-24Korea
1151Jan-24China
1419Jan-24India

Here we rename or otherwise transform the columns of the dataset above in order to prepare it for creating a heatmap plot (we also show the deduced type):

my $k = 1;
my @fuelDataset = $resFuel2.map({ 
    my %h = $_.clone; 
    %h<z> = log10(%h<Value>); 
    %h<y> = %h<Country>; 
    %h<x> = %h<Date>; 
    %h<label> = %h<Value>;
    %h.grep({ $_.key ∈ <x y z label> }).Hash }).Array;

deduce-type(@fuelDataset);
Vector(Struct([label, x, y, z], [Int, Str, Str, Num]), 33)

Here is the heatmap plot:

#%js
js-d3-heatmap-plot(@fuelDataset,
                width => 700,
                height => 500,
                color-palette => 'Reds',
                plot-labels-color => 'White',
                plot-labels-font-size => 18,
                tick-labels-color => 'steelblue',
                tick-labels-font-size => 12,
                low-value => 0,
                high-value => 3.5,
                margins => {left => 100, right => 0},
                mesh => 0.01,
                title => 'Russia redirecting seaborne crude amid sanctions, 1000 b/d')

Here are the corresponding totals:

group-by($resFuel2, 'Date').map({ $_.key => $_.value.map(*<Value>).sum })
(Jan-24 => 3381 Jan-22 => 3192)

References

Articles

[AA1] Anton Antonov, “Heatmap plots over LLM scraped data”, (2024), RakuForPrediction at WordPress.

[JB1] Jeffrey Bryant​, “Sympathetic solar flare and geoeffective coronal mass ejection​”, (2024), Wolfram Community.

Packages

[AAp1] Anton Antonov, Data::Importers Raku package, (2024), GitHub/antononcube.

[AAp2] Anton Antonov, LLM::Functions Raku package, (2023-2024), GitHub/antononcube.

[AAp3] Anton Antonov, JavaScript::D3 Raku package, (2022-2024), GitHub/antononcube.

Videos

[AAv1] Anton Antonov, “Random mandalas generation (with D3.js via Raku)”, (2022), Anton Antonov’s YouTube channel.

One thought on “Omni-slurping with LLMing

Leave a comment