Automap — Slide Deck v3

AUTO

M ulti-Agent

A utomated Mapping

P ipeline

Universidad Politécnica de Madrid · PIONERA Project

Transforms CSV + ontology into validated knowledge graphs — with auto-generated Competency Questions, SPARQL execution validation, and optional SHACL conformance checking.

11

Pipeline nodes

4

Feedback loops

100%

Local-LLM capable

3

YARRRML sub-agents

2

What if we just gave everything to one LLM?

Dump the CSV, the full ontology, examples, and instructions into a single prompt — and ask for YARRRML mappings.

Everything at once

📄 CSV schema 🔷 Full ontology 📜 YARRRML examples 🗒 Prefix table 🔗 Relationship hints ✅ Validation rules

→

🧠

One LLM Call

~20 000+ tokens
mixed instructions

→

Typical failures

Hallucinated predicates Wrong prefix bindings Missing entity links Flat mapping Ignored columns Invalid YARRRML syntax No error recovery

3

11-Node LangGraph Architecture

A single typed state machine with deterministic edges, four conditional feedback loops, and explicit hard-stop conditions at every gate. Press → Next to highlight each phase.

flowchart TD A([CSV + Ontology]) --> N1[analyze_schema] N1 --> N2[scout_ontology] N2 --> N3[map_semantics] N3 --> N4[align_schema] N4 --> N5[generate_cqs] N5 --> N6[generate_yarrrml] N6 --> N7{validate_yarrrml} N7 -->|Syntax error| N6 N7 -->|Passed| N8{refine_logic} N8 -->|Logic error| N6 N8 -->|Approved| N9[generate_kg] N9 --> N10{shacl_validate} N10 -->|SHACL error| N6 N10 -->|Pass/Skip| N11{sparql_validate_cqs} N11 -->|CQ fail| N6 N11 -->|CQ deep fail| N4 N11 -->|Pass/Skip| O([Knowledge Graph])

Press → Next to walk through each phase

1–3

Understand

analyze_schema → scout_ontology → map_semantics: CSV structure, ontology vocabulary, semantic column mapping.

4–5

Plan

align_schema builds the functional entity plan. generate_cqs auto-generates Competency Questions (or uses user-provided ones).

6–8

Generate + Syntax + Logic

YARRRML coordinator → Yatter syntax check → Refiner logic check. Both loops feed back into generation.

9–11

Materialize + Validate

Morph-KGC produces N-Triples. SHACL shapes (via Astrea) check conformance. SPARQL ASK queries execute each CQ on the live KG.

4

Agent Breakdown

Click → Next to walk through each agent. The pipeline diagram on the left highlights the active node.

Pipeline Nodes

1

analyze_schema

2

scout_ontology

3

map_semantics

4

align_schema

5

generate_cqs

6

generate_yarrrml

⬦ validate_yarrrml

⬦ refine_logic

9

generate_kg

⬦ shacl_validate

⬦ sparql_validate_cqs

Agent — of 11

Click → to begin

🤖

Press → Next to start walking through the agents

5

Competency Question Pipeline

Nodes 5 & 11 are the CQ nodes. Click → Next to walk through the four key ideas.

Pipeline Nodes

1

analyze_schema

2

scout_ontology

3

map_semantics

4

align_schema

5

generate_cqs ★

6

generate_yarrrml

⬦ validate_yarrrml

⬦ refine_logic

9

generate_kg

⬦ shacl_validate

11

sparql_validate ★

Step — of 4

Click → to begin

❓

Press → Next to explore the CQ pipeline

6

YARRRML Coordinator

Node 6 orchestrates three sub-agents + deterministic post-processing. Click → Next to walk through each step.

Pipeline Nodes

1

analyze_schema

2

scout_ontology

3

map_semantics

4

align_schema

5

generate_cqs

6

generate_yarrrml ★

P

PrefixAgent

E

EntityAgent

R

RelationshipAgent

✦

Post-processing

⬦ validate_yarrrml

⬦ refine_logic

9

generate_kg

⬦ shacl_validate

⬦ sparql_validate

Step — of 5

Click → to begin

⚙️

Press → Next to walk through the coordinator

7

Evaluation Framework

Four metric levels — pipeline health, KG fidelity, structural completeness, and CQ satisfaction — measured independently and combined. Press → Next to reveal each level.

Press → Next to reveal each metric level

L1

Pipeline Health

Always available · zero cost

Boolean flags

yarrrml_produced yarrrml_syntactic_valid yarrrml_translatable rml_materializable pipeline_success

Counters

retry_count total_triples total_latency_sec

Key: L1_* prefix in output dict

L2

Gold KG Comparison

URI-tolerant normalisation · P / R / F1

Triple Level

norm_triple_precision norm_triple_recall norm_triple_f1 true_positives false_positives false_negatives

Schema Level

predicate_precision predicate_recall predicate_f1 class_precision class_recall class_f1

Diagnostics

predicates_missing predicates_extra object_type_mismatches total_generated total_gold

Norm: subject → row ID · object URI → local name

L3

Column Coverage

Structural completeness via YARRRML template analysis

Primary — YARRRML refs

columns_total columns_mapped_yarrrml column_coverage_by_yarrrml ★ columns_missing_yarrrml

Parses all $(col) refs directly — 100% accurate ground truth

Secondary — literal value match

columns_mapped_value column_coverage_by_value columns_missing_value

Checks if first-row cell values appear as RDF literals in the KG

L4

CQ / SPARQL Validation

Always computed when SPARQL results present

Counts

cq_total cq_passed cq_failed cq_error

Coverage Score

cq_coverage

passed / total
Null results (SPARQL gen errors) count as failures — conservative and honest.

CQ → SPARQL ASK → pyoxigraph execution on live KG

How to run

python main.py --eval 1 2 3 --gold data/gold/bikeshare_gold.nt
python main.py --eval 1 2 3 4 --sparql --cqs "Which stations have docks?"
python main.py --eval 1 --shacl --sparql

8

Fully Configurable Per-Agent Pipeline

Every agent can be independently tuned — choose the right model for each role, control temperature for precision, and set retry limits per feedback loop. Press → Next to explore.

Press → Next to reveal each configuration dimension

🧠

Per-Agent LLM Model

Right model for the right job

Assign a reasoning model (e.g. DeepSeek-R1) to complex planning agents and a fast instruct model (e.g. Qwen Coder) to generation agents. One env var per role.

SCHEMA deepseek-r1-distill-qwen-14b reasoning

MAPPER deepseek-r1-distill-qwen-14b reasoning

YARRRML qwen2.5-coder-14b-instruct instruct

REFINER qwen2.5-coder-14b-instruct instruct

🌡️

Per-Agent Temperature

Determinism where it counts

Creative agents (schema understanding, CQ generation) get higher temperatures. Validation and generation agents use lower values to keep output deterministic and parseable.

schema / mapper

0.3

alignment

0.2

yarrrml / cq

0.3

refiner

0.2

🔁

Retry Configuration

Hard-stop guards per feedback loop

Each feedback loop has an independent retry counter with a configurable max. When a loop hits its limit, the pipeline hard-stops rather than cycling endlessly.

⟳ Syntax loop (validate_yarrrml) max 10

⟳ Logic loop (refine_logic) max 6

⟳ SHACL loop (shacl_validate) max 5

⟳ CQ loop (sparql_validate_cqs) max 3

All limits are configurable via MAX_RETRIES_* env vars — no code changes needed.

9

KG Quality Evaluation — IMDB Dataset

Unified F1 comparison of all five pipelines against the same gold-standard KG (49 triples, Hofer et al.). Press → Next to reveal each result.

Ref A: GPT-4 0125-preview Ref B: Claude 3 Opus — Hofer et al. Our α: Qwen 2.5 Coder 14B Our β: Mistral Nemo 2407 Our γ: DeepSeek-R1-Distill-Qwen-14B ✦ Gold: 49 triples

56

Ref A triples

GPT-4 0125-preview

40

Ref B triples

Claude 3 Opus

67

Our α triples

Qwen 2.5 Coder 14B

77

Our β triples

Mistral Nemo 2407

67

Our γ triples ✦

DeepSeek-R1-Distill-Qwen-14B

F1 Score Breakdown — all models vs gold KG (49 triples)

	📚 Hofer et al. Reference Pipelines		🤖 Our Automap Pipeline — Local LLMs
Metric	🟡 GPT-4 0125-preview · Ref A	🟢 Claude 3 Opus · Ref B (Hofer)	🔵 Qwen 14B 2.5 Coder · Our α	🟣 Mistral Nemo 2407 · Our β	🩵 DeepSeek-R1 Distill-Qwen-14B · Our γ
Entity & Subject Coverage
Subject IDs (fuzzy, IRI-agnostic)	1.000	1.000	0.957	1.000	1.000
Class Assignments
Class assignments (rdf:type)	0.846	1.000	0.815	0.385	0.846
Predicate Coverage
Predicate usage	0.667	0.899	0.672	0.603	0.655
Value Fidelity
Literal values	0.833	1.000	0.667	0.469	0.648
⌀ Average F1	0.592	0.726	0.528	0.436	0.550 −0.042 vs Ref A

🩵 DeepSeek-R1-Distill highlights

Best avg F1 of our three — 0.550
Only −0.042 below GPT-4 reference
Perfect entity recall

🔵 Qwen 14B highlights

Avg F1 0.528
Best literal values among our three (0.667)

🟣 Mistral Nemo highlights

Avg F1 0.436
Actors correctly typed dbo:Actor ✓ (unique)

10

Ablation Study — Credit Card Fraud Dataset

Kaggle Credit Card Fraud · Qwen 2.5-Coder-14B · 3 CQ-retry rounds · SHACL enabled. Two conditions: Auto-CQ vs Crafted-CQ. Press → Next to reveal each column.

Input CSV — 22 columns

trans_date_trans_time  cc_num              merchant          category       amt
2020-06-21T12:14:25    2291163933867244    fraud_Kirlin…     personal_care  2.86

first   last     gender  street               city      state  zip    lat       long
Jeff    Elliott  M       351 Darlene Green    Columbia  SC     29209  33.9659  -80.9355

city_pop  job                    dob         trans_num       unix_time   is_fraud
333497    Mechanical engineer    1968-03-19  2da90c7d74…     1371816865  0

🏆 Gold KG — 5 Entity Nodes

Tx

ex:CreditCardTransaction

trans_num, amt, isFraud, dateTime → madeBy, atMerchant, usesCard

Ch

schema:Person (cardholder)

givenName, familyName, gender, birthDate, jobTitle → address

Ad

schema:PostalAddress

street, city, state, zip, geo:lat, geo:long

Mc

ex:Merchant

name, merchantCategory, merch_lat, merch_long

Cc

ex:CreditCard

cardNumber → heldBy (cardholder)

🤖

Auto-CQ (generic)

Which transactions exist in the dataset?
What is the amount of a given transaction?
What is the merchant of a transaction?
What is the category of a transaction?
Is a given transaction fraudulent?
What is the cardholder's name?
What is the transaction date and time?
What is the card number used?
What is the merchant's location?
What is the cardholder's location?

⚠ Schema-level only

These questions require no joins, no aggregations, no geo reasoning. They don't encode the intended entity structure.

Auto-CQ generates → 3 nodes

CreditCardTransactionhasPart, merchant as string

Merchantlat/long only

Metadataflat blob of all other columns

❌ No Person node · No CreditCard node · No Address node

✍️

Crafted-CQ (domain-specific)

Which transactions are flagged as fraudulent?
Which cardholders made >3 fraudulent transactions?
Total amount per merchant category?
Job and date of birth of a cardholder?
Cardholders in cities with pop > 100 000?
Merchants with most fraud flags?
Which categories are most fraud-associated?
Transactions >50 km from cardholder location?
Fraudulent transactions in a specific US state?
Transactions within the same hour?
Cardholder with multiple transactions in 10 min?

✅ Multi-hop · aggregation · geo-spatial

These questions encode the cardholder–card–merchant–address entity structure present in the gold KG.

Crafted-CQ generates → 4 nodes

CreditCardTransactionisFraudulent, cardholder link

Merchanthas_part transaction link

Persongender, jobTitle, birthDate ✓

MetadatatransDateTime, state, zip ✓

✓ Person node added · ✓ fraud flag exposed · ✗ still missing CreditCard + Address nodes

11

Auto-CQ vs Crafted-CQ — Results

CQ quality is the primary driver of KG structural fidelity. Domain-specific CQs compel the pipeline to model the precise entity structure of the gold KG. Press → Next to reveal.

+29.3

pp F1 gain

Triple F1: 0.139 → 0.546

Domain-specific CQs about fraud flags, cardholder–card ownership, merchant categories, and geo-location compel the pipeline to model the precise entity structure of the gold KG.

Metric	🤖 Auto-CQ	✍️ Crafted-CQ
Run statistics
Triples materialised	168	293
Retries	3	3
Latency (s)	1144	1354
Level 2 — Gold KG comparison
Triple Precision	0.122	0.544
Triple Recall	0.161	0.548
Triple F1 ★	0.139	0.546
Predicate F1	0.480	0.440
Class F1	0.400	0.364
Level 3 — Column coverage
YARRRML column coverage	1.000 ✓	1.000 ✓
Level 4 — CQ satisfaction
CQs passed / total	4 / 10	2 / 11
CQ coverage	0.400	0.182

Entity structure — what the pipeline generates

🤖 Auto-CQ
3 nodes

CreditCardTransaction

Merchant

Metadata (flat)

❌ No Person
❌ No Card · No Address

→

✍️ Crafted-CQ
4 nodes

CreditCardTransaction

Merchant

Person ✓ NEW

Metadata (enriched)

✓ Person added
✗ Card · ✗ Address

→

🏆 Gold
5 nodes

CreditCardTransaction

Merchant

Person (cardholder)

PostalAddress

CreditCard

✅ Full column coverage

Both conditions achieve L3 = 1.0 — all 22 CSV columns referenced. The pipeline is structurally reliable regardless of CQ input.

⚠ The Satisfaction Paradox

Crafted CQs pass fewer SPARQL checks (18% vs 40%) but produce a better KG. Simpler questions are easier to satisfy but carry less structural signal.

🎯 Takeaway

CQ quality is the primary driver of KG structural fidelity. Investing in well-scoped, domain-specific questions substantially outweighs a higher satisfaction count on simpler auto-generated ones. The +29.3 pp F1 gain comes from CQs encoding the intent of the gold KG structure.

A

Thank you

Thanks for the attention — I would very much appreciate your feedback and any questions you might have.

Supervisor

Dr. Raúl García Castro

garcia-castro.com

Ontology Engineering Group

Dept. de Inteligencia Artificial

ETSI Informáticos · UPM

Naveen Varma Kalidindi

Universidad Politécnica de Madrid
PIONERA Project

✉ naveen.kalidindi@upm.es

Supervisor

Pablo Calleja

Profesor Permanente Laboral

Dept. de Inteligencia Artificial

ETSI Informáticos · UPM