Slide

Automap

Agentic knowledge graph generation from CSV + ontology
LangGraphMorph-KGCYARRRML → RMLLM Studio
AUTO
M ulti-Agent
A utomated Mapping
P ipeline
Universidad Politécnica de Madrid · PIONERA Project

Transforms CSV + ontology into validated knowledge graphs — with auto-generated Competency Questions, SPARQL execution validation, and optional SHACL conformance checking.

11
Pipeline nodes
4
Feedback loops
100%
Local-LLM capable
3
YARRRML sub-agents
2
What if we just gave everything to one LLM?

Dump the CSV, the full ontology, examples, and instructions into a single prompt — and ask for YARRRML mappings.

Everything at once
📄 CSV schema 🔷 Full ontology 📜 YARRRML examples 🗒 Prefix table 🔗 Relationship hints ✅ Validation rules
🧠

One LLM Call

~20 000+ tokens
mixed instructions

Typical failures
Hallucinated predicates Wrong prefix bindings Missing entity links Flat mapping Ignored columns Invalid YARRRML syntax No error recovery
3
11-Node LangGraph Architecture

A single typed state machine with deterministic edges, four conditional feedback loops, and explicit hard-stop conditions at every gate. Press → Next to highlight each phase.

flowchart TD A([CSV + Ontology]) --> N1[analyze_schema] N1 --> N2[scout_ontology] N2 --> N3[map_semantics] N3 --> N4[align_schema] N4 --> N5[generate_cqs] N5 --> N6[generate_yarrrml] N6 --> N7{validate_yarrrml} N7 -->|Syntax error| N6 N7 -->|Passed| N8{refine_logic} N8 -->|Logic error| N6 N8 -->|Approved| N9[generate_kg] N9 --> N10{shacl_validate} N10 -->|SHACL error| N6 N10 -->|Pass/Skip| N11{sparql_validate_cqs} N11 -->|CQ fail| N6 N11 -->|CQ deep fail| N4 N11 -->|Pass/Skip| O([Knowledge Graph])
Press → Next to walk through each phase
1–3
Understand

analyze_schemascout_ontologymap_semantics: CSV structure, ontology vocabulary, semantic column mapping.

4–5
Plan

align_schema builds the functional entity plan. generate_cqs auto-generates Competency Questions (or uses user-provided ones).

6–8
Generate + Syntax + Logic

YARRRML coordinator → Yatter syntax check → Refiner logic check. Both loops feed back into generation.

9–11
Materialize + Validate

Morph-KGC produces N-Triples. SHACL shapes (via Astrea) check conformance. SPARQL ASK queries execute each CQ on the live KG.

4
Agent Breakdown

Click → Next to walk through each agent. The pipeline diagram on the left highlights the active node.

Pipeline Nodes
1
analyze_schema
2
scout_ontology
3
map_semantics
4
align_schema
5
generate_cqs
6
generate_yarrrml
⬦ validate_yarrrml
⬦ refine_logic
9
generate_kg
⬦ shacl_validate
⬦ sparql_validate_cqs
Agent of 11
Click → to begin
🤖

Press → Next to start walking through the agents

5
Competency Question Pipeline

Nodes 5 & 11 are the CQ nodes. Click → Next to walk through the four key ideas.

Pipeline Nodes
1
analyze_schema
2
scout_ontology
3
map_semantics
4
align_schema
5
generate_cqs ★
6
generate_yarrrml
⬦ validate_yarrrml
⬦ refine_logic
9
generate_kg
⬦ shacl_validate
11
sparql_validate ★
Step of 4
Click → to begin

Press → Next to explore the CQ pipeline

6
YARRRML Coordinator

Node 6 orchestrates three sub-agents + deterministic post-processing. Click → Next to walk through each step.

Pipeline Nodes
1
analyze_schema
2
scout_ontology
3
map_semantics
4
align_schema
5
generate_cqs
6
generate_yarrrml ★
P
PrefixAgent
E
EntityAgent
R
RelationshipAgent
Post-processing
⬦ validate_yarrrml
⬦ refine_logic
9
generate_kg
⬦ shacl_validate
⬦ sparql_validate
Step of 5
Click → to begin
⚙️

Press → Next to walk through the coordinator

7
Evaluation Framework

Four metric levels — pipeline health, KG fidelity, structural completeness, and CQ satisfaction — measured independently and combined. Press → Next to reveal each level.

Press → Next to reveal each metric level
L1
Pipeline Health
Always available · zero cost
Boolean flags
yarrrml_produced yarrrml_syntactic_valid yarrrml_translatable rml_materializable pipeline_success
Counters
retry_count total_triples total_latency_sec
Key: L1_* prefix in output dict
L2
Gold KG Comparison
URI-tolerant normalisation · P / R / F1
Triple Level
norm_triple_precision norm_triple_recall norm_triple_f1 true_positives false_positives false_negatives
Schema Level
predicate_precision predicate_recall predicate_f1 class_precision class_recall class_f1
Diagnostics
predicates_missing predicates_extra object_type_mismatches total_generated total_gold
Norm: subject → row ID · object URI → local name
L3
Column Coverage
Structural completeness via YARRRML template analysis
Primary — YARRRML refs
columns_total columns_mapped_yarrrml column_coverage_by_yarrrml ★ columns_missing_yarrrml
Parses all $(col) refs directly — 100% accurate ground truth
Secondary — literal value match
columns_mapped_value column_coverage_by_value columns_missing_value
Checks if first-row cell values appear as RDF literals in the KG
L4
CQ / SPARQL Validation
Always computed when SPARQL results present
Counts
cq_total cq_passed cq_failed cq_error
Coverage Score
cq_coverage

passed / total
Null results (SPARQL gen errors) count as failures — conservative and honest.

CQ → SPARQL ASK → pyoxigraph execution on live KG
How to run
python main.py --eval 1 2 3 --gold data/gold/bikeshare_gold.nt
python main.py --eval 1 2 3 4 --sparql --cqs "Which stations have docks?"
python main.py --eval 1 --shacl --sparql
8
Fully Configurable Per-Agent Pipeline

Every agent can be independently tuned — choose the right model for each role, control temperature for precision, and set retry limits per feedback loop. Press → Next to explore.

Press → Next to reveal each configuration dimension
🧠
Per-Agent LLM Model
Right model for the right job

Assign a reasoning model (e.g. DeepSeek-R1) to complex planning agents and a fast instruct model (e.g. Qwen Coder) to generation agents. One env var per role.

SCHEMA deepseek-r1-distill-qwen-14b reasoning
MAPPER deepseek-r1-distill-qwen-14b reasoning
YARRRML qwen2.5-coder-14b-instruct instruct
REFINER qwen2.5-coder-14b-instruct instruct
🌡️
Per-Agent Temperature
Determinism where it counts

Creative agents (schema understanding, CQ generation) get higher temperatures. Validation and generation agents use lower values to keep output deterministic and parseable.

schema / mapper
0.3
alignment
0.2
yarrrml / cq
0.3
refiner
0.2
🔁
Retry Configuration
Hard-stop guards per feedback loop

Each feedback loop has an independent retry counter with a configurable max. When a loop hits its limit, the pipeline hard-stops rather than cycling endlessly.

Syntax loop (validate_yarrrml) max 10
Logic loop (refine_logic) max 6
SHACL loop (shacl_validate) max 5
CQ loop (sparql_validate_cqs) max 3
All limits are configurable via MAX_RETRIES_* env vars — no code changes needed.
9
KG Quality Evaluation — IMDB Dataset

Unified F1 comparison of all five pipelines against the same gold-standard KG (49 triples, Hofer et al.). Press → Next to reveal each result.

Ref A: GPT-4 0125-preview Ref B: Claude 3 Opus — Hofer et al. Our α: Qwen 2.5 Coder 14B Our β: Mistral Nemo 2407 Our γ: DeepSeek-R1-Distill-Qwen-14B ✦ Gold: 49 triples

All runs validated against Hofer et al.'s gold KG (49 triples) using their published test-case evaluation methodology. Reference pipelines reproduced from the original study.

56
Ref A triples
GPT-4 0125-preview
40
Ref B triples
Claude 3 Opus
67
Our α triples
Qwen 2.5 Coder 14B
77
Our β triples
Mistral Nemo 2407
67
Our γ triples ✦
DeepSeek-R1-Distill-Qwen-14B
F1 Score Breakdown — all models vs gold KG (49 triples)
📚 Hofer et al. Reference Pipelines 🤖 Our Automap Pipeline — Local LLMs
Metric 🟡 GPT-4
0125-preview · Ref A
🟢 Claude 3
Opus · Ref B (Hofer)
🔵 Qwen 14B
2.5 Coder · Our α
🟣 Mistral
Nemo 2407 · Our β
🩵 DeepSeek-R1
Distill-Qwen-14B · Our γ
Entity & Subject Coverage
Subject IDs (fuzzy, IRI-agnostic)
1.000
1.000
0.957
1.000
1.000
Class Assignments
Class assignments (rdf:type)
0.846
1.000
0.815
0.385
0.846
Predicate Coverage
Predicate usage
0.667
0.899
0.672
0.603
0.655
Value Fidelity
Literal values
0.833
1.000
0.667
0.469
0.648
⌀ Average F1 0.592 0.726 0.528 0.436 0.550
−0.042 vs Ref A
🩵 DeepSeek-R1-Distill highlights
  • Best avg F1 of our three — 0.550
  • Only −0.042 below GPT-4 reference
  • Perfect entity recall
🔵 Qwen 14B highlights
  • Avg F1 0.528
  • Best literal values among our three (0.667)
🟣 Mistral Nemo highlights
  • Avg F1 0.436
  • Actors correctly typed dbo:Actor ✓ (unique)
10
Ablation Study — Credit Card Fraud Dataset

Kaggle Credit Card Fraud · Qwen 2.5-Coder-14B · 3 CQ-retry rounds · SHACL enabled. Two conditions: Auto-CQ vs Crafted-CQ. Press → Next to reveal each column.

Input CSV — 22 columns
trans_date_trans_time  cc_num              merchant          category       amt
2020-06-21T12:14:25    2291163933867244    fraud_Kirlin…     personal_care  2.86

first   last     gender  street               city      state  zip    lat       long
Jeff    Elliott  M       351 Darlene Green    Columbia  SC     29209  33.9659  -80.9355

city_pop  job                    dob         trans_num       unix_time   is_fraud
333497    Mechanical engineer    1968-03-19  2da90c7d74…     1371816865  0
🏆 Gold KG — 5 Entity Nodes
Tx
ex:CreditCardTransaction
trans_num, amt, isFraud, dateTime → madeBy, atMerchant, usesCard
Ch
schema:Person (cardholder)
givenName, familyName, gender, birthDate, jobTitle → address
Ad
schema:PostalAddress
street, city, state, zip, geo:lat, geo:long
Mc
ex:Merchant
name, merchantCategory, merch_lat, merch_long
Cc
ex:CreditCard
cardNumber → heldBy (cardholder)
🤖
Auto-CQ (generic)
  • Which transactions exist in the dataset?
  • What is the amount of a given transaction?
  • What is the merchant of a transaction?
  • What is the category of a transaction?
  • Is a given transaction fraudulent?
  • What is the cardholder's name?
  • What is the transaction date and time?
  • What is the card number used?
  • What is the merchant's location?
  • What is the cardholder's location?
⚠ Schema-level only

These questions require no joins, no aggregations, no geo reasoning. They don't encode the intended entity structure.

Auto-CQ generates → 3 nodes
CreditCardTransactionhasPart, merchant as string
Merchantlat/long only
Metadataflat blob of all other columns
❌ No Person node · No CreditCard node · No Address node
✍️
Crafted-CQ (domain-specific)
  • Which transactions are flagged as fraudulent?
  • Which cardholders made >3 fraudulent transactions?
  • Total amount per merchant category?
  • Job and date of birth of a cardholder?
  • Cardholders in cities with pop > 100 000?
  • Merchants with most fraud flags?
  • Which categories are most fraud-associated?
  • Transactions >50 km from cardholder location?
  • Fraudulent transactions in a specific US state?
  • Transactions within the same hour?
  • Cardholder with multiple transactions in 10 min?
✅ Multi-hop · aggregation · geo-spatial

These questions encode the cardholder–card–merchant–address entity structure present in the gold KG.

Crafted-CQ generates → 4 nodes
CreditCardTransactionisFraudulent, cardholder link
Merchanthas_part transaction link
Persongender, jobTitle, birthDate ✓
MetadatatransDateTime, state, zip ✓
✓ Person node added · ✓ fraud flag exposed · ✗ still missing CreditCard + Address nodes
11
Auto-CQ vs Crafted-CQ — Results

CQ quality is the primary driver of KG structural fidelity. Domain-specific CQs compel the pipeline to model the precise entity structure of the gold KG. Press → Next to reveal.

+29.3
pp F1 gain
Triple F1: 0.139 → 0.546

Domain-specific CQs about fraud flags, cardholder–card ownership, merchant categories, and geo-location compel the pipeline to model the precise entity structure of the gold KG.

Metric 🤖 Auto-CQ ✍️ Crafted-CQ
Run statistics
Triples materialised168293
Retries33
Latency (s)11441354
Level 2 — Gold KG comparison
Triple Precision0.1220.544
Triple Recall0.1610.548
Triple F1 ★0.1390.546
Predicate F10.4800.440
Class F10.4000.364
Level 3 — Column coverage
YARRRML column coverage1.000 ✓1.000 ✓
Level 4 — CQ satisfaction
CQs passed / total4 / 102 / 11
CQ coverage0.4000.182
Entity structure — what the pipeline generates
🤖 Auto-CQ
3 nodes
CreditCardTransaction
Merchant
Metadata (flat)
❌ No Person
❌ No Card · No Address
✍️ Crafted-CQ
4 nodes
CreditCardTransaction
Merchant
Person ✓ NEW
Metadata (enriched)
✓ Person added
✗ Card · ✗ Address
🏆 Gold
5 nodes
CreditCardTransaction
Merchant
Person (cardholder)
PostalAddress
CreditCard
✅ Full column coverage

Both conditions achieve L3 = 1.0 — all 22 CSV columns referenced. The pipeline is structurally reliable regardless of CQ input.

⚠ The Satisfaction Paradox

Crafted CQs pass fewer SPARQL checks (18% vs 40%) but produce a better KG. Simpler questions are easier to satisfy but carry less structural signal.

🎯 Takeaway

CQ quality is the primary driver of KG structural fidelity. Investing in well-scoped, domain-specific questions substantially outweighs a higher satisfaction count on simpler auto-generated ones. The +29.3 pp F1 gain comes from CQs encoding the intent of the gold KG structure.

A
Thank you

Thanks for the attention — I would very much appreciate your feedback and any questions you might have.

Supervisor
Dr. Raúl García Castro
garcia-castro.com
Ontology Engineering Group
Dept. de Inteligencia Artificial
ETSI Informáticos · UPM
Naveen Varma Kalidindi
Universidad Politécnica de Madrid
PIONERA Project
✉ naveen.kalidindi@upm.es
Supervisor
Pablo Calleja
Profesor Permanente Laboral
Dept. de Inteligencia Artificial
ETSI Informáticos · UPM