Building a Knowledge Graph for the Mahabharata: A Technical Deep Dive

How we turned 1.8 million words of Sanskrit epic into a navigable causal graph — and what the pipeline taught us about LLM evaluation, cost estimation, and knowing when "free" is the most expensive option.

For background on the project and its editorial approach, see the project overview.

The Pipeline: Five Stages from Text to Graph

We built a five-stage extraction pipeline that transforms the Mahabharata's narrative structure into a structured knowledge graph. Each stage runs offline, as a one-time batch job — no LLM runs at request time in production.

Stage 1: Text Ingestion (Python, no LLM)
  → raw chapter JSON (225 chapters for Adi Parva; 72 for Sabha)

Stage 2: Entity Extraction (Claude Sonnet Batch API for Adi; deepseek-official for Sabha+)
  → characters, places, weapons, relationships, narrative frame

Stage 3: Substory Segmentation (deepseek-official)
  → trigger / agency / resolution per substory

Stage 4: Entity Resolution (Python, no LLM)
  → canonical entity registry with merged aliases; cross-parva merge for Sabha+

Stage 5: Graph Assembly (deepseek-official + Neo4j)
  → TRIGGERS edges, Arc nodes, significance scores, static JSON export

Let me tell you what worked, what didn't, and what cost us money we didn't expect to spend.

Stage 2: Entity Extraction — From Claude Sonnet to DeepSeek

Adi Parva: Establishing the Baseline

For Adi Parva, entity extraction was the first place we needed to know what "good" looked like. The task requires reliably extracting structured JSON: characters with aliases, relationships typed to a controlled 9-type vocabulary, narrative frames identifying who is telling what story to whom. Schema adherence matters enormously.

We evaluated four models:

Model	Entity Overlap	Alias Recall	Reliability
Claude Sonnet	100% (baseline)	100%	100%
Kimi K2	73%	—	100%
qwen-32b	100%	—	17%
DeepSeek (NIM)	82%	—	17%

Kimi was reliable but missed 27% of entities. qwen-32b had surprisingly good entity recall but failed nearly 5 out of 6 times due to output truncation. DeepSeek had similar reliability problems. The decision was clear: Claude Sonnet via Batch API. The Batch API gives a 50% discount — the catch is a 24-hour turnaround, which is perfectly acceptable for an offline pipeline.

Adi Stage 2 cost: $4.93 for 223 chapters. That was 5× higher than our initial estimate — a lesson we'll come back to.

(Side note: we also accidentally resubmitted all 224 chapters with a --force flag while retrying some failures, wasting $6.85 before we could cancel. That flag is now banned. Use --retry-errors.)

Sabha Parva: Evaluating Cheaper Alternatives

Adi's $4.93 was reasonable for a one-off. But facing 16 more parvas, it was worth asking whether Claude Sonnet was truly necessary. For Sabha, we built a proper 10-chapter gold standard — hand-annotated against the source text — and ran a quantitative evaluation.

Model	Entity F1	Alias Recall	Schema Valid	Cost/Parva	Verdict
deepseek-chat (official)	0.93	0.83	100%	~$0.02	Winner
qwen-32b (NIM free)	0.82	0.65	100%	$0	Fallback
deepseek-reasoner	0.84	0.80	80%	~$0.05	Too unreliable

DeepSeek official at F1=0.93 and 100% schema validity was close enough to Sonnet's ceiling that the cost difference was indefensible. At ~$0.02/parva versus $4.93, it runs the entire remaining pipeline for the cost of rounding error.

DeepSeek official is the default for all parvas from Sabha onward.

The alias recall gap (0.83 vs 1.0) is real but cosmetic: missed aliases mean some character names don't hyperlink in retellings. It doesn't affect graph structure.

One important finding from the Sabha eval: "Krishna" disambiguation is a required test case for any future model evaluation. Sabha Parva has three distinct people called Krishna — Krishna Vasudeva (the hero), Draupadi (dark-complexioned; kṛṣṇā is one of her names), and Krishna Dvaipayana (Vyasa). DeepSeek resolved correctly in 70/72 chapters. The 2 failures were caught by a downstream guard. Any model that can't disambiguate "Krishna" from context is disqualified.

Why Output Tokens Cost More Than You Think

Our original Stage 2 cost estimate was $18 for entity extraction across all 18 parvas. Reality for Adi alone: $4.93. The estimate was wrong by a factor of roughly 5.

The culprit is the asymmetry in API pricing that's easy to forget:

Input tokens: $3/M (or $1.50/M with Batch discount)
Output tokens: $15/M (or $7.50/M with Batch discount)

Output is 5× more expensive per token — and extracted JSON is verbose. Our extracted JSON averaged 2,071 output tokens per chapter versus our assumed 1,000. Add a ~3,500-token fixed overhead from the prompt template with two few-shot examples, and each chapter costs 2.2¢ rather than the estimated ~0.5¢.

The lesson: always measure your actual output verbosity before estimating costs at scale.

The Actual Costliest Mistake: No Golden Set

The wasted $6.85 from the duplicate submission is a funny story about a flag. It didn't change anything structurally.

The actual costliest mistake — one that cost us an entire rebuild of Adi Parva — was running Stages 3 and 5 without a golden evaluation set: a small corpus of hand-verified chapters against which we could measure model output quality.

Without a golden set, model evaluation defaults to relative comparison. You run several models, compare their outputs to each other, note which one produces more plausible results, and call the most plausible one the winner. The problem is that relative comparison can't catch systematic errors that all models agree on — and it definitely can't catch cases where a cheap model is consistently wrong in a direction that only becomes visible at the parva level.

That's exactly what happened with Stage 3. Qwen-32B looked good in relative comparison. Its per-chapter outputs were schema-valid and seemed internally coherent. We only discovered the over-segmentation problem after running all 225 chapters and looking at the total: 724 substories, when the correct number was 479. A golden set of 10 hand-annotated chapters would have caught this before a single production chapter ran.

What a Golden Set Is

A golden set is the right answer for a representative sample of your data. For substory segmentation, that means: for these 10 chapters, here are the correct substory boundaries, titles, triggers, agency, and resolutions — verified against the source text, not inferred from model consensus.

The critical property is independence: the golden set can't be the same chapters you use as few-shot examples in your prompt. You're measuring generalization, not memorization.

Chapter selection matters more than chapter count. We needed diversity across the failure modes we cared about:

Dense narrative chapters: multiple events, complex causation, risk of false edges. Chapter 109 (Pandu's curse) spans four chapters in its causal chain — a model that only reads sequence will miss it.
Genealogical chapters: long lists of kings with no narrative arc. These are not substories. A model that creates substories from them is wrong.
Frame narration chapters: Janamejaya asking questions, Vaishampayana responding. The question is not caused by the answer. These are the most reliably mishandled chapter type.
Short chapters: one episode, one substory. The model should produce exactly one.
Retrospective chapters: events recounted out of order. The Mahabharata does this constantly. A model that reads text position as causal order fails here.

Ten well-chosen chapters catches systematic errors. Fifty would be more rigorous but takes 5× longer to annotate and delivers diminishing returns. For a pipeline that runs once per parva, 10 is the right tradeoff.

Stage 3: The Free Model Gamble and What It Cost Us

The Stage 3 plan was to find a free model good enough for substory segmentation. We ran a systematic evaluation across 12 models on NVIDIA's NIM free tier.

Model	Parse Rate	Schema Valid	Entity Overlap	Speed
Minimax M2.1	100%	83%	78%	Fast (5–10s)
Qwen2.5-Coder 32B	100%	75%	78%	Fast (5–10s)
qwen3.5-397B	67%	—	—	Timeout (67% failure)
DeepSeek (NIM)	100%	50%	82%	Unreliable
Llama-3.3-70B	50%	—	—	Invents relations
Claude Sonnet	100%	100%	100%	Batch only

The most interesting finding: model size doesn't predict quality. Qwen-32B (specifically the "Coder" variant) outperformed Qwen-397B, which was too creative and frequently timed out. The Coder variant's training on structured JSON paid off directly for schema adherence. "Thinking mode" on larger models caused 10+ minute hangs before producing anything useful. Disabled for all models after the first call.

Qwen2.5-Coder-32B looked like the clear winner: 100% parse rate, reasonable schema validity, free, and fast. We ran it on all 225 chapters of Adi Parva. Cost: $0.

What $0 Actually Bought Us

The Qwen output looked correct at the chapter level. Each chapter's substories were schema-valid, had plausible titles, and the T/A/R fields were grammatically coherent. The problem was grain — and grain is invisible when you evaluate one chapter at a time without a reference for what the right count should be.

Qwen split scenes. A two-part episode — a character receives a curse, then immediately departs as a result — became two substories instead of one. Beat reports got elevated to substory status. Frame narration boundaries were sometimes missed. None of these errors were catastrophic in isolation. Cumulatively, they produced 724 substories across Adi Parva when the correct number was closer to 479.

The rebuild using DeepSeek's official API — with a sharper grain heuristic and a correct distinction between episodes and beats — took two days and cost ~$0.13. The 724-substory pipeline ran for weeks before we caught the problem. The free model cost us more time than a paid model would have cost money.

For Sabha and every parva after: deepseek-official is the default for Stage 3. Not because it's cheap (though it is), but because it's predictably good. Zero timeouts. Consistent output quality. The cost difference between "free" and "almost free" is negligible compared to the cost of a rebuild.

Stage 5a: The Hardest Model Evaluation — Causal Reasoning

Generating TRIGGERS edges — the causal links between substories — required the most careful model evaluation. This is where the product lives or dies: if the causal chains are just narrative sequence dressed up as causation, the exploration engine has no real value.

We tested five models on two chapters: chapter 57 (a summary birth chapter with obvious causal chains) and chapter 109 (Pandu's curse — a delayed consequence spanning multiple chapters).

The Sequential vs. Causal Distinction

The core failure mode: LLMs trained on text have a strong prior to connect adjacent events. They read "A appeared before B in the text" as "A caused B." This is wrong almost all of the time in the Mahabharata.

The most egregious example was Llama-3.3-70B on chapter 57. It produced a chain connecting every birth in sequence:

Indra blesses Uparichara → Uparichara establishes sons → river Shuktimati → Uparichara conceives Matsya → Parashara and Satyavati → birth of Dvaipayana → birth of Vidura → birth of Karna → birth of Vishnu → birth of other warriors → birth of Pandavas

This is not causation — it's narration sequence. Vidura's birth doesn't cause Karna's birth. Llama produced zero cross-chapter edges from chapter 57. For a graph whose primary purpose is navigation across chapters, this is a critical failure.

Why DeepSeek Won

DeepSeek produced 19 edges with an average confidence of 0.95. Critically, it found cross-chapter causal links that other models missed entirely.

The most impressive: it correctly identified that chapter 57's brief birth mentions are summaries of events elaborated in later chapters. birth_of_dvaipayana (ch57) → ambika_conception (ch100) because Vyasa's birth in chapter 57 is the necessary precondition for the niyoga in chapter 100. The mechanism it produced:

"Vyasa's birth establishes him as the available half-brother of Vichitravirya's widows; this identity is the necessary condition for Satyavati to summon him to father Dhritarashtra, Pandu, and Vidura."

Compare to Llama's mechanism for the same sequence:

"The story of the river Shuktimati and Kolahala led to Uparichara's conception of Matsya."

The difference is mechanism specificity. Vague mechanisms — "led to," "resulted in," "set the stage for" — are useless for the causal chain panel that users actually read. The mechanism is the explanation. It has to be specific enough to teach you something.

Model	Edges Found	Avg Confidence	Cross-Chapter Edges	Verdict
DeepSeek V3	19	0.95	✅ Many	Winner
Kimi K2	15	0.87	✅ Yes	Strong second
Llama-3.3-70B	15	0.84	❌ Zero (ch57)	Fails causation test
qwen3.5-397B	2	0.97	✅ 1	Too slow
qwen2.5-32B	—	—	—	Context overflow

We initially ran DeepSeek via NIM's free tier. Under sustained Stage 5 workloads, it hit ~40% timeouts. We switched to DeepSeek's official API: ~$0.05/parva, zero timeouts, same quality. 944 TRIGGERS edges for Adi Parva; 266 for Sabha.

Stage 5b: Arc Grouping and the "Dramatic Question" Rule

With 479 substories in Adi Parva, we needed to group them into story arcs. This surfaced a design question that turned out to be foundational: what makes a good arc?

An arc is NOT "substories about character X." It's NOT "substories in chapters 60-70." A good arc is a dramatic question with uncertain outcome:

"Will the Pandavas survive Duryodhana's plot to burn them alive?" (lac house conspiracy)
"Can Garuda obtain the amrita to free his mother from slavery?" (Garuda quest arc, 23 substories)
"Can Shakuntala reclaim her rightful place after Duhshanta forgets their marriage?"

The thesis must be a question. Topic labels like "events in the forest" create overlapping, useless arcs with no narrative logic.

One specific failure mode we had to guard against: the Mahabharata has a deeply nested frame structure. Sauti tells sages what Vaishampayana told Janamejaya about what happened to the Pandavas. This creates "frame narration" substories — chapters where Janamejaya asks Vaishampayana to tell him about Pandu's death. These frame substories must NOT be assigned to the arc they frame. One model we evaluated kept putting Janamejaya's questions into the Pandu arc. That model was disqualified.

The result: 39 arcs for Adi Parva, 11 arcs for Sabha Parva. Arc coverage across Adi: ~80% of narrative substories assigned (the remainder are transitional frames or genealogical interludes with no dramatic question).

Stage 6: Story Retellings

The final pipeline stage — generating narrative retellings — had the most interesting model evaluation.

The goal was retellings that feel like a skilled storyteller recounting the episode to an engaged listener. Not a textbook entry. Not a florid translation. The closest analogy is a great museum audio guide: knowledgeable, evocative, and respectful of the material without being dry.

We had a clear rubric for what we didn't want:

Too clinical: "Karna removed his armor and earrings and gave them to Indra. This made Karna vulnerable in battle."

Too florid: "With trembling hands, the great hero Karna, tears streaming down his sun-kissed face, tore the divine kavach from his own bleeding flesh as the treacherous Indra watched with barely concealed glee."

Right: "Karna knew exactly what he was giving away. The kavach and kundala — the divine armor and earrings he had been born with — were the only things that made him immortal in battle. Surya himself had warned him: Indra would come disguised as a Brahmin, and he would ask for the one thing Karna could not afford to give. Karna gave it anyway. He cut the armor from his own body and handed it over, because refusing a petitioner was something Karna simply could not do."

Note what the "right" version doesn't do: it doesn't tell you Karna's generosity represents a particular caste virtue, or that Indra's deception is morally condemnable, or that this moment prefigures some grand statement about fate. It describes what happened, gives you the emotional weight the text gives it, and trusts the reader to feel what needs to be felt.

We evaluated DeepSeek V3 against Claude Sonnet on five diverse substories:

Model	Score	Clean (no voice flags)	Avg Word Count
Claude Sonnet	100/100 (reference)	5/5	~620 words
DeepSeek V3	97/100	4/5	~650 words

A 3-point gap on a 100-point rubric. DeepSeek occasionally drifts into mild meta-commentary ("the story tells us") that violates the voice guide. But for bulk generation at $0.70 total for 479 Adi Parva retellings versus ~$7 for Sonnet, that's a clear win. The meta-commentary flags are fixable with a light edit pass.

The temperature question was worth investigating. At 0.1 (very low), the output was mechanically correct but flat. At 0.6, it started inventing emotional details not in the source text. We settled at 0.4: vivid enough to carry the reader, controlled enough to stay faithful.

Total: 479 Adi Parva retellings at ~$0.70; 154 Sabha Parva at ~$0.15. 633 retellings across two parvas for under a dollar.

The Graph Architecture: Why Neo4j, and Why Not in Production

The Case for a Graph Database

The Mahabharata's structure is inherently non-relational. The causal connection between Kunti's invocation of Surya (book 1) and Karna's death at Kurukshetra (book 8) is a path through a directed graph, not a join across tables.

We chose Neo4j for the build-time graph workbench because Cypher — its query language — reduces complex graph operations to a few readable lines. Computing the causal reach of every substory is this:

MATCH (s:Substory)-[:TRIGGERS*1..7]->(downstream:Substory)
WITH s, count(DISTINCT downstream) AS reach
SET s.causal_reach = reach

Writing that as hand-rolled BFS/DFS in Python would take three times longer and be three times harder to debug.

The Graph Schema

The knowledge graph has 7 node types: Substory, Character, Arc, Place, Object, Theme, and Parva. And 9 edge types, each enabling a different navigation pattern:

Edge	Enables
TRIGGERS (Substory→Substory)	Causal chain navigation — the primary pull mechanism
APPEARS_IN (Character→Substory)	Character arc timelines
RELATED_TO (Character→Character)	Relationship webs
PART_OF (Substory→Arc)	Arc browsing
HAS_BACKSTORY (Substory→Substory)	"Go deeper" progressive disclosure

Two parvas currently: 633 substory nodes, 582 character pages. The full 18-parva graph will likely have 15,000–25,000 TRIGGERS edges alone.

The Critical Constraint: TRIGGERS Must Form a DAG

Causal chains can't have cycles. "A caused B caused A" is logically incoherent and would create infinite loops in any traversal algorithm.

In practice, LLMs generate cycles at a rate of about 2.4% of edges. The Mahabharata's own structure causes this: the text frequently narrates events retrospectively. Chapter 93 contains Ganga's explanation of something that happened in chapter 91. The LLM correctly identifies a causal relationship, then assigns the source and target to the wrong substories.

We resolve cycles using a greedy Minimum Feedback Arc Set (MFAS) algorithm: find a cycle, remove the lowest-confidence edge, repeat. Not globally optimal (true MFAS is NP-hard), but sufficient for a 2.4% cycle rate. 36 removals resolved all cycles in Adi Parva; 8 in Sabha Parva.

One specific failure mode worth naming: narrative frame contamination. LLMs generate edges like Parikshit's birth → Janamejaya's questions about Pandu — genealogically true, but the frame story asking is not caused by the events narrated within it. The prompts need explicit REJECT examples for this pattern.

Why Neo4j Is NOT in Production

Neo4j is a build tool, not a production dependency.

AuraDB Free auto-pauses after 72 hours of inactivity. Your site goes down. AuraDB Professional is $65/month — unacceptable for a $0/month hobby project. And at 600 substories, every graph traversal we need can be pre-computed in seconds.

The production site serves static JSON from a CDN. No database can go down. No API can time out.

Build time:  Neo4j → Cypher queries → Static JSON export
Serve time:  Vercel CDN → Static JSON → Next.js SSG pages

Total exported data: ~22 MB on disk. The entire knowledge graph fits in a single edge cache entry — no database, no API, no cold starts.

When we need dynamic graph queries — community detection, betweenness centrality, ad-hoc traversals — we promote Neo4j to a production dependency. The data model, Cypher queries, and graph are already built. Migration cost is near zero.

The Significance Scoring Problem

Not all substories are equal. We needed a way to compute which ones were pivotal versus minor — and we needed the scoring to be structural, not editorial opinion.

The formula:

Score = (0.35 × Causal Reach) + (0.25 × Character Centrality)
      + (0.25 × State Change Magnitude) + (0.15 × Narrative Recurrence)

Causal Reach is the most important factor: how many downstream substories are transitively connected via TRIGGERS edges (bounded to 7 hops). A substory that sets in motion a 10-event causal chain is objectively more significant than one with no downstream consequences.

The result for Adi Parva: 49 pivotal / 66 major / 161 supporting / 203 minor. Sabha Parva: 15 pivotal / 30 major / 59 supporting / 50 minor. The pivotal substories across both parvas — 64 total — form the essential reading spine.

The Entity Resolution Problem: Sanskrit Is Hard

Entity resolution — merging alias clusters across chapters — turned out to be uniquely difficult in Sanskrit texts.

Arjuna alone has more than 10 names: Arjuna, Partha, Dhananjaya, Savyasachi, Vijaya, Gudakesha, Phalguna, Kiriti, and several epithets. Across the Mahabharata's 1,000+ characters, there are an estimated 800–1,000 alias mappings.

The naive approach — fuzzy string matching — failed catastrophically. The problem is transitivity. If "Partha" = Arjuna and "Partha" = Bhima (it's a patronymic for all Pandava sons of Pritha), then your algorithm merges Arjuna and Bhima into one entity. Via a chain of shared epithets, you can merge all five Pandavas, then Krishna (who shares an epithet with Arjuna), then Karna (who shares a different epithet), until you have one megacharacter representing half the epic.

We implemented fuzzy matching, tested it, watched it produce catastrophic false merges, and deleted it.

The safe approach: 60+ manually curated merge rules, exact canonical name matching, and an explicit blocklist for genuinely ambiguous names — "Krishna" (refers to Krishna the god, Draupadi, and Vyasa), "Rama" (the god Rama AND Parashurama), "Bharadvaja" (the sage AND a patronymic for his son Drona).

For Adi Parva: 1,049 source entity IDs → 1,006 canonical entities. For Sabha: an additional 245 source IDs merged into the registry, with cross-parva deduplication ensuring that Arjuna, Krishna, Bhishma, and every other recurring character carries a single canonical ID across both parvas.

The "Krishna problem" deserves its own note. In Sabha Parva, "Krishna" refers to at least three distinct people — Krishna Vasudeva (the hero), Draupadi (dark-complexioned; kṛṣṇā is one of her names), and Krishna Dvaipayana (Vyasa). The Stage 2 prompt must explicitly resolve "Krishna" to the correct canonical from context before generating entity IDs. Default-to-Vasudeva is wrong roughly 15% of the time in Sabha Parva.

What We've Learned

Evaluate against a golden set, not against other models. Relative comparison between models tells you which model produces more plausible output. It cannot tell you whether any of them are actually correct. Systematic errors — grain too coarse, grain too fine, wrong causal direction — will look identical across models if the failure mode is shared. The only way to catch them is to have a reference you built yourself from the source text, independent of any model output.

The right abstraction matters more than the right model. For causal edge generation, the gap between models wasn't raw capability — it was whether the model understood the distinction between sequential narration and actual causation. A prompt that teaches this distinction explicitly outperformed a larger model that hadn't internalized it.

"Free" is a cost that doesn't show up in your invoice. The NIM free tier saved us API costs on Stage 3 and Stage 5 of the initial Adi pipeline. The invisible cost was a full rebuild when the over-segmentation problem surfaced, and ongoing timeout debugging on Stage 5. DeepSeek's official API costs ~$0.20 for an entire parva end-to-end. The $0.20 is cheaper than the rebuild. Run cheap, not free.

The pipeline is more expensive than estimated, but by a predictable factor. Output tokens dominate LLM costs, not input tokens. Any cost estimate that doesn't account for output verbosity will be wrong by 2–5×. Measure actual output token counts on a sample before projecting costs at scale.

Sanskrit entity resolution requires human judgment. No algorithm can reliably merge "Partha" to Arjuna without also accidentally merging it to Bhima. The curated override list is not a workaround — it's the only correct approach for texts with dense epithet systems.

Static-first production is not a compromise. Every graph traversal pre-computed at build time means faster load times than any database can produce, zero operational cost, and zero downtime risk. At 600–5,000 substories, the pre-computation window is measured in seconds.

Staying close to the source is a design constraint, not just an editorial preference. Every time we were tempted to add an interpretive gloss — to say "this prefigures the war" or "Ekalavya's story shows the violence of the guru-shishya hierarchy" — we had to ask: is this in the text, or is this us? Almost always, it was us. The text is richer for the restraint.

The Mahabharata ends with a note from Vyasa himself: "What is found here may be found elsewhere. What is not found here will not be found elsewhere." That claim — made 2,000 years ago — has aged remarkably well. The challenge has always been finding it.