Methodology

"Seek to make every semantic decision traceable to evidence, and make every proposal testable with fast, repeatable queries." — guiding principle from this project's design document

This page explains how the Biblical Lexicon was built: how the senses were decided, how the glosses for lemmas and senses were composed, and how the semantic domains were assigned. The entire pipeline was designed to be reproducible, evidence-backed, and free of dependency on copyrighted lexical resources.

1. Source Data: Multilingual Translation Signatures

The foundation of this lexicon is a massive body of word-level glossed translations: 448,269 original-language word tokens (Hebrew, Aramaic, and Greek) across the entire Protestant canon, each with a 1:1 gloss in up to 43 target languages.

These glosses were produced by Bible.Systems, spurred by the work begun at GlobalBibleTools.com. Bible.Systems is a platform for computational tools related to the Bible, including a Bible translation platform, where every Hebrew/Greek token has a corresponding translation unit in each target language. For the purposes of this lexicon, this 1:1 alignment is unusual and extremely valuable—it removes the alignment noise that plagues most multilingual NLP pipelines.

The key linguistic insight: translation divergence across language families is one of the strongest signals of polysemy. When the same Hebrew lemma maps to different glosses across unrelated languages (say Hindi, Mandarin, and Swahili), that almost certainly indicates distinct senses. A word that is always translated the same way across all 43 languages is very likely monosemous.

Languages used: South Asian (Hindi, Bengali, Urdu, Marathi, Telugu, Tamil, Gujarati, Punjabi), East Asian (Mandarin, Cantonese, Japanese, Korean, Vietnamese, Thai, Burmese), Southeast Asian (Indonesian, Javanese, Tagalog), European (Spanish, Portuguese, French, German, Italian, Dutch, Polish, Ukrainian, Russian, Romanian, Czech, Hungarian, Greek, Swedish, Norwegian, Danish), Middle Eastern (Arabic, Hebrew, Turkish, Persian), African (Swahili, Hausa, Yoruba, Amharic), and Caribbean (Haitian Creole). Of these, 17 languages had full Bible coverage and were used as primary clustering features.

2. Feature Materialization

The raw glosses were decomposed into structured features for computational use. Each gloss string was normalized (Unicode NFKC, dash normalization) and then split three ways:

  • Phrase features — the complete gloss string, e.g. make-known (weight: 3.0)
  • Head features — the last component word, e.g. known (weight: 2.0)
  • Part features — each component individually, e.g. make and known (weight: 1.0)

All tokens were dictionary-encoded to integers for fast computation. This materialization produced 7.5 million phrase features, 11 million part features, and 7.5 million head features—the raw material for sense induction.

The materialization was performed by a custom Rust binary (materialize-chirho) that processes the full 7.6M gloss rows in about 8 minutes on an Apple M4.

3. Sense Induction: Agglomerative Clustering

For each of the 14,639 lemmas, all occurrences were clustered based on their multilingual translation signatures. The algorithm:

  1. For each lemma, collect all token occurrences and their feature vectors across all available languages.
  2. Compute pairwise weighted Jaccard similarity between occurrences. Occurrences that are translated similarly across many languages land close together; occurrences that diverge in translation are pushed apart.
  3. Run agglomerative (hierarchical) clustering with a similarity threshold, cutting the dendrogram where translation signatures diverge.
  4. Each resulting cluster becomes a candidate sense.

This initial pass produced 56,976 candidate senses from 14,639 lemmas. Many lemmas were correctly identified as monosemous (one cluster). But the clustering was deliberately conservative—it was better to over-split than to lump genuinely distinct senses together, because merging is easier (and less destructive) than splitting.

4. LLM Merge Review (Pass 1: 10+ Senses)

The first merge pass focused on the most over-split lemmas: 1,919 lemmas that had been assigned 10 or more candidate senses. Each lemma was reviewed by Opus 4.6, Anthropic's most capable model at the time of this writing, which was given:

  • All candidate sense clusters for the lemma
  • Representative occurrences in each cluster with glosses in 5–6 diverse languages
  • The BDB/LSJ lexicon entry for reference
  • Instructions to merge clusters that represent the same lexical sense while preserving genuinely distinct meanings

The model produced structured merge decisions (which clusters to combine, which to keep separate) with linguistic reasoning. This reduced the sense inventory from 56,976 down to 30,897 senses.

Quality gate: We tested both Opus 4.6 and Sonnet 4.5 on the same lemmas. Sonnet tended to collapse senses too aggressively (e.g., merging H8130 down to 2 senses where Opus correctly identified 6). Opus 4.6 was used exclusively for all merge decisions because building a real lexicon requires distinguishing subtle semantic differences.

5. Automated Cross-Lingual Merge (2–9 Senses)

For lemmas with 2–9 candidate senses (too many for manual review, too few for the LLM batch pass), we ran an automated agglomerative merge using cross-lingual Jaccard similarity between sense clusters:

  1. For each pair of senses within a lemma, compute the overlap of their multilingual gloss distributions.
  2. If two senses share more than a threshold proportion of their translations across multiple languages, merge them.
  3. Repeat until no more merges are possible.

This brought the count down to 26,523 senses.

6. Opus Re-Merge (Full Re-Review)

To catch remaining over-splits and ensure consistency, all 1,499 still-polysemous lemmas were sent through a second full Opus review pass. This generated 4,443 merge response files, each containing structured decisions with reasoning.

After applying all merges, the final sense inventory settled at 20,143 senses—a 64.6% reduction from the initial 56,976. The distribution:

  • 12,261 monosemous lemmas (one sense)
  • 2,377 polysemous lemmas (average 3.3 senses each, maximum 20)

7. Sense Refinement: Split Proposals

After the merge passes, we ran a detection pass to find senses that may have been over-lumped—cases where a single sense cluster actually contained two or more distinct meanings (e.g., a literal and metaphorical use). The refinement pipeline:

  1. Detection: Scan all senses for high internal translation entropy or bimodal gloss distributions, which suggest that distinct sub-senses were merged together.
  2. Proposal generation: For each flagged sense, generate a split proposal with the evidence for separation.
  3. LLM reassignment: Opus 4.6 reviews each proposal and, where warranted, assigns occurrences to new sub-senses with appropriate labels.

This produced 168 sub-senses from 82 split proposals, covering 38,032 occurrences that were reassigned to more precise sense distinctions.

8. Semantic Domain Assignment

Each of the 20,143 senses was classified into a three-level semantic domain hierarchy:

8a. Macro Domains (32)

Broad conceptual categories like Communication, Emotion, Movement, Kinship, Worship, etc. These were seeded from a combination of Louw & Nida–style categories and SIL's open semantic domain hierarchy (CC BY-SA 4.0), then refined based on the actual distribution of senses in the data.

8b. Supersense Categories (93)

Mid-level groupings under each macro domain, following the structure of Louw & Nida's semantic classification but derived independently from our data.

8c. Fine Clusters (1,236)

These were produced by community detection (Louvain algorithm) on a sense-to-sense similarity graph. Each community of semantically related senses becomes a fine-grained domain cluster.

Assignment Method

Each sense was assigned to its nearest domain via centroid proximity: the translation signature of each sense was compared against the centroid (average signature) of each domain, and the best-fitting domain was selected. Multi-label assignment was allowed where a sense genuinely spans domain boundaries.

The result: 39,658 domain assignments connecting senses to the 2,135-node domain hierarchy.

9. Gloss Composition

For each lemma and sense, lexicon article text was composed in three tiers:

Tier 1: Terse Glosses

A concise English gloss for each lemma, summarizing its range of meaning in a sentence or two. These were generated by Opus 4.6 from the full evidence set (BDB/LSJ entries, multilingual gloss distributions, sense inventory). Coverage: 14,883 of 14,884 lemmas (99.99%).

Tier 2: Readable Glosses

A paragraph-length explanation of the lemma's semantic range, suitable for a dictionary article. These draw on the same evidence as the terse glosses but provide more context and usage notes.

Tier 3: Enriched Sense Descriptions

For each individual sense, a detailed description explaining what distinguishes this sense from the lemma's other senses. These include information about typical contexts, translation patterns, and semantic nuances. Coverage: 20,063 of 20,143 senses enriched.

Cross-gloss refinement: After initial composition, a refinement pass compared each lemma's sense descriptions against each other to ensure they were mutually distinguishable and that no two senses had been described in effectively identical terms.

10. Scripture References

Each sense is linked to its scripture references (the verses where that sense of the word occurs), providing complete traceability from the lexicon article back to the biblical text. The current database contains 174,434 sense–reference links.

11. Multilingual Sense Glosses

Beyond the English descriptions, each sense carries gloss labels in multiple display languages. The current database holds 348,858 sense–gloss records across 17 display languages (English, Spanish, French, German, and others). These are derived from the most frequent translations observed for each sense cluster in the corresponding language.

12. Tools and Infrastructure

The pipeline was built with:

  • Rust — hot-path compute (feature materialization, clustering, similarity matrices, entropy computation) via a workspace of five crates
  • Bun/TypeScript — orchestration, LLM API calls, data import/export, CLI tools
  • SQLite (local, WAL mode) — primary working database for all intermediate and final results
  • Opus 4.6 (Anthropic) — LLM merge review, sense labeling, gloss composition
  • SvelteKit 2 + Cloudflare Workers — the web application you are reading now
  • Cloudflare D1 — production database serving the lexicon
  • Typst — PDF lexicon generation with EzraSIL Hebrew and Gentium Plus Greek fonts

13. What This Is Not

This lexicon is computationally derived, not hand-curated by a team of lexicographers over decades. It should be understood as a preliminary draft—an evidence-backed starting point for further scholarly refinement, not a finished reference work. The sense boundaries, while grounded in cross-lingual translation evidence, have not been exhaustively verified by human experts.

The domain hierarchy was seeded from open-licensed resources (SIL's CC BY-SA 4.0 semantic domains) and refined computationally. It does not depend on Louw & Nida's copyrighted classification, though it follows a similar structural philosophy.

14. Pipeline Summary

StageInputOutputMethod
Import800K words, 43 langs7.6M glossesPostgres → SQLite
Materialize7.6M glosses7.5M phrases, 11M partsRust normalization + encoding
Cluster14,639 lemmas56,976 sensesAgglomerative clustering
LLM merge (10+)1,919 lemmas30,897 sensesOpus 4.6 review
Auto-merge (2–9)Remaining polysemous26,523 sensesCross-lingual Jaccard
Opus re-merge1,499 lemmas20,143 sensesOpus 4.6 (4,443 files)
RefinementOver-lumped senses+168 sub-sensesSplit detection + reassignment
Domains20,143 senses2,135 domain nodesCommunity detection + centroid
GlossesFull evidence set14,883 lemma + 20,063 senseOpus 4.6 composition

15. Open Source

The code, pipeline, and data are available on GitHub. The semantic domain ontology is released under CC BY-SA 4.0. To God be the glory.