Methodology
"Seek to make every semantic decision traceable to evidence, and make every proposal testable with fast, repeatable queries." — guiding principle from this project's design document
This page explains how the Biblical Lexicon was built: how the senses were decided, how the glosses for lemmas and senses were composed, and how the semantic domains were assigned. The entire pipeline was designed to be reproducible, evidence-backed, and free of dependency on copyrighted lexical resources.
1. Source Data: Multilingual Translation Signatures
The foundation of this lexicon is a massive body of word-level glossed translations: 448,269 original-language word tokens (Hebrew, Aramaic, and Greek) across the entire Protestant canon, each with a 1:1 gloss in up to 43 target languages.
These glosses were produced by Bible.Systems, spurred by the work begun at GlobalBibleTools.com. Bible.Systems is a platform for computational tools related to the Bible, including a Bible translation platform, where every Hebrew/Greek token has a corresponding translation unit in each target language. For the purposes of this lexicon, this 1:1 alignment is unusual and extremely valuable—it removes the alignment noise that plagues most multilingual NLP pipelines.
The key linguistic insight: translation divergence across language families is one of the strongest signals of polysemy. When the same Hebrew lemma maps to different glosses across unrelated languages (say Hindi, Mandarin, and Swahili), that almost certainly indicates distinct senses. A word that is always translated the same way across all 43 languages is very likely monosemous.
2. Feature Materialization
The raw glosses were decomposed into structured features for computational use. Each gloss string was normalized (Unicode NFKC, dash normalization) and then split three ways:
- Phrase features — the complete gloss string, e.g.
make-known(weight: 3.0) - Head features — the last component word, e.g.
known(weight: 2.0) - Part features — each component individually, e.g.
makeandknown(weight: 1.0)
All tokens were dictionary-encoded to integers for fast computation. This materialization produced 7.5 million phrase features, 11 million part features, and 7.5 million head features—the raw material for sense induction.
The materialization was performed by a custom Rust binary (materialize-chirho)
that processes the full 7.6M gloss rows in about 8 minutes on an Apple M4.
3. Sense Induction: Agglomerative Clustering
For each of the 14,639 lemmas, all occurrences were clustered based on their multilingual translation signatures. The algorithm:
- For each lemma, collect all token occurrences and their feature vectors across all available languages.
- Compute pairwise weighted Jaccard similarity between occurrences. Occurrences that are translated similarly across many languages land close together; occurrences that diverge in translation are pushed apart.
- Run agglomerative (hierarchical) clustering with a similarity threshold, cutting the dendrogram where translation signatures diverge.
- Each resulting cluster becomes a candidate sense.
This initial pass produced 56,976 candidate senses from 14,639 lemmas. Many lemmas were correctly identified as monosemous (one cluster). But the clustering was deliberately conservative—it was better to over-split than to lump genuinely distinct senses together, because merging is easier (and less destructive) than splitting.
4. LLM Merge Review (Pass 1: 10+ Senses)
The first merge pass focused on the most over-split lemmas: 1,919 lemmas that had been assigned 10 or more candidate senses. Each lemma was reviewed by Opus 4.6, Anthropic's most capable model at the time of this writing, which was given:
- All candidate sense clusters for the lemma
- Representative occurrences in each cluster with glosses in 5–6 diverse languages
- The BDB/LSJ lexicon entry for reference
- Instructions to merge clusters that represent the same lexical sense while preserving genuinely distinct meanings
The model produced structured merge decisions (which clusters to combine, which to keep separate) with linguistic reasoning. This reduced the sense inventory from 56,976 down to 30,897 senses.
5. Automated Cross-Lingual Merge (2–9 Senses)
For lemmas with 2–9 candidate senses (too many for manual review, too few for the LLM batch pass), we ran an automated agglomerative merge using cross-lingual Jaccard similarity between sense clusters:
- For each pair of senses within a lemma, compute the overlap of their multilingual gloss distributions.
- If two senses share more than a threshold proportion of their translations across multiple languages, merge them.
- Repeat until no more merges are possible.
This brought the count down to 26,523 senses.
6. Opus Re-Merge (Full Re-Review)
To catch remaining over-splits and ensure consistency, all 1,499 still-polysemous lemmas were sent through a second full Opus review pass. This generated 4,443 merge response files, each containing structured decisions with reasoning.
After applying all merges, the final sense inventory settled at 20,143 senses—a 64.6% reduction from the initial 56,976. The distribution:
- 12,261 monosemous lemmas (one sense)
- 2,377 polysemous lemmas (average 3.3 senses each, maximum 20)
7. Sense Refinement: Split Proposals
After the merge passes, we ran a detection pass to find senses that may have been over-lumped—cases where a single sense cluster actually contained two or more distinct meanings (e.g., a literal and metaphorical use). The refinement pipeline:
- Detection: Scan all senses for high internal translation entropy or bimodal gloss distributions, which suggest that distinct sub-senses were merged together.
- Proposal generation: For each flagged sense, generate a split proposal with the evidence for separation.
- LLM reassignment: Opus 4.6 reviews each proposal and, where warranted, assigns occurrences to new sub-senses with appropriate labels.
This produced 168 sub-senses from 82 split proposals, covering 38,032 occurrences that were reassigned to more precise sense distinctions.
8. Semantic Domain Assignment
Each of the 20,143 senses was classified into a three-level semantic domain hierarchy:
8a. Macro Domains (32)
Broad conceptual categories like Communication, Emotion, Movement, Kinship, Worship, etc. These were seeded from a combination of Louw & Nida–style categories and SIL's open semantic domain hierarchy (CC BY-SA 4.0), then refined based on the actual distribution of senses in the data.
8b. Supersense Categories (93)
Mid-level groupings under each macro domain, following the structure of Louw & Nida's semantic classification but derived independently from our data.
8c. Fine Clusters (1,236)
These were produced by community detection (Louvain algorithm) on a sense-to-sense similarity graph. Each community of semantically related senses becomes a fine-grained domain cluster.
Assignment Method
Each sense was assigned to its nearest domain via centroid proximity: the translation signature of each sense was compared against the centroid (average signature) of each domain, and the best-fitting domain was selected. Multi-label assignment was allowed where a sense genuinely spans domain boundaries.
The result: 39,658 domain assignments connecting senses to the 2,135-node domain hierarchy.
9. Gloss Composition
For each lemma and sense, lexicon article text was composed in three tiers:
Tier 1: Terse Glosses
A concise English gloss for each lemma, summarizing its range of meaning in a sentence or two. These were generated by Opus 4.6 from the full evidence set (BDB/LSJ entries, multilingual gloss distributions, sense inventory). Coverage: 14,883 of 14,884 lemmas (99.99%).
Tier 2: Readable Glosses
A paragraph-length explanation of the lemma's semantic range, suitable for a dictionary article. These draw on the same evidence as the terse glosses but provide more context and usage notes.
Tier 3: Enriched Sense Descriptions
For each individual sense, a detailed description explaining what distinguishes this sense from the lemma's other senses. These include information about typical contexts, translation patterns, and semantic nuances. Coverage: 20,063 of 20,143 senses enriched.
10. Scripture References
Each sense is linked to its scripture references (the verses where that sense of the word occurs), providing complete traceability from the lexicon article back to the biblical text. The current database contains 174,434 sense–reference links.
11. Multilingual Sense Glosses
Beyond the English descriptions, each sense carries gloss labels in multiple display languages. The current database holds 348,858 sense–gloss records across 17 display languages (English, Spanish, French, German, and others). These are derived from the most frequent translations observed for each sense cluster in the corresponding language.
12. Tools and Infrastructure
The pipeline was built with:
- Rust — hot-path compute (feature materialization, clustering, similarity matrices, entropy computation) via a workspace of five crates
- Bun/TypeScript — orchestration, LLM API calls, data import/export, CLI tools
- SQLite (local, WAL mode) — primary working database for all intermediate and final results
- Opus 4.6 (Anthropic) — LLM merge review, sense labeling, gloss composition
- SvelteKit 2 + Cloudflare Workers — the web application you are reading now
- Cloudflare D1 — production database serving the lexicon
- Typst — PDF lexicon generation with EzraSIL Hebrew and Gentium Plus Greek fonts
13. What This Is Not
This lexicon is computationally derived, not hand-curated by a team of lexicographers over decades. It should be understood as a preliminary draft—an evidence-backed starting point for further scholarly refinement, not a finished reference work. The sense boundaries, while grounded in cross-lingual translation evidence, have not been exhaustively verified by human experts.
The domain hierarchy was seeded from open-licensed resources (SIL's CC BY-SA 4.0 semantic domains) and refined computationally. It does not depend on Louw & Nida's copyrighted classification, though it follows a similar structural philosophy.
14. Pipeline Summary
| Stage | Input | Output | Method |
|---|---|---|---|
| Import | 800K words, 43 langs | 7.6M glosses | Postgres → SQLite |
| Materialize | 7.6M glosses | 7.5M phrases, 11M parts | Rust normalization + encoding |
| Cluster | 14,639 lemmas | 56,976 senses | Agglomerative clustering |
| LLM merge (10+) | 1,919 lemmas | 30,897 senses | Opus 4.6 review |
| Auto-merge (2–9) | Remaining polysemous | 26,523 senses | Cross-lingual Jaccard |
| Opus re-merge | 1,499 lemmas | 20,143 senses | Opus 4.6 (4,443 files) |
| Refinement | Over-lumped senses | +168 sub-senses | Split detection + reassignment |
| Domains | 20,143 senses | 2,135 domain nodes | Community detection + centroid |
| Glosses | Full evidence set | 14,883 lemma + 20,063 sense | Opus 4.6 composition |
15. Open Source
The code, pipeline, and data are available on GitHub. The semantic domain ontology is released under CC BY-SA 4.0. To God be the glory.