Deewan · Technical

How Diwan works

A brief engineering dossier on Diwan's search architecture, with quality measured against a published academic benchmark.

01The corpus at a glance

Diwan indexes 6.57M Arabic verses: 6.44M classical verses from the open ashaar dataset, and 125K Nabati verses from the tarab dataset. All verses pass through cleaning and deduplication before entering the index, and they span twelve literary eras from pre-Islamic to modern.

02Search by meaning, not by letters

Traditional search matches letters and words. Diwan understands meaning. When you search for "longing for home," it surfaces verses that express the feeling — even when the words "longing" and "home" never appear in the text.

03From verse to vector

Every verse is encoded into a 384-dimensional vector that summarizes its meaning, using the Arabic-Triplet-Matryoshka-V2 model (triplet-trained on a broad Arabic corpus). Verses close in meaning end up close in this vector space. The model's native output is 768 dims; we truncate to the first 384 as a memory/latency trade-off, and validate the resulting quality empirically on Fann-or-Flop rather than assuming it from the published Matryoshka tiers.

04Searching millions in a few seconds

To search 6.57M verses in a few seconds, Diwan uses an HNSW (Hierarchical Navigable Small World) approximate-nearest-neighbor index. Instead of scanning every verse, the index reaches the most meaning-adjacent candidates quickly, on a single server.

05Two parallel corpora: Classical and Nabati

Classical (fusha) and Nabati poetry live in two separate indexes. Their vocabulary, themes, and meters differ; keeping them apart produces cleaner results than merging them, and a search in one does not pollute the other.

06How we know it works: the Fann-or-Flop benchmark

Traditional keyword search fails this task completely — you cannot match a scholar's prose description to the right poet by word overlap, because the explanation shares almost no words with the verse itself. That is exactly why we chose it. We evaluated Diwan against Fann-or-Flop (EMNLP 2025), a peer-reviewed benchmark of 53,047 scholar-written explanations paired with the verses they describe. The task: given an explanation, retrieve the correct poet out of thousands. Diwan scores about 22× better than random chance, placing the correct poet in the top-10 for 3.13% of queries (random baseline: ~0.14%). To our knowledge, this is the first time an Arabic poetry search engine has been evaluated against a published academic benchmark.

07Engineering notes: what we tested and rejected

We empirically tested SILMA (silma-ai/silma-embedding-matryoshka-v0.1, a Saudi-trained Arabic embedding model) against the same benchmark. Quality regressed across every era (3.13% → 1.03% poet-in-top-10), so we rolled back to the original model. Model selection in Diwan is evidence-driven, not narrative-driven.

08Era-aware re-ranking and a diversity cap

Modern poets often name the query's theme literally, crowding out older canonical poets who embody the theme without naming it. Diwan applies a small re-rank to give classical eras (Jahili, Umayyad, Abbasid, Andalusian…) a fair share of the top results, plus a soft cap on how many results from one era can appear together. The adjustment never promotes irrelevant verses — it only re-orders near-tied candidates toward older, more diverse output.

09Open data, self-hosted

All verse data comes from publicly available, open datasets (linked in the site footer). The entire infrastructure runs on a single dedicated server with no dependency on external APIs. Texts, indexes, model, and search logic are all local and auditable. The project spans about 10,600 lines of code: 3,922 lines for the frontend, 879 for the backend, and 5,827 for the cleaning, training, and indexing pipelines.

10API coming soon

We are preparing a public API so researchers, universities, and cultural institutions can access Diwan's semantic search programmatically — for academic work, educational tools, and downstream products. Early-access requests are welcome.