Deewan · Benchmark

Academic benchmark

Diwan's results on Fann-or-Flop (EMNLP 2025)

Why benchmark?

Numbers are what separate a real search engine from a claim. We evaluated Diwan against a peer-reviewed published academic benchmark — not against examples we hand-picked ourselves.

The task

Fann-or-Flop contains 53,047 scholar-written prose explanations of Arabic verses, paired with the verses' authors. Given only the explanation, retrieve the correct poet out of thousands. Traditional keyword search fails here — the explanation shares almost no words with the verse it describes.

The result

The correct poet appears in the top-10 results for 3.13% of queries. Random baseline: ~0.14%. That's a 22× improvement over random. To our knowledge, this is the first time an Arabic poetry search engine has been measured against a published academic benchmark.

Top-10 unique poet match rate
Random0.14%Diwan3.13%SILMA1.03%0%1%2%3%4%

Higher is better. 22× random, measured on the full 6.57M-verse index.

Correct-poet-in-top-10 rate
SystemRate
Random baseline~0.14%
Diwan (Arabic-Triplet-Matryoshka-V2)3.13%
SILMA (rejected after benchmarking)1.03%

Planned comparisons (honest note)

A fair, same-scale comparison against other retrievers requires re-embedding Diwan's full 6.57M-verse corpus with each baseline model — a compute-intensive task planned for a follow-up technical report. Our initial runs on Fann-or-Flop's native pool alone were not informative: the pool is heavily skewed toward a few well-represented poets, so even a random baseline scored ~40% there, leaving little room to discriminate between systems. The 3.13% headline above is measured on the full Diwan corpus, where random scores ~0.14% and the 22× gap is meaningful.

  • BM25 over the full 6.57M-verse index — classical keyword baseline
  • AraBERT embeddings encoded across the full corpus — Arabic-specific baseline
  • BGE-M3 and multilingual-E5 encoded across the full corpus — multilingual SOTA baselines

Researchers with dedicated compute budget who are interested in collaborating on the full-scale comparison are welcome to get in touch.

Citation

You can cite this page and the original Fann-or-Flop paper. A detailed technical report is in preparation.