Academic benchmark
Diwan's results on Fann-or-Flop (EMNLP 2025)
Why benchmark?
Numbers are what separate a real search engine from a claim. We evaluated Diwan against a peer-reviewed published academic benchmark — not against examples we hand-picked ourselves.
The task
Fann-or-Flop contains 53,047 scholar-written prose explanations of Arabic verses, paired with the verses' authors. Given only the explanation, retrieve the correct poet out of thousands. Traditional keyword search fails here — the explanation shares almost no words with the verse it describes.
The result
The correct poet appears in the top-10 results for 3.13% of queries. Random baseline: ~0.14%. That's a 22× improvement over random. To our knowledge, this is the first time an Arabic poetry search engine has been measured against a published academic benchmark.
Higher is better. 22× random, measured on the full 6.57M-verse index.
| System | Rate |
|---|---|
| Random baseline | ~0.14% |
| Diwan (Arabic-Triplet-Matryoshka-V2) | 3.13% |
| SILMA (rejected after benchmarking) | 1.03% |
Planned comparisons (honest note)
A fair, same-scale comparison against other retrievers requires re-embedding Diwan's full 6.57M-verse corpus with each baseline model — a compute-intensive task planned for a follow-up technical report. Our initial runs on Fann-or-Flop's native pool alone were not informative: the pool is heavily skewed toward a few well-represented poets, so even a random baseline scored ~40% there, leaving little room to discriminate between systems. The 3.13% headline above is measured on the full Diwan corpus, where random scores ~0.14% and the 22× gap is meaningful.
- BM25 over the full 6.57M-verse index — classical keyword baseline
- AraBERT embeddings encoded across the full corpus — Arabic-specific baseline
- BGE-M3 and multilingual-E5 encoded across the full corpus — multilingual SOTA baselines
Researchers with dedicated compute budget who are interested in collaborating on the full-scale comparison are welcome to get in touch.
Citation
You can cite this page and the original Fann-or-Flop paper. A detailed technical report is in preparation.