The corpus
What's actually indexed in Diwan — and what isn't
Sources
Classical verses come from the open arbml/ashaar dataset. Nabati verses come from the drelhaj/Tarab dataset. Both are publicly released and open-access.
Eras covered
Era categorization taken from the original source.
Classical + Nabati combined. Percentages are of the era-tagged subset — not every verse in the source carries an era label.
- العصر الحديث2,058,30148%
- العصر العباسي637,09515%
- العصر المملوكي451,46810%
- العصر العثماني360,2038%
- العصر الأيوبي213,7695%
- العصر الأندلسي162,2584%
- المغرب والأندلس129,0943%
- العصر الأموي97,9812%
- العصر الفاطمي89,3452%
- المخضرمين37,3720.9%
- العصر الجاهلي36,3180.8%
- عصر بين الدولتين22,7570.5%
- قبل الإسلام18,1080.4%
- العصر الإسلامي4,4830.1%
Most-represented poets
Top 10 from each corpus, ranked by indexed verse count.
Classical
- 01خالد مصباح مظلوم77,217
- 02ابن الرومي71,165
- 03مهيار الديلمي61,091
- 04البحتري52,711
- 05خليل مطران45,393
- 06أحمد شوقي43,479
- 07أحمد محرم38,470
- 08نزار قباني37,424
- 09عبد الله بن علي الخليلي37,253
- 10ريم سليمان الخش36,827
Nabati
- 01بدر بن عبد المحسن2,264
- 02خالد الفيصل1,141
- 03عبد اللطيف البناي985
- 04الناصر929
- 05مخاوي الليل877
- 06ساهر835
- 07خالد عبد الرحمن753
- 08حسين المحضار732
- 09ساري712
- 10محمد العبد الله الفيصل702
Known gaps
The corpus draws only from two open datasets (arbml/ashaar and drelhaj/Tarab). Some groups of poets are partially or entirely absent because they are not in those sources — not because of any editorial decision by Diwan:
- Most living contemporary poets — rights-protected, so open datasets don't include them
- Women poets are less present in the open sources
If you search for a poet and find nothing, it almost certainly means they aren't in the source datasets. Expanding coverage is a longer-term data-partnership question, not a code fix.
Cleaning policy
These are the transformations applied before indexing, documented so results are reproducible:
- Deduplication: verbatim-identical verses are merged before indexing.
- Diacritics (التشكيل): preserved where present in the source; semantic search does not depend on them.
- Normalization: hamzas (أ/إ/آ → ا), alef maqsura (ى → ي), and similar variants are unified so searches match across spelling variations.
- Punctuation: trimmed at shatr boundaries; meter and era fields kept as given in the source.
- Classical and Nabati are kept in separate indexes — no cross-contamination between the two traditions.
Every step is documented in the pipeline code; the datasets, model, index, and SQL are all inspectable on the server.