Deewan · Corpus

The corpus

What's actually indexed in Diwan — and what isn't

6,568,850

Total verses

6,444,111

Classical

124,739

Nabati

Sources

Classical verses come from the open arbml/ashaar dataset. Nabati verses come from the drelhaj/Tarab dataset. Both are publicly released and open-access.

Eras covered

Era categorization taken from the original source.

Classical + Nabati combined. Percentages are of the era-tagged subset — not every verse in the source carries an era label.

العصر الحديث2,058,30148%
العصر العباسي637,09515%
العصر المملوكي451,46810%
العصر العثماني360,2038%
العصر الأيوبي213,7695%
العصر الأندلسي162,2584%
المغرب والأندلس129,0943%
العصر الأموي97,9812%
العصر الفاطمي89,3452%
المخضرمين37,3720.9%
العصر الجاهلي36,3180.8%
عصر بين الدولتين22,7570.5%
قبل الإسلام18,1080.4%
العصر الإسلامي4,4830.1%

Most-represented poets

Top 10 from each corpus, ranked by indexed verse count.

Classical

01خالد مصباح مظلوم77,217
02ابن الرومي71,165
03مهيار الديلمي61,091
04البحتري52,711
05خليل مطران45,393
06أحمد شوقي43,479
07أحمد محرم38,470
08نزار قباني37,424
09عبد الله بن علي الخليلي37,253
10ريم سليمان الخش36,827

Nabati

01بدر بن عبد المحسن2,264
02خالد الفيصل1,141
03عبد اللطيف البناي985
04الناصر929
05مخاوي الليل877
06ساهر835
07خالد عبد الرحمن753
08حسين المحضار732
09ساري712
10محمد العبد الله الفيصل702

Known gaps

The corpus draws only from two open datasets (arbml/ashaar and drelhaj/Tarab). Some groups of poets are partially or entirely absent because they are not in those sources — not because of any editorial decision by Diwan:

Most living contemporary poets — rights-protected, so open datasets don't include them
Women poets are less present in the open sources

If you search for a poet and find nothing, it almost certainly means they aren't in the source datasets. Expanding coverage is a longer-term data-partnership question, not a code fix.

Cleaning policy

These are the transformations applied before indexing, documented so results are reproducible:

Deduplication: verbatim-identical verses are merged before indexing.
Diacritics (التشكيل): preserved where present in the source; semantic search does not depend on them.
Normalization: hamzas (أ/إ/آ → ا), alef maqsura (ى → ي), and similar variants are unified so searches match across spelling variations.
Punctuation: trimmed at shatr boundaries; meter and era fields kept as given in the source.
Classical and Nabati are kept in separate indexes — no cross-contamination between the two traditions.

Every step is documented in the pipeline code; the datasets, model, index, and SQL are all inspectable on the server.