Deewan · Corpus

The corpus

What's actually indexed in Diwan — and what isn't

6,568,850
Total verses
6,444,111
Classical
124,739
Nabati

Sources

Classical verses come from the open arbml/ashaar dataset. Nabati verses come from the drelhaj/Tarab dataset. Both are publicly released and open-access.

Eras covered

Era categorization taken from the original source.

Classical + Nabati combined. Percentages are of the era-tagged subset — not every verse in the source carries an era label.

  • العصر الحديث2,058,30148%
  • العصر العباسي637,09515%
  • العصر المملوكي451,46810%
  • العصر العثماني360,2038%
  • العصر الأيوبي213,7695%
  • العصر الأندلسي162,2584%
  • المغرب والأندلس129,0943%
  • العصر الأموي97,9812%
  • العصر الفاطمي89,3452%
  • المخضرمين37,3720.9%
  • العصر الجاهلي36,3180.8%
  • عصر بين الدولتين22,7570.5%
  • قبل الإسلام18,1080.4%
  • العصر الإسلامي4,4830.1%

Most-represented poets

Top 10 from each corpus, ranked by indexed verse count.

Classical

  1. 01خالد مصباح مظلوم77,217
  2. 02ابن الرومي71,165
  3. 03مهيار الديلمي61,091
  4. 04البحتري52,711
  5. 05خليل مطران45,393
  6. 06أحمد شوقي43,479
  7. 07أحمد محرم38,470
  8. 08نزار قباني37,424
  9. 09عبد الله بن علي الخليلي37,253
  10. 10ريم سليمان الخش36,827

Nabati

  1. 01بدر بن عبد المحسن2,264
  2. 02خالد الفيصل1,141
  3. 03عبد اللطيف البناي985
  4. 04الناصر929
  5. 05مخاوي الليل877
  6. 06ساهر835
  7. 07خالد عبد الرحمن753
  8. 08حسين المحضار732
  9. 09ساري712
  10. 10محمد العبد الله الفيصل702

Known gaps

The corpus draws only from two open datasets (arbml/ashaar and drelhaj/Tarab). Some groups of poets are partially or entirely absent because they are not in those sources — not because of any editorial decision by Diwan:

  • Most living contemporary poets — rights-protected, so open datasets don't include them
  • Women poets are less present in the open sources

If you search for a poet and find nothing, it almost certainly means they aren't in the source datasets. Expanding coverage is a longer-term data-partnership question, not a code fix.

Cleaning policy

These are the transformations applied before indexing, documented so results are reproducible:

  • Deduplication: verbatim-identical verses are merged before indexing.
  • Diacritics (التشكيل): preserved where present in the source; semantic search does not depend on them.
  • Normalization: hamzas (أ/إ/آ → ا), alef maqsura (ى → ي), and similar variants are unified so searches match across spelling variations.
  • Punctuation: trimmed at shatr boundaries; meter and era fields kept as given in the source.
  • Classical and Nabati are kept in separate indexes — no cross-contamination between the two traditions.

Every step is documented in the pipeline code; the datasets, model, index, and SQL are all inspectable on the server.