Skip to content

ArabicNLP-UK/KALIMAT

Repository files navigation

Kalimat - a multipurpose Arabic Corpus

This repository provides a cleaned and consolidated version of the Kalimat - a multipurpose Arabic Corpus, containing 18,256 Arabic news articles collected from a diverse range of domains. The original material consisted of thousands of individual .txt files organised across multiple category folders. These have been reconstructed, normalised, and compiled into modern machine-learning-friendly formats.


📚 Corpus Overview

The corpus includes 20k+ articles covering a wide selection of news categories:

  • Politics
  • Economy
  • Culture
  • Religion
  • Sport
  • Social / Society-related topics
  • Other sub-domains depending on the original folder structure

Each article was originally stored as one word per line. In this cleaned edition, all documents have been reconstructed into natural text format with proper spacing, UTF-8 encoding, and consistent metadata extraction.

Although the original filenames varied widely, each document is now associated with the following fields:

  • id – numeric identifier extracted from the filename (or -1 where none existed)
  • filename – original filename exactly as it appeared
  • category – derived from directory structure or filename
  • year_month – extracted from filename where possible, otherwise "unknown"
  • text – reconstructed, cleaned article text

📦 Provided Formats

The cleaned dataset is released in the following forms:

1. CSV File

kalimat.csv

A single UTF-8 CSV containing all metadata and article texts.

2. JSONL File

kalimat.jsonl

One JSON object per line, suitable for training modern NLP models (e.g. HuggingFace Transformers).

3. TXT Version (Zipped)

kalimat_txt.zip

All reconstructed .txt documents are included in a single compressed archive to avoid storing thousands of individual files in the repository. Each .txt file uses the original filename for easy reference.


🔀 Train / Validation / Test Splits

The dataset has been randomly split (using a fixed seed for reproducibility) into:

  • Training set – 80%
  • Validation set – 10%
  • Test set – 10%

These splits are provided as:

  • kalimat_train.csv
  • kalimat_val.csv
  • kalimat_test.csv

All CSVs preserve the same column structure as the main file.

Code used for splitting (for reference)

python
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("kalimat.csv", encoding="utf-8")

train_df, temp_df = train_test_split(
    df, test_size=0.20, random_state=42, shuffle=True
)

val_df, test_df = train_test_split(
    temp_df, test_size=0.50, random_state=42, shuffle=True
)

train_df.to_csv("kalimat_train.csv", index=False, encoding="utf-8")
val_df.to_csv("kalimat_val.csv", index=False, encoding="utf-8")
test_df.to_csv("kalimat_test.csv", index=False, encoding="utf-8")

📁 Repository Structure

kalimat.csv
kalimat.jsonl
kalimat_train.csv
kalimat_val.csv
kalimat_test.csv
kalimat_txt.zip
README.md

📄 Citation

If you use this dataset in your work, please cite the original Kalimat paper:

El-Haj, M., & Koulali, R. (2013). Kalimat: a multipurpose Arabic corpus. In Second Workshop on Arabic Corpus Linguistics (WACL-2), pp. 22–25.
PDF available here


About

a multipurpose Arabic Corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published