📦 LLaMAFactory Dataset Generator

A CLI utility for producing high-quality, domain-aware training datasets for LLM fine-tuning. Configure your domains in YAML, generate datasets in one command, and keep quality high with validation, deduplication, and statistics files.

Features

Multi-domain configuration

Select any domain using CLI arguments without editing code:

--domain haiintel_core
--domain expense
--domain <your-custom-domain>

All domain details (company, agents, regions, currencies, and more) live in config.yaml.

YAML-driven pipeline

No Python edits required to add or adjust a domain. Update config.yaml and regenerate datasets.

Dataset quality safeguards

Validation for every generated example
Automatic deduplication
Companion *_stats.json files with totals, token estimates, and section breakdowns

Entity classification

Rule-based keyword classifier to produce meaningful entity labels:

{
  "system": "You are HAIIndexer classification module. Classify the given string into one or more entity types.",
  "instruction": "What type of entity is Global Invoice for CFO 001?",
  "input": "Global Invoice for CFO 001",
  "output": "This entity belongs to the following types: Person, Invoice",
  "metadata": {
    "section": "entity_classification",
    "classified_as": ["Person", "Invoice"],
    "possible_labels": ["Person", "CostCenter", "ExpensePolicy", "Vendor", ...]
  }
}

Prerequisites

Python 3.8 or higher
pip
Optional: jq for JSON formatting

Installation

Install dependencies directly:

pip install -r requirements.txt

Or use the Makefile shortcut:

make install

Usage

Prepare config.yaml — define one or more domains. Example:

domains:
  - id: expense
    company_name: "<Company Name>"
    agent_name: "<Agent Name>"
    chat_agent_name: "HAI Expense Agent"
    domain_name: "Expense Management"
    kb_label: "HaiIntel Expense Knowledge Base"
    primary_products: ["HAIExpenseLens", "HAIIndexer"]
    primary_roles: ["CFO", "Finance Controller"]
    primary_regions: ["Global", "UAE", "India"]
    entity_types: ["Invoice", "Receipt", "ExpensePolicy", "Vendor"]
    expense_doc_types: ["Invoice", "Bill", "Receipt"]
    currencies: ["INR", "USD", "AED"]

Generate datasets

Direct Python command:

python -m src.cli --config config.yaml --domain expense --out-dir ./training-jsons

Using the Makefile:

make generate DOMAIN=expense  # Single domain
make generate-all             # All domains in config.yaml

Output structure

The generator writes JSON datasets plus per-section statistics:

training-jsons/
├─ intro-training.json                        # Greetings and introductions
├─ operator-training.json                     # Operator logic examples
├─ rag_context_training.json                  # RAG context handling
├─ entity-classification-training.json        # Entity type classification
├─ safety_guardrails_training.json            # Safety and guardrails
├─ hard_negatives_hallucinations.json         # Hard negative examples
├─ company_kb_training.json                   # Company knowledge base Q&A
├─ company_kb_no_hallucinations_training.json # Anti-hallucination KB
├─ business_integration_training.json         # Business integration scenarios
├─ expense_documents_training.json            # Domain-specific: Expense docs (if configured)
└─ *_stats.json                               # Stats for each dataset above

Project Structure

dataset-generator/
├── config.yaml              # Multi-domain configuration
├── requirements.txt         # Python dependencies
├── Makefile                 # Build automation
├── README.md                # Project overview and usage
└── src/
    ├── cli.py               # Command-line interface
    ├── domain_config.py     # Domain configuration data class
    ├── factory.py           # Section builder factory
    ├── generator.py         # Main dataset generator
    ├── utils.py             # Shared utilities and entity classifier
    └── sections/            # Section builders (one per training type)
        ├── base.py
        ├── intro.py
        ├── operator.py
        ├── entity_classification.py
        ├── rag_context.py
        ├── safety.py
        └── ...

Adding a New Domain

Edit config.yaml to add a domain entry:

domains:
  - id: my_new_domain
    company_name: "MyCompany"
    agent_name: "MyAgent"
    domain_name: "My Domain"
    entity_types: ["TypeA", "TypeB"]
    # ... other configuration

Extend the entity classifier (optional) — add keyword patterns in src/utils.py:classify_entity_name().
Generate datasets with make generate DOMAIN=my_new_domain.

Development

Code quality tools

The project supports common Python tooling:

pip install -r requirements.txt  # Dev dependencies included
black src/
mypy src/
ruff src/
pytest

Design principles

Dependency inversion — the generator depends on factories rather than concrete builders.
DRY utilities — shared helpers live in utils.py.
Extensibility — add new JSON schemas or builders with minimal changes.
Testability — builders are pure functions returning examples; easy to validate and unit test.
Quality-first — deduplication, validation, and statistics are built into the generation pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📦 LLaMAFactory Dataset Generator

Table of Contents

Features

Multi-domain configuration

YAML-driven pipeline

Dataset quality safeguards

Entity classification

Prerequisites

Installation

Usage

Output structure

Project Structure

Adding a New Domain

Development

Code quality tools

Design principles

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
config.yaml		config.yaml
haiintel.md		haiintel.md
requirements.txt		requirements.txt

pattabhia/dataset-generator

Folders and files

Latest commit

History

Repository files navigation

📦 LLaMAFactory Dataset Generator

Table of Contents

Features

Multi-domain configuration

YAML-driven pipeline

Dataset quality safeguards

Entity classification

Prerequisites

Installation

Usage

Output structure

Project Structure

Adding a New Domain

Development

Code quality tools

Design principles

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages