A CLI utility for producing high-quality, domain-aware training datasets for LLM fine-tuning. Configure your domains in YAML, generate datasets in one command, and keep quality high with validation, deduplication, and statistics files.
- Features
- Prerequisites
- Installation
- Usage
- Output Structure
- Project Structure
- Adding a New Domain
- Development
Select any domain using CLI arguments without editing code:
--domain haiintel_core
--domain expense
--domain <your-custom-domain>All domain details (company, agents, regions, currencies, and more) live in config.yaml.
No Python edits required to add or adjust a domain. Update config.yaml and regenerate datasets.
- Validation for every generated example
- Automatic deduplication
- Companion
*_stats.jsonfiles with totals, token estimates, and section breakdowns
Rule-based keyword classifier to produce meaningful entity labels:
{
"system": "You are HAIIndexer classification module. Classify the given string into one or more entity types.",
"instruction": "What type of entity is Global Invoice for CFO 001?",
"input": "Global Invoice for CFO 001",
"output": "This entity belongs to the following types: Person, Invoice",
"metadata": {
"section": "entity_classification",
"classified_as": ["Person", "Invoice"],
"possible_labels": ["Person", "CostCenter", "ExpensePolicy", "Vendor", ...]
}
}- Python 3.8 or higher
- pip
- Optional:
jqfor JSON formatting
Install dependencies directly:
pip install -r requirements.txtOr use the Makefile shortcut:
make install-
Prepare
config.yaml— define one or more domains. Example:domains: - id: expense company_name: "<Company Name>" agent_name: "<Agent Name>" chat_agent_name: "HAI Expense Agent" domain_name: "Expense Management" kb_label: "HaiIntel Expense Knowledge Base" primary_products: ["HAIExpenseLens", "HAIIndexer"] primary_roles: ["CFO", "Finance Controller"] primary_regions: ["Global", "UAE", "India"] entity_types: ["Invoice", "Receipt", "ExpensePolicy", "Vendor"] expense_doc_types: ["Invoice", "Bill", "Receipt"] currencies: ["INR", "USD", "AED"]
-
Generate datasets
- Direct Python command:
python -m src.cli --config config.yaml --domain expense --out-dir ./training-jsons
- Using the Makefile:
make generate DOMAIN=expense # Single domain make generate-all # All domains in config.yaml
- Direct Python command:
The generator writes JSON datasets plus per-section statistics:
training-jsons/
├─ intro-training.json # Greetings and introductions
├─ operator-training.json # Operator logic examples
├─ rag_context_training.json # RAG context handling
├─ entity-classification-training.json # Entity type classification
├─ safety_guardrails_training.json # Safety and guardrails
├─ hard_negatives_hallucinations.json # Hard negative examples
├─ company_kb_training.json # Company knowledge base Q&A
├─ company_kb_no_hallucinations_training.json # Anti-hallucination KB
├─ business_integration_training.json # Business integration scenarios
├─ expense_documents_training.json # Domain-specific: Expense docs (if configured)
└─ *_stats.json # Stats for each dataset above
dataset-generator/
├── config.yaml # Multi-domain configuration
├── requirements.txt # Python dependencies
├── Makefile # Build automation
├── README.md # Project overview and usage
└── src/
├── cli.py # Command-line interface
├── domain_config.py # Domain configuration data class
├── factory.py # Section builder factory
├── generator.py # Main dataset generator
├── utils.py # Shared utilities and entity classifier
└── sections/ # Section builders (one per training type)
├── base.py
├── intro.py
├── operator.py
├── entity_classification.py
├── rag_context.py
├── safety.py
└── ...
- Edit
config.yamlto add a domain entry:domains: - id: my_new_domain company_name: "MyCompany" agent_name: "MyAgent" domain_name: "My Domain" entity_types: ["TypeA", "TypeB"] # ... other configuration
- Extend the entity classifier (optional) — add keyword patterns in
src/utils.py:classify_entity_name(). - Generate datasets with
make generate DOMAIN=my_new_domain.
The project supports common Python tooling:
pip install -r requirements.txt # Dev dependencies included
black src/
mypy src/
ruff src/
pytest- Dependency inversion — the generator depends on factories rather than concrete builders.
- DRY utilities — shared helpers live in
utils.py. - Extensibility — add new JSON schemas or builders with minimal changes.
- Testability — builders are pure functions returning examples; easy to validate and unit test.
- Quality-first — deduplication, validation, and statistics are built into the generation pipeline.