Skip to content

Diabolocom-Research/data-generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agentic Data Generation for Few-Shot Text Classification

This repository contains experiments comparing different approaches for few-shot text classification:

  1. Direct training on seed examples (SetFit, Encoder baselines)
  2. Static LLM prompting to generate synthetic training data
  3. Agentic LLM generation with iterative refinement

Installation

# Install dependencies with uv
uv sync

# Or with pip
pip install -e .

Quick Start

1. Direct Training Baseline (orch_setfit.py)

Train SetFit or Encoder baseline directly on seed examples without any data generation.

# Single experiment
python src/orch_setfit.py --dataset-name CR --examples-per-label 8 --baseline setfit

# Parallel experiments with multiple seeds
python src/orch_setfit.py --dataset-name CR --baseline setfit --parallel \
    --seeds 10 20 30 --examples-counts 2 4 8 --max-workers 3

Programmatic usage:

from src.orch_setfit import run_parallel_experiments

run_parallel_experiments(
    seeds=[10, 20, 30],
    examples_counts=[2, 4, 8],
    dataset_name="CR",
    baseline="setfit",  # or "encoder"
    max_workers=3,
    output_dir="results",
)

2. Static LLM Prompting (simple_example.py)

Generate synthetic examples using simple LLM prompting, then train SetFit baseline.

from src.simple_example import run_parallel_experiments

run_parallel_experiments(
    seeds=[10, 20, 30],
    examples_counts=[0, 2, 4, 8],  # 0 = zero-shot
    dataset_name="sst5",
    llm_model="openrouter/openai/gpt-4o-mini",
    max_workers=3,
    output_dir="results",
)

3. Agentic Generation (agentic_simple_example.py)

Use agentic LLM approach with iterative refinement via SetFitTool feedback.

from src.agentic_simple_example import run_parallel_agentic_experiments

run_parallel_agentic_experiments(
    seeds=[10, 20, 30],
    examples_counts=[0, 2, 4, 8],
    dataset_name="sst5",
    baseline_type="setfit",  # or "encoder"
    llm_model="openrouter/openai/gpt-5",
    max_workers=2,
    output_dir="results",
)

Main Scripts

Script Approach Data Generation Description
orch_setfit.py Direct training None Baseline - train on seed examples only
simple_example.py Static prompting ExampleGenerator LLM generates synthetic data
agentic_simple_example.py Agentic ToolCallingAgent Agent iteratively refines data

Baselines

SetFit (src/baselines/setfit.py)

Contrastive learning with sentence transformers. Fast, efficient for few-shot.

Encoder (src/baselines/encoder.py)

Standard transformer fine-tuning with AutoModelForSequenceClassification. Default model: EuroBERT-210m

CLI Arguments

Common arguments for orch_setfit.py:

  • --dataset-name: Dataset name (CR, sst5, emotion, ag_news, etc.)
  • --examples-per-label: Number of seed examples per label
  • --seed: Random seed for reproducibility
  • --baseline: Baseline type ("setfit" or "encoder")
  • --model-name: Override default model

Parallel experiment arguments (orch_setfit.py --parallel):

  • --seeds: List of seeds (e.g., 10 20 30)
  • --examples-counts: List of example counts (e.g., 2 4 8)
  • --max-workers: Number of parallel workers
  • --output-dir: Output directory for results

Reproducing Results

Run all approaches on a dataset:

# 1. Direct SetFit baseline
from src.orch_setfit import run_parallel_experiments
run_parallel_experiments(
    seeds=[10, 20, 30],
    examples_counts=[2, 4, 8],
    dataset_name="sst5",
    baseline="setfit",
    max_workers=3,
    output_dir="results",
)

# 2. Static prompting + SetFit
from src.simple_example import run_parallel_experiments
run_parallel_experiments(
    seeds=[10, 20, 30],
    examples_counts=[0, 2, 4, 8],
    dataset_name="sst5",
    max_workers=3,
    output_dir="results",
)

# 3. Agentic + SetFit
from src.agentic_simple_example import run_parallel_agentic_experiments
run_parallel_agentic_experiments(
    seeds=[10, 20, 30],
    examples_counts=[0, 2, 4, 8],
    dataset_name="sst5",
    baseline_type="setfit",
    max_workers=2,
    output_dir="results",
)

Results are saved to results/ directory as JSON files with accuracy statistics.

Project Structure

src/
├── baselines/
│   ├── setfit.py          # SetFit baseline
│   └── encoder.py         # Encoder baseline (EuroBERT, ModernBERT)
├── dataset_folder/
│   └── huggingface_dataset_reader.py  # Dataset loading
├── agentic/
│   └── setfit_wrapper_tool.py  # SetFitTool for agents
├── orch_setfit.py         # Direct training (parallel experiments)
├── simple_example.py      # Static LLM prompting (parallel experiments)
├── agentic_simple_example.py  # Agentic generation (parallel experiments)
└── llm_example_generator.py  # LLM-based data generator

Environment Variables

Set these in .env file:

OPENROUTER_API_KEY=your_key_here

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages