Skip to content

dat-boris/structo-research

Repository files navigation

Evaluating on Structured Output

A simple research framework to evaluate various prompt engineering strategies for structured output.

Presentation

This is done for a presentation at PyData Global 2025.

See Jupyter notebook at presentation.ipynb.

Research results

All results are stored in eval_results/*.csv in CSV format. There are 2 framework we use to evaluate the various output.

lm-evaluation-harness

We make use of https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks to pull the Shuffle 5/7 and GSM8k benchmarks. We wrap around the framework to provide the evaluation. To run, see pyproject.toml for some of the eval task.

For example:

$ ./structo_eval.py bbh_zeroshot_tracking_shuffled_objects_five_objects --all --base-model claude-haiku-3

======================================================================
                      SUMMARY - All Integrations
======================================================================

Dataset:     bbh_zeroshot_tracking_shuffled_objects_five_objects
Base Model:  claude-haiku-3
Total Duration: 4249.8s (70.8 minutes)
----------------------------------------------------------------------
Integration                           Accuracy          Status
----------------------------------------------------------------------
natlang                                 17.39%         Success
natlang_cot                             54.40%         Success
structo                                 19.60%         Success
structo_small_steps                     65.60%         Success
----------------------------------------------------------------------

======================================================================

DSPy evaluation

These are done by the scripts at:

  • dspy_nl_example.py
  • dspy_so_example.py

The result is logged to MLFlow (run via task mlflow which runs the experiment server at http://localhost:5000).

We then collect the results manually from the experiments and collate them into the CSV of various optimization.

Quick Start

Run an evaluation:

python structo_eval.py tracking_shuffled_objects_five_objects gpt-4o

This will:

  • Load the dataset from Hugging Face
  • Register a custom model using LiteLLM
  • Run evaluation via lm-evaluation-harness
  • Display results in a readable format
  • Save results to results/ directory

Usage

python structo_eval.py <dataset_name> <model_identifier>

Arguments:

  • dataset_name: Dataset name from Hugging Face (e.g., tracking_shuffled_objects_five_objects)
  • model_identifier: Model identifier for LiteLLM (e.g., gpt-4o, openai/gpt-4o)

Options:

  • --dataset-source: Source dataset repository (default: openeval/BIG-Bench-Hard)
  • --help: Show help message

Examples

# Evaluate GPT-4o on tracking shuffled objects dataset
python structo_eval.py tracking_shuffled_objects_five_objects gpt-4o

# Use explicit provider format
python structo_eval.py tracking_shuffled_objects_five_objects openai/gpt-4o

Setup

Dependencies

Created via uv venv structo:

uv venv structo
source structo/bin/activate  # or: uv venv

Dependencies are managed via uv and defined in pyproject.toml:

  • litellm>=1.80.7 - Unified interface for model providers
  • lm-eval>=0.4.9.2 - Evaluation harness framework
  • datasets>=4.4.1 - Hugging Face dataset loading

API Credentials

Set up API credentials for your model provider (e.g., OpenAI):

export OPENAI_API_KEY="your-api-key-here"

Architecture

The framework uses:

  • LiteLLM: Unified interface for multiple model providers
  • lm-evaluation-harness: Evaluation framework with custom model support
  • Hugging Face datasets: Dataset loading and management

Custom model class (src/structo/model.py) bridges LiteLLM and lm-evaluation-harness.

Results

Results are saved to: results/{dataset_name}/{model_name}/{timestamp}.json

Each result file includes:

  • Evaluation metrics
  • Configuration used
  • Tool versions (for reproducibility)
  • Timestamp and metadata

Project Management

We use speckit for project management:

specify init structo-eval --ai cursor-agent --here

See specs/001-structo-eval-script/ for detailed specifications and implementation plan.

About

Research on LLM Structured Output

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published