Evaluating on Structured Output

A simple research framework to evaluate various prompt engineering strategies for structured output.

Presentation

This is done for a presentation at PyData Global 2025.

See Jupyter notebook at presentation.ipynb.

Research results

All results are stored in eval_results/*.csv in CSV format. There are 2 framework we use to evaluate the various output.

lm-evaluation-harness

We make use of https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks to pull the Shuffle 5/7 and GSM8k benchmarks. We wrap around the framework to provide the evaluation. To run, see pyproject.toml for some of the eval task.

For example:

$ ./structo_eval.py bbh_zeroshot_tracking_shuffled_objects_five_objects --all --base-model claude-haiku-3

======================================================================
                      SUMMARY - All Integrations
======================================================================

Dataset:     bbh_zeroshot_tracking_shuffled_objects_five_objects
Base Model:  claude-haiku-3
Total Duration: 4249.8s (70.8 minutes)
----------------------------------------------------------------------
Integration                           Accuracy          Status
----------------------------------------------------------------------
natlang                                 17.39%         Success
natlang_cot                             54.40%         Success
structo                                 19.60%         Success
structo_small_steps                     65.60%         Success
----------------------------------------------------------------------

======================================================================

DSPy evaluation

These are done by the scripts at:

dspy_nl_example.py
dspy_so_example.py

The result is logged to MLFlow (run via task mlflow which runs the experiment server at http://localhost:5000).

We then collect the results manually from the experiments and collate them into the CSV of various optimization.

Quick Start

Run an evaluation:

python structo_eval.py tracking_shuffled_objects_five_objects gpt-4o

This will:

Load the dataset from Hugging Face
Register a custom model using LiteLLM
Run evaluation via lm-evaluation-harness
Display results in a readable format
Save results to results/ directory

Usage

python structo_eval.py <dataset_name> <model_identifier>

Arguments:

dataset_name: Dataset name from Hugging Face (e.g., tracking_shuffled_objects_five_objects)
model_identifier: Model identifier for LiteLLM (e.g., gpt-4o, openai/gpt-4o)

Options:

--dataset-source: Source dataset repository (default: openeval/BIG-Bench-Hard)
--help: Show help message

Examples

# Evaluate GPT-4o on tracking shuffled objects dataset
python structo_eval.py tracking_shuffled_objects_five_objects gpt-4o

# Use explicit provider format
python structo_eval.py tracking_shuffled_objects_five_objects openai/gpt-4o

Setup

Dependencies

Created via uv venv structo:

uv venv structo
source structo/bin/activate  # or: uv venv

Dependencies are managed via uv and defined in pyproject.toml:

litellm>=1.80.7 - Unified interface for model providers
lm-eval>=0.4.9.2 - Evaluation harness framework
datasets>=4.4.1 - Hugging Face dataset loading

API Credentials

Set up API credentials for your model provider (e.g., OpenAI):

export OPENAI_API_KEY="your-api-key-here"

Architecture

The framework uses:

LiteLLM: Unified interface for multiple model providers
lm-evaluation-harness: Evaluation framework with custom model support
Hugging Face datasets: Dataset loading and management

Custom model class (src/structo/model.py) bridges LiteLLM and lm-evaluation-harness.

Results

Results are saved to: results/{dataset_name}/{model_name}/{timestamp}.json

Each result file includes:

Evaluation metrics
Configuration used
Tool versions (for reproducibility)
Timestamp and metadata

Project Management

We use speckit for project management:

specify init structo-eval --ai cursor-agent --here

See specs/001-structo-eval-script/ for detailed specifications and implementation plan.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.cursor/commands		.cursor/commands
.specify		.specify
eval_results		eval_results
results		results
slide_assets		slide_assets
specs		specs
src/structo		src/structo
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
dspy_nl_example.py		dspy_nl_example.py
dspy_so_example.py		dspy_so_example.py
env.template		env.template
litellm_config.yaml		litellm_config.yaml
main.py		main.py
presentation.ipynb		presentation.ipynb
presentation.slides.html		presentation.slides.html
pyproject.toml		pyproject.toml
structo_eval.py		structo_eval.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating on Structured Output

Presentation

Research results

lm-evaluation-harness

DSPy evaluation

Quick Start

Usage

Examples

Setup

Dependencies

API Credentials

Architecture

Results

Project Management

About

Uh oh!

Releases

Packages

Languages

dat-boris/structo-research

Folders and files

Latest commit

History

Repository files navigation

Evaluating on Structured Output

Presentation

Research results

lm-evaluation-harness

DSPy evaluation

Quick Start

Usage

Examples

Setup

Dependencies

API Credentials

Architecture

Results

Project Management

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages