A simple research framework to evaluate various prompt engineering strategies for structured output.
This is done for a presentation at PyData Global 2025.
See Jupyter notebook at presentation.ipynb.
All results are stored in eval_results/*.csv in CSV format. There are 2 framework
we use to evaluate the various output.
We make use of https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks to pull
the Shuffle 5/7 and GSM8k benchmarks. We wrap around the framework to provide
the evaluation. To run, see pyproject.toml for some of the eval task.
For example:
$ ./structo_eval.py bbh_zeroshot_tracking_shuffled_objects_five_objects --all --base-model claude-haiku-3
======================================================================
SUMMARY - All Integrations
======================================================================
Dataset: bbh_zeroshot_tracking_shuffled_objects_five_objects
Base Model: claude-haiku-3
Total Duration: 4249.8s (70.8 minutes)
----------------------------------------------------------------------
Integration Accuracy Status
----------------------------------------------------------------------
natlang 17.39% Success
natlang_cot 54.40% Success
structo 19.60% Success
structo_small_steps 65.60% Success
----------------------------------------------------------------------
======================================================================
These are done by the scripts at:
- dspy_nl_example.py
- dspy_so_example.py
The result is logged to MLFlow (run via task mlflow which runs the
experiment server at http://localhost:5000).
We then collect the results manually from the experiments and collate them into the CSV of various optimization.
Run an evaluation:
python structo_eval.py tracking_shuffled_objects_five_objects gpt-4oThis will:
- Load the dataset from Hugging Face
- Register a custom model using LiteLLM
- Run evaluation via lm-evaluation-harness
- Display results in a readable format
- Save results to
results/directory
python structo_eval.py <dataset_name> <model_identifier>Arguments:
dataset_name: Dataset name from Hugging Face (e.g.,tracking_shuffled_objects_five_objects)model_identifier: Model identifier for LiteLLM (e.g.,gpt-4o,openai/gpt-4o)
Options:
--dataset-source: Source dataset repository (default:openeval/BIG-Bench-Hard)--help: Show help message
# Evaluate GPT-4o on tracking shuffled objects dataset
python structo_eval.py tracking_shuffled_objects_five_objects gpt-4o
# Use explicit provider format
python structo_eval.py tracking_shuffled_objects_five_objects openai/gpt-4oCreated via uv venv structo:
uv venv structo
source structo/bin/activate # or: uv venvDependencies are managed via uv and defined in pyproject.toml:
litellm>=1.80.7- Unified interface for model providerslm-eval>=0.4.9.2- Evaluation harness frameworkdatasets>=4.4.1- Hugging Face dataset loading
Set up API credentials for your model provider (e.g., OpenAI):
export OPENAI_API_KEY="your-api-key-here"The framework uses:
- LiteLLM: Unified interface for multiple model providers
- lm-evaluation-harness: Evaluation framework with custom model support
- Hugging Face datasets: Dataset loading and management
Custom model class (src/structo/model.py) bridges LiteLLM and lm-evaluation-harness.
Results are saved to: results/{dataset_name}/{model_name}/{timestamp}.json
Each result file includes:
- Evaluation metrics
- Configuration used
- Tool versions (for reproducibility)
- Timestamp and metadata
We use speckit for project management:
specify init structo-eval --ai cursor-agent --hereSee specs/001-structo-eval-script/ for detailed specifications and implementation plan.