Exp result rebase #71

PaulHax · 2026-01-05T02:17:30Z

Make each probe run directly editable and remove “prompt” sidebar
Runtime editable ADM hydra configs
Export runs as zip in same format as running experiments via align-system and hydra
Exploring the parameters space is hard: added table to browse runs/probes

TODO

Fix pipeline_random LLM option for loaded experiments
Table view should show all loaded experiment runs
Table view: filter values in columns
Fix spacing between run columns
Support collection of files rather than zip for expirment loading? A directory?
Consitant approach to generating edits to scneairos and ADMs form text entry.
Hash the edits of the decider ADM config and don't create a new edit if the same string
Encurage folks to look at the table after Loading Experiments somehow?
Fix memory cleanup of old ADMs

related to #60

Update selectors in page object for new layout and fix timing issues with dropdown selections by waiting for listbox to close.

- Add add_probes_from_experiments method to ProbeRegistry - Wire up in core.py to populate probes from experiment items - Skip default probes when --experiments or --scenarios is provided

Extract unique decider configs from experiment results and add them to the decider dropdown. Configs are deduplicated by hashing the normalized adm section (with dataset paths stripped to filenames). Priority order: CLI --deciders > experiments > built-in deciders Extract experiment's configured LLM and put it first in the available backbones list, followed by default LLM options. Consolidate experiment config loading into experiment_config_loader.py.

Move experiment-to-registry conversion logic into experiment_converters.py. Core orchestrates calling these functions and populating registries.

- Move run_models.py from app/ to adm/ for better layering - Add run conversion functions to experiment_converters.py - Add lru_cache to load_experiment_adm_config for performance - Wire up populate_cache_bulk in core.py on startup

Keep resolved_config pure - only what the decider module needs to instantiate an ADM. GUI metadata (llm_backbones, max_alignment_attributes) now stays in decider entries and is accessed via get_decider_options(). Removed model_path_keys entirely (was redundant - injection path is hardcoded).

Editing the config and pressing Choose creates a new run with a new "edited" decider (named "{original} - edit {n}"). The edited config bypasses Hydra loading and uses the stored resolved config directly.

- Extract shared caching logic into _execute_with_cache() helper - Remove unused create_and_execute_run() method - Fix SearchController/RunsStateAdapter circular dependency with callback - Delete orphaned prompt_logic.py, move functions to consumers - Remove unused ui.py functions (serialize_prompt, prep_for_state, etc) - Convert RunsRegistry namedtuple to Protocol for better type hints - Rename find_probe_by_base_and_scene to find_probe_by_scenario_and_scene

When loading experiments with ADMs that don't use an LLM (e.g., pipeline_random), set llm_backbones to ["no_llm"] instead of the full LLM_BACKBONES list. This matches the value stored in runs from ADMConfig.llm_backbone property.

- Use cache_key as unified ID for table rows (runs and experiment items) - Store experiment items in RunsRegistry instead of separate registry - Materialize experiment items to full Runs only when selected - CLI --experiments and ZIP upload now use same import_experiments() - Remove ExperimentResultsRegistry (no longer needed) - Add log messages when loading experiments

- Add RunsTableFilter class to manage filter state and logic - Add VSelect dropdowns for each column (Scenario, Scene, Decider, LLM, Alignment, Decision) - Support multi-select filtering with AND combination across columns - Extract pure filter functions for testability - Add unit tests for filter logic

- Add sortable column headers with sort indicator icons - Fix filter dropdown click not triggering column sort - Use natural sorting for filter options (Probe 1, 2, 10 not 1, 10, 2) - Set equal column widths with fixed table layout - Align header content to top of cells

- Add blur handler to config YAML textarea (matches scenario textareas) - Add blur handler to situation textarea (was missing) - Add check_config_edited controller for blur-based config edits - Remove config_dirty flag (no longer needed with blur approach) - Add deduplication: find existing probes/deciders with matching content - Check against root decider config when reverting edits - Add e2e tests for config and scenario editing with revert cases

- Add Situation column showing probe display_state - Add searchable_text field with unstructured + choice text for search - Add native title tooltips on all table cells - Refactor with cell_with_tooltip and filterable_column helpers

- Clear old model from cache before loading new one in worker.py - Add cleanup function that deletes model from PipelineADM steps - Call gc.collect() and torch.cuda.empty_cache() after cleanup - Add tests verifying cleanup is called before loading new ADM

Users can now load experiments from either zip files or directories via a dropdown menu. Directory files are zipped server-side before import. Both toolbar and runs table modal have the same behavior.

Users can now drag zip files or directories onto the main content area to import experiments. Visual feedback (blue dashed outline) shows when dragging over the drop zone.

When picking an ADM from the browser modal, it now automatically sets that ADM as the decider for the run that opened the modal.

Enables discovery of DecisionFlow ADM by scanning subdirectories that don't have a matching top-level YAML. Blacklists non-pipeline ADMs that have incompatible interfaces. Fixes OmegaConf resolver conflict when switching ADMs.

- Flatten nested Hydra config structure for subdirectory ADMs so interpolations like ${adm.attribute_definitions} resolve correctly - Move max_alignment_attributes to top level for runtime deciders - Add system_prompt_template overrides for decision_flow ADMs - Register ref resolver when extracting attribute_definitions

- Move search field to beginning of toolbar - Rename save buttons to download, change icon to mdi-download - Download button shows "Download All" or "Download Selected" based on selection - Download all runs when none selected instead of empty file - Add confirmation dialogs for clear buttons - Reorder toolbar: Load Experiments before Download - Add padding to select-all checkbox header

- Improve runs grouping to deduplicate by cache_key, preferring runs with decisions - Remove run removal logic that conflicted with in-place updates - Use exact=True for Load Experiments button locator to avoid ambiguity

Remove confusing blur-triggered saves for config YAML and probe text editing. Add Save buttons that appear only when content has been modified, giving users explicit control over when edits are applied.

- Add comparison_label to run state with scenario, scene, alignment, decider, LLM - Update Run dropdown to show descriptive labels prefixed with run number - Rename 'Run Number' row to 'Run' - Add unit tests for comparison_label generation

- Implemented 'Swap & Clean' strategy in RunsRegistry to remove draft runs on edit - Updated RunsStateAdapter to always replace run IDs in the UI comparison view - Refined apply_cached_decision to protect existing run decisions - Added E2E tests for run replacement behavior

- Fix ruff lint/format issues across multiple files - Fix mypy error: add Optional type for current_choices parameter - Fix E2E alignment panel locator to use regex name match - Update test expectation for renamed pipeline_baseline_multi decider

PaulHax added 30 commits January 4, 2026 21:16

Refactor e2e tests for updated DOM structure

c6e9f04

Update selectors in page object for new layout and fix timing issues with dropdown selections by waiting for listbox to close.

Add experiment_results_registry and CLI arg

2a8ed4d

Add probes from experiments to probe registry

f9bc94d

- Add add_probes_from_experiments method to ProbeRegistry - Wire up in core.py to populate probes from experiment items - Skip default probes when --experiments or --scenarios is provided

Refactor experiment conversions to pure functions

4e50a4c

Move experiment-to-registry conversion logic into experiment_converters.py. Core orchestrates calling these functions and populating registries.

Prepopulate decision cache from experiment results

02a8864

- Move run_models.py from app/ to adm/ for better layering - Add run conversion functions to experiment_converters.py - Add lru_cache to load_experiment_adm_config for performance - Wire up populate_cache_bulk in core.py on startup

Display resolved decider config as YAML in expandable panel

8aa5775

Make decider config YAML editable with new run creation

0a49a1b

Editing the config and pressing Choose creates a new run with a new "edited" decider (named "{original} - edit {n}"). The edited config bypasses Hydra loading and uses the stored resolved config directly.

Deduplicate config lookup and spinner condition

894c563

Include resolved_config in run cache key

9601627

Export runs as ZIP with Pydantic Experiment structure

69bbb53

Fix zip export

2a03e66

Decider worker.py cache bust on config change

48babed

Add table view of runs

2ab82be

Add import experiments from ZIP file

a2adba3

Use empty list for no-LLM imports to display N/A consistently

92529a0

Display N/A when llm_backbone is None

e59e4a1

Add Save Selected button to runs table for exporting checked runs

de14cd4

Add Load Experiments and Clear All buttons to runs table modal

f38160d

Add directory upload support to Load Experiments

89cc88f

Users can now load experiments from either zip files or directories via a dropdown menu. Directory files are zipped server-side before import. Both toolbar and runs table modal have the same behavior.

Add drag-and-drop support for experiment import

34b0ee2

Users can now drag zip files or directories onto the main content area to import experiments. Visual feedback (blue dashed outline) shows when dragging over the drop zone.

PaulHax added 14 commits January 4, 2026 21:16

Refactor drop handler JS to multi-line constant

323e6a0

Add eye indicator for runs currently in comparison view

22ec39d

fix: use align-utils 1.5.0 and fix mypy error

9b91c7c

chore: require Python 3.10+, update lock file

c57e4b4

Add ADM browser modal with auto-select decider on pick

96e0def

When picking an ADM from the browser modal, it now automatically sets that ADM as the decider for the run that opened the modal.

Add ADM blacklist and discover configs in subdirectories

ae17f3e

Enables discovery of DecisionFlow ADM by scanning subdirectories that don't have a matching top-level YAML. Blacklists non-pipeline ADMs that have incompatible interfaces. Fixes OmegaConf resolver conflict when switching ADMs.

Fix CI failures: mypy error, e2e test fixture, remove outdated tests

e27ddd8

Fix export deduplication and test button locator

1413a58

- Improve runs grouping to deduplicate by cache_key, preferring runs with decisions - Remove run removal logic that conflicted with in-place updates - Use exact=True for Load Experiments button locator to avoid ambiguity

Replace blur event handlers with explicit Save buttons

c6c31ab

Remove confusing blur-triggered saves for config YAML and probe text editing. Add Save buttons that appear only when content has been modified, giving users explicit control over when edits are applied.

Deduplicate runs table rows by cache key

a55e793

PaulHax force-pushed the exp-result-rebase branch 5 times, most recently from ea18a7a to 1ddc990 Compare January 12, 2026 18:26

PaulHax force-pushed the exp-result-rebase branch from 1ddc990 to 17e3c66 Compare January 12, 2026 19:03

PaulHax merged commit 1b46ca9 into main Jan 12, 2026
6 checks passed

This was referenced Jan 12, 2026

GUI to pick custom ADM/expriment configs #60

Closed

Experiment run log browser. AKA precomputed results #10

Closed

Pre computed choice results #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exp result rebase #71

Exp result rebase #71

Uh oh!

PaulHax commented Jan 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Exp result rebase #71

Exp result rebase #71

Uh oh!

Conversation

PaulHax commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PaulHax commented Jan 5, 2026 •

edited

Loading