Skip to content

Conversation

@PaulHax
Copy link
Contributor

@PaulHax PaulHax commented Jan 5, 2026

  • Make each probe run directly editable and remove “prompt” sidebar
  • Runtime editable ADM hydra configs
  • Export runs as zip in same format as running experiments via align-system and hydra
  • Exploring the parameters space is hard: added table to browse runs/probes

TODO

  • Fix pipeline_random LLM option for loaded experiments
  • Table view should show all loaded experiment runs
  • Table view: filter values in columns
  • Fix spacing between run columns
  • Support collection of files rather than zip for expirment loading? A directory?
  • Consitant approach to generating edits to scneairos and ADMs form text entry.
  • Hash the edits of the decider ADM config and don't create a new edit if the same string
  • Encurage folks to look at the table after Loading Experiments somehow?
  • Fix memory cleanup of old ADMs
image image image

related to #60

PaulHax added 30 commits January 4, 2026 21:16
Update selectors in page object for new layout and fix timing issues
with dropdown selections by waiting for listbox to close.
- Add add_probes_from_experiments method to ProbeRegistry
- Wire up in core.py to populate probes from experiment items
- Skip default probes when --experiments or --scenarios is provided
Extract unique decider configs from experiment results and add them
to the decider dropdown. Configs are deduplicated by hashing the
normalized adm section (with dataset paths stripped to filenames).

Priority order: CLI --deciders > experiments > built-in deciders

Extract experiment's configured LLM and put it first in the available
backbones list, followed by default LLM options.

Consolidate experiment config loading into experiment_config_loader.py.
Move experiment-to-registry conversion logic into experiment_converters.py.
Core orchestrates calling these functions and populating registries.
- Move run_models.py from app/ to adm/ for better layering
- Add run conversion functions to experiment_converters.py
- Add lru_cache to load_experiment_adm_config for performance
- Wire up populate_cache_bulk in core.py on startup
Keep resolved_config pure - only what the decider module needs to
instantiate an ADM. GUI metadata (llm_backbones, max_alignment_attributes)
now stays in decider entries and is accessed via get_decider_options().

Removed model_path_keys entirely (was redundant - injection path is hardcoded).
Editing the config and pressing Choose creates a new run with a new
"edited" decider (named "{original} - edit {n}"). The edited config
bypasses Hydra loading and uses the stored resolved config directly.
- Extract shared caching logic into _execute_with_cache() helper
- Remove unused create_and_execute_run() method
- Fix SearchController/RunsStateAdapter circular dependency with callback
- Delete orphaned prompt_logic.py, move functions to consumers
- Remove unused ui.py functions (serialize_prompt, prep_for_state, etc)
- Convert RunsRegistry namedtuple to Protocol for better type hints
- Rename find_probe_by_base_and_scene to find_probe_by_scenario_and_scene
When loading experiments with ADMs that don't use an LLM (e.g., pipeline_random),
set llm_backbones to ["no_llm"] instead of the full LLM_BACKBONES list. This
matches the value stored in runs from ADMConfig.llm_backbone property.
- Use cache_key as unified ID for table rows (runs and experiment items)
- Store experiment items in RunsRegistry instead of separate registry
- Materialize experiment items to full Runs only when selected
- CLI --experiments and ZIP upload now use same import_experiments()
- Remove ExperimentResultsRegistry (no longer needed)
- Add log messages when loading experiments
- Add RunsTableFilter class to manage filter state and logic
- Add VSelect dropdowns for each column (Scenario, Scene, Decider, LLM, Alignment, Decision)
- Support multi-select filtering with AND combination across columns
- Extract pure filter functions for testability
- Add unit tests for filter logic
- Add sortable column headers with sort indicator icons
- Fix filter dropdown click not triggering column sort
- Use natural sorting for filter options (Probe 1, 2, 10 not 1, 10, 2)
- Set equal column widths with fixed table layout
- Align header content to top of cells
- Add blur handler to config YAML textarea (matches scenario textareas)
- Add blur handler to situation textarea (was missing)
- Add check_config_edited controller for blur-based config edits
- Remove config_dirty flag (no longer needed with blur approach)
- Add deduplication: find existing probes/deciders with matching content
- Check against root decider config when reverting edits
- Add e2e tests for config and scenario editing with revert cases
- Add Situation column showing probe display_state
- Add searchable_text field with unstructured + choice text for search
- Add native title tooltips on all table cells
- Refactor with cell_with_tooltip and filterable_column helpers
- Clear old model from cache before loading new one in worker.py
- Add cleanup function that deletes model from PipelineADM steps
- Call gc.collect() and torch.cuda.empty_cache() after cleanup
- Add tests verifying cleanup is called before loading new ADM
Users can now load experiments from either zip files or directories
via a dropdown menu. Directory files are zipped server-side before
import. Both toolbar and runs table modal have the same behavior.
Users can now drag zip files or directories onto the main content
area to import experiments. Visual feedback (blue dashed outline)
shows when dragging over the drop zone.
PaulHax added 14 commits January 4, 2026 21:16
When picking an ADM from the browser modal, it now automatically
sets that ADM as the decider for the run that opened the modal.
Enables discovery of DecisionFlow ADM by scanning subdirectories
that don't have a matching top-level YAML. Blacklists non-pipeline
ADMs that have incompatible interfaces. Fixes OmegaConf resolver
conflict when switching ADMs.
- Flatten nested Hydra config structure for subdirectory ADMs so
  interpolations like ${adm.attribute_definitions} resolve correctly
- Move max_alignment_attributes to top level for runtime deciders
- Add system_prompt_template overrides for decision_flow ADMs
- Register ref resolver when extracting attribute_definitions
- Move search field to beginning of toolbar
- Rename save buttons to download, change icon to mdi-download
- Download button shows "Download All" or "Download Selected" based on selection
- Download all runs when none selected instead of empty file
- Add confirmation dialogs for clear buttons
- Reorder toolbar: Load Experiments before Download
- Add padding to select-all checkbox header
- Improve runs grouping to deduplicate by cache_key, preferring runs with decisions
- Remove run removal logic that conflicted with in-place updates
- Use exact=True for Load Experiments button locator to avoid ambiguity
Remove confusing blur-triggered saves for config YAML and probe text editing.
Add Save buttons that appear only when content has been modified, giving users
explicit control over when edits are applied.
- Add comparison_label to run state with scenario, scene, alignment, decider, LLM
- Update Run dropdown to show descriptive labels prefixed with run number
- Rename 'Run Number' row to 'Run'
- Add unit tests for comparison_label generation
- Implemented 'Swap & Clean' strategy in RunsRegistry to remove draft runs on edit
- Updated RunsStateAdapter to always replace run IDs in the UI comparison view
- Refined apply_cached_decision to protect existing run decisions
- Added E2E tests for run replacement behavior
@PaulHax PaulHax force-pushed the exp-result-rebase branch 5 times, most recently from ea18a7a to 1ddc990 Compare January 12, 2026 18:26
- Fix ruff lint/format issues across multiple files
- Fix mypy error: add Optional type for current_choices parameter
- Fix E2E alignment panel locator to use regex name match
- Update test expectation for renamed pipeline_baseline_multi decider
@PaulHax PaulHax merged commit 1b46ca9 into main Jan 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants