-
Notifications
You must be signed in to change notification settings - Fork 0
Exp result rebase #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Update selectors in page object for new layout and fix timing issues with dropdown selections by waiting for listbox to close.
- Add add_probes_from_experiments method to ProbeRegistry - Wire up in core.py to populate probes from experiment items - Skip default probes when --experiments or --scenarios is provided
Extract unique decider configs from experiment results and add them to the decider dropdown. Configs are deduplicated by hashing the normalized adm section (with dataset paths stripped to filenames). Priority order: CLI --deciders > experiments > built-in deciders Extract experiment's configured LLM and put it first in the available backbones list, followed by default LLM options. Consolidate experiment config loading into experiment_config_loader.py.
Move experiment-to-registry conversion logic into experiment_converters.py. Core orchestrates calling these functions and populating registries.
- Move run_models.py from app/ to adm/ for better layering - Add run conversion functions to experiment_converters.py - Add lru_cache to load_experiment_adm_config for performance - Wire up populate_cache_bulk in core.py on startup
Keep resolved_config pure - only what the decider module needs to instantiate an ADM. GUI metadata (llm_backbones, max_alignment_attributes) now stays in decider entries and is accessed via get_decider_options(). Removed model_path_keys entirely (was redundant - injection path is hardcoded).
Editing the config and pressing Choose creates a new run with a new
"edited" decider (named "{original} - edit {n}"). The edited config
bypasses Hydra loading and uses the stored resolved config directly.
- Extract shared caching logic into _execute_with_cache() helper - Remove unused create_and_execute_run() method - Fix SearchController/RunsStateAdapter circular dependency with callback - Delete orphaned prompt_logic.py, move functions to consumers - Remove unused ui.py functions (serialize_prompt, prep_for_state, etc) - Convert RunsRegistry namedtuple to Protocol for better type hints - Rename find_probe_by_base_and_scene to find_probe_by_scenario_and_scene
When loading experiments with ADMs that don't use an LLM (e.g., pipeline_random), set llm_backbones to ["no_llm"] instead of the full LLM_BACKBONES list. This matches the value stored in runs from ADMConfig.llm_backbone property.
- Use cache_key as unified ID for table rows (runs and experiment items) - Store experiment items in RunsRegistry instead of separate registry - Materialize experiment items to full Runs only when selected - CLI --experiments and ZIP upload now use same import_experiments() - Remove ExperimentResultsRegistry (no longer needed) - Add log messages when loading experiments
- Add RunsTableFilter class to manage filter state and logic - Add VSelect dropdowns for each column (Scenario, Scene, Decider, LLM, Alignment, Decision) - Support multi-select filtering with AND combination across columns - Extract pure filter functions for testability - Add unit tests for filter logic
- Add sortable column headers with sort indicator icons - Fix filter dropdown click not triggering column sort - Use natural sorting for filter options (Probe 1, 2, 10 not 1, 10, 2) - Set equal column widths with fixed table layout - Align header content to top of cells
- Add blur handler to config YAML textarea (matches scenario textareas) - Add blur handler to situation textarea (was missing) - Add check_config_edited controller for blur-based config edits - Remove config_dirty flag (no longer needed with blur approach) - Add deduplication: find existing probes/deciders with matching content - Check against root decider config when reverting edits - Add e2e tests for config and scenario editing with revert cases
- Add Situation column showing probe display_state - Add searchable_text field with unstructured + choice text for search - Add native title tooltips on all table cells - Refactor with cell_with_tooltip and filterable_column helpers
- Clear old model from cache before loading new one in worker.py - Add cleanup function that deletes model from PipelineADM steps - Call gc.collect() and torch.cuda.empty_cache() after cleanup - Add tests verifying cleanup is called before loading new ADM
Users can now load experiments from either zip files or directories via a dropdown menu. Directory files are zipped server-side before import. Both toolbar and runs table modal have the same behavior.
Users can now drag zip files or directories onto the main content area to import experiments. Visual feedback (blue dashed outline) shows when dragging over the drop zone.
When picking an ADM from the browser modal, it now automatically sets that ADM as the decider for the run that opened the modal.
Enables discovery of DecisionFlow ADM by scanning subdirectories that don't have a matching top-level YAML. Blacklists non-pipeline ADMs that have incompatible interfaces. Fixes OmegaConf resolver conflict when switching ADMs.
- Flatten nested Hydra config structure for subdirectory ADMs so
interpolations like ${adm.attribute_definitions} resolve correctly
- Move max_alignment_attributes to top level for runtime deciders
- Add system_prompt_template overrides for decision_flow ADMs
- Register ref resolver when extracting attribute_definitions
- Move search field to beginning of toolbar - Rename save buttons to download, change icon to mdi-download - Download button shows "Download All" or "Download Selected" based on selection - Download all runs when none selected instead of empty file - Add confirmation dialogs for clear buttons - Reorder toolbar: Load Experiments before Download - Add padding to select-all checkbox header
- Improve runs grouping to deduplicate by cache_key, preferring runs with decisions - Remove run removal logic that conflicted with in-place updates - Use exact=True for Load Experiments button locator to avoid ambiguity
Remove confusing blur-triggered saves for config YAML and probe text editing. Add Save buttons that appear only when content has been modified, giving users explicit control over when edits are applied.
- Add comparison_label to run state with scenario, scene, alignment, decider, LLM - Update Run dropdown to show descriptive labels prefixed with run number - Rename 'Run Number' row to 'Run' - Add unit tests for comparison_label generation
- Implemented 'Swap & Clean' strategy in RunsRegistry to remove draft runs on edit - Updated RunsStateAdapter to always replace run IDs in the UI comparison view - Refined apply_cached_decision to protect existing run decisions - Added E2E tests for run replacement behavior
ea18a7a to
1ddc990
Compare
- Fix ruff lint/format issues across multiple files - Fix mypy error: add Optional type for current_choices parameter - Fix E2E alignment panel locator to use regex name match - Update test expectation for renamed pipeline_baseline_multi decider
1ddc990 to
17e3c66
Compare
This was referenced Jan 12, 2026
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TODO
related to #60