-
Notifications
You must be signed in to change notification settings - Fork 1
Add GPU support and enhance monitoring features for jobs #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds GPU support and monitoring capabilities to a Kubernetes job monitoring tool. It introduces GPU resource tracking at both the cluster and job level, and adds an interactive log viewer with auto-refresh functionality.
Changes:
- Added GPU information collection from cluster nodes and pods, including GPU type identification and allocation tracking
- Enhanced job display to show GPU requests and types, with color-coded resource utilization indicators
- Introduced interactive log viewer with keyboard navigation, syntax highlighting, and auto-refresh capability
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 18 comments.
| File | Description |
|---|---|
| monitor.py | Added GPU tracking functions (get_gpu_info, _shorten_gpu_name), enhanced job/pod data collection with GPU information, added interactive log viewer with navigation, and updated UI to display GPU resources and selection indicators |
| mock_data.py | Added GPU field to all job definitions, implemented GPU resource requests in job specs, and added GPU-aware node assignment logic for mock pods |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from rich.console import Console, Group | ||
| from rich.text import Text |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The imports Group and Text from the rich library are added but never used in the code. These unused imports should be removed to keep the code clean and avoid confusion.
| from rich.console import Console, Group | |
| from rich.text import Text | |
| from rich.console import Console |
| node_gpu_map = { | ||
| 'gpu-node-01': 'A100-40GB', | ||
| 'gpu-node-02': 'A100-40GB', | ||
| 'gpu-node-03': 'V100-32GB', |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an inconsistency in the mock GPU node types. The get_gpu_info mock data returns nodes with H100-80GB and A100-80GB/A100-40GB types, but the node_gpu_map fallback in get_jobs_pods specifies A100-40GB for gpu-node-01 and gpu-node-02, and V100-32GB for gpu-node-03. This mismatch could lead to confusing or incorrect GPU type displays in mock mode. The node_gpu_map should align with the GPU types defined in get_gpu_info's mock data.
| node_gpu_map = { | |
| 'gpu-node-01': 'A100-40GB', | |
| 'gpu-node-02': 'A100-40GB', | |
| 'gpu-node-03': 'V100-32GB', | |
| # Fallback mock mapping should match get_gpu_info's mock GPU types | |
| node_gpu_map = { | |
| 'gpu-node-01': 'H100-80GB', | |
| 'gpu-node-02': 'A100-80GB', | |
| 'gpu-node-03': 'A100-40GB', |
| # Add indicator for pods that can show logs | ||
| if row['type'] == 'pod': | ||
| name_display = f"[reverse]▶ {row['display_name'].strip()}[/reverse]" |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a pod is selected, the indicator "▶ " is added and the display_name is stripped. However, display_name already contains formatting with tree characters (├── or └── ) and padding. Stripping it may remove necessary spacing, and the indicator placement could break the tree structure visualization. Consider preserving the tree structure by inserting the indicator after the padding but before the pod name, or apply the indicator in a way that doesn't disrupt the visual tree hierarchy.
| # Add indicator for pods that can show logs | |
| if row['type'] == 'pod': | |
| name_display = f"[reverse]▶ {row['display_name'].strip()}[/reverse]" | |
| # Add indicator for pods that can show logs without breaking tree indentation | |
| if row['type'] == 'pod': | |
| name_display = f"[reverse]{row['display_name']} ▶[/reverse]" |
| total_lines = len(lines) | ||
| scroll_info = f" ({scroll_offset + 1}-{min(scroll_offset + (max_lines or total_lines), total_lines)}/{total_lines})" |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scroll indicator calculation on line 685 may produce incorrect values when scroll_offset is 0 and there are no logs. If lines is empty or contains only ["No logs available"], then scroll_offset + 1 would show "1-1/1" which is technically correct but could be clearer. Consider handling the edge case where there are no actual logs to display more gracefully, or adjusting the calculation to show "0-0/0" when there are no logs.
| total_lines = len(lines) | |
| scroll_info = f" ({scroll_offset + 1}-{min(scroll_offset + (max_lines or total_lines), total_lines)}/{total_lines})" | |
| has_real_logs = bool(logs and logs.strip()) | |
| if not has_real_logs: | |
| total_lines = 0 | |
| scroll_info = " (0-0/0)" | |
| else: | |
| total_lines = len(lines) | |
| start_line = min(scroll_offset + 1, total_lines) | |
| end_line = min(scroll_offset + (max_lines or total_lines), total_lines) | |
| scroll_info = f" ({start_line}-{end_line}/{total_lines})" |
| elif key_input == b'H': # Up arrow on Windows (alternate) | ||
| key = 'up' | ||
| elif key_input == b'P': # Down arrow on Windows | ||
| elif key_input == b'P': # Down arrow on Windows (alternate) | ||
| key = 'down' |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Windows arrow key handling has redundant code paths. Lines 864-867 check for standalone b'H' and b'P' which would only occur if the extended key prefix b'\xe0' wasn't received first. However, Windows arrow keys always send the extended key prefix, so these fallback checks for b'H' and b'P' will never be triggered and are redundant. Consider removing lines 864-867 to simplify the code.
| if 'gpu' in key.lower(): | ||
| try: | ||
| gpu_count = int(capacity[key]) | ||
| except: |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| except (ValueError, TypeError): | |
| # Some nodes may report a non-integer or unexpected GPU capacity; treat as zero and continue. |
| gpu_info['total_gpus'] += gpu_count | ||
| gpu_info['gpu_types'].add(gpu_type) | ||
| gpu_info['node_gpu_map'][node_name] = gpu_type | ||
| except: |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| except: | |
| # Best-effort GPU/node discovery: ignore errors from kubectl/JSON parsing | |
| # so that failure to gather GPU info does not break the monitor. |
| for node in gpu_info['nodes']: | ||
| if node['name'] == node_name: | ||
| node['allocated'] += gpu_req | ||
| except (ValueError, TypeError): |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except (ValueError, TypeError): | |
| except (ValueError, TypeError): | |
| # If the GPU request value is not a valid integer, ignore it | |
| # and treat it as requesting 0 GPUs so monitoring continues. |
| node['allocated'] += gpu_req | ||
| except (ValueError, TypeError): | ||
| pass | ||
| except: |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| except: | |
| # Intentionally ignore errors when parsing pod GPU allocations; | |
| # if this fails, we proceed with zero/partial allocation data. |
| if 'gpu' in key.lower(): | ||
| try: | ||
| gpu_request += int(value) | ||
| except (ValueError, TypeError): |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except (ValueError, TypeError): | |
| except (ValueError, TypeError): | |
| # Intentionally ignore non-numeric or malformed GPU request values. |
| pass | ||
|
|
||
| # Get GPU allocation from pods | ||
| pods_json = run_cmd(f"kubectl get pods --all-namespaces -o json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --al-namespaces option will fail when running inside an EIDF VM, please use the args.namespace argument.
Also, we are already getting pods information somewhere in the code. Could we take advantage of that and call it only once and get all the info we need from pods?
| if use_mock: | ||
| return { | ||
| 'nodes': [ | ||
| {'name': 'gpu-node-01', 'gpu_type': 'H100-80GB', 'gpu_count': 8, 'allocated': 4}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move all hardcoded mock data to mock_data.py so we make this main file cleaner?
Also, currently the total GPU value for mocked data:
H100-80GB 4/16
A100-80GB 8/16
A100-40GB 3/4
doesn't match the GPU total:
GPU 2/4
| name_display = f"[reverse]{row['display_name']}[/reverse]" | ||
| # Add indicator for pods that can show logs | ||
| if row['type'] == 'pod': | ||
| name_display = f"[reverse]▶ {row['display_name'].strip()}[/reverse]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When testing the mocked data on a small screen, in a way that we need to scroll down to see the rest of the list, for some reason, the highlighting disappears. It only returns after we scroll up and then down again.
|
|
||
| console.print("[bold yellow]Keyboard Shortcuts:[/bold yellow]") | ||
| console.print(" [cyan]↑/↓[/cyan] Navigate up and down") | ||
| console.print(" [cyan]Enter[/cyan] View logs for selected pod") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please update the README to reflect the changes?
Introduce GPU support in job data generation and monitoring, improve GPU resource management, and refine error handling. Enhance the log viewer with an auto-refresh feature for better usability.