Skip to content

Conversation

@stogiannidis
Copy link

Introduce GPU support in job data generation and monitoring, improve GPU resource management, and refine error handling. Enhance the log viewer with an auto-refresh feature for better usability.

Copilot AI review requested due to automatic review settings January 21, 2026 17:25
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds GPU support and monitoring capabilities to a Kubernetes job monitoring tool. It introduces GPU resource tracking at both the cluster and job level, and adds an interactive log viewer with auto-refresh functionality.

Changes:

  • Added GPU information collection from cluster nodes and pods, including GPU type identification and allocation tracking
  • Enhanced job display to show GPU requests and types, with color-coded resource utilization indicators
  • Introduced interactive log viewer with keyboard navigation, syntax highlighting, and auto-refresh capability

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 18 comments.

File Description
monitor.py Added GPU tracking functions (get_gpu_info, _shorten_gpu_name), enhanced job/pod data collection with GPU information, added interactive log viewer with navigation, and updated UI to display GPU resources and selection indicators
mock_data.py Added GPU field to all job definitions, implemented GPU resource requests in job specs, and added GPU-aware node assignment logic for mock pods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +20 to +21
from rich.console import Console, Group
from rich.text import Text
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The imports Group and Text from the rich library are added but never used in the code. These unused imports should be removed to keep the code clean and avoid confusion.

Suggested change
from rich.console import Console, Group
from rich.text import Text
from rich.console import Console

Copilot uses AI. Check for mistakes.
Comment on lines +278 to +281
node_gpu_map = {
'gpu-node-01': 'A100-40GB',
'gpu-node-02': 'A100-40GB',
'gpu-node-03': 'V100-32GB',
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an inconsistency in the mock GPU node types. The get_gpu_info mock data returns nodes with H100-80GB and A100-80GB/A100-40GB types, but the node_gpu_map fallback in get_jobs_pods specifies A100-40GB for gpu-node-01 and gpu-node-02, and V100-32GB for gpu-node-03. This mismatch could lead to confusing or incorrect GPU type displays in mock mode. The node_gpu_map should align with the GPU types defined in get_gpu_info's mock data.

Suggested change
node_gpu_map = {
'gpu-node-01': 'A100-40GB',
'gpu-node-02': 'A100-40GB',
'gpu-node-03': 'V100-32GB',
# Fallback mock mapping should match get_gpu_info's mock GPU types
node_gpu_map = {
'gpu-node-01': 'H100-80GB',
'gpu-node-02': 'A100-80GB',
'gpu-node-03': 'A100-40GB',

Copilot uses AI. Check for mistakes.
Comment on lines +594 to +596
# Add indicator for pods that can show logs
if row['type'] == 'pod':
name_display = f"[reverse]▶ {row['display_name'].strip()}[/reverse]"
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a pod is selected, the indicator "▶ " is added and the display_name is stripped. However, display_name already contains formatting with tree characters (├── or └── ) and padding. Stripping it may remove necessary spacing, and the indicator placement could break the tree structure visualization. Consider preserving the tree structure by inserting the indicator after the padding but before the pod name, or apply the indicator in a way that doesn't disrupt the visual tree hierarchy.

Suggested change
# Add indicator for pods that can show logs
if row['type'] == 'pod':
name_display = f"[reverse]{row['display_name'].strip()}[/reverse]"
# Add indicator for pods that can show logs without breaking tree indentation
if row['type'] == 'pod':
name_display = f"[reverse]{row['display_name']}[/reverse]"

Copilot uses AI. Check for mistakes.
Comment on lines +684 to +685
total_lines = len(lines)
scroll_info = f" ({scroll_offset + 1}-{min(scroll_offset + (max_lines or total_lines), total_lines)}/{total_lines})"
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scroll indicator calculation on line 685 may produce incorrect values when scroll_offset is 0 and there are no logs. If lines is empty or contains only ["No logs available"], then scroll_offset + 1 would show "1-1/1" which is technically correct but could be clearer. Consider handling the edge case where there are no actual logs to display more gracefully, or adjusting the calculation to show "0-0/0" when there are no logs.

Suggested change
total_lines = len(lines)
scroll_info = f" ({scroll_offset + 1}-{min(scroll_offset + (max_lines or total_lines), total_lines)}/{total_lines})"
has_real_logs = bool(logs and logs.strip())
if not has_real_logs:
total_lines = 0
scroll_info = " (0-0/0)"
else:
total_lines = len(lines)
start_line = min(scroll_offset + 1, total_lines)
end_line = min(scroll_offset + (max_lines or total_lines), total_lines)
scroll_info = f" ({start_line}-{end_line}/{total_lines})"

Copilot uses AI. Check for mistakes.
Comment on lines +864 to 867
elif key_input == b'H': # Up arrow on Windows (alternate)
key = 'up'
elif key_input == b'P': # Down arrow on Windows
elif key_input == b'P': # Down arrow on Windows (alternate)
key = 'down'
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Windows arrow key handling has redundant code paths. Lines 864-867 check for standalone b'H' and b'P' which would only occur if the extended key prefix b'\xe0' wasn't received first. However, Windows arrow keys always send the extended key prefix, so these fallback checks for b'H' and b'P' will never be triggered and are redundant. Consider removing lines 864-867 to simplify the code.

Copilot uses AI. Check for mistakes.
if 'gpu' in key.lower():
try:
gpu_count = int(capacity[key])
except:
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
except (ValueError, TypeError):
# Some nodes may report a non-integer or unexpected GPU capacity; treat as zero and continue.

Copilot uses AI. Check for mistakes.
gpu_info['total_gpus'] += gpu_count
gpu_info['gpu_types'].add(gpu_type)
gpu_info['node_gpu_map'][node_name] = gpu_type
except:
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
except:
# Best-effort GPU/node discovery: ignore errors from kubectl/JSON parsing
# so that failure to gather GPU info does not break the monitor.

Copilot uses AI. Check for mistakes.
for node in gpu_info['nodes']:
if node['name'] == node_name:
node['allocated'] += gpu_req
except (ValueError, TypeError):
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except (ValueError, TypeError):
except (ValueError, TypeError):
# If the GPU request value is not a valid integer, ignore it
# and treat it as requesting 0 GPUs so monitoring continues.

Copilot uses AI. Check for mistakes.
node['allocated'] += gpu_req
except (ValueError, TypeError):
pass
except:
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
except:
# Intentionally ignore errors when parsing pod GPU allocations;
# if this fails, we proceed with zero/partial allocation data.

Copilot uses AI. Check for mistakes.
if 'gpu' in key.lower():
try:
gpu_request += int(value)
except (ValueError, TypeError):
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except (ValueError, TypeError):
except (ValueError, TypeError):
# Intentionally ignore non-numeric or malformed GPU request values.

Copilot uses AI. Check for mistakes.
pass

# Get GPU allocation from pods
pods_json = run_cmd(f"kubectl get pods --all-namespaces -o json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --al-namespaces option will fail when running inside an EIDF VM, please use the args.namespace argument.

Also, we are already getting pods information somewhere in the code. Could we take advantage of that and call it only once and get all the info we need from pods?

if use_mock:
return {
'nodes': [
{'name': 'gpu-node-01', 'gpu_type': 'H100-80GB', 'gpu_count': 8, 'allocated': 4},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move all hardcoded mock data to mock_data.py so we make this main file cleaner?

Also, currently the total GPU value for mocked data:

H100-80GB 4/16
A100-80GB 8/16
A100-40GB 3/4

doesn't match the GPU total:
GPU 2/4

name_display = f"[reverse]{row['display_name']}[/reverse]"
# Add indicator for pods that can show logs
if row['type'] == 'pod':
name_display = f"[reverse]▶ {row['display_name'].strip()}[/reverse]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing the mocked data on a small screen, in a way that we need to scroll down to see the rest of the list, for some reason, the highlighting disappears. It only returns after we scroll up and then down again.


console.print("[bold yellow]Keyboard Shortcuts:[/bold yellow]")
console.print(" [cyan]↑/↓[/cyan] Navigate up and down")
console.print(" [cyan]Enter[/cyan] View logs for selected pod")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update the README to reflect the changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants