Add GPU support and enhance monitoring features for jobs #2

stogiannidis · 2026-01-21T17:25:30Z

Introduce GPU support in job data generation and monitoring, improve GPU resource management, and refine error handling. Enhance the log viewer with an auto-refresh feature for better usability.

…d handling

…ature

Copilot

Pull request overview

This pull request adds GPU support and monitoring capabilities to a Kubernetes job monitoring tool. It introduces GPU resource tracking at both the cluster and job level, and adds an interactive log viewer with auto-refresh functionality.

Changes:

Added GPU information collection from cluster nodes and pods, including GPU type identification and allocation tracking
Enhanced job display to show GPU requests and types, with color-coded resource utilization indicators
Introduced interactive log viewer with keyboard navigation, syntax highlighting, and auto-refresh capability

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 18 comments.

File	Description
monitor.py	Added GPU tracking functions (get_gpu_info, _shorten_gpu_name), enhanced job/pod data collection with GPU information, added interactive log viewer with navigation, and updated UI to display GPU resources and selection indicators
mock_data.py	Added GPU field to all job definitions, implemented GPU resource requests in job specs, and added GPU-aware node assignment logic for mock pods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-21T17:31:54Z

monitor.py

+from rich.console import Console, Group
+from rich.text import Text


The imports Group and Text from the rich library are added but never used in the code. These unused imports should be removed to keep the code clean and avoid confusion.

Suggested change

from rich.console import Console, Group

from rich.text import Text

from rich.console import Console

Copilot · 2026-01-21T17:31:55Z

monitor.py

+        node_gpu_map = {
+            'gpu-node-01': 'A100-40GB',
+            'gpu-node-02': 'A100-40GB', 
+            'gpu-node-03': 'V100-32GB',


There is an inconsistency in the mock GPU node types. The get_gpu_info mock data returns nodes with H100-80GB and A100-80GB/A100-40GB types, but the node_gpu_map fallback in get_jobs_pods specifies A100-40GB for gpu-node-01 and gpu-node-02, and V100-32GB for gpu-node-03. This mismatch could lead to confusing or incorrect GPU type displays in mock mode. The node_gpu_map should align with the GPU types defined in get_gpu_info's mock data.

Suggested change

node_gpu_map = {

'gpu-node-01': 'A100-40GB',

'gpu-node-02': 'A100-40GB',

'gpu-node-03': 'V100-32GB',

# Fallback mock mapping should match get_gpu_info's mock GPU types

node_gpu_map = {

'gpu-node-01': 'H100-80GB',

'gpu-node-02': 'A100-80GB',

'gpu-node-03': 'A100-40GB',

Copilot · 2026-01-21T17:31:55Z

monitor.py

+            # Add indicator for pods that can show logs
+            if row['type'] == 'pod':
+                name_display = f"[reverse]▶ {row['display_name'].strip()}[/reverse]"


When a pod is selected, the indicator "▶ " is added and the display_name is stripped. However, display_name already contains formatting with tree characters (├── or └── ) and padding. Stripping it may remove necessary spacing, and the indicator placement could break the tree structure visualization. Consider preserving the tree structure by inserting the indicator after the padding but before the pod name, or apply the indicator in a way that doesn't disrupt the visual tree hierarchy.

Suggested change

# Add indicator for pods that can show logs

if row['type'] == 'pod':

name_display = f"[reverse]▶ {row['display_name'].strip()}[/reverse]"

# Add indicator for pods that can show logs without breaking tree indentation

if row['type'] == 'pod':

name_display = f"[reverse]{row['display_name']} ▶[/reverse]"

Copilot · 2026-01-21T17:31:56Z

monitor.py

+    total_lines = len(lines)
+    scroll_info = f" ({scroll_offset + 1}-{min(scroll_offset + (max_lines or total_lines), total_lines)}/{total_lines})"


The scroll indicator calculation on line 685 may produce incorrect values when scroll_offset is 0 and there are no logs. If lines is empty or contains only ["No logs available"], then scroll_offset + 1 would show "1-1/1" which is technically correct but could be clearer. Consider handling the edge case where there are no actual logs to display more gracefully, or adjusting the calculation to show "0-0/0" when there are no logs.

Suggested change

total_lines = len(lines)

scroll_info = f" ({scroll_offset + 1}-{min(scroll_offset + (max_lines or total_lines), total_lines)}/{total_lines})"

has_real_logs = bool(logs and logs.strip())

if not has_real_logs:

total_lines = 0

scroll_info = " (0-0/0)"

else:

total_lines = len(lines)

start_line = min(scroll_offset + 1, total_lines)

end_line = min(scroll_offset + (max_lines or total_lines), total_lines)

scroll_info = f" ({start_line}-{end_line}/{total_lines})"

Copilot · 2026-01-21T17:31:56Z

monitor.py

+                        elif key_input == b'H':  # Up arrow on Windows (alternate)
                            key = 'up'
-                        elif key_input == b'P':  # Down arrow on Windows
+                        elif key_input == b'P':  # Down arrow on Windows (alternate)
                            key = 'down'


The Windows arrow key handling has redundant code paths. Lines 864-867 check for standalone b'H' and b'P' which would only occur if the extended key prefix b'\xe0' wasn't received first. However, Windows arrow keys always send the extended key prefix, so these fallback checks for b'H' and b'P' will never be triggered and are redundant. Consider removing lines 864-867 to simplify the code.

Copilot · 2026-01-21T17:31:58Z

monitor.py

+                if 'gpu' in key.lower():
+                    try:
+                        gpu_count = int(capacity[key])
+                    except:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except:

except (ValueError, TypeError):

# Some nodes may report a non-integer or unexpected GPU capacity; treat as zero and continue.

Copilot · 2026-01-21T17:31:58Z

monitor.py

+                gpu_info['total_gpus'] += gpu_count
+                gpu_info['gpu_types'].add(gpu_type)
+                gpu_info['node_gpu_map'][node_name] = gpu_type
+    except:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except:

except:

# Best-effort GPU/node discovery: ignore errors from kubectl/JSON parsing

# so that failure to gather GPU info does not break the monitor.

Copilot · 2026-01-21T17:31:59Z

monitor.py

+                            for node in gpu_info['nodes']:
+                                if node['name'] == node_name:
+                                    node['allocated'] += gpu_req
+                        except (ValueError, TypeError):


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except (ValueError, TypeError):

except (ValueError, TypeError):

# If the GPU request value is not a valid integer, ignore it

# and treat it as requesting 0 GPUs so monitoring continues.

Copilot · 2026-01-21T17:31:59Z

monitor.py

+                                    node['allocated'] += gpu_req
+                        except (ValueError, TypeError):
+                            pass
+    except:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except:

except:

# Intentionally ignore errors when parsing pod GPU allocations;

# if this fails, we proceed with zero/partial allocation data.

Copilot · 2026-01-21T17:31:59Z

monitor.py

+                        if 'gpu' in key.lower():
+                            try:
+                                gpu_request += int(value)
+                            except (ValueError, TypeError):


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except (ValueError, TypeError):

except (ValueError, TypeError):

# Intentionally ignore non-numeric or malformed GPU request values.

DaniloBueno · 2026-01-23T10:57:37Z

monitor.py

+        pass
+
+    # Get GPU allocation from pods
+    pods_json = run_cmd(f"kubectl get pods --all-namespaces -o json")


The --al-namespaces option will fail when running inside an EIDF VM, please use the args.namespace argument.

Also, we are already getting pods information somewhere in the code. Could we take advantage of that and call it only once and get all the info we need from pods?

DaniloBueno · 2026-01-23T11:01:06Z

monitor.py

+    if use_mock:
+        return {
+            'nodes': [
+                {'name': 'gpu-node-01', 'gpu_type': 'H100-80GB', 'gpu_count': 8, 'allocated': 4},


Could we move all hardcoded mock data to mock_data.py so we make this main file cleaner?

Also, currently the total GPU value for mocked data:

H100-80GB 4/16 A100-80GB 8/16 A100-40GB 3/4

doesn't match the GPU total:
GPU 2/4

DaniloBueno · 2026-01-23T11:05:58Z

monitor.py

+            name_display = f"[reverse]{row['display_name']}[/reverse]"
+            # Add indicator for pods that can show logs
+            if row['type'] == 'pod':
+                name_display = f"[reverse]▶ {row['display_name'].strip()}[/reverse]"


When testing the mocked data on a small screen, in a way that we need to scroll down to see the rest of the list, for some reason, the highlighting disappears. It only returns after we scroll up and then down again.

DaniloBueno · 2026-01-23T11:10:53Z

monitor.py


    console.print("[bold yellow]Keyboard Shortcuts:[/bold yellow]")
    console.print("  [cyan]↑/↓[/cyan]            Navigate up and down")
+    console.print("  [cyan]Enter[/cyan]          View logs for selected pod")


Could you please update the README to reflect the changes?

stogiannidis added 4 commits January 21, 2026 16:38

Add GPU support to job data generation and monitoring features

19ea10a

Enhance GPU resource management and monitoring features in job and po…

380a339

…d handling

Merge branch 'features/gpu_logs' into develop

2ca1565

Refine GPU error handling and enhance log viewer with auto-refresh fe…

df228df

…ature

Copilot AI review requested due to automatic review settings January 21, 2026 17:25

Copilot started reviewing on behalf of stogiannidis January 21, 2026 17:26 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

DaniloBueno reviewed Jan 23, 2026

View reviewed changes

		from rich.console import Console, Group
		from rich.text import Text

	from rich.console import Console, Group
	from rich.text import Text
	from rich.console import Console

		total_lines = len(lines)
		scroll_info = f" ({scroll_offset + 1}-{min(scroll_offset + (max_lines or total_lines), total_lines)}/{total_lines})"

-    total_lines = len(lines)
-    scroll_info = f" ({scroll_offset + 1}-{min(scroll_offset + (max_lines or total_lines), total_lines)}/{total_lines})"
+    has_real_logs = bool(logs and logs.strip())
+    if not has_real_logs:
+        total_lines = 0
+        scroll_info = " (0-0/0)"
+    else:
+        total_lines = len(lines)
+        start_line = min(scroll_offset + 1, total_lines)
+        end_line = min(scroll_offset + (max_lines or total_lines), total_lines)
+        scroll_info = f" ({start_line}-{end_line}/{total_lines})"

	except:
	except (ValueError, TypeError):
	# Some nodes may report a non-integer or unexpected GPU capacity; treat as zero and continue.

-    except:
+    except:
+        # Best-effort GPU/node discovery: ignore errors from kubectl/JSON parsing
+        # so that failure to gather GPU info does not break the monitor.

-    except:
+    except:
+        # Intentionally ignore errors when parsing pod GPU allocations;
+        # if this fails, we proceed with zero/partial allocation data.

	except (ValueError, TypeError):
	except (ValueError, TypeError):
	# Intentionally ignore non-numeric or malformed GPU request values.

Add GPU support and enhance monitoring features for jobs #2

Are you sure you want to change the base?

Add GPU support and enhance monitoring features for jobs #2

Uh oh!

Conversation

stogiannidis commented Jan 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

DaniloBueno Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

DaniloBueno Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

DaniloBueno Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

DaniloBueno Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants