Skip to content

Conversation

@manfuin
Copy link

@manfuin manfuin commented May 10, 2023

Use case

To improve the workloads density on the node, when workloads has a bursty pattern of resource usage (in our case jupyterhub interactive sessions) overcommiting is a way to go.

Still it is good to have an upper limit, when the workload is actually killed as a preventive measure for memory leaks, missconfiguration, typos, etc.

This is essentially how normal memory resource is working. The requests are used for scheduling, when limits are used for enforcement (OOM Kill).

So the motivation is to implement the same behavior for metagpu allocation and memory enforcement.

Overcommit implementation details

Metagpu is now distinguishing the requests and limits all over the code (including mgdp, mgctl and prometheus exporter). All enforcement and relative load calculations comes from limits. Allocation uses requests. In general it is good to have such a distinction (maybe for future ResourceClaims flexibility).

Due to k8s limitations (requests must be equal to limits in the container spec for custom resources), the way to achieve overcommit is to specify the gpu-mem-limit.cnvrg.io/metagpu annotation that essentially redefine value of limits from the regular pod spec.

JSON output for mgctl

This entire overcommit work was done as part of metagpu integration to our jupyterhub setup. On the user side we need a data from mgctl to integrate with JupyterLab environment, not just a colored table.

So on the way @ErmakovDmitriy implemented the more generic mgctl output formatting, including JSON, that we are using further in the system. Judging from not previously implemented command line arguments, you already thought about JSON output. It interfere with limits-awareness changes over the code, so it ends up as a part of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant