Overcommit support for metagpus and mgctl JSON output #5

manfuin · 2023-05-10T12:36:16Z

Use case

To improve the workloads density on the node, when workloads has a bursty pattern of resource usage (in our case jupyterhub interactive sessions) overcommiting is a way to go.

Still it is good to have an upper limit, when the workload is actually killed as a preventive measure for memory leaks, missconfiguration, typos, etc.

This is essentially how normal memory resource is working. The requests are used for scheduling, when limits are used for enforcement (OOM Kill).

So the motivation is to implement the same behavior for metagpu allocation and memory enforcement.

Overcommit implementation details

Metagpu is now distinguishing the requests and limits all over the code (including mgdp, mgctl and prometheus exporter). All enforcement and relative load calculations comes from limits. Allocation uses requests. In general it is good to have such a distinction (maybe for future ResourceClaims flexibility).

Due to k8s limitations (requests must be equal to limits in the container spec for custom resources), the way to achieve overcommit is to specify the gpu-mem-limit.cnvrg.io/metagpu annotation that essentially redefine value of limits from the regular pod spec.

JSON output for `mgctl`

This entire overcommit work was done as part of metagpu integration to our jupyterhub setup. On the user side we need a data from mgctl to integrate with JupyterLab environment, not just a colored table.

So on the way @ErmakovDmitriy implemented the more generic mgctl output formatting, including JSON, that we are using further in the system. Judging from not previously implemented command line arguments, you already thought about JSON output. It interfere with limits-awareness changes over the code, so it ends up as a part of this PR.

Implement JSON, raw output for get

…mmit for cleaner PR This reverts commit 96f14ae, reversing changes made to a276999.

manfuin and others added 13 commits May 4, 2023 12:17

use resource.requests to schedule and resource.limits to enforce

bd1e7ac

try enforcement limit via annotation

a276999

parse containerid from (docker|crio)-id.scope

3edaa37

Implement JSON, raw output for get

cfd33b2

Merge branch 'docker-and-crio-cgroup-name-format' into overcommit

96f14ae

propagate requests/limits logic to mgctl

ac0eb9f

Merge branch 'overcommit' into overcommit

a1d8630

Merge pull request #1 from ErmakovDmitriy/overcommit

f7123c8

Implement JSON, raw output for get

fix typo

f1f7353

adding reqests/limits logic to exported metrics

146fb70

relativeGpuUsage in mgctl from limits

1fdd374

Revert "Merge branch 'docker-and-crio-cgroup-name-format' into overco…

be001ae

…mmit for cleaner PR This reverts commit 96f14ae, reversing changes made to a276999.

document overcommit

fd89524

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Overcommit support for metagpus and mgctl JSON output #5

Overcommit support for metagpus and mgctl JSON output #5

Uh oh!

manfuin commented May 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Overcommit support for metagpus and mgctl JSON output #5

Are you sure you want to change the base?

Overcommit support for metagpus and mgctl JSON output #5

Uh oh!

Conversation

manfuin commented May 10, 2023

Use case

Overcommit implementation details

JSON output for mgctl

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JSON output for `mgctl`