Overcommit support for metagpus and mgctl JSON output #5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Use case
To improve the workloads density on the node, when workloads has a bursty pattern of resource usage (in our case jupyterhub interactive sessions) overcommiting is a way to go.
Still it is good to have an upper limit, when the workload is actually killed as a preventive measure for memory leaks, missconfiguration, typos, etc.
This is essentially how normal
memoryresource is working. Therequestsare used for scheduling, whenlimitsare used for enforcement (OOM Kill).So the motivation is to implement the same behavior for metagpu allocation and memory enforcement.
Overcommit implementation details
Metagpu is now distinguishing the requests and limits all over the code (including
mgdp,mgctland prometheus exporter). All enforcement and relative load calculations comes from limits. Allocation uses requests. In general it is good to have such a distinction (maybe for future ResourceClaims flexibility).Due to k8s limitations (requests must be equal to limits in the container spec for custom resources), the way to achieve overcommit is to specify the
gpu-mem-limit.cnvrg.io/metagpuannotation that essentially redefine value oflimitsfrom the regular pod spec.JSON output for
mgctlThis entire overcommit work was done as part of metagpu integration to our jupyterhub setup. On the user side we need a data from mgctl to integrate with JupyterLab environment, not just a colored table.
So on the way @ErmakovDmitriy implemented the more generic
mgctloutput formatting, including JSON, that we are using further in the system. Judging from not previously implemented command line arguments, you already thought about JSON output. It interfere with limits-awareness changes over the code, so it ends up as a part of this PR.