-
Notifications
You must be signed in to change notification settings - Fork 442
Description
What happened:
When deploying task with engine thufeifeibear/xdit-dev + model black-forest-labs/FLUX.1-dev. HAMi-core reports SET_TASK_PID FAILED and Out-of-Memory error when intercepting Cuda driver API.
�[36m(pid=409)�[0m [HAMI-core ERROR (pid:409 thread=139817570217024 utils.c:160)]: host pid is error!
�[36m(pid=409)�[0m [HAMI-core Warn(409:139817570217024:libvgpu.c:857)]: SET_TASK_PID FAILED.
// Some content has been omitted......
�[36m(ImageGenerator pid=414)�[0m [HAMI-core ERROR (pid:414 thread=140470261484608 allocator.c:54)]: Device 0 OOM 48309993472 / 48305799168
�[36m(ImageGenerator pid=414)�[0m [HAMI-core ERROR (pid:414 thread=140470261484608 allocator.c:54)]: Device 0 OOM 48309993472 / 48305799168
�[36m(ImageGenerator pid=414)�[0m
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Info(414:140470261484608:memory.c:509)]: orig free=21910454272 total=47622258688 limit=48305799168 usage=48234496000
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Info(414:140470261484608:memory.c:515)]: after free=0 total=47622258688 limit=48305799168 usage=48234496000
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:hook.c:403)]: found symbol cuDeviceGetAttribute
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:hook.c:403)]: found symbol cuMemAddressReserve
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:hook.c:403)]: found symbol cuMemRelease
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:hook.c:403)]: found symbol cuMemMap
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:hook.c:403)]: found symbol cuMemCreate
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:hook.c:403)]: found symbol cuMemImportFromShareableHandle
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:hook.c:403)]: found symbol cuMemsetD32Async
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:libvgpu.c:74)]: into dlsym nvmlInit_v2
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:libvgpu.c:74)]: into dlsym nvmlDeviceGetHandleByPciBusId_v2
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:libvgpu.c:74)]: into dlsym nvmlDeviceGetNvLinkRemoteDeviceType
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:libvgpu.c:74)]: into dlsym nvmlDeviceGetNvLinkRemotePciInfo_v2
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:libvgpu.c:74)]: into dlsym nvmlDeviceGetComputeRunningProcesses
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:libvgpu.c:74)]: into dlsym nvmlSystemGetCudaDriverVersion_v2
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:hook.c:436)]: nvmlInitWithFlags
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:hook.c:438)]: Hijacking nvmlInitWithFlags
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:device.c:10)]: Hijacking cuDeviceGetAttribute
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:device.c:10)]: Hijacking cuDeviceGetAttribute
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:device.c:10)]: Hijacking cuDeviceGetAttribute
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:device.c:10)]: Hijacking cuDeviceGetAttribute
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Debug(414:140470261484608:device.c:10)]: Hijacking cuDeviceGetAttribute
�[36m(ImageGenerator pid=414)�[0m [HAMI-core Info(414:140470261484608:hook.c:405)]: NVML DeviceGetHandleByPciBusID_v2 00000000:10:00.0
�[36m(ImageGenerator pid=414)�[0m Exception raised in creation task: The actor died because of an error raised in its creation task, �[36mray::ImageGenerator.__init__()�[39m (pid=414, ip=10.224.4.19, actor_id=4d89d46aec24713e09b28ddd01000000, repr=<server_real.ImageGenerator object at 0x7fc1c4460610>)
�[36m(ImageGenerator pid=414)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[36m(ImageGenerator pid=414)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[36m(ImageGenerator pid=414)�[0m File "/app/server_real.py", line 71, in __init__
�[36m(ImageGenerator pid=414)�[0m self.initialize_model(xfuser_args)
�[36m(ImageGenerator pid=414)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[36m(ImageGenerator pid=414)�[0m File "/app/server_real.py", line 90, in initialize_model
�[36m(ImageGenerator pid=414)�[0m ).to("cuda")
�[36m(ImageGenerator pid=414)�[0m ^^^^^^^^^^
�[36m(ImageGenerator pid=414)�[0m File "/usr/local/lib/python3.11/site-packages/xfuser/model_executor/pipelines/base_pipeline.py", line 201, in to
�[36m(ImageGenerator pid=414)�[0m self.module = self.module.to(*args, **kwargs)
�[36m(ImageGenerator pid=414)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[36m(ImageGenerator pid=414)�[0m File "/usr/local/lib/python3.11/site-packages/diffusers/pipelines/pipeline_utils.py", line 541, in to
�[36m(ImageGenerator pid=414)�[0m module.to(device, dtype)
�[36m(ImageGenerator pid=414)�[0m File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1369, in to
�[36m(ImageGenerator pid=414)�[0m return self._apply(convert)
�[36m(ImageGenerator pid=414)�[0m ^^^^^^^^^^^^^^^^^^^^
�[36m(ImageGenerator pid=414)�[0m File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 928, in _apply
�[36m(ImageGenerator pid=414)�[0m module._apply(fn)
�[36m(ImageGenerator pid=414)�[0m File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 928, in _apply
�[36m(ImageGenerator pid=414)�[0m module._apply(fn)
�[36m(ImageGenerator pid=414)�[0m File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 928, in _apply
�[36m(ImageGenerator pid=414)�[0m module._apply(fn)
�[36m(ImageGenerator pid=414)�[0m [Previous line repeated 1 more time]
�[36m(ImageGenerator pid=414)�[0m File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 955, in _apply
�[36m(ImageGenerator pid=414)�[0m param_applied = fn(param)
�[36m(ImageGenerator pid=414)�[0m ^^^^^^^^^
�[36m(ImageGenerator pid=414)�[0m File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1355, in convert
�[36m(ImageGenerator pid=414)�[0m return t.to(
�[36m(ImageGenerator pid=414)�[0m ^^^^^
�[36m(ImageGenerator pid=414)�[0m torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 0 has a total capacity of 44.35 GiB of which 0 bytes is free. Process 2342430 has 23.93 GiB memory in use. Of the allocated memory 23.67 GiB is allocated by PyTorch, and 6.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
What you expected to happen: HAMi-core working normally
How to reproduce it (as minimally and precisely as possible): deploying task with engine thufeifeibear/xdit-dev + model black-forest-labs/FLUX.1-dev
Anything else we need to know?:
Container running command:
python server_real.py --world_size 2 --ulysses_parallel_degree 2If the Pod environment variable CUDA_DISABLE_CONTROL is set to disable HAMi-core interception, the service will operate normally without reporting an OOM error.
Set pod env variable LIBCUDA_LOG_LEVEL to 4 to print DEBUG logs of HAMi-core and saved as the following files.
Original hami-core debug log: flux.txt
To simplify reviewing bugs, the logs have been cleaned up. The results are as follows:
- Manually remove the log info of process 1 of container: flux without process 1.txt
- Manually remove log infos of all other processes except for the error process 409 of container: flux409.txt
Environment:
When inspecting container processes via nvidia-smi upon error exit (Please view the video below.), the two worker processes occupied approximately 22–23GB each on GPU cards 2 and 3 respectively. At this point, HAMi-core's OOM error message indicated that the total GPU memory requested by a single device amounted to 45GB. It is suspected that HAMi-core aggregated the memory of both cards into a single card, triggering the OOM. It is unclear whether this is related to the initial SET_TASK_PID FAILED error.
