-
Notifications
You must be signed in to change notification settings - Fork 519
Description
[2025-03-12 13:48:38,645] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W0312 13:48:39.857000 66730 /home/shawn/diskb/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/torch/distributed/run.py:793]
W0312 13:48:39.857000 66730 /home/shawn/diskb/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/torch/distributed/run.py:793] *****************************************
W0312 13:48:39.857000 66730 /home/shawn/diskb/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0312 13:48:39.857000 66730 /home/shawn/diskb/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/torch/distributed/run.py:793] *****************************************
[2025-03-12 13:48:44,775] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-12 13:48:44,786] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-12 13:48:45,659] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-12 13:48:45,669] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-12 13:48:45,669] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Generating train split: 8891 examples [00:00, 40534.39 examples/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/builder.py", line 1870, in _prepare_split_single
[rank0]: writer.write_table(table)
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/arrow_writer.py", line 622, in write_table
[rank0]: pa_table = table_cast(pa_table, self._schema)
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 2292, in table_cast
[rank0]: return cast_table_to_schema(table, schema)
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 2245, in cast_table_to_schema
[rank0]: arrays = [
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 2246, in
[rank0]: cast_array_to_feature(
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 1795, in wrapper
[rank0]: return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 1795, in
[rank0]: return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 2013, in cast_array_to_feature
[rank0]: casted_array_values = _c(array.values, feature[0])
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 1797, in wrapper
[rank0]: return func(array, *args, **kwargs)
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 2108, in cast_array_to_feature
[rank0]: raise TypeError(f"Couldn't cast array of type\n{_short_str(array.type)}\nto\n{_short_str(feature)}")
[rank0]: TypeError: Couldn't cast array of type
[rank0]: struct<role: string, content: string, type: string>
[rank0]: to
[rank0]: {'role': Value(dtype='string', id=None), 'content': Value(dtype='string', id=None), 'loss': Value(dtype='bool', id=None)}
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/shawn/diska/samba/Train/BIgmode/RAG/MANUS/OpenManus-RL-main/worksapce/../openmanus-rl/sft.py", line 178, in
[rank0]: main(script_args, training_args, model_args)
[rank0]: File "/home/shawn/diska/samba/Train/BIgmode/RAG/MANUS/OpenManus-RL-main/worksapce/../openmanus-rl/sft.py", line 108, in main
[rank0]: dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config,cache_dir='../../Datasets/CharlieDreemur_OpenManus-RL/')
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/load.py", line 2151, in load_dataset
[rank0]: builder_instance.download_and_prepare(
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare
[rank0]: self._download_and_prepare(
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/builder.py", line 1000, in _download_and_prepare
[rank0]: self._prepare_split(split_generator, **prepare_split_kwargs)
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/builder.py", line 1741, in _prepare_split
[rank0]: for job_id, done, content in self._prepare_split_single(
[rank0]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/builder.py", line 1897, in _prepare_split_single
[rank0]: raise DatasetGenerationError("An error occurred while generating the dataset") from e
[rank0]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
Generating train split: 4627 examples [00:00, 33344.35 examples/s][rank0]:[W312 13:48:49.043917769 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
Generating train split: 8891 examples [00:00, 36580.33 examples/s]
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/builder.py", line 1870, in _prepare_split_single
[rank1]: writer.write_table(table)
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/arrow_writer.py", line 622, in write_table
[rank1]: pa_table = table_cast(pa_table, self._schema)
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 2292, in table_cast
[rank1]: return cast_table_to_schema(table, schema)
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 2245, in cast_table_to_schema
[rank1]: arrays = [
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 2246, in
[rank1]: cast_array_to_feature(
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 1795, in wrapper
[rank1]: return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 1795, in
[rank1]: return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 2013, in cast_array_to_feature
[rank1]: casted_array_values = _c(array.values, feature[0])
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 1797, in wrapper
[rank1]: return func(array, *args, **kwargs)
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/table.py", line 2108, in cast_array_to_feature
[rank1]: raise TypeError(f"Couldn't cast array of type\n{_short_str(array.type)}\nto\n{_short_str(feature)}")
[rank1]: TypeError: Couldn't cast array of type
[rank1]: struct<role: string, content: string, type: string>
[rank1]: to
[rank1]: {'role': Value(dtype='string', id=None), 'content': Value(dtype='string', id=None), 'loss': Value(dtype='bool', id=None)}
[rank1]: The above exception was the direct cause of the following exception:
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/shawn/diska/samba/Train/BIgmode/RAG/MANUS/OpenManus-RL-main/worksapce/../openmanus-rl/sft.py", line 178, in
[rank1]: main(script_args, training_args, model_args)
[rank1]: File "/home/shawn/diska/samba/Train/BIgmode/RAG/MANUS/OpenManus-RL-main/worksapce/../openmanus-rl/sft.py", line 108, in main
[rank1]: dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config,cache_dir='../../Datasets/CharlieDreemur_OpenManus-RL/')
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/load.py", line 2151, in load_dataset
[rank1]: builder_instance.download_and_prepare(
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare
[rank1]: self._download_and_prepare(
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/builder.py", line 1000, in _download_and_prepare
[rank1]: self._prepare_split(split_generator, **prepare_split_kwargs)
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/builder.py", line 1741, in _prepare_split
[rank1]: for job_id, done, content in self._prepare_split_single(
[rank1]: File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/datasets/builder.py", line 1897, in _prepare_split_single
[rank1]: raise DatasetGenerationError("An error occurred while generating the dataset") from e
[rank1]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
W0312 13:48:50.585000 66730 /home/shawn/diskb/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 66865 closing signal SIGTERM
E0312 13:48:50.699000 66730 /home/shawn/diskb/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 66864) of binary: /home/shawn/anaconda3/envs/openmanus-rl/bin/python
Traceback (most recent call last):
File "/home/shawn/anaconda3/envs/openmanus-rl/bin/accelerate", line 8, in
sys.exit(main())
File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1182, in launch_command
deepspeed_launcher(args)
File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/accelerate/commands/launch.py", line 861, in deepspeed_launcher
distrib_run.run(args)
File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/shawn/anaconda3/envs/openmanus-rl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
../openmanus-rl/sft.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-03-12_13:48:50
host : yd-virtual-machine
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 66864)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html