-
Notifications
You must be signed in to change notification settings - Fork 208
feat: Omni dataloader for HF models #1639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ttings in the config file
📝 WalkthroughWalkthroughAdds support for multimodal (audio, video, image) conversation datasets through a new Changes
Sequence DiagramsequenceDiagram
participant App as Training Script
participant DS as GeneralConversationsJsonlDataset
participant Preproc as Preprocessor<br/>(conversation_base)
participant MediaUtil as Multimodal Utils
participant Tokenizer as Tokenizer
App->>DS: Load dataset with media dirs
DS->>DS: Load JSONL conversations
DS->>Preproc: Process example
Preproc->>Preproc: Normalize metadata
Preproc->>Preproc: Split message by media tags
Preproc->>MediaUtil: Resolve media file paths
MediaUtil->>MediaUtil: Check/extend file extensions
MediaUtil-->>Preproc: Resolved media paths
Preproc-->>DS: Formatted messages with placeholders
DS-->>App: Preprocessed example
App->>Tokenizer: Tokenize + Load media
Tokenizer->>MediaUtil: load_media_from_message
MediaUtil->>MediaUtil: Load images (PIL)<br/>Load audio (transformers/decord)<br/>Load video (transformers/decord)
MediaUtil-->>Tokenizer: Media tensors
Tokenizer->>Tokenizer: Pack tensors<br/>by computed dimension
Tokenizer-->>App: Token IDs + multimodal embeddings
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/run_sft.py (1)
144-147: Inconsistent config access pattern between train and validation.Training dataset (lines 131-132) accesses
data_config["add_bos"]anddata_config["add_eos"]directly, while validation (lines 145-146) uses.get("add_bos", True)with defaults. This inconsistency could mask configuration errors.Either both should use direct access (if these are required config keys) or both should use
.get()with defaults. As per coding guidelines, YAML should be the single source of truth for defaults, and required attributes should be accessed directly.
🧹 Nitpick comments (14)
nemo_rl/data/llm_message_utils.py (2)
614-616: Remove stray blank line inside code block.Line 615 contains an empty line with extra whitespace that breaks the code block flow.
if "video" in media_cur_message: media_kwargs["videos"] = media_cur_message["video"] - processed_chunk = tokenizer(
542-564: Remove debug print statements before merge.This debug block uses a function attribute (
_debug_printed) to gate one-time printing, which is an unusual pattern that adds mutable state to the function object. This should be removed before merging, or converted to proper logging with configurable verbosity.examples/run_sft.py (1)
121-123: Consider defensive key access for datum_preprocessor.The code assumes
data.datum_preprocessorcontains both"train"and"val"keys when present. If a dataset provides only one key, this would raise aKeyError.Consider using
.get()with a fallback:elif hasattr(data, "datum_preprocessor"): - datum_preprocessor_train = data.datum_preprocessor["train"] - datum_preprocessor_val = data.datum_preprocessor["val"] + datum_preprocessor_train = data.datum_preprocessor.get("train") + datum_preprocessor_val = data.datum_preprocessor.get("val")This would gracefully handle cases where only one split has a preprocessor.
nemo_rl/algorithms/utils.py (2)
327-330: Addstacklevel=2to warnings for proper caller attribution.When
warnings.warn()is called from a utility function, settingstacklevel=2ensures the warning points to the calling code rather than this function, making debugging easier.warnings.warn( - f"Overriding audio sampling rate from {processor.feature_extractor.sampling_rate} to {new_sampling_rate}" + f"Overriding audio sampling rate from {processor.feature_extractor.sampling_rate} to {new_sampling_rate}", + stacklevel=2, )Apply the same fix to warnings at lines 336-338 and 344-346.
340-347: Consider explicit validation for mutually exclusive fps/num_frames.The comment acknowledges that
fpsandnum_framescannot co-exist, but the code defers the error to downstream components. Providing an early, clear error message would improve the user experience.+ if "fps" in tokenizer_config["video"] and "num_frames" in tokenizer_config["video"]: + raise ValueError( + "Cannot specify both 'fps' and 'num_frames' in video configuration; they are mutually exclusive." + ) # fps and num_frames cannot co-exist, but let it crash later if "num_frames" in tokenizer_config["video"] and \ tokenizer_config["video"]["num_frames"] != processor.video_processor.num_frames:nemo_rl/data/datasets/response_datasets/__init__.py (1)
146-157: Consider addingGeneralConversationsJsonlDatasetto__all__.The new class is imported but not exported in
__all__. If it should be part of the public API (importable directly from this module), add it to the list:__all__ = [ "CLEVRCoGenTDataset", "DeepScalerDataset", "DAPOMath17KDataset", + "GeneralConversationsJsonlDataset", "Geometry3KDataset", "OpenAIFormatDataset", "OasstDataset", "OpenMathInstruct2Dataset", "RefCOCODataset", "ResponseDataset", "SquadDataset", ]If the class is only intended to be used through
load_response_dataset(), this can be safely ignored.nemo_rl/data/datasets/response_datasets/conversation_base.py (3)
15-24: Remove unused imports.
io,copy,dataclasses, andPathare imported but never used in this file.import os import re -import io -import copy import warnings -import dataclasses from PIL import Image -from pathlib import Path from collections import defaultdict from typing import Any, Dict, Callable, Optional
37-67: Addstackleveltowarnings.warnand fix minor style issue.The warning should include
stacklevel=2so the warning points to the caller rather than this function. Also, there's an extra space beforedelon line 48.if tag_mapped not in data: data[tag_mapped] = data[tag] - del data[tag] + del data[tag] else: warnings.warn( - f"Trying to map {tag} to {tag_mapped}, but {tag_mapped} already exists in the raw data. Mapping is not carried out." + f"Trying to map {tag} to {tag_mapped}, but {tag_mapped} already exists in the raw data. Mapping is not carried out.", + stacklevel=2, )
92-92: Unused loop variable.The loop variable
iis not used. Rename to_to indicate it's intentionally unused.- for i, part in enumerate(parts): + for _, part in enumerate(parts):Or simply iterate without enumerate:
- for i, part in enumerate(parts): + for part in parts:nemo_rl/data/multimodal_utils.py (2)
285-285: Avoid usingassertfor runtime validation.Using
assertfor input validation can be disabled with-Oflag. Use an explicitifcheck withValueErrorinstead.- assert "audio" in multimodal_load_kwargs and "sampling_rate" in multimodal_load_kwargs["audio"] + if "audio" not in multimodal_load_kwargs or "sampling_rate" not in multimodal_load_kwargs["audio"]: + raise ValueError("Audio loading requires 'sampling_rate' in multimodal_load_kwargs['audio']")
297-297: Variable name shadows module-level constant.
load_video_kwargsshadows the module-level constant defined on line 28. Use a different variable name to avoid confusion.- load_video_kwargs = multimodal_load_kwargs["video"] if "video" in multimodal_load_kwargs else {} + video_kwargs = multimodal_load_kwargs.get("video", {}) # seems decord backend loads video faster with multithread ffmpeg and it is easier to install - loaded_media["video"].append(load_video(vid, backend="decord", **load_video_kwargs)[0]) + loaded_media["video"].append(load_video(vid, backend="decord", **video_kwargs)[0])nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (3)
26-26: Remove unused debug flag.
_DEBUG=Trueis defined but never used in this file. Remove it or implement the intended debugging behavior.-_DEBUG=True - - class GeneralConversationsJsonlDataset:
17-17: Remove unused importLiteral.-from typing import Any, Optional, Literal +from typing import Any, Optional
30-34: Fix typos in docstring.- """Loads general conversation datasets that have the json (manifest) files and media files in separate files (jsonl datasets). - Each sample can be single/multi-turn converstaions with multiple modalities. + """Loads general conversation datasets that have the JSON (manifest) files and media files in separate files (JSONL datasets). + Each sample can be single/multi-turn conversations with multiple modalities. Each modality can have one or more number of media objects. - There is no requiement of where the media tag (e.g. '<sound>') should appear in the conversations. + There is no requirement of where the media tag (e.g. '<sound>') should appear in the conversations.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
examples/configs/sft_avlm.yaml(1 hunks)examples/run_sft.py(3 hunks)nemo_rl/algorithms/utils.py(1 hunks)nemo_rl/data/datasets/response_datasets/__init__.py(2 hunks)nemo_rl/data/datasets/response_datasets/conversation_base.py(1 hunks)nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py(1 hunks)nemo_rl/data/llm_message_utils.py(4 hunks)nemo_rl/data/multimodal_utils.py(3 hunks)pyproject.toml(2 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code
Files:
nemo_rl/algorithms/utils.pynemo_rl/data/llm_message_utils.pynemo_rl/data/datasets/response_datasets/conversation_base.pyexamples/run_sft.pynemo_rl/data/datasets/response_datasets/__init__.pynemo_rl/data/datasets/response_datasets/general_conversations_dataset.pynemo_rl/data/multimodal_utils.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes
Files:
nemo_rl/algorithms/utils.pynemo_rl/data/llm_message_utils.pynemo_rl/data/datasets/response_datasets/conversation_base.pynemo_rl/data/datasets/response_datasets/__init__.pynemo_rl/data/datasets/response_datasets/general_conversations_dataset.pynemo_rl/data/multimodal_utils.py
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
nemo_rl/algorithms/utils.pyexamples/configs/sft_avlm.yamlnemo_rl/data/llm_message_utils.pynemo_rl/data/datasets/response_datasets/conversation_base.pyexamples/run_sft.pypyproject.tomlnemo_rl/data/datasets/response_datasets/__init__.pynemo_rl/data/datasets/response_datasets/general_conversations_dataset.pynemo_rl/data/multimodal_utils.py
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
nemo_rl/algorithms/utils.pynemo_rl/data/llm_message_utils.pynemo_rl/data/datasets/response_datasets/conversation_base.pyexamples/run_sft.pynemo_rl/data/datasets/response_datasets/__init__.pynemo_rl/data/datasets/response_datasets/general_conversations_dataset.pynemo_rl/data/multimodal_utils.py
🧬 Code graph analysis (4)
nemo_rl/data/datasets/response_datasets/conversation_base.py (1)
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (1)
process_message_fragment(107-118)
examples/run_sft.py (1)
nemo_rl/data/datasets/response_datasets/clevr.py (1)
format_clevr_cogent_dataset(33-62)
nemo_rl/data/datasets/response_datasets/__init__.py (2)
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (1)
GeneralConversationsJsonlDataset(29-154)nemo_rl/data/datasets/utils.py (1)
get_extra_kwargs(86-102)
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (3)
nemo_rl/data/datasets/utils.py (1)
load_dataset_from_path(64-83)nemo_rl/data/interfaces.py (1)
TaskDataSpec(53-86)nemo_rl/data/datasets/response_datasets/conversation_base.py (2)
convert_metadata(38-67)conversation_process_message(70-131)
🪛 Ruff (0.14.8)
nemo_rl/algorithms/utils.py
327-327: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
336-336: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
344-344: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
nemo_rl/data/datasets/response_datasets/conversation_base.py
50-50: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
74-74: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
77-77: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
78-78: Unused function argument: tags_mapping_sample_to_allowed
(ARG001)
92-92: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
117-117: Avoid specifying long messages outside the exception class
(TRY003)
nemo_rl/data/datasets/response_datasets/__init__.py
119-121: Avoid specifying long messages outside the exception class
(TRY003)
nemo_rl/data/multimodal_utils.py
271-271: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
288-288: Do not use bare except
(E722)
291-291: Undefined name tokenizer
(F821)
291-291: Undefined name key
(F821)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (5)
nemo_rl/algorithms/utils.py (1)
323-347: Multimodal processor configuration overrides look good.The runtime override approach for audio sampling rate and video fps/num_frames is a reasonable way to synchronize processor settings with configuration without modifying the underlying model configs. The warnings appropriately inform users of the overrides.
nemo_rl/data/datasets/response_datasets/__init__.py (1)
117-135: Implementation follows existing patterns correctly.The new
GeneralConversationsJsonlDatasetbranch properly validates required config, extracts optional kwargs, and instantiates the dataset class. This mirrors the existingResponseDatasetpattern.nemo_rl/data/multimodal_utils.py (2)
27-28: LGTM!Dynamic extraction of function parameter names using
inspect.signatureis a clean approach for forward compatibility.
33-65: LGTM!Well-organized media tag definitions with appropriate reverse mappings and regex pattern compilation. The warning comment about avoiding cyclic mappings is helpful.
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (1)
106-118: LGTM!The
process_message_fragmentmethod correctly handles media path resolution with fallback to media directory, and properly splits compound tags like"video-audio"into separate fragment entries.
| train_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_10000.jsonl | ||
| val_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_100.jsonl | ||
| train_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min | ||
| val_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace hardcoded user-specific paths with placeholders.
The example config contains absolute paths to user-specific directories on an internal filesystem (/lustre/fsw/portfolios/llmservice/users/yuanhangs/...). These paths won't work for other users and expose internal infrastructure details.
Consider using placeholder paths that users can customize:
data:
dataset_name: GeneralConversationsJsonlDataset
- train_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_10000.jsonl
- val_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_100.jsonl
- train_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min
- val_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min
+ train_data_path: /path/to/train_data.jsonl # Replace with your training data path
+ val_data_path: /path/to/val_data.jsonl # Replace with your validation data path
+ train_media_data_dir: /path/to/train_media/ # Directory containing training media files
+ val_media_data_dir: /path/to/val_media/ # Directory containing validation media files📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| train_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_10000.jsonl | |
| val_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_100.jsonl | |
| train_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min | |
| val_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min | |
| data: | |
| dataset_name: GeneralConversationsJsonlDataset | |
| train_data_path: /path/to/train_data.jsonl # Replace with your training data path | |
| val_data_path: /path/to/val_data.jsonl # Replace with your validation data path | |
| train_media_data_dir: /path/to/train_media/ # Directory containing training media files | |
| val_media_data_dir: /path/to/val_media/ # Directory containing validation media files |
🤖 Prompt for AI Agents
In examples/configs/sft_avlm.yaml around lines 11 to 14, the file contains
hardcoded, user-specific absolute paths that expose internal infrastructure and
will fail for other users; replace those literal paths with configurable
placeholders (e.g., ${DATA_DIR}/train.jsonl, ${DATA_DIR}/val.jsonl,
${MEDIA_DIR}/train_media, ${MEDIA_DIR}/val_media) or relative paths and update
any docs/README to instruct users to set the environment variables or edit the
placeholders to point to their local dataset and media directories before
running.
| def conversation_process_message( | ||
| metadata: Dict[str, Any], | ||
| message: Dict[str, str], | ||
| media_index: dict, | ||
| raw: Dict[str, Any] = {}, | ||
| allow_empty_text: bool = False, | ||
| check_if_media_file_exist: bool = True, | ||
| tried_default_extensions: set = set(), | ||
| tags_mapping_sample_to_allowed: Dict[str, str] = multimodal_utils.media_tags_to_allowed, | ||
| process_message_fragment: Callable = lambda tag, fragment: [{tag: fragment}], | ||
| ) -> list[Dict[str, Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Mutable default arguments will cause bugs.
raw={} and tried_default_extensions=set() are mutable defaults. The tried_default_extensions set is modified on line 106, meaning the same set instance will be shared across all calls that don't explicitly pass this argument, causing incorrect behavior.
def conversation_process_message(
metadata: Dict[str, Any],
message: Dict[str, str],
media_index: dict,
- raw: Dict[str, Any] = {},
+ raw: Optional[Dict[str, Any]] = None,
allow_empty_text: bool = False,
check_if_media_file_exist: bool = True,
- tried_default_extensions: set = set(),
+ tried_default_extensions: Optional[set] = None,
tags_mapping_sample_to_allowed: Dict[str, str] = multimodal_utils.media_tags_to_allowed,
process_message_fragment: Callable = lambda tag, fragment: [{tag: fragment}],
) -> list[Dict[str, Any]]:
"""
Args:
raw: dictionary with all webdataset compliant keys of a sample.
Emtpy for jsonl dataset, non-empty otherwise
metadata:
"""
+ if raw is None:
+ raw = {}
+ if tried_default_extensions is None:
+ tried_default_extensions = set()
fragments = [] 🧰 Tools
🪛 Ruff (0.14.8)
74-74: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
77-77: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
78-78: Unused function argument: tags_mapping_sample_to_allowed
(ARG001)
🤖 Prompt for AI Agents
In nemo_rl/data/datasets/response_datasets/conversation_base.py around lines 70
to 80, the function uses mutable default arguments raw={} and
tried_default_extensions=set(), which leads to shared state across calls
(tried_default_extensions is mutated on line 106). Change both defaults to None
in the signature, then inside the function initialize raw = {} if raw is None
and tried_default_extensions = set() if tried_default_extensions is None so each
call gets a fresh object; ensure any downstream logic continues to use these
local variables.
| allow_empty_text: bool = False, | ||
| check_if_media_file_exist: bool = True, | ||
| tried_default_extensions: set = set(), | ||
| tags_mapping_sample_to_allowed: Dict[str, str] = multimodal_utils.media_tags_to_allowed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Unused parameter tags_mapping_sample_to_allowed.
This parameter is declared but never used in the function body. Either remove it or implement the intended functionality.
🧰 Tools
🪛 Ruff (0.14.8)
78-78: Unused function argument: tags_mapping_sample_to_allowed
(ARG001)
🤖 Prompt for AI Agents
In nemo_rl/data/datasets/response_datasets/conversation_base.py around line 78,
the parameter tags_mapping_sample_to_allowed is declared but never used; either
remove it from the function signature and update any callers, or implement its
intended behavior by applying the mapping to the sample's tags (e.g., map each
sample tag via tags_mapping_sample_to_allowed to an allowed tag and use that
result wherever allowed tags are computed/validated), and then adjust type hints
and add/modify tests to cover the mapping behavior.
| if not isinstance(metadata[tag], list): | ||
| metadata[tag] = [metadata[tag]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cat -n nemo_rl/data/datasets/response_datasets/conversation_base.py | sed -n '80,110p'Repository: NVIDIA-NeMo/RL
Length of output: 1585
🏁 Script executed:
cat -n nemo_rl/data/datasets/response_datasets/conversation_base.py | sed -n '65,95p'Repository: NVIDIA-NeMo/RL
Length of output: 1333
🏁 Script executed:
rg -n "conversation_process_message" nemo_rl/data/datasets/response_datasets/Repository: NVIDIA-NeMo/RL
Length of output: 306
🏁 Script executed:
cat -n nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py | sed -n '130,160p'Repository: NVIDIA-NeMo/RL
Length of output: 1302
🏁 Script executed:
cat -n nemo_rl/data/datasets/response_datasets/conversation_base.py | sed -n '1,70p'Repository: NVIDIA-NeMo/RL
Length of output: 2819
🏁 Script executed:
rg -n "media_tags" nemo_rl/data/multimodal_utils.py | head -20Repository: NVIDIA-NeMo/RL
Length of output: 277
🏁 Script executed:
sed -n '33,70p' nemo_rl/data/multimodal_utils.pyRepository: NVIDIA-NeMo/RL
Length of output: 1062
Add validation to prevent KeyError when media tag is in message but missing from metadata.
At line 96, metadata[tag] is accessed without checking if tag exists in the metadata dictionary. If a message contains a media tag like <image> but the metadata dict lacks the corresponding key, a KeyError will be raised. Add a check before accessing metadata[tag] to handle missing tags gracefully (either skip the tag, raise a clear error, or provide a default value).
🤖 Prompt for AI Agents
In nemo_rl/data/datasets/response_datasets/conversation_base.py around lines
96-97, the code accesses metadata[tag] without verifying tag exists which can
raise KeyError; add a presence check before accessing metadata[tag] (e.g., if
tag not in metadata: either continue to skip processing that tag, or set a
sensible default like metadata[tag]=[] or raise a clear ValueError with a
descriptive message), then proceed with the existing list-wrapping logic.
| self.formatted_ds = { | ||
| "train": train_ds, | ||
| "validation": val_ds, | ||
| } | ||
|
|
||
| self.datum_preprocessor = { | ||
| "train": partial(self._datum_preprocessor, media_directory=train_media_data_dir), | ||
| "val": partial(self._datum_preprocessor, media_directory=val_media_data_dir) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent dictionary keys: "validation" vs "val".
formatted_ds uses "validation" as the key, but datum_preprocessor uses "val". This inconsistency could cause lookup failures when accessing preprocessors by split name.
self.datum_preprocessor = {
"train": partial(self._datum_preprocessor, media_directory=train_media_data_dir),
- "val": partial(self._datum_preprocessor, media_directory=val_media_data_dir)
+ "validation": partial(self._datum_preprocessor, media_directory=val_media_data_dir)
}🤖 Prompt for AI Agents
In nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py
around lines 94 to 102, the split keys are inconsistent: formatted_ds uses
"validation" while datum_preprocessor uses "val", which will cause lookups to
fail; fix by making the keys identical (e.g., change the "val" key in
datum_preprocessor to "validation" or rename "validation" to "val" so both dicts
use the same split name), or preferably derive both from a single split-name
constant; update only the dictionary keys to match and ensure any downstream
access uses the chosen split name.
| def load_media_from_message( | ||
| message: dict[str, Any], | ||
| processor = None, | ||
| multimodal_load_kwargs: dict[str, dict[str, Any]] = {}, | ||
| ) -> dict[list[Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mutable default argument and malformed type hint.
The multimodal_load_kwargs parameter has a mutable default {}. Also, the return type hint dict[list[Any]] is malformed; it should be dict[str, list[Any]].
def load_media_from_message(
message: dict[str, Any],
processor = None,
- multimodal_load_kwargs: dict[str, dict[str, Any]] = {},
-) -> dict[list[Any]]:
+ multimodal_load_kwargs: Optional[dict[str, dict[str, Any]]] = None,
+) -> dict[str, list[Any]]:
loaded_media = defaultdict(list)
media_in_message = get_media_from_message(message)
- if not multimodal_load_kwargs and processor is not None:
+ if multimodal_load_kwargs is None:
+ multimodal_load_kwargs = {}
+ if not multimodal_load_kwargs and processor is not None:
multimodal_load_kwargs = get_multimodal_default_settings_from_processor(processor)🧰 Tools
🪛 Ruff (0.14.8)
271-271: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
🤖 Prompt for AI Agents
In nemo_rl/data/multimodal_utils.py around lines 268 to 272, change the mutable
default and malformed return hint: replace multimodal_load_kwargs: dict[str,
dict[str, Any]] = {} with multimodal_load_kwargs: Optional[dict[str, dict[str,
Any]]] = None (import Optional if needed) and at function start set
multimodal_load_kwargs = multimodal_load_kwargs or {} to avoid a shared mutable
default; also correct the return type hint from dict[list[Any]] to dict[str,
list[Any]].
| if "audio" in media_in_message: | ||
| for aud in media_in_message["audio"]: | ||
| if isinstance(aud, str): | ||
| print(multimodal_load_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debug print statement.
This print(multimodal_load_kwargs) appears to be a debug artifact and should be removed before merging.
- print(multimodal_load_kwargs)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(multimodal_load_kwargs) |
🤖 Prompt for AI Agents
In nemo_rl/data/multimodal_utils.py around line 284, remove the stray debug
print statement "print(multimodal_load_kwargs)"; simply delete that line so the
function no longer emits debug output, and run tests/lint to ensure no leftover
unused variables or formatting issues.
| except: | ||
| # use decord | ||
| loaded_audio = decord.AudioReader(aud, sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"], mono=True) | ||
| loaded_media["audio"].append(loaded_audio[:].asnumpy()[get_dim_to_pack_along(tokenizer, key)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Bare except clause and undefined variables.
- The bare
except:catches all exceptions includingKeyboardInterruptandSystemExit, masking errors. Specify the expected exception type. - Line 291 references
tokenizerandkeywhich are undefined in this scope—this will cause aNameErrorat runtime.
- try:
- loaded_media["audio"].append(load_audio(aud, **multimodal_load_kwargs["audio"]))
- except:
- # use decord
- loaded_audio = decord.AudioReader(aud, sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"], mono=True)
- loaded_media["audio"].append(loaded_audio[:].asnumpy()[get_dim_to_pack_along(tokenizer, key)])
+ try:
+ loaded_media["audio"].append(load_audio(aud, **multimodal_load_kwargs["audio"]))
+ except Exception as e:
+ # Fallback to decord for unsupported formats
+ loaded_audio = decord.AudioReader(aud, sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"], mono=True)
+ loaded_media["audio"].append(loaded_audio[:].asnumpy()[0]) # Fix: use appropriate indexThe processor variable should likely be used instead of tokenizer, and key should be "audio" or similar. Please verify the intended logic.
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.8)
288-288: Do not use bare except
(E722)
291-291: Undefined name tokenizer
(F821)
291-291: Undefined name key
(F821)
🤖 Prompt for AI Agents
In nemo_rl/data/multimodal_utils.py around lines 288–291, replace the bare
except with a specific exception handler (e.g., except (RuntimeError,
ValueError, OSError):) so we don't swallow KeyboardInterrupt/SystemExit, and
inside the handler use the correct variables — replace the undefined tokenizer
with processor and use the explicit key "audio" (or the variable that holds the
modality string) when indexing, making sure processor is available in scope;
re-raise unexpected exceptions if needed or let them propagate.
| transformer-engine-torch = [{ requirement = "torch", match-runtime = true }] | ||
| mamba-ssm = [{ requirement = "torch", match-runtime = true }] | ||
| causal-conv1d = [{ requirement = "torch", match-runtime = true }] | ||
| causal-conv1d = [{ requirement = "torch", match-runtime = true }, "setuptools.build_meta"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and examine the pyproject.toml file
find . -name "pyproject.toml" -type fRepository: NVIDIA-NeMo/RL
Length of output: 257
🏁 Script executed:
# Once found, let's look at the relevant lines around 232 and context
head -250 pyproject.toml | tail -50Repository: NVIDIA-NeMo/RL
Length of output: 2448
🏁 Script executed:
# Also check around line 112 for the build dependency section
head -120 pyproject.toml | tail -30Repository: NVIDIA-NeMo/RL
Length of output: 1285
🏁 Script executed:
# Let's see the entire dependency-groups section and extra-build-dependencies
grep -n "extra-build-dependencies\|dependency-groups\|causal-conv1d\|setuptools" pyproject.toml | head -20Repository: NVIDIA-NeMo/RL
Length of output: 793
🌐 Web query:
uv extra-build-dependencies format package name validation
💡 Result:
Short answer — format and validation
-
Keys in tool.uv.extra-build-dependencies are package/project names (the package you want to augment). Values are either:
- a list of requirement strings (PEP 508 requirement specifiers), e.g. ["setuptools", "torch==2.6.0"], or
- a list of objects with fields like requirement (a PEP‑508 requirement string) and match-runtime (bool), e.g. [{ requirement = "torch", match-runtime = true }]. [1][2]
-
The requirement strings follow standard pip/PEP‑508 requirement syntax (project name + optional version specifiers, extras, markers, direct URLs, etc.). uv parses/validates these using a PEP‑508-compatible requirement parser, so invalid PEP‑508 requirement strings or invalid package identifiers will be rejected. [2][3]
-
Practical examples and behavior (matching runtime, rebuild triggers, experimental flag) are documented in the uv config docs and the project’s examples/issues. If you get a validation error, ensure your package names/requirement strings conform to PEP‑508 (or use the exploded table style shown in the docs). [1][2][4]
Sources:
- uv docs — extra-build-dependencies and examples. [1]
- uv docs — configuring projects / augmenting build dependencies (match-runtime examples). [2]
- pip / PEP‑508 requirement specifier reference (requirement string format). [3]
- uv GitHub issues / implementation notes showing PEP‑508 parsing/validation behavior. [4]
🌐 Web query:
causal-conv1d build requirements setuptools
💡 Result:
Short answer — to build causal-conv1d from source you need:
- Python >= 3.9. [1]
- setuptools (and wheel) available for building a Python wheel. [2]
- A compatible PyTorch installation (headers) — the extension builds against PyTorch/C++/CUDA. [1]
- CUDA toolkit (nvcc) and a matching CUDA driver for the prebuilt CUDA kernels (or ROCm toolchain for AMD/ROCm builds; the project documents a ROCm 6.0 patch). [1][3]
- A C/C++ compiler toolchain (gcc/clang on Linux; MSVC on Windows) to compile the extension. [1]
References:
- PyPI project page (requirements, Python version, ROCm note). [1]
- Repository setup/build context and reports of setuptools being required during install. [2][3]
🏁 Script executed:
# Let's verify if setuptools.build_meta would be accepted by PEP 508 parsers
# by checking the causal-conv1d repository's own build configuration
curl -s https://raw.githubusercontent.com/Dao-AILab/causal-conv1d/v1.5.0.post8/pyproject.toml 2>/dev/null | head -50Repository: NVIDIA-NeMo/RL
Length of output: 72
🏁 Script executed:
# Also check if uv has any special handling or if this causes issues
# by examining uv's requirement parsing behavior
python3 << 'EOF'
# Simulate PEP 508 parsing to verify if "setuptools.build_meta" is valid
try:
# Try using packaging library for PEP 508 parsing
import subprocess
result = subprocess.run(['pip', 'list'], capture_output=True, text=True)
if 'packaging' in result.stdout:
from packaging.requirements import Requirement
try:
req = Requirement("setuptools.build_meta")
print(f"✓ 'setuptools.build_meta' parsed as valid: {req}")
except Exception as e:
print(f"✗ 'setuptools.build_meta' failed PEP 508 parsing: {e}")
else:
print("packaging not installed, installing...")
subprocess.run(['pip', 'install', 'packaging'], capture_output=True)
from packaging.requirements import Requirement
try:
req = Requirement("setuptools.build_meta")
print(f"✓ 'setuptools.build_meta' parsed as valid: {req}")
except Exception as e:
print(f"✗ 'setuptools.build_meta' failed PEP 508 parsing: {e}")
except Exception as e:
print(f"Error: {e}")
EOFRepository: NVIDIA-NeMo/RL
Length of output: 122
🏁 Script executed:
# Check if "setuptools.build_meta" is actually a package on PyPI
python3 << 'EOF'
import urllib.request
import json
try:
url = "https://pypi.org/pypi/setuptools.build_meta/json"
with urllib.request.urlopen(url, timeout=5) as response:
data = json.loads(response.read())
print(f"✓ Package 'setuptools.build_meta' found on PyPI")
print(f" Latest version: {data['info']['version']}")
except urllib.error.HTTPError as e:
if e.code == 404:
print(f"✗ Package 'setuptools.build_meta' NOT found on PyPI (404)")
else:
print(f"✗ HTTP Error {e.code}: {e}")
except Exception as e:
print(f"✗ Error checking PyPI: {e}")
EOFRepository: NVIDIA-NeMo/RL
Length of output: 208
🏁 Script executed:
# Also verify what setuptools actually provides
python3 << 'EOF'
try:
import setuptools
# Check if setuptools.build_meta exists as a module
try:
import setuptools.build_meta
print("✓ setuptools.build_meta module exists within setuptools package")
print(f" Location: {setuptools.build_meta.__file__}")
except ImportError as e:
print(f"✗ setuptools.build_meta module NOT found: {e}")
except ImportError:
print("setuptools not installed in test environment")
EOFRepository: NVIDIA-NeMo/RL
Length of output: 189
🏁 Script executed:
# Check the causal-conv1d's actual build configuration by looking at its setup
git ls-files | grep -E "causal.*conv.*1d" || echo "No causal-conv1d files in repo"Repository: NVIDIA-NeMo/RL
Length of output: 88
🏁 Script executed:
# Test if uv would actually fail or succeed with setuptools.build_meta
# by checking if uv tries to resolve it as a package
python3 << 'EOF'
# PEP 508 parsing accepts it, but PyPI package resolution won't find it
# Let's check if setuptools.build_meta is a real PyPI package by normalizing the name
from packaging.utils import canonicalize_name
pkg_name = "setuptools.build_meta"
canonical = canonicalize_name(pkg_name)
print(f"Original name: {pkg_name}")
print(f"Canonical name: {canonical}")
print(f"Would uv try to find a package named '{canonical}' on PyPI: Yes")
print(f"\nAnalysis:")
print(f"- 'setuptools' is a real package on PyPI")
print(f"- 'setuptools.build_meta' is NOT a package on PyPI")
print(f"- It's a module within the setuptools package")
print(f"- Using it as a requirement string is misleading/incorrect")
EOFRepository: NVIDIA-NeMo/RL
Length of output: 414
🏁 Script executed:
# Let's see if there are any comments or documentation in pyproject.toml explaining this
grep -B5 -A2 "setuptools.build_meta" pyproject.tomlRepository: NVIDIA-NeMo/RL
Length of output: 632
Remove or correct "setuptools.build_meta" in extra-build-dependencies.
The string "setuptools.build_meta" is not a valid PyPI package name. While PEP 508 parsing accepts it syntactically, uv will normalize it to "setuptools-build-meta" and fail to resolve it on PyPI (the package doesn't exist). This will cause causal-conv1d builds to fail.
setuptools.build_meta is a build backend module within the setuptools package, not a package itself. Since setuptools is already listed in the build dependency group (line 112) and is the correct build dependency for causal-conv1d, either remove this entry or replace it with "setuptools":
-causal-conv1d = [{ requirement = "torch", match-runtime = true }, "setuptools.build_meta"]
+causal-conv1d = [{ requirement = "torch", match-runtime = true }]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| causal-conv1d = [{ requirement = "torch", match-runtime = true }, "setuptools.build_meta"] | |
| causal-conv1d = [{ requirement = "torch", match-runtime = true }] |
🤖 Prompt for AI Agents
In pyproject.toml around line 232, the extras build-dependency entry includes
the invalid PyPI name "setuptools.build_meta"; remove that token or replace it
with the actual package name "setuptools" so the extras list resolves on PyPI
(e.g. change the causal-conv1d extras array to use "setuptools" or simply drop
the "setuptools.build_meta" item).
terrykong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yfw to review
What does this PR do ?
Add video and audio dataloadling support for HF models
Usage
Additional Information
Major changes:
Summary by CodeRabbit
New Features
Enhancements
Dependencies
✏️ Tip: You can customize this high-level summary in your review settings.