feat: Omni dataloader for HF models #1639

yuanhangsu1986 · 2025-12-15T22:34:52Z

What does this PR do ?

Add video and audio dataloadling support for HF models

Usage

export NRL_FORCE_REBUILD_VENVS=true
uv venv
uv run python ./examples/run_vlm_sft.py cluster.gpus_per_node=8 --config ./examples/configs/sft_avlm.yaml

Additional Information

Major changes:

nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py: base class GeneralConversationsJsonlDataset for loading conversation-based data structure which is described in the comments.
nemo_rl/data/datasets/response_datasets/conversation_base.py: base class and functions for conversation-based data structure which is described in general_conversations_dataset.py
nemo_rl/data/multimodal_utils.py: functions for loading omni data
nemo_rl/data/llm_message_utils.py: loading omni data for HF-based llm model
examples/configs/sft_avlm.yaml: example for loading video conversation-based dataset for Qwen3-VL with 16 frames per video.

Summary by CodeRabbit

New Features
- Added support for general conversation datasets with multimodal content (images, audio, video).
- Introduced runtime processor configuration for video and audio settings (fps, frames, sampling rate).
- Added new configuration template for video-aware fine-tuning.
Enhancements
- Separated train and validation data preprocessing for improved flexibility.
Dependencies
- Added decord library for video processing support.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…-Edge-Omni

…ttings in the config file

coderabbitai · 2025-12-15T22:40:57Z

📝 Walkthrough

Walkthrough

Adds support for multimodal (audio, video, image) conversation datasets through a new GeneralConversationsJsonlDataset loader, multimodal utilities for media extraction and loading, split-specific data preprocessors, dynamic processor configuration overrides, and an example AVLM configuration file.

Changes

Cohort / File(s)	Summary
Configuration `examples/configs/sft_avlm.yaml`	New YAML configuration for audio-visual language model training with video tokenizer (16 frames) and GeneralConversationsJsonlDataset.
Training Script `examples/run_sft.py`	Introduces separate train/val datum preprocessors; supports per-split preprocessor selection from dataset attributes or shared defaults for specific datasets (e.g., clevr_cogent).
Processor Configuration `nemo_rl/algorithms/utils.py`	Adds runtime overrides in `get_tokenizer` for multimodal processor settings (audio sampling_rate, video fps/num_frames) when values differ from configuration; includes validation checks and warnings.
Dataset System – Base `nemo_rl/data/datasets/response_datasets/conversation_base.py`	Introduces conversation metadata normalization and message fragment processing with multimodal tag support; utilities handle media file resolution, extension fallback, and validation.
Dataset System – Loader `nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`	New `GeneralConversationsJsonlDataset` class loads JSONL conversation data with optional train/val splits and media directories; converts conversations to OpenAI-style messages with media reference resolution.
Dataset System – Registration `nemo_rl/data/datasets/response_datasets/__init__.py`	Registers `GeneralConversationsJsonlDataset` in `load_response_dataset`; validates required paths and constructs dataset with optional media directories and split specifications.
Multimodal Utilities `nemo_rl/data/multimodal_utils.py`	Adds multimodal constants (media tags, extensions, patterns), processor introspection utilities, message-level media extraction, and cross-format media loaders (images via PIL, audio via transformers/decord fallback, video via transformers/decord fallback).
Message Processing `nemo_rl/data/llm_message_utils.py`	Refactors `get_formatted_message_log` to use generalized multimodal media loading via `load_media_from_message`; dynamically constructs tokenizer media kwargs instead of fixed image parameters; updates tensor packing dimension selection for multimodal data.
Dependencies `pyproject.toml`	Adds `decord` dependency for video processing; updates `causal-conv1d` build metadata.

Sequence Diagram

sequenceDiagram
    participant App as Training Script
    participant DS as GeneralConversationsJsonlDataset
    participant Preproc as Preprocessor<br/>(conversation_base)
    participant MediaUtil as Multimodal Utils
    participant Tokenizer as Tokenizer
    
    App->>DS: Load dataset with media dirs
    DS->>DS: Load JSONL conversations
    DS->>Preproc: Process example
    Preproc->>Preproc: Normalize metadata
    Preproc->>Preproc: Split message by media tags
    Preproc->>MediaUtil: Resolve media file paths
    MediaUtil->>MediaUtil: Check/extend file extensions
    MediaUtil-->>Preproc: Resolved media paths
    Preproc-->>DS: Formatted messages with placeholders
    DS-->>App: Preprocessed example
    App->>Tokenizer: Tokenize + Load media
    Tokenizer->>MediaUtil: load_media_from_message
    MediaUtil->>MediaUtil: Load images (PIL)<br/>Load audio (transformers/decord)<br/>Load video (transformers/decord)
    MediaUtil-->>Tokenizer: Media tensors
    Tokenizer->>Tokenizer: Pack tensors<br/>by computed dimension
    Tokenizer-->>App: Token IDs + multimodal embeddings

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Multimodal utilities (multimodal_utils.py): High logic density with multiple media loaders, processor introspection, and fallback handling (audio/video decord fallbacks, file extension resolution).
Dataset processor logic (general_conversations_dataset.py, conversation_base.py): New multimodal-aware message processing with metadata normalization and media path validation.
Message processing integration (llm_message_utils.py): Refactored media loading pipeline and dynamic tensor packing dimension selection; requires careful verification of multimodal kwargs construction and tokenizer integration.
Split-specific preprocessor routing (run_sft.py): Conditional logic for selecting train/val preprocessors; validate correctness of data flow for datasets with/without datum_preprocessor attribute.

Possibly related PRs

refactor: refactor dataset module #977: Introduces the foundational response datasets system that this PR extends with multimodal support, conversation processing, and dataset loader registration.

Suggested labels

CI, Run CICD

Suggested reviewers

odelalleau
ashors1
terrykong

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	PR introduces major multimodal dataloader features without documenting any test results, performance metrics, convergence data, or CI test status.	Include unit tests for new dataset/utility classes, integration tests for end-to-end data loading, performance benchmarks, and convergence/loss curves demonstrating no regression.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Omni dataloader for HF models' accurately summarizes the main change: adding multimodal (omni) dataloading support for Hugging Face models across multiple data files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/run_sft.py (1)

144-147: Inconsistent config access pattern between train and validation.

Training dataset (lines 131-132) accesses data_config["add_bos"] and data_config["add_eos"] directly, while validation (lines 145-146) uses .get("add_bos", True) with defaults. This inconsistency could mask configuration errors.

Either both should use direct access (if these are required config keys) or both should use .get() with defaults. As per coding guidelines, YAML should be the single source of truth for defaults, and required attributes should be accessed directly.

🧹 Nitpick comments (14)

nemo_rl/data/llm_message_utils.py (2)
614-616: Remove stray blank line inside code block.

Line 615 contains an empty line with extra whitespace that breaks the code block flow.
             if "video" in media_cur_message:
                 media_kwargs["videos"] = media_cur_message["video"]
-    
             processed_chunk = tokenizer(
542-564: Remove debug print statements before merge.

This debug block uses a function attribute (_debug_printed) to gate one-time printing, which is an unusual pattern that adds mutable state to the function object. This should be removed before merging, or converted to proper logging with configurable verbosity.
examples/run_sft.py (1)
121-123: Consider defensive key access for datum_preprocessor.

The code assumes data.datum_preprocessor contains both "train" and "val" keys when present. If a dataset provides only one key, this would raise a KeyError.

Consider using .get() with a fallback:
     elif hasattr(data, "datum_preprocessor"):
-        datum_preprocessor_train = data.datum_preprocessor["train"]
-        datum_preprocessor_val = data.datum_preprocessor["val"]
+        datum_preprocessor_train = data.datum_preprocessor.get("train")
+        datum_preprocessor_val = data.datum_preprocessor.get("val")
This would gracefully handle cases where only one split has a preprocessor.
nemo_rl/algorithms/utils.py (2)
327-330: Add stacklevel=2 to warnings for proper caller attribution.

When warnings.warn() is called from a utility function, setting stacklevel=2 ensures the warning points to the calling code rather than this function, making debugging easier.
                 warnings.warn(
-                    f"Overriding audio sampling rate from {processor.feature_extractor.sampling_rate} to {new_sampling_rate}"
+                    f"Overriding audio sampling rate from {processor.feature_extractor.sampling_rate} to {new_sampling_rate}",
+                    stacklevel=2,
                 )
Apply the same fix to warnings at lines 336-338 and 344-346.

340-347: Consider explicit validation for mutually exclusive fps/num_frames.

The comment acknowledges that fps and num_frames cannot co-exist, but the code defers the error to downstream components. Providing an early, clear error message would improve the user experience.
+            if "fps" in tokenizer_config["video"] and "num_frames" in tokenizer_config["video"]:
+                raise ValueError(
+                    "Cannot specify both 'fps' and 'num_frames' in video configuration; they are mutually exclusive."
+                )
             # fps and num_frames cannot co-exist, but let it crash later
             if "num_frames" in tokenizer_config["video"] and \
                 tokenizer_config["video"]["num_frames"] != processor.video_processor.num_frames:
nemo_rl/data/datasets/response_datasets/__init__.py (1)
146-157: Consider adding GeneralConversationsJsonlDataset to __all__.

The new class is imported but not exported in __all__. If it should be part of the public API (importable directly from this module), add it to the list:
 __all__ = [
     "CLEVRCoGenTDataset",
     "DeepScalerDataset",
     "DAPOMath17KDataset",
+    "GeneralConversationsJsonlDataset",
     "Geometry3KDataset",
     "OpenAIFormatDataset",
     "OasstDataset",
     "OpenMathInstruct2Dataset",
     "RefCOCODataset",
     "ResponseDataset",
     "SquadDataset",
 ]
If the class is only intended to be used through load_response_dataset(), this can be safely ignored.
nemo_rl/data/datasets/response_datasets/conversation_base.py (3)
15-24: Remove unused imports.

io, copy, dataclasses, and Path are imported but never used in this file.
 import os
 import re
-import io
-import copy
 import warnings
-import dataclasses
 from PIL import Image
-from pathlib import Path
 from collections import defaultdict
 from typing import Any, Dict, Callable, Optional
37-67: Add stacklevel to warnings.warn and fix minor style issue.

The warning should include stacklevel=2 so the warning points to the caller rather than this function. Also, there's an extra space before del on line 48.
             if tag_mapped not in data:
                 data[tag_mapped] = data[tag]
-                del  data[tag]
+                del data[tag]
             else:
                 warnings.warn(
-                    f"Trying to map {tag} to {tag_mapped}, but {tag_mapped} already exists in the raw data. Mapping is not carried out."
+                    f"Trying to map {tag} to {tag_mapped}, but {tag_mapped} already exists in the raw data. Mapping is not carried out.",
+                    stacklevel=2,
                 )
92-92: Unused loop variable.

The loop variable i is not used. Rename to _ to indicate it's intentionally unused.
-    for i, part in enumerate(parts):
+    for _, part in enumerate(parts):
Or simply iterate without enumerate:
-    for i, part in enumerate(parts):
+    for part in parts:
nemo_rl/data/multimodal_utils.py (2)
285-285: Avoid using assert for runtime validation.

Using assert for input validation can be disabled with -O flag. Use an explicit if check with ValueError instead.
-                assert "audio" in multimodal_load_kwargs and "sampling_rate"  in multimodal_load_kwargs["audio"]
+                if "audio" not in multimodal_load_kwargs or "sampling_rate" not in multimodal_load_kwargs["audio"]:
+                    raise ValueError("Audio loading requires 'sampling_rate' in multimodal_load_kwargs['audio']")
297-297: Variable name shadows module-level constant.

load_video_kwargs shadows the module-level constant defined on line 28. Use a different variable name to avoid confusion.
-                load_video_kwargs = multimodal_load_kwargs["video"] if "video" in multimodal_load_kwargs else {}
+                video_kwargs = multimodal_load_kwargs.get("video", {})
                 # seems decord backend loads video faster with multithread ffmpeg and it is easier to install
-                loaded_media["video"].append(load_video(vid, backend="decord", **load_video_kwargs)[0])
+                loaded_media["video"].append(load_video(vid, backend="decord", **video_kwargs)[0])
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (3)
26-26: Remove unused debug flag.

_DEBUG=True is defined but never used in this file. Remove it or implement the intended debugging behavior.
-_DEBUG=True
-
-
 class GeneralConversationsJsonlDataset:
17-17: Remove unused import Literal.
-from typing import Any, Optional, Literal
+from typing import Any, Optional
30-34: Fix typos in docstring.
-    """Loads general conversation datasets that have the json (manifest) files and media files in separate files (jsonl datasets).
-    Each sample can be single/multi-turn converstaions with multiple modalities.
+    """Loads general conversation datasets that have the JSON (manifest) files and media files in separate files (JSONL datasets).
+    Each sample can be single/multi-turn conversations with multiple modalities.
     Each modality can have one or more number of media objects.
-    There is no requiement of where the media tag (e.g. '<sound>') should appear in the conversations.
+    There is no requirement of where the media tag (e.g. '<sound>') should appear in the conversations.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a010564 and 1c049b4.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (9)

examples/configs/sft_avlm.yaml (1 hunks)
examples/run_sft.py (3 hunks)
nemo_rl/algorithms/utils.py (1 hunks)
nemo_rl/data/datasets/response_datasets/__init__.py (2 hunks)
nemo_rl/data/datasets/response_datasets/conversation_base.py (1 hunks)
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (1 hunks)
nemo_rl/data/llm_message_utils.py (4 hunks)
nemo_rl/data/multimodal_utils.py (3 hunks)
pyproject.toml (2 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

nemo_rl/algorithms/utils.py
nemo_rl/data/llm_message_utils.py
nemo_rl/data/datasets/response_datasets/conversation_base.py
examples/run_sft.py
nemo_rl/data/datasets/response_datasets/__init__.py
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py
nemo_rl/data/multimodal_utils.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes

Files:

nemo_rl/algorithms/utils.py
nemo_rl/data/llm_message_utils.py
nemo_rl/data/datasets/response_datasets/conversation_base.py
nemo_rl/data/datasets/response_datasets/__init__.py
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py
nemo_rl/data/multimodal_utils.py

!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

nemo_rl/algorithms/utils.py
examples/configs/sft_avlm.yaml
nemo_rl/data/llm_message_utils.py
nemo_rl/data/datasets/response_datasets/conversation_base.py
examples/run_sft.py
pyproject.toml
nemo_rl/data/datasets/response_datasets/__init__.py
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py
nemo_rl/data/multimodal_utils.py

**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

nemo_rl/algorithms/utils.py
nemo_rl/data/llm_message_utils.py
nemo_rl/data/datasets/response_datasets/conversation_base.py
examples/run_sft.py
nemo_rl/data/datasets/response_datasets/__init__.py
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py
nemo_rl/data/multimodal_utils.py

🧬 Code graph analysis (4)

nemo_rl/data/datasets/response_datasets/conversation_base.py (1)

nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (1)

process_message_fragment (107-118)

examples/run_sft.py (1)

nemo_rl/data/datasets/response_datasets/clevr.py (1)

format_clevr_cogent_dataset (33-62)

nemo_rl/data/datasets/response_datasets/__init__.py (2)

nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (1)

GeneralConversationsJsonlDataset (29-154)

nemo_rl/data/datasets/utils.py (1)

get_extra_kwargs (86-102)

nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (3)

nemo_rl/data/datasets/utils.py (1)

load_dataset_from_path (64-83)

nemo_rl/data/interfaces.py (1)

TaskDataSpec (53-86)

nemo_rl/data/datasets/response_datasets/conversation_base.py (2)

convert_metadata (38-67)

conversation_process_message (70-131)

🪛 Ruff (0.14.8)

nemo_rl/algorithms/utils.py

327-327: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

336-336: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

344-344: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

nemo_rl/data/datasets/response_datasets/conversation_base.py

50-50: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

74-74: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

77-77: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

78-78: Unused function argument: tags_mapping_sample_to_allowed

(ARG001)

92-92: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

117-117: Avoid specifying long messages outside the exception class

(TRY003)

nemo_rl/data/datasets/response_datasets/__init__.py

119-121: Avoid specifying long messages outside the exception class

(TRY003)

nemo_rl/data/multimodal_utils.py

271-271: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

288-288: Do not use bare except

(E722)

291-291: Undefined name tokenizer

(F821)

291-291: Undefined name key

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Post submodule check comment / Comment on PR
GitHub Check: Post automodel integration comment / Comment on PR

🔇 Additional comments (5)

nemo_rl/algorithms/utils.py (1)

323-347: Multimodal processor configuration overrides look good.

The runtime override approach for audio sampling rate and video fps/num_frames is a reasonable way to synchronize processor settings with configuration without modifying the underlying model configs. The warnings appropriately inform users of the overrides.

nemo_rl/data/datasets/response_datasets/__init__.py (1)

117-135: Implementation follows existing patterns correctly.

The new GeneralConversationsJsonlDataset branch properly validates required config, extracts optional kwargs, and instantiates the dataset class. This mirrors the existing ResponseDataset pattern.

nemo_rl/data/multimodal_utils.py (2)

27-28: LGTM!

Dynamic extraction of function parameter names using inspect.signature is a clean approach for forward compatibility.

33-65: LGTM!

Well-organized media tag definitions with appropriate reverse mappings and regex pattern compilation. The warning comment about avoiding cyclic mappings is helpful.

nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (1)

106-118: LGTM!

The process_message_fragment method correctly handles media path resolution with fallback to media directory, and properly splits compound tags like "video-audio" into separate fragment entries.

coderabbitai · 2025-12-15T22:41:00Z

examples/configs/sft_avlm.yaml

+  train_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_10000.jsonl
+  val_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_100.jsonl
+  train_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min
+  val_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min


⚠️ Potential issue | 🟠 Major

Replace hardcoded user-specific paths with placeholders.

The example config contains absolute paths to user-specific directories on an internal filesystem (/lustre/fsw/portfolios/llmservice/users/yuanhangs/...). These paths won't work for other users and expose internal infrastructure details.

Consider using placeholder paths that users can customize:

data: dataset_name: GeneralConversationsJsonlDataset - train_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_10000.jsonl - val_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_100.jsonl - train_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min - val_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min + train_data_path: /path/to/train_data.jsonl # Replace with your training data path + val_data_path: /path/to/val_data.jsonl # Replace with your validation data path + train_media_data_dir: /path/to/train_media/ # Directory containing training media files + val_media_data_dir: /path/to/val_media/ # Directory containing validation media files

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

train_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_10000.jsonl

val_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_100.jsonl

train_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min

val_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min

data:

dataset_name: GeneralConversationsJsonlDataset

train_data_path: /path/to/train_data.jsonl # Replace with your training data path

val_data_path: /path/to/val_data.jsonl # Replace with your validation data path

train_media_data_dir: /path/to/train_media/ # Directory containing training media files

val_media_data_dir: /path/to/val_media/ # Directory containing validation media files

🤖 Prompt for AI Agents

In examples/configs/sft_avlm.yaml around lines 11 to 14, the file contains hardcoded, user-specific absolute paths that expose internal infrastructure and will fail for other users; replace those literal paths with configurable placeholders (e.g., ${DATA_DIR}/train.jsonl, ${DATA_DIR}/val.jsonl, ${MEDIA_DIR}/train_media, ${MEDIA_DIR}/val_media) or relative paths and update any docs/README to instruct users to set the environment variables or edit the placeholders to point to their local dataset and media directories before running.

coderabbitai · 2025-12-15T22:41:00Z

nemo_rl/data/datasets/response_datasets/conversation_base.py

+def conversation_process_message(
+    metadata: Dict[str, Any],
+    message: Dict[str, str],
+    media_index: dict,
+    raw: Dict[str, Any] = {},
+    allow_empty_text: bool = False,
+    check_if_media_file_exist: bool = True,
+    tried_default_extensions: set = set(),
+    tags_mapping_sample_to_allowed: Dict[str, str] = multimodal_utils.media_tags_to_allowed,
+    process_message_fragment: Callable = lambda tag, fragment: [{tag: fragment}],
+) -> list[Dict[str, Any]]:


⚠️ Potential issue | 🔴 Critical

Critical: Mutable default arguments will cause bugs.

raw={} and tried_default_extensions=set() are mutable defaults. The tried_default_extensions set is modified on line 106, meaning the same set instance will be shared across all calls that don't explicitly pass this argument, causing incorrect behavior.

def conversation_process_message( metadata: Dict[str, Any], message: Dict[str, str], media_index: dict, - raw: Dict[str, Any] = {}, + raw: Optional[Dict[str, Any]] = None, allow_empty_text: bool = False, check_if_media_file_exist: bool = True, - tried_default_extensions: set = set(), + tried_default_extensions: Optional[set] = None, tags_mapping_sample_to_allowed: Dict[str, str] = multimodal_utils.media_tags_to_allowed, process_message_fragment: Callable = lambda tag, fragment: [{tag: fragment}], ) -> list[Dict[str, Any]]: """ Args: raw: dictionary with all webdataset compliant keys of a sample. Emtpy for jsonl dataset, non-empty otherwise metadata: """ + if raw is None: + raw = {} + if tried_default_extensions is None: + tried_default_extensions = set() fragments = []

🧰 Tools

🪛 Ruff (0.14.8)

74-74: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

77-77: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

78-78: Unused function argument: tags_mapping_sample_to_allowed

(ARG001)

🤖 Prompt for AI Agents

In nemo_rl/data/datasets/response_datasets/conversation_base.py around lines 70 to 80, the function uses mutable default arguments raw={} and tried_default_extensions=set(), which leads to shared state across calls (tried_default_extensions is mutated on line 106). Change both defaults to None in the signature, then inside the function initialize raw = {} if raw is None and tried_default_extensions = set() if tried_default_extensions is None so each call gets a fresh object; ensure any downstream logic continues to use these local variables.

coderabbitai · 2025-12-15T22:41:00Z

nemo_rl/data/datasets/response_datasets/conversation_base.py

+    allow_empty_text: bool = False,
+    check_if_media_file_exist: bool = True,
+    tried_default_extensions: set = set(),
+    tags_mapping_sample_to_allowed: Dict[str, str] = multimodal_utils.media_tags_to_allowed,


🛠️ Refactor suggestion | 🟠 Major

Unused parameter tags_mapping_sample_to_allowed.

This parameter is declared but never used in the function body. Either remove it or implement the intended functionality.

🧰 Tools

🪛 Ruff (0.14.8)

78-78: Unused function argument: tags_mapping_sample_to_allowed

(ARG001)

🤖 Prompt for AI Agents

In nemo_rl/data/datasets/response_datasets/conversation_base.py around line 78, the parameter tags_mapping_sample_to_allowed is declared but never used; either remove it from the function signature and update any callers, or implement its intended behavior by applying the mapping to the sample's tags (e.g., map each sample tag via tags_mapping_sample_to_allowed to an allowed tag and use that result wherever allowed tags are computed/validated), and then adjust type hints and add/modify tests to cover the mapping behavior.

coderabbitai · 2025-12-15T22:41:00Z

nemo_rl/data/datasets/response_datasets/conversation_base.py

+            if not isinstance(metadata[tag], list):
+                metadata[tag] = [metadata[tag]]


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n nemo_rl/data/datasets/response_datasets/conversation_base.py | sed -n '80,110p'

Repository: NVIDIA-NeMo/RL

Length of output: 1585

🏁 Script executed:

cat -n nemo_rl/data/datasets/response_datasets/conversation_base.py | sed -n '65,95p'

Repository: NVIDIA-NeMo/RL

Length of output: 1333

🏁 Script executed:

rg -n "conversation_process_message" nemo_rl/data/datasets/response_datasets/

Repository: NVIDIA-NeMo/RL

Length of output: 306

🏁 Script executed:

cat -n nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py | sed -n '130,160p'

Repository: NVIDIA-NeMo/RL

Length of output: 1302

🏁 Script executed:

cat -n nemo_rl/data/datasets/response_datasets/conversation_base.py | sed -n '1,70p'

Repository: NVIDIA-NeMo/RL

Length of output: 2819

🏁 Script executed:

rg -n "media_tags" nemo_rl/data/multimodal_utils.py | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 277

🏁 Script executed:

sed -n '33,70p' nemo_rl/data/multimodal_utils.py

Repository: NVIDIA-NeMo/RL

Length of output: 1062

Add validation to prevent KeyError when media tag is in message but missing from metadata.

At line 96, metadata[tag] is accessed without checking if tag exists in the metadata dictionary. If a message contains a media tag like <image> but the metadata dict lacks the corresponding key, a KeyError will be raised. Add a check before accessing metadata[tag] to handle missing tags gracefully (either skip the tag, raise a clear error, or provide a default value).

🤖 Prompt for AI Agents

In nemo_rl/data/datasets/response_datasets/conversation_base.py around lines 96-97, the code accesses metadata[tag] without verifying tag exists which can raise KeyError; add a presence check before accessing metadata[tag] (e.g., if tag not in metadata: either continue to skip processing that tag, or set a sensible default like metadata[tag]=[] or raise a clear ValueError with a descriptive message), then proceed with the existing list-wrapping logic.

coderabbitai · 2025-12-15T22:41:00Z

nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py

+        self.formatted_ds = {
+            "train": train_ds,
+            "validation": val_ds,
+        }
+
+        self.datum_preprocessor = {
+            "train": partial(self._datum_preprocessor, media_directory=train_media_data_dir),
+            "val": partial(self._datum_preprocessor, media_directory=val_media_data_dir)
+        }


⚠️ Potential issue | 🟠 Major

Inconsistent dictionary keys: "validation" vs "val".

formatted_ds uses "validation" as the key, but datum_preprocessor uses "val". This inconsistency could cause lookup failures when accessing preprocessors by split name.

self.datum_preprocessor = { "train": partial(self._datum_preprocessor, media_directory=train_media_data_dir), - "val": partial(self._datum_preprocessor, media_directory=val_media_data_dir) + "validation": partial(self._datum_preprocessor, media_directory=val_media_data_dir) }

🤖 Prompt for AI Agents

In nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py around lines 94 to 102, the split keys are inconsistent: formatted_ds uses "validation" while datum_preprocessor uses "val", which will cause lookups to fail; fix by making the keys identical (e.g., change the "val" key in datum_preprocessor to "validation" or rename "validation" to "val" so both dicts use the same split name), or preferably derive both from a single split-name constant; update only the dictionary keys to match and ensure any downstream access uses the chosen split name.

coderabbitai · 2025-12-15T22:41:01Z

nemo_rl/data/multimodal_utils.py

+def load_media_from_message(
+    message: dict[str, Any], 
+    processor = None, 
+    multimodal_load_kwargs: dict[str, dict[str, Any]] = {},
+) -> dict[list[Any]]:


⚠️ Potential issue | 🟠 Major

Mutable default argument and malformed type hint.

The multimodal_load_kwargs parameter has a mutable default {}. Also, the return type hint dict[list[Any]] is malformed; it should be dict[str, list[Any]].

def load_media_from_message( message: dict[str, Any], processor = None, - multimodal_load_kwargs: dict[str, dict[str, Any]] = {}, -) -> dict[list[Any]]: + multimodal_load_kwargs: Optional[dict[str, dict[str, Any]]] = None, +) -> dict[str, list[Any]]: loaded_media = defaultdict(list) media_in_message = get_media_from_message(message) - if not multimodal_load_kwargs and processor is not None: + if multimodal_load_kwargs is None: + multimodal_load_kwargs = {} + if not multimodal_load_kwargs and processor is not None: multimodal_load_kwargs = get_multimodal_default_settings_from_processor(processor)

🧰 Tools

🪛 Ruff (0.14.8)

271-271: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🤖 Prompt for AI Agents

In nemo_rl/data/multimodal_utils.py around lines 268 to 272, change the mutable default and malformed return hint: replace multimodal_load_kwargs: dict[str, dict[str, Any]] = {} with multimodal_load_kwargs: Optional[dict[str, dict[str, Any]]] = None (import Optional if needed) and at function start set multimodal_load_kwargs = multimodal_load_kwargs or {} to avoid a shared mutable default; also correct the return type hint from dict[list[Any]] to dict[str, list[Any]].

coderabbitai · 2025-12-15T22:41:01Z

nemo_rl/data/multimodal_utils.py

+    if "audio" in media_in_message:
+        for aud in media_in_message["audio"]:
+            if isinstance(aud, str):
+                print(multimodal_load_kwargs)


⚠️ Potential issue | 🟡 Minor

Remove debug print statement.

This print(multimodal_load_kwargs) appears to be a debug artifact and should be removed before merging.

- print(multimodal_load_kwargs)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(multimodal_load_kwargs)

🤖 Prompt for AI Agents

In nemo_rl/data/multimodal_utils.py around line 284, remove the stray debug print statement "print(multimodal_load_kwargs)"; simply delete that line so the function no longer emits debug output, and run tests/lint to ensure no leftover unused variables or formatting issues.

coderabbitai · 2025-12-15T22:41:01Z

nemo_rl/data/multimodal_utils.py

+                except:
+                    # use decord
+                    loaded_audio = decord.AudioReader(aud, sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"], mono=True)
+                    loaded_media["audio"].append(loaded_audio[:].asnumpy()[get_dim_to_pack_along(tokenizer, key)])


⚠️ Potential issue | 🔴 Critical

Critical: Bare except clause and undefined variables.

The bare except: catches all exceptions including KeyboardInterrupt and SystemExit, masking errors. Specify the expected exception type.

Line 291 references tokenizer and key which are undefined in this scope—this will cause a NameError at runtime.

- try: - loaded_media["audio"].append(load_audio(aud, **multimodal_load_kwargs["audio"])) - except: - # use decord - loaded_audio = decord.AudioReader(aud, sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"], mono=True) - loaded_media["audio"].append(loaded_audio[:].asnumpy()[get_dim_to_pack_along(tokenizer, key)]) + try: + loaded_media["audio"].append(load_audio(aud, **multimodal_load_kwargs["audio"])) + except Exception as e: + # Fallback to decord for unsupported formats + loaded_audio = decord.AudioReader(aud, sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"], mono=True) + loaded_media["audio"].append(loaded_audio[:].asnumpy()[0]) # Fix: use appropriate index

The processor variable should likely be used instead of tokenizer, and key should be "audio" or similar. Please verify the intended logic.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.14.8)

288-288: Do not use bare except

(E722)

291-291: Undefined name tokenizer

(F821)

291-291: Undefined name key

(F821)

🤖 Prompt for AI Agents

In nemo_rl/data/multimodal_utils.py around lines 288–291, replace the bare except with a specific exception handler (e.g., except (RuntimeError, ValueError, OSError):) so we don't swallow KeyboardInterrupt/SystemExit, and inside the handler use the correct variables — replace the undefined tokenizer with processor and use the explicit key "audio" (or the variable that holds the modality string) when indexing, making sure processor is available in scope; re-raise unexpected exceptions if needed or let them propagate.

pyproject.toml

coderabbitai · 2025-12-15T22:41:01Z

pyproject.toml

 transformer-engine-torch = [{ requirement = "torch", match-runtime = true }]
 mamba-ssm = [{ requirement = "torch", match-runtime = true }]
-causal-conv1d = [{ requirement = "torch", match-runtime = true }]
+causal-conv1d = [{ requirement = "torch", match-runtime = true }, "setuptools.build_meta"]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the pyproject.toml file find . -name "pyproject.toml" -type f

Repository: NVIDIA-NeMo/RL

Length of output: 257

🏁 Script executed:

# Once found, let's look at the relevant lines around 232 and context head -250 pyproject.toml | tail -50

Repository: NVIDIA-NeMo/RL

Length of output: 2448

🏁 Script executed:

# Also check around line 112 for the build dependency section head -120 pyproject.toml | tail -30

Repository: NVIDIA-NeMo/RL

Length of output: 1285

🏁 Script executed:

# Let's see the entire dependency-groups section and extra-build-dependencies grep -n "extra-build-dependencies\|dependency-groups\|causal-conv1d\|setuptools" pyproject.toml | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 793

🌐 Web query:

uv extra-build-dependencies format package name validation

💡 Result:

Short answer — format and validation

Keys in tool.uv.extra-build-dependencies are package/project names (the package you want to augment). Values are either:

a list of requirement strings (PEP 508 requirement specifiers), e.g. ["setuptools", "torch==2.6.0"], or

a list of objects with fields like requirement (a PEP‑508 requirement string) and match-runtime (bool), e.g. [{ requirement = "torch", match-runtime = true }]. [1][2]

The requirement strings follow standard pip/PEP‑508 requirement syntax (project name + optional version specifiers, extras, markers, direct URLs, etc.). uv parses/validates these using a PEP‑508-compatible requirement parser, so invalid PEP‑508 requirement strings or invalid package identifiers will be rejected. [2][3]

Practical examples and behavior (matching runtime, rebuild triggers, experimental flag) are documented in the uv config docs and the project’s examples/issues. If you get a validation error, ensure your package names/requirement strings conform to PEP‑508 (or use the exploded table style shown in the docs). [1][2][4]

Sources:

uv docs — extra-build-dependencies and examples. [1]

uv docs — configuring projects / augmenting build dependencies (match-runtime examples). [2]

pip / PEP‑508 requirement specifier reference (requirement string format). [3]

uv GitHub issues / implementation notes showing PEP‑508 parsing/validation behavior. [4]

🌐 Web query:

causal-conv1d build requirements setuptools

💡 Result:

Short answer — to build causal-conv1d from source you need:

Python >= 3.9. [1]

setuptools (and wheel) available for building a Python wheel. [2]

A compatible PyTorch installation (headers) — the extension builds against PyTorch/C++/CUDA. [1]

CUDA toolkit (nvcc) and a matching CUDA driver for the prebuilt CUDA kernels (or ROCm toolchain for AMD/ROCm builds; the project documents a ROCm 6.0 patch). [1][3]

A C/C++ compiler toolchain (gcc/clang on Linux; MSVC on Windows) to compile the extension. [1]

References:

PyPI project page (requirements, Python version, ROCm note). [1]

Repository setup/build context and reports of setuptools being required during install. [2][3]

🏁 Script executed:

# Let's verify if setuptools.build_meta would be accepted by PEP 508 parsers # by checking the causal-conv1d repository's own build configuration curl -s https://raw.githubusercontent.com/Dao-AILab/causal-conv1d/v1.5.0.post8/pyproject.toml 2>/dev/null | head -50

Repository: NVIDIA-NeMo/RL

Length of output: 72

🏁 Script executed:

# Also check if uv has any special handling or if this causes issues # by examining uv's requirement parsing behavior python3 << 'EOF' # Simulate PEP 508 parsing to verify if "setuptools.build_meta" is valid try: # Try using packaging library for PEP 508 parsing import subprocess result = subprocess.run(['pip', 'list'], capture_output=True, text=True) if 'packaging' in result.stdout: from packaging.requirements import Requirement try: req = Requirement("setuptools.build_meta") print(f"✓ 'setuptools.build_meta' parsed as valid: {req}") except Exception as e: print(f"✗ 'setuptools.build_meta' failed PEP 508 parsing: {e}") else: print("packaging not installed, installing...") subprocess.run(['pip', 'install', 'packaging'], capture_output=True) from packaging.requirements import Requirement try: req = Requirement("setuptools.build_meta") print(f"✓ 'setuptools.build_meta' parsed as valid: {req}") except Exception as e: print(f"✗ 'setuptools.build_meta' failed PEP 508 parsing: {e}") except Exception as e: print(f"Error: {e}") EOF

Repository: NVIDIA-NeMo/RL

Length of output: 122

🏁 Script executed:

# Check if "setuptools.build_meta" is actually a package on PyPI python3 << 'EOF' import urllib.request import json try: url = "https://pypi.org/pypi/setuptools.build_meta/json" with urllib.request.urlopen(url, timeout=5) as response: data = json.loads(response.read()) print(f"✓ Package 'setuptools.build_meta' found on PyPI") print(f" Latest version: {data['info']['version']}") except urllib.error.HTTPError as e: if e.code == 404: print(f"✗ Package 'setuptools.build_meta' NOT found on PyPI (404)") else: print(f"✗ HTTP Error {e.code}: {e}") except Exception as e: print(f"✗ Error checking PyPI: {e}") EOF

Repository: NVIDIA-NeMo/RL

Length of output: 208

🏁 Script executed:

# Also verify what setuptools actually provides python3 << 'EOF' try: import setuptools # Check if setuptools.build_meta exists as a module try: import setuptools.build_meta print("✓ setuptools.build_meta module exists within setuptools package") print(f" Location: {setuptools.build_meta.__file__}") except ImportError as e: print(f"✗ setuptools.build_meta module NOT found: {e}") except ImportError: print("setuptools not installed in test environment") EOF

Repository: NVIDIA-NeMo/RL

Length of output: 189

🏁 Script executed:

# Check the causal-conv1d's actual build configuration by looking at its setup git ls-files | grep -E "causal.*conv.*1d" || echo "No causal-conv1d files in repo"

Repository: NVIDIA-NeMo/RL

Length of output: 88

🏁 Script executed:

# Test if uv would actually fail or succeed with setuptools.build_meta # by checking if uv tries to resolve it as a package python3 << 'EOF' # PEP 508 parsing accepts it, but PyPI package resolution won't find it # Let's check if setuptools.build_meta is a real PyPI package by normalizing the name from packaging.utils import canonicalize_name pkg_name = "setuptools.build_meta" canonical = canonicalize_name(pkg_name) print(f"Original name: {pkg_name}") print(f"Canonical name: {canonical}") print(f"Would uv try to find a package named '{canonical}' on PyPI: Yes") print(f"\nAnalysis:") print(f"- 'setuptools' is a real package on PyPI") print(f"- 'setuptools.build_meta' is NOT a package on PyPI") print(f"- It's a module within the setuptools package") print(f"- Using it as a requirement string is misleading/incorrect") EOF

Repository: NVIDIA-NeMo/RL

Length of output: 414

🏁 Script executed:

# Let's see if there are any comments or documentation in pyproject.toml explaining this grep -B5 -A2 "setuptools.build_meta" pyproject.toml

Repository: NVIDIA-NeMo/RL

Length of output: 632

Remove or correct "setuptools.build_meta" in extra-build-dependencies.

The string "setuptools.build_meta" is not a valid PyPI package name. While PEP 508 parsing accepts it syntactically, uv will normalize it to "setuptools-build-meta" and fail to resolve it on PyPI (the package doesn't exist). This will cause causal-conv1d builds to fail.

setuptools.build_meta is a build backend module within the setuptools package, not a package itself. Since setuptools is already listed in the build dependency group (line 112) and is the correct build dependency for causal-conv1d, either remove this entry or replace it with "setuptools":

-causal-conv1d = [{ requirement = "torch", match-runtime = true }, "setuptools.build_meta"] +causal-conv1d = [{ requirement = "torch", match-runtime = true }]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

causal-conv1d = [{ requirement = "torch", match-runtime = true }, "setuptools.build_meta"]

causal-conv1d = [{ requirement = "torch", match-runtime = true }]

🤖 Prompt for AI Agents

In pyproject.toml around line 232, the extras build-dependency entry includes the invalid PyPI name "setuptools.build_meta"; remove that token or replace it with the actual package name "setuptools" so the extras list resolves on PyPI (e.g. change the causal-conv1d extras array to use "setuptools" or simply drop the "setuptools.build_meta" item).

terrykong

@yfw to review

Yuanhang Su and others added 14 commits December 8, 2025 13:44

add dep for causal-conv1d

dd26eae

add conversation-based dataset

82e3767

add avlm config yaml

9edd54c

import bugfix

f9ca6a7

indentation fix

f1bf681

add GeneralConversationsJsonlDataset initializer

499198c

bugfix

5f05cd1

process multimodal data

14ee7d7

use decord for video and audio loading

8cdb458

video output bugfix

6b44329

move the sample processing to sft_processor

7b0a10d

Merge branch 'main' of https://github.com/yuanhangsu1986/RL-Nemontron…

de7c995

…-Edge-Omni

move multimodal functions to multimodal_utils.py; add video, audio se…

8cd6777

…ttings in the config file

bugfix

1c049b4

yuanhangsu1986 requested review from a team as code owners December 15, 2025 22:34

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

github-actions bot added the community-request label Dec 15, 2025

terrykong reviewed Dec 15, 2025

View reviewed changes

terrykong requested a review from yfw December 15, 2025 22:43

terrykong changed the title ~~Omni dataloader for HF models~~ feat: Omni dataloader for HF models Dec 15, 2025

-  train_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_10000.jsonl
-  val_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_100.jsonl
-  train_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min
-  val_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min
+data:
+  dataset_name: GeneralConversationsJsonlDataset
+  train_data_path: /path/to/train_data.jsonl  # Replace with your training data path
+  val_data_path: /path/to/val_data.jsonl  # Replace with your validation data path
+  train_media_data_dir: /path/to/train_media/  # Directory containing training media files
+  val_media_data_dir: /path/to/val_media/  # Directory containing validation media files

		if not isinstance(metadata[tag], list):
		metadata[tag] = [metadata[tag]]

	causal-conv1d = [{ requirement = "torch", match-runtime = true }, "setuptools.build_meta"]
	causal-conv1d = [{ requirement = "torch", match-runtime = true }]

feat: Omni dataloader for HF models #1639

Are you sure you want to change the base?

feat: Omni dataloader for HF models #1639

Uh oh!

Conversation

yuanhangsu1986 commented Dec 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Usage

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanhangsu1986 commented Dec 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 15, 2025 •

edited

Loading