From 6af38b2926fcb6279e1e16a61a844b624689d76b Mon Sep 17 00:00:00 2001 From: Bobby Chen Date: Fri, 31 Oct 2025 17:54:17 +0000 Subject: [PATCH 1/3] Update MM docs --- docs/mm/nemo_2/in-framework.md | 136 ++++++++++++++++++++++- docs/mm/nemo_2/optimized/tensorrt-llm.md | 45 +++----- 2 files changed, 150 insertions(+), 31 deletions(-) diff --git a/docs/mm/nemo_2/in-framework.md b/docs/mm/nemo_2/in-framework.md index 7421471adb..51cf87c9fb 100644 --- a/docs/mm/nemo_2/in-framework.md +++ b/docs/mm/nemo_2/in-framework.md @@ -1,5 +1,135 @@ -# Deploy NeMo 2.0 Multimodal Models +# Deploy NeMo 2.0 Multimodal Models with Triton Inference Server -## Optimized Inference for Multimodal Models using TensorRT +This section explains how to deploy [NeMo 2.0](https://github.com/NVIDIA-NeMo/NeMo) multimodal models with the NVIDIA Triton Inference Server. -Will be updated soon. \ No newline at end of file +## Quick Example + +1. Follow the steps on the [Generate A NeMo 2.0 Checkpoint page](gen_nemo2_ckpt.md) to generate a NeMo 2.0 multimodal checkpoint. + +2. In a terminal, go to the folder where the ``qwen2_vl_3b`` is located. Pull and run the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use: + + ```shell + docker pull nvcr.io/nvidia/nemo:vr + + docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \ + -v ${PWD}/:/opt/checkpoints/ \ + -w /opt/Export-Deploy \ + --name nemo-fw \ + nvcr.io/nvidia/nemo:vr + ``` + +3. Using a NeMo 2.0 multimodal model, run the following deployment script to verify that everything is working correctly. The script directly serves the NeMo 2.0 model on the Triton server: + + ```shell + python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/qwen2_vl_3b --triton_model_name qwen + ``` + +4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (for example, gradually by 50%). + +5. In a separate terminal, access the running container as follows: + + ```shell + docker exec -it nemo-fw bash + ``` + +6. To send a query to the Triton server, run the following script with an image: + + ```shell + python /opt/Export-Deploy/scripts/deploy/multimodal/query_inframework.py \ + --model_name qwen \ + --prompt "Describe this image" \ + --image /path/to/image.jpg \ + --max_output_len 100 + ``` + +## Use a Script to Deploy NeMo 2.0 Multimodal Models on a Triton Inference Server + +You can deploy a multimodal model from a NeMo checkpoint on Triton using the provided script. + +### Deploy a NeMo Multimodal Model + +Executing the script will directly deploy the NeMo 2.0 multimodal model and start the service on Triton. + +1. Start the container using the steps described in the **Quick Example** section. + +2. To begin serving the downloaded model, run the following script: + + ```shell + python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/qwen2_vl_3b --triton_model_name qwen + ``` + + The following parameters are defined in the ``deploy_inframework_triton.py`` script: + + - ``-nc``, ``--nemo_checkpoint``: Path to the NeMo 2.0 checkpoint file to deploy. (Required) + - ``-tmn``, ``--triton_model_name``: Name to register the model under in Triton. (Required) + - ``-tmv``, ``--triton_model_version``: Version number for the model in Triton. Default: 1 + - ``-sp``, ``--server_port``: Port for the REST server to listen for requests. Default: 8080 + - ``-sa``, ``--server_address``: HTTP address for the REST server. Default: 0.0.0.0 + - ``-trp``, ``--triton_port``: Port for the Triton server to listen for requests. Default: 8000 + - ``-tha``, ``--triton_http_address``: HTTP address for the Triton server. Default: 0.0.0.0 + - ``-tps``, ``--tensor_parallel_size``: Tensor parallelism size. Default: 1 + - ``-pps``, ``--pipeline_parallel_size``: Pipeline parallelism size. Default: 1 + - ``-mbs``, ``--max_batch_size``: Max batch size of the model. Default: 4 + - ``-dm``, ``--debug_mode``: Enable debug mode. (Flag; set to enable) + - ``-pd``, ``--params_dtype``: Data type for model parameters. Choices: float16, bfloat16, float32. Default: bfloat16 + - ``-ibts``, ``--inference_batch_times_seqlen_threshold``: Inference batch times sequence length threshold. Default: 1000 + + *Note: Some parameters may be ignored or have no effect depending on the model and deployment environment. Refer to the script's help message for the most up-to-date list.* + +3. To deploy a different model, just change the ``--nemo_checkpoint`` argument in the script. + + +## How To Send a Query + +You can send queries to the Triton Inference Server using either the provided script or the available APIs. + +### Send a Query using the Script +This script allows you to interact with the multimodal model via HTTP requests, sending prompts and images and receiving generated responses directly from the Triton server. + +The example below demonstrates how to use the query script to send a prompt and image to your deployed model. You can customize the request with various parameters to control generation behavior, such as output length, sampling strategy, and more. For a full list of supported parameters, see below. + + +```shell +python /opt/Export-Deploy/scripts/deploy/multimodal/query_inframework.py \ + --model_name qwen \ + --processor_name Qwen/Qwen2.5-VL-3B-Instruct \ + --prompt "What is in this image?" \ + --image /path/to/image.jpg \ + --max_output_len 100 +``` + +**All Parameters:** +- `-u`, `--url`: URL for the Triton server (default: 0.0.0.0) +- `-mn`, `--model_name`: Name of the Triton model (required) +- `-pn`, `--processor_name`: Processor name for qwen-vl models (default: Qwen/Qwen2.5-VL-7B-Instruct) +- `-p`, `--prompt`: Prompt text (mutually exclusive with --prompt_file; required if --prompt_file not given) +- `-pf`, `--prompt_file`: File to read the prompt from (mutually exclusive with --prompt; required if --prompt not given) +- `-i`, `--image`: Path or URL to input image file (required) +- `-mol`, `--max_output_len`: Max output token length (default: 50) +- `-mbs`, `--max_batch_size`: Max batch size for inference (default: 4) +- `-tk`, `--top_k`: Top-k sampling (default: 1) +- `-tpp`, `--top_p`: Top-p sampling (default: 0.0) +- `-t`, `--temperature`: Sampling temperature (default: 1.0) +- `-rs`, `--random_seed`: Random seed for generation (optional) +- `-it`, `--init_timeout`: Init timeout for the Triton server in seconds (default: 60.0) + + +### Send a Query using the NeMo APIs + +Please see the below if you would like to use APIs to send a query. + +```python +from nemo_deploy.multimodal import NemoQueryMultimodalPytorch +from PIL import Image + +nq = NemoQueryMultimodalPytorch(url="localhost:8000", model_name="qwen") +output = nq.query_multimodal( + prompts=["What is in this image?"], + images=[Image.open("/path/to/image.jpg")], + max_length=100, + top_k=1, + top_p=0.0, + temperature=1.0, +) +print(output) +``` \ No newline at end of file diff --git a/docs/mm/nemo_2/optimized/tensorrt-llm.md b/docs/mm/nemo_2/optimized/tensorrt-llm.md index c1a421b790..fdcca2422a 100644 --- a/docs/mm/nemo_2/optimized/tensorrt-llm.md +++ b/docs/mm/nemo_2/optimized/tensorrt-llm.md @@ -7,13 +7,9 @@ This section shows how to use scripts and APIs to export a NeMo 2.0 MM to Tensor The following table shows the supported models. -| Model Name | NeMo Precision | TensorRT Precision | -| :---------- | -------------- |--------------------| -| Neva | bfloat16 | bfloat16 | -| Video Neva | bfloat16 | bfloat16 | -| LITA/VITA | bfloat16 | bfloat16 | -| VILA | bfloat16 | bfloat16 | -| SALM | bfloat16 | bfloat16 | +| Model Name | NeMo Precision | TensorRT Precision | +| :--------------- | -------------- |--------------------| +| Llama 3.2-Vision | bfloat16 | bfloat16 | ### Access the Models with a Hugging Face Token @@ -34,7 +30,7 @@ If you want to run inference using the LLama3 model, you'll need to generate a H ### Export and Deploy a NeMo Multimodal Checkpoint to TensorRT-LLM -This section provides an example of how to quickly and easily deploy a NeMo checkpoint to TensorRT. Neva will be used as an example model. Please consult the table above for a complete list of supported models. +This section provides an example of how to quickly and easily deploy a NeMo checkpoint to TensorRT. Llama 3.2-Vision will be used as an example model. Please consult the table above for a complete list of supported models. 1. Follow the steps on the [Generate A NeMo 2.0 Checkpoint page](../gen_nemo2_ckpt.md) to generate a NeMo 2.0 Llama Vision Instruct checkpoint. @@ -71,8 +67,6 @@ This section provides an example of how to quickly and easily deploy a NeMo chec ```shell python /opt/Export-Deploy/scripts/deploy/multimodal/query.py -mn mllama -mt=mllama -int="What is in this image?" -im=/path/to/image.jpg ``` - -6. To export and deploy a different model, such as Video Neva, change the *model_type* and *modality* in the *scripts/deploy/multimodal/deploy_triton.py* script. ### Use a Script to Run Inference on a Triton Server @@ -89,7 +83,7 @@ After executing the script, it will export the model to TensorRT and then initia 2. To begin serving the model, run the following script: ```shell - python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_neva.nemo --model_type neva --llm_model_type llama --triton_model_name neva + python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_mllama.nemo --model_type mllama --triton_model_name mllama ``` The following parameters are defined in the ``deploy_triton.py`` script: @@ -118,14 +112,9 @@ After executing the script, it will export the model to TensorRT and then initia 3. To export and deploy a different model, such as Video Neva, change the *model_type* and *modality* in the *scripts/deploy/multimodal/deploy_triton.py* script. Please see the table below to learn more about which *model_type* and *modality* is used for a multimodal model. - | Model Name | model_type | modality | - | :---------- | ------------ |------------| - | Neva | neva | vision | - | Video Neva | video-neva | vision | - | LITA | lita | vision | - | VILA | vila | vision | - | VITA | vita | vision | - | SALM | salm | audio | + | Model Name | model_type | modality | + | :---------------- | ------------ |------------| + | Llama 3.2-Vision | mllama | vision | 4. Stop the running container and then run the following command to specify an empty directory: @@ -135,7 +124,7 @@ After executing the script, it will export the model to TensorRT and then initia docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}:/opt/checkpoints/ -w /opt/NeMo nvcr.io/nvidia/nemo:vr - python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_neva.nemo --model_type neva --llm_model_type llama --triton_model_name neva --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --modality vision + python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_mllama.nemo --model_type mllama --triton_model_name mllama --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --modality vision ``` The checkpoint will be exported to the specified folder after executing the script mentioned above. @@ -143,7 +132,7 @@ After executing the script, it will export the model to TensorRT and then initia 5. To load the exported model directly, run the following script within the container: ```shell - python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --triton_model_name neva --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --model_type neva --llm_model_type llama --modality vision + python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_triton.py --triton_model_name mllama --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --model_type mllama --modality vision ``` #### Send a Query @@ -160,7 +149,7 @@ The following example shows how to execute the query script within the currently 1. To use a query script, run the following command. For VILA/LITA/VITA models, the input_text should add ``\n`` before the actual text, such as ``\n What is in this image?``: ```shell - python /opt/Export-Deploy/scripts/deploy/multimodal/query.py --url "http://localhost:8000" --model_name neva --model_type neva --input_text "What is in this image?" --input_media /path/to/image.jpg + python /opt/Export-Deploy/scripts/deploy/multimodal/query.py --url "http://localhost:8000" --model_name mllama --model_type mllama --input_text "What is in this image?" --input_media /path/to/image.jpg ``` 2. Change the url and the ``model_name`` based on your server and the model name of your service. The code in the script can be used as a basis for your client code as well. ``input_media`` is the path to the image or audio file you want to use as input. @@ -173,7 +162,7 @@ Up until now, we have used scripts for exporting and deploying Multimodal models #### Export a Multimodal Model to TensorRT -You can use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM. The following code example assumes the ``nemo_neva.nemo`` checkpoint has already mounted to the ``/opt/checkpoints/`` path. Additionally, the ``/opt/data/image.jpg`` is also assumed to exist. +You can use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM. The following code example assumes the ``nemo_mllama.nemo`` checkpoint has already mounted to the ``/opt/checkpoints/`` path. Additionally, the ``/opt/data/image.jpg`` is also assumed to exist. 1. Run the following command: @@ -181,7 +170,7 @@ You can use the APIs in the export module to export a NeMo checkpoint to TensorR from nemo_export.tensorrt_mm_exporter import TensorRTMMExporter exporter = TensorRTMMExporter(model_dir="/opt/checkpoints/tmp_triton_model_repository/", modality="vision") - exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_neva.nemo", model_type="neva", llm_model_type="llama", tensor_parallel_size=1) + exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_mllama.nemo", model_type="mllama" tensor_parallel_size=1) output = exporter.forward("What is in this image?", "/opt/data/image.jpg", max_output_token=30, top_k=1, top_p=0.0, temperature=1.0) print("output: ", output) ``` @@ -191,7 +180,7 @@ You can use the APIs in the export module to export a NeMo checkpoint to TensorR #### Deploy a Multimodal Model to TensorRT -You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Triton. The following code example assumes the ``nemo_neva.nemo`` checkpoint has already mounted to the ``/opt/checkpoints/`` path. +You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Triton. The following code example assumes the ``nemo_mllama.nemo`` checkpoint has already mounted to the ``/opt/checkpoints/`` path. 1. Run the following command: @@ -200,9 +189,9 @@ You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Trit from nemo_deploy import DeployPyTriton exporter = TensorRTMMExporter(model_dir="/opt/checkpoints/tmp_triton_model_repository/", modality="vision") - exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_neva.nemo", model_type="neva", llm_model_type="llama", tensor_parallel_size=1) + exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_mllama.nemo", model_type="mllama", tensor_parallel_size=1) - nm = DeployPyTriton(model=exporter, triton_model_name="neva", port=8000) + nm = DeployPyTriton(model=exporter, triton_model_name="mllama", port=8000) nm.deploy() nm.serve() ``` @@ -216,7 +205,7 @@ The NeMo Framework provides NemoQueryMultimodal APIs to send a query to the Trit ```python from nemo_deploy.multimodal import NemoQueryMultimodal - nq = NemoQueryMultimodal(url="localhost:8000", model_name="neva", model_type="neva") + nq = NemoQueryMultimodal(url="localhost:8000", model_name="mllama", model_type="mllama") output = nq.query(input_text="What is in this image?", input_media="/opt/data/image.jpg", max_output_len=30, top_k=1, top_p=0.0, temperature=1.0) print(output) ``` From 6f39939c58c11504c8a815058cc82d93a220720e Mon Sep 17 00:00:00 2001 From: Bobby Chen Date: Tue, 4 Nov 2025 01:23:54 +0000 Subject: [PATCH 2/3] Modify nemo2 ckpt example --- docs/mm/nemo_2/gen_nemo2_ckpt.md | 31 ++++++++++++++++++++++++++++++- 1 file changed, 30 insertions(+), 1 deletion(-) diff --git a/docs/mm/nemo_2/gen_nemo2_ckpt.md b/docs/mm/nemo_2/gen_nemo2_ckpt.md index e426034e6e..58f80e0fc3 100644 --- a/docs/mm/nemo_2/gen_nemo2_ckpt.md +++ b/docs/mm/nemo_2/gen_nemo2_ckpt.md @@ -2,6 +2,8 @@ To run the code examples, you will need a NeMo 2.0 checkpoint. Follow the steps below to generate a NeMo 2.0 checkpoint, which you can then use to test the export and deployment workflows for NeMo 2.0 models. +## Setup + 1. Pull down and run [NeMo Framework](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use: ```shell @@ -15,7 +17,34 @@ To run the code examples, you will need a NeMo 2.0 checkpoint. Follow the steps ```shell huggingface-cli login ``` - + +## Generate Qwen VL Checkpoint (for In-Framework Deployment) + +This checkpoint is used for in-framework deployment examples. + +3. Run the following Python code to generate the NeMo 2.0 checkpoint: + + ```python + from nemo.collections.llm import import_ckpt + from nemo.collections import vlm + from pathlib import Path + + if __name__ == '__main__': + # Specify the Hugging Face model ID + hf_model_id = "Qwen/Qwen2.5-VL-3B-Instruct" + + # Import the model and convert to NeMo 2.0 format + import_ckpt( + model=vlm.Qwen2VLModel(vlm.Qwen2VLConfig3BInstruct()), + source=f"hf://{hf_model_id}", # Hugging Face model source + output_path=Path('/opt/checkpoints/qwen2_vl_3b') + ) + ``` + +## Generate Llama 3.2-Vision Checkpoint (for TensorRT-LLM Deployment) + +This checkpoint is used for optimized TensorRT-LLM deployment examples. + 3. Run the following Python code to generate the NeMo 2.0 checkpoint: ```python From f0799fed28100fcf6a07c85f95bbcdc75abdc98a Mon Sep 17 00:00:00 2001 From: Bobby Chen Date: Tue, 4 Nov 2025 01:27:31 +0000 Subject: [PATCH 3/3] Change qwen2 to qwen25 --- docs/mm/nemo_2/gen_nemo2_ckpt.md | 4 ++-- docs/mm/nemo_2/in-framework.md | 6 +++--- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/mm/nemo_2/gen_nemo2_ckpt.md b/docs/mm/nemo_2/gen_nemo2_ckpt.md index 58f80e0fc3..11d71f262d 100644 --- a/docs/mm/nemo_2/gen_nemo2_ckpt.md +++ b/docs/mm/nemo_2/gen_nemo2_ckpt.md @@ -35,9 +35,9 @@ This checkpoint is used for in-framework deployment examples. # Import the model and convert to NeMo 2.0 format import_ckpt( - model=vlm.Qwen2VLModel(vlm.Qwen2VLConfig3BInstruct()), + model=vlm.Qwen2VLModel(vlm.Qwen25VLConfig3B(), model_version='qwen25-vl') source=f"hf://{hf_model_id}", # Hugging Face model source - output_path=Path('/opt/checkpoints/qwen2_vl_3b') + output_path=Path('/opt/checkpoints/qwen25_vl_3b') ) ``` diff --git a/docs/mm/nemo_2/in-framework.md b/docs/mm/nemo_2/in-framework.md index 51cf87c9fb..186f85de60 100644 --- a/docs/mm/nemo_2/in-framework.md +++ b/docs/mm/nemo_2/in-framework.md @@ -6,7 +6,7 @@ This section explains how to deploy [NeMo 2.0](https://github.com/NVIDIA-NeMo/Ne 1. Follow the steps on the [Generate A NeMo 2.0 Checkpoint page](gen_nemo2_ckpt.md) to generate a NeMo 2.0 multimodal checkpoint. -2. In a terminal, go to the folder where the ``qwen2_vl_3b`` is located. Pull and run the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use: +2. In a terminal, go to the folder where the ``qwen25_vl_3b`` is located. Pull and run the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use: ```shell docker pull nvcr.io/nvidia/nemo:vr @@ -21,7 +21,7 @@ This section explains how to deploy [NeMo 2.0](https://github.com/NVIDIA-NeMo/Ne 3. Using a NeMo 2.0 multimodal model, run the following deployment script to verify that everything is working correctly. The script directly serves the NeMo 2.0 model on the Triton server: ```shell - python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/qwen2_vl_3b --triton_model_name qwen + python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/qwen25_vl_3b --triton_model_name qwen ``` 4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (for example, gradually by 50%). @@ -55,7 +55,7 @@ Executing the script will directly deploy the NeMo 2.0 multimodal model and star 2. To begin serving the downloaded model, run the following script: ```shell - python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/qwen2_vl_3b --triton_model_name qwen + python /opt/Export-Deploy/scripts/deploy/multimodal/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/qwen25_vl_3b --triton_model_name qwen ``` The following parameters are defined in the ``deploy_inframework_triton.py`` script: