diff --git a/examples/nlp_and_llms/cpu-small-nlp/README.md b/examples/nlp_and_llms/cpu-small-nlp/README.md new file mode 100644 index 00000000..91421224 --- /dev/null +++ b/examples/nlp_and_llms/cpu-small-nlp/README.md @@ -0,0 +1,65 @@ +# Small Transformer Inference (CPU Baseline) + +This template implements a **high-efficiency CPU inference** workflow for Natural Language Processing (NLP). It uses **DistilBERT**, a smaller, faster version of BERT, and demonstrates how to further optimize it using **Dynamic Quantization** to achieve production-grade performance without GPUs. + +**Infrastructure:** [Saturn Cloud](https://saturncloud.io/) +**Resource:** Jupyter Notebook +**Hardware:** CPU +**Tech Stack:** PyTorch, Hugging Face Transformers, Scikit-Learn + +--- + +## ๐ Overview + +Deploying massive Large Language Models (LLMs) often requires expensive GPUs. However, for specific enterprise tasks like **Sentiment Analysis** or **Named Entity Recognition (NER)**, smaller "distilled" transformers running on standard CPUs are often sufficient, faster, and significantly cheaper. + +This template provides a **CPU-optimized baseline**: +1. **Sentiment Analysis:** Using `distilbert-base-uncased`. +2. **Named Entity Recognition (NER):** Using `distilbert-base-cased`. +3. **Optimization:** Applies PyTorch **Dynamic Quantization** to boost inference speed by ~2x and reduce memory usage by ~40%. + +--- + +## ๐ Quick Start + +### 1. Workflow + +1. Open **`small_transformer_cpu.ipynb`** in the Jupyter interface. +2. **Run All Cells**: +* **Install:** Sets up `transformers` and `torch` in the current environment. +* **Download:** Fetches the public DistilBERT model (no login required). +* **Benchmark (FP32):** Measures the baseline latency of the standard 32-bit floating point model. +* **Quantize (INT8):** Converts the model weights to 8-bit integers on the fly. +* **Compare:** Validates the speedup (typically **1.5x - 2.0x faster**). + +--- + +## ๐ง Architecture: "Distill & Quantize" + +We use a two-step optimization strategy to ensure the model runs efficiently on commodity hardware. + +### 1. Distillation + +We use **DistilBERT**, which acts as a student model trained to mimic the behavior of the larger BERT model. + +* **40% fewer parameters** than BERT. +* **60% faster** inference. +* **97% retained accuracy** on standard benchmarks. + +### 2. Dynamic Quantization + +Standard models store weights as 32-bit floating point numbers (FP32). This template uses **Dynamic Quantization** to convert the linear layer weights to **8-bit integers (INT8)**. + +* **Size Reduction:** The model file shrinks by ~40% (e.g., 255MB โ 130MB). +* **Speedup:** CPUs can process 8-bit integer math significantly faster than 32-bit float math, resulting in lower latency per request. + +--- + +## ๐ Conclusion + +This template proves that you don't always need a GPU for NLP. For targeted tasks, a quantized DistilBERT on a modern CPU can handle hundreds of requests per second with minimal cost. + +To scale this solutionโfor example, processing millions of documents or deploying this as a serverless APIโconsider moving this workload to a [Saturn Cloud](https://saturncloud.io/) CPU cluster. + +``` + diff --git a/examples/nlp_and_llms/cpu-small-nlp/setup.sh b/examples/nlp_and_llms/cpu-small-nlp/setup.sh new file mode 100755 index 00000000..be732959 --- /dev/null +++ b/examples/nlp_and_llms/cpu-small-nlp/setup.sh @@ -0,0 +1,34 @@ +#!/bin/bash +set -e + +GREEN='\033[0;32m' +NC='\033[0m' + +echo -e "${GREEN}๐ Starting Small Transformer Setup...${NC}" + +# 1. Robust Python Detection +if command -v python3 &> /dev/null; then + PY_CMD="python3" +elif command -v python &> /dev/null; then + PY_CMD="python" +else + echo "โ Error: Could not find 'python3' or 'python' in your PATH." + exit 1 +fi + +# 2. Create Virtual Environment +echo "๐ฆ Creating Virtual Environment 'venv'..." +$PY_CMD -m venv venv + +# 3. Install Dependencies +echo "โฌ๏ธ Installing libraries..." +. venv/bin/activate +pip install --upgrade pip +# Core stack: PyTorch (CPU), Transformers (Hugging Face), Scikit-Learn (Metrics) +pip install torch transformers scikit-learn numpy pandas + +echo -e "${GREEN}โ Environment Ready!${NC}" +echo "-------------------------------------------------------" +echo "To generate the notebook:" +echo " $PY_CMD generate_notebook.py" +echo "-------------------------------------------------------" \ No newline at end of file diff --git a/examples/nlp_and_llms/cpu-small-nlp/small_transformer_cpu.ipynb b/examples/nlp_and_llms/cpu-small-nlp/small_transformer_cpu.ipynb new file mode 100644 index 00000000..2bdaa87e --- /dev/null +++ b/examples/nlp_and_llms/cpu-small-nlp/small_transformer_cpu.ipynb @@ -0,0 +1,449 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# โก Small Transformer Inference (CPU Baseline)\n", + "\n", + "This notebook demonstrates how to achieve **high-performance inference** on a CPU using DistilBERT and **Dynamic Quantization**.\n", + "\n", + "**Tasks:**\n", + "1. **Sentiment Analysis**: Classifying text as Positive/Negative.\n", + "2. **NER**: Extracting entities (Names, Locations) from text.\n", + "3. **Optimization**: Quantizing the model to `int8`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 1. Install Dependencies\n", + "%pip install torch transformers numpy pandas\n", + "\n", + "import torch\n", + "import time\n", + "import os\n", + "import pandas as pd\n", + "from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline\n", + "\n", + "# ๐ง CPU Optimization: Control Threads\n", + "# Setting this to the number of physical cores is usually best for latency.\n", + "torch.set_num_threads(os.cpu_count())\n", + "print(f\"โ Threads set to: {torch.get_num_threads()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Sentiment Analysis Baseline (FP32)\n", + "We load `distilbert-base-uncased-finetuned-sst-2-english`. It is a standard baseline for sentiment." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "โฌ๏ธ Downloading distilbert-base-uncased-finetuned-sst-2-english...\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "e7d674d4cb61417784587c5d248c1735", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Loading weights: 0%| | 0/104 [00:00, ?it/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "3246f296053a49e6bdece9d6292408ab", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Writing model shards: 0%| | 0/1 [00:00, ?it/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "โ Loaded & Saved to './model_fp32' (Size: 255.4 MB)\n" + ] + } + ], + "source": [ + "MODEL_NAME = \"distilbert-base-uncased-finetuned-sst-2-english\"\n", + "\n", + "# Load Model & Tokenizer\n", + "print(f\"โฌ๏ธ Downloading {MODEL_NAME}...\")\n", + "tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)\n", + "model_fp32 = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)\n", + "\n", + "# Save model locally so we can accurately measure its size\n", + "model_fp32.save_pretrained(\"./model_fp32\")\n", + "tokenizer.save_pretrained(\"./model_fp32\")\n", + "\n", + "# FIX: Check for either .bin (standard) or .safetensors (newer default)\n", + "if os.path.exists(\"./model_fp32/pytorch_model.bin\"):\n", + " weights_path = \"./model_fp32/pytorch_model.bin\"\n", + "elif os.path.exists(\"./model_fp32/model.safetensors\"):\n", + " weights_path = \"./model_fp32/model.safetensors\"\n", + "else:\n", + " raise FileNotFoundError(\"Could not find model weights file (.bin or .safetensors)\")\n", + "\n", + "file_size = os.path.getsize(weights_path) / 1024**2\n", + "print(f\"โ Loaded & Saved to './model_fp32' (Size: {file_size:.1f} MB)\")" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "โฑ๏ธ Standard (FP32) Latency: 48.62 ms\n" + ] + } + ], + "source": [ + "# Benchmark Function\n", + "def benchmark_model(model, text, steps=50):\n", + " inputs = tokenizer(text, return_tensors=\"pt\")\n", + " \n", + " # Warmup\n", + " for _ in range(5):\n", + " _ = model(**inputs)\n", + " \n", + " # Timing\n", + " start = time.time()\n", + " for _ in range(steps):\n", + " with torch.no_grad():\n", + " _ = model(**inputs)\n", + " end = time.time()\n", + " \n", + " avg_time = (end - start) / steps * 1000\n", + " return avg_time\n", + "\n", + "sample_text = \"Saturn Cloud makes scaling machine learning workloads incredibly easy and efficient.\"\n", + "time_fp32 = benchmark_model(model_fp32, sample_text)\n", + "print(f\"โฑ๏ธ Standard (FP32) Latency: {time_fp32:.2f} ms\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Dynamic Quantization (INT8)\n", + "We use `torch.quantization.quantize_dynamic` to convert the Linear layers to 8-bit integers. This requires **no retraining**." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/tmp/ipykernel_548476/1943256708.py:1: DeprecationWarning: torch.ao.quantization is deprecated and will be removed in 2.10. \n", + "For migrations of users: \n", + "1. Eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead \n", + "2. FX graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e) \n", + "3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e) \n", + "see https://github.com/pytorch/ao/issues/2259 for more details\n", + " model_int8 = torch.quantization.quantize_dynamic(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐ Quantized Model Size: 132.3 MB\n" + ] + } + ], + "source": [ + "model_int8 = torch.quantization.quantize_dynamic(\n", + " model_fp32, \n", + " {torch.nn.Linear}, # We only quantize the heavy Linear layers\n", + " dtype=torch.qint8\n", + ")\n", + "\n", + "# Verify size reduction\n", + "torch.save(model_int8.state_dict(), \"quantized_model.pt\")\n", + "size_int8 = os.path.getsize(\"quantized_model.pt\") / 1024**2\n", + "print(f\"๐ Quantized Model Size: {size_int8:.1f} MB\")" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "โฑ๏ธ Quantized (INT8) Latency: 33.94 ms\n", + "๐ Speedup: 1.43x faster\n" + ] + } + ], + "source": [ + "# Benchmark Quantized Model\n", + "time_int8 = benchmark_model(model_int8, sample_text)\n", + "print(f\"โฑ๏ธ Quantized (INT8) Latency: {time_int8:.2f} ms\")\n", + "\n", + "speedup = time_fp32 / time_int8\n", + "print(f\"๐ Speedup: {speedup:.2f}x faster\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. NER Task (Token Classification)\n", + "Switching tasks is as easy as changing the pipeline model. We use `dslim/bert-base-NER` (or a smaller DistilBERT variant if available) for Named Entity Recognition." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "e5663f382557412097f12a4f57f7987e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "config.json: 0%| | 0.00/829 [00:00, ?B/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "88a1ac3c9cf1417d87519716f92c7f8f", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "model.safetensors: 0%| | 0.00/433M [00:00, ?B/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "c69cb041d69e40fa9939dca257c715ef", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Loading weights: 0%| | 0/199 [00:00, ?it/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "BertForTokenClassification LOAD REPORT from: dslim/bert-base-NER\n", + "Key | Status | | \n", + "-------------------------+------------+--+-\n", + "bert.pooler.dense.weight | UNEXPECTED | | \n", + "bert.pooler.dense.bias | UNEXPECTED | | \n", + "\n", + "Notes:\n", + "- UNEXPECTED\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "bcee4695a5d74155bf790a06e0436e62", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "tokenizer_config.json: 0%| | 0.00/59.0 [00:00, ?B/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "ca736329308444f988d4fae6c19d7639", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "vocab.txt: 0.00B [00:00, ?B/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "84025cc2ff414f9bb75179e246bfa13f", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "added_tokens.json: 0%| | 0.00/2.00 [00:00, ?B/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "6bbe850ba1344b728fdc2dd67fbd8a91", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "special_tokens_map.json: 0%| | 0.00/112 [00:00, ?B/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
| \n", + " | word | \n", + "entity_group | \n", + "score | \n", + "
|---|---|---|---|
| 0 | \n", + "Apple Inc | \n", + "ORG | \n", + "0.999508 | \n", + "
| 1 | \n", + "Abuja | \n", + "LOC | \n", + "0.998583 | \n", + "
| 2 | \n", + "Nigeria | \n", + "LOC | \n", + "0.999648 | \n", + "
| Step | \n","Training Loss | \n","
|---|---|
| 25 | \n","0.000000 | \n","
| 50 | \n","0.000000 | \n","
| 75 | \n","0.000000 | \n","
| 100 | \n","0.000000 | \n","
| 125 | \n","0.000000 | \n","
| 150 | \n","0.000000 | \n","
| 175 | \n","0.000000 | \n","
| 200 | \n","0.000000 | \n","
| 225 | \n","0.000000 | \n","
| 250 | \n","0.000000 | \n","
| 275 | \n","0.000000 | \n","
| 300 | \n","0.000000 | \n","
| 325 | \n","0.000000 | \n","
| 350 | \n","0.000000 | \n","
| 375 | \n","0.000000 | \n","
| 400 | \n","0.000000 | \n","
| 425 | \n","0.000000 | \n","
| 450 | \n","0.000000 | \n","
| 475 | \n","0.000000 | \n","
| 500 | \n","0.000000 | \n","
"]},"metadata":{}},{"output_type":"stream","name":"stdout","text":["โ
Training complete!\n"]}],"source":["import torch\n","from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer\n","\n","# Prepare data collator\n","data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)\n","\n","# Define training arguments\n","args = Seq2SeqTrainingArguments(\n"," output_dir=\"outputs-lora\",\n"," per_device_train_batch_size=2,\n"," per_device_eval_batch_size=2,\n"," learning_rate=2e-4,\n"," num_train_epochs=1,\n"," save_strategy=\"epoch\",\n"," logging_steps=25,\n"," predict_with_generate=True,\n"," fp16=torch.cuda.is_available(), # Use mixed precision if GPU supports it\n"," report_to=[], # disables online tracking (no API needed)\n",")\n","\n","# Initialise trainer\n","trainer = Seq2SeqTrainer(\n"," model=model,\n"," args=args,\n"," train_dataset=train_tok,\n"," eval_dataset=eval_tok,\n"," tokenizer=tokenizer,\n"," data_collator=data_collator,\n",")\n","\n","print(\"๐ Starting fine-tuningโฆ\")\n","trainer.train()\n","print(\"โ
Training complete!\")"]},{"cell_type":"markdown","id":"cb3261ba-fd89-42f8-8cbc-b9391b859ee6","metadata":{"id":"cb3261ba-fd89-42f8-8cbc-b9391b859ee6"},"source":["Let's test the fine-tuned model to verify that it can generate meaningful summaries. It performs a full inference pass using the model and tokenizer."]},{"cell_type":"code","execution_count":16,"id":"f86f32e1-49c1-426e-b013-3156cb6d6e4f","metadata":{"jp-MarkdownHeadingCollapsed":true,"colab":{"base_uri":"https://localhost:8080/"},"id":"f86f32e1-49c1-426e-b013-3156cb6d6e4f","executionInfo":{"status":"ok","timestamp":1761300634308,"user_tz":-60,"elapsed":233,"user":{"displayName":"Durojaye Olusegun","userId":"09188621512197003284"}},"outputId":"057ca513-731d-438a-a6d3-c41225bfa966"},"outputs":[{"output_type":"stream","name":"stdout","text":["\n","๐ง Fine-tuned Model Output:\n","\n","Bob and Alice discuss the museum's history.\n"]}],"source":["test_input = \"Write a brief summary: Alice and Bob discussed weekend plans. Bob suggested hiking, but Alice preferred visiting the museum.\"\n","\n","# Tokenise and move to model device\n","inputs = tokenizer(test_input, return_tensors=\"pt\", truncation=True, padding=True).to(model.device)\n","\n","# Generate output\n","outputs = model.generate(**inputs, max_new_tokens=80)\n","\n","# Decode and display\n","print(\"\\n๐ง Fine-tuned Model Output:\\n\")\n","print(tokenizer.decode(outputs[0], skip_special_tokens=True))\n"]},{"cell_type":"markdown","id":"3ee4b6cb-1684-49ca-9cc2-74609bf610bd","metadata":{"id":"3ee4b6cb-1684-49ca-9cc2-74609bf610bd"},"source":["This allows interactively test the fine-tuned model with your own custom input."]},{"cell_type":"code","execution_count":17,"id":"3bad36a0-89b4-484d-953c-7371d83cfff6","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"3bad36a0-89b4-484d-953c-7371d83cfff6","executionInfo":{"status":"ok","timestamp":1761300740710,"user_tz":-60,"elapsed":106374,"user":{"displayName":"Durojaye Olusegun","userId":"09188621512197003284"}},"outputId":"cee233ae-58d2-42ac-90a9-3e49430bc355"},"outputs":[{"output_type":"stream","name":"stdout","text":["๐ฌ Try your own prompt!\n","\n","Enter a text or paragraph you'd like the model to summarise: what is it doing \n","\n","๐งฉ Model Output:\n","\n","It is doing it doing it doing it\n"]}],"source":["print(\"๐ฌ Try your own prompt!\")\n","\n","user_prompt = input(\"\\nEnter a text or paragraph you'd like the model to summarise: \")\n","\n","# Tokenise user prompt\n","inputs = tokenizer(user_prompt, return_tensors=\"pt\", truncation=True, padding=True).to(model.device)\n","\n","# Generate output\n","outputs = model.generate(**inputs, max_new_tokens=80)\n","\n","# Decode and print\n","print(\"\\n๐งฉ Model Output:\\n\")\n","print(tokenizer.decode(outputs[0], skip_special_tokens=True))\n"]},{"cell_type":"markdown","id":"a0a3c84e-2d27-46ad-9356-95e2ef9a598b","metadata":{"id":"a0a3c84e-2d27-46ad-9356-95e2ef9a598b"},"source":["In this template, you fine-tuned **Googleโs FLAN-T5-Small** model using **LoRA (Low-Rank Adaptation)** with the **PEFT** library โ a modern, lightweight approach to large language model adaptation.\n","\n","Running this workflow on **Saturn Cloud** makes it both **scalable and cost-effective**. Saturn Cloudโs managed infrastructure allows you to:\n","\n","* Start with a **single NVIDIA GPU** for experimentation and scale up to multi-GPU clusters for larger models.\n","* Collaborate across teams easily through shared Jupyter environments.\n","* Integrate this fine-tuning workflow into production pipelines for enterprise-ready deployment.\n","\n","By using this template, you now have a complete, ready-to-run foundation for **adapter-based fine-tuning** in Saturn Cloud โ ideal for tasks like summarisation, translation, or instruction-following with minimal resource use.\n","\n","To continue exploring, check out:\n","\n","* [Saturn Cloud Documentation](https://saturncloud.io/docs/) โ for advanced configuration and GPU scaling.\n","* [Saturn Cloud Templates](https://saturncloud.io/resources/templates/) โ for more examples of ML, LLM, and data science workflows."]}],"metadata":{"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.13.7"},"colab":{"provenance":[],"gpuType":"T4"},"accelerator":"GPU","widgets":{"application/vnd.jupyter.widget-state+json":{"9db3a5ac0dd84249a2b236b96c58aad8":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_2c821f95cbf94e6f972651544b51bacf","IPY_MODEL_69e89bf8eace41aa850498fd3fd61f99","IPY_MODEL_3aaca7366ecb47d8b4ac27b6301aa91b"],"layout":"IPY_MODEL_48ba285de8364e65a380add6e08e4d69"}},"2c821f95cbf94e6f972651544b51bacf":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_29dfb08a2a1d43b3878cb8a98b285b09","placeholder":"โ","style":"IPY_MODEL_4edaefbb46844f8ba1583f63c20f9ccf","value":"Map:โ100%"}},"69e89bf8eace41aa850498fd3fd61f99":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_168534f6a2f3457b8dfa29da5aa15d6a","max":200,"min":0,"orientation":"horizontal","style":"IPY_MODEL_3e54568d0ae94350a1a461a6b1cc3423","value":200}},"3aaca7366ecb47d8b4ac27b6301aa91b":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_bd8316fe2cc24289bf8d39ab6f065e43","placeholder":"โ","style":"IPY_MODEL_d802453c7a484c89897a30b8ddde157b","value":"โ200/200โ[00:13<00:00,โ15.11โexamples/s]"}},"48ba285de8364e65a380add6e08e4d69":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"29dfb08a2a1d43b3878cb8a98b285b09":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"4edaefbb46844f8ba1583f63c20f9ccf":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"168534f6a2f3457b8dfa29da5aa15d6a":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3e54568d0ae94350a1a461a6b1cc3423":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"bd8316fe2cc24289bf8d39ab6f065e43":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d802453c7a484c89897a30b8ddde157b":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}}}}},"nbformat":4,"nbformat_minor":5}
\ No newline at end of file
diff --git a/examples/nlp_and_llms/nvidia-nim-tgi/README.md b/examples/nlp_and_llms/nvidia-nim-tgi/README.md
new file mode 100644
index 00000000..1da88fb1
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-nim-tgi/README.md
@@ -0,0 +1,222 @@
+# ๐ NIM / TGI Server โ Drop-In API
+
+**Tech Stack:** NVIDIA NIM + TGI (Text Generation Inference)
+**Built for:** Saturn Cloud Custom Templates
+โก๏ธ [https://saturncloud.io/](https://saturncloud.io/)
+
+---
+
+## ๐ง Overview
+
+This template provides a **plug-and-play inference server** that supports **two interchangeable LLM backends**:
+
+| Backend | Description | Use Case |
+| -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------- |
+| **NVIDIA NIM Cloud** | Fully hosted LLMs on NVIDIA's high-performance GPU cloud | High-accuracy, large models (Qwen 80B, Mistral, Nemotron, etc.) |
+| **Local TGI Server** | Lightweight local model running via HuggingFace Transformers | Fast prototyping, offline usage |
+
+The API exposes **the same unified interface** for both backends, so users can switch engines without changing frontend code.
+
+This is ideal for **Saturn Cloud Data Science workflows**, allowing teams to quickly integrate LLM inference inside their notebooks, pipelines, or applications.
+
+---
+
+# ๐ Project Structure
+
+```
+NIM-TGI-Server/
+โ
+โโโ server.py # Main FastAPI server (unified interface)
+โโโ backend_tgi.py # Local TGI backend (SmolLM)
+โโโ backend_nim.py # NVIDIA cloud backend
+โโโ cli.py # CLI tool (select backend from terminal)
+โโโ requirements.txt
+โโโ README.md # (this file)
+```
+
+---
+
+# โ๏ธ 1. Environment Setup
+
+## **Create and activate a virtual environment**
+
+### Linux / MacOS
+
+```bash
+python -m venv venv
+source venv/bin/activate
+```
+
+### Windows (PowerShell)
+
+```powershell
+python -m venv venv
+venv\Scripts\activate
+```
+
+---
+
+## **Install dependencies**
+
+```bash
+pip install -r requirements.txt
+```
+
+---
+
+# ๐ 2. Getting a NVIDIA NIM API Key
+
+To use the **NIM Cloud backend**, you need an **NVIDIA AI Foundation API Key**.
+
+### Steps:
+
+1. Visit:
+ ๐ [https://build.nvidia.com/explore/discover](https://build.nvidia.com/explore/discover)
+2. Sign in with NVIDIA account
+3. Open your "API Keys" panel
+4. Click **Create New API Key**
+5. Copy the key
+6. **Paste it into `backend_nim.py`**, replacing:
+
+```python
+API_KEY = "nvapi-xxxxxxxxxxxxxxxxxxxx"
+```
+
+โ ๏ธ **Note:**
+This template currently embeds the key directly for simplicity, but in production you should store it in environment variables or a secret manager.
+
+---
+
+# ๐ง 3. Backend Models
+
+## **A. NVIDIA NIM Backend (Cloud)**
+
+* Model used: `qwen/qwen3-next-80b-a3b-instruct`
+* Endpoint: `https://integrate.api.nvidia.com/v1`
+* Requires API Key
+* Supports streaming + large prompts
+
+## **B. Local TGI Backend (Lightweight CPU/GPU)**
+
+* Model: `HuggingFaceTB/SmolLM-1.7B-Instruct`
+* Runs entirely inside Python (no Docker needed)
+* Great for local experimentation
+
+---
+
+# ๐ 4. Running the Server
+
+Start FastAPI server:
+
+```bash
+uvicorn server:app --reload
+```
+
+Youโll see:
+
+```
+INFO: Uvicorn running on http://127.0.0.1:8000
+```
+
+---
+
+# ๐งช 5. Testing the Server
+
+## A. Test Local TGI Model
+
+**POST /chat/local**
+
+### Curl:
+
+```bash
+curl -X POST -F "prompt=Explain machine learning" http://localhost:8000/chat/local
+```
+
+### Expected Response:
+
+```json
+{
+ "backend": "tgi-local",
+ "response": "Machine learning is..."
+}
+```
+
+---
+
+## B. Test NVIDIA NIM Model
+
+**POST /chat/nim**
+
+### Curl:
+
+```bash
+curl -X POST -F "prompt=Write a short poem" http://localhost:8000/chat/nim
+```
+
+### Streaming:
+
+```bash
+curl -N -X POST -F "prompt=Tell me a story" -F "stream=true" http://localhost:8000/chat/nim
+```
+
+---
+
+# ๐ฅ๏ธ 6. Command-Line Interface (CLI)
+
+The template includes a **CLI wrapper**:
+
+### Local TGI:
+
+```bash
+python cli.py --backend local "Explain photosynthesis"
+```
+
+### NVIDIA NIM:
+
+```bash
+python cli.py --backend nim "Write 5 facts about Jupiter"
+```
+
+Streaming output works automatically.
+
+---
+
+# ๐ก 7. Using with Saturn Cloud
+
+This template is designed as a **plug-and-play server component** inside Saturn Cloud:
+
+* Run the server inside a Jupyter workspace
+* Use the API from notebooks or external apps
+* Swap between local inference (TGI) and cloud inference (NIM)
+* Ideal for ML research, RAG systems, agent development, and batch inference jobs
+
+Saturn Cloud provides scalable Jupyter environments with GPUs:
+๐ [https://saturncloud.io/](https://saturncloud.io/)
+
+---
+
+# โ๏ธ 8. Summary
+
+This template provides:
+
+### **โ A drop-in inference server**
+
+Supports both NVIDIA Cloud NIM and local TGI backends.
+
+### **โ Ready to use in Saturn Cloud**
+
+Works inside a GPU instance or CPU instance.
+
+### **โ Unified API**
+
+Same route structure for both engines.
+
+### **โ Full CLI + server support**
+
+### **โ Ideal foundation for:**
+
+* Chatbots
+* RAG pipelines
+* Model comparison apps
+* AI feature development
+* ML/DS experimentation
\ No newline at end of file
diff --git a/examples/nlp_and_llms/nvidia-nim-tgi/backend_nim.py b/examples/nlp_and_llms/nvidia-nim-tgi/backend_nim.py
new file mode 100644
index 00000000..8638bfc1
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-nim-tgi/backend_nim.py
@@ -0,0 +1,32 @@
+from openai import OpenAI
+import os
+
+# set it up in the key in the environment first
+## sample free API key: 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
+API_KEY = "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"
+
+if not API_KEY:
+ raise ValueError("โ NVIDIA_API_KEY is not set. Export it first!")
+
+client = OpenAI(
+ base_url="https://integrate.api.nvidia.com/v1",
+ api_key=API_KEY,
+)
+
+def nim_chat(prompt, model="qwen/qwen3-next-80b-a3b-instruct", stream=False):
+ completion = client.chat.completions.create(
+ model=model,
+ messages=[{"role": "user", "content": prompt}],
+ temperature=0.6,
+ top_p=0.7,
+ max_tokens=1024,
+ stream=stream
+ )
+
+ if stream:
+ for chunk in completion:
+ delta = chunk.choices[0].delta
+ if delta and delta.content:
+ yield delta.content
+ else:
+ return completion.choices[0].message["content"]
diff --git a/examples/nlp_and_llms/nvidia-nim-tgi/backend_tgi.py b/examples/nlp_and_llms/nvidia-nim-tgi/backend_tgi.py
new file mode 100644
index 00000000..8c8b8c0a
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-nim-tgi/backend_tgi.py
@@ -0,0 +1,24 @@
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+MODEL_ID = "HuggingFaceTB/SmolLM-1.7B-Instruct"
+
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+model = AutoModelForCausalLM.from_pretrained(
+ MODEL_ID,
+ device_map="auto",
+ torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
+)
+
+def tgi_chat(prompt, max_tokens=256, temperature=0.7):
+ formatted_prompt = f"User: {prompt}\nAssistant:"
+ inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
+
+ outputs = model.generate(
+ **inputs,
+ max_new_tokens=max_tokens,
+ temperature=temperature
+ )
+
+ text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+ return text.split("Assistant:")[-1].strip()
diff --git a/examples/nlp_and_llms/nvidia-nim-tgi/cli.py b/examples/nlp_and_llms/nvidia-nim-tgi/cli.py
new file mode 100644
index 00000000..a54ab10c
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-nim-tgi/cli.py
@@ -0,0 +1,19 @@
+import argparse
+from backend_tgi import tgi_chat
+from backend_nim import nim_chat
+
+parser = argparse.ArgumentParser(description="NIM/TGI CLI")
+parser.add_argument("--backend", choices=["local", "nim"], required=True)
+parser.add_argument("prompt", type=str)
+
+args = parser.parse_args()
+
+if args.backend == "local":
+ print("\n๐ข Local TGI Response:")
+ print(tgi_chat(args.prompt))
+
+else:
+ print("\n๐ข NVIDIA NIM Response:")
+ for chunk in nim_chat(args.prompt, stream=True):
+ print(chunk, end="", flush=True)
+ print("\n")
diff --git a/examples/nlp_and_llms/nvidia-nim-tgi/requirements.txt b/examples/nlp_and_llms/nvidia-nim-tgi/requirements.txt
new file mode 100644
index 00000000..e4211a4c
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-nim-tgi/requirements.txt
@@ -0,0 +1,5 @@
+fastapi
+uvicorn
+transformers
+torch
+openai
\ No newline at end of file
diff --git a/examples/nlp_and_llms/nvidia-nim-tgi/server.py b/examples/nlp_and_llms/nvidia-nim-tgi/server.py
new file mode 100644
index 00000000..abb6fdb5
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-nim-tgi/server.py
@@ -0,0 +1,26 @@
+from fastapi import FastAPI, Form
+from fastapi.responses import StreamingResponse, JSONResponse
+from backend_tgi import tgi_chat
+from backend_nim import nim_chat
+
+app = FastAPI(title="NIM / TGI Drop-in API Server")
+
+@app.post("/chat/local")
+def chat_local(prompt: str = Form(...)):
+ response = tgi_chat(prompt)
+ return {"backend": "tgi-local", "response": response}
+
+
+@app.post("/chat/nim")
+def chat_nim(prompt: str = Form(...), stream: bool = False):
+ if stream:
+ generator = nim_chat(prompt, stream=True)
+ return StreamingResponse(generator, media_type="text/event-stream")
+
+ response = nim_chat(prompt, stream=False)
+ return {"backend": "nvidia-nim", "response": response}
+
+
+@app.get("/")
+def root():
+ return {"message": "NIM/TGI Server Running", "endpoints": ["/chat/local", "/chat/nim"]}
diff --git a/examples/nlp_and_llms/nvidia-rag-mini/README.md b/examples/nlp_and_llms/nvidia-rag-mini/README.md
new file mode 100644
index 00000000..aa22e6d3
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-rag-mini/README.md
@@ -0,0 +1,222 @@
+# ๐ง RAG Mini Docs Q&A
+
+A lightweight **Retrieval-Augmented Generation (RAG)** system that lets you drop `.txt` files into a folder and ask natural-language questions about them.
+
+This template combines:
+
+* **SentenceTransformers** for document embeddings
+* **ChromaDB** for vector storage & retrieval
+* **๐ค Transformers (FLAN-T5)** for answer generation
+* **FastAPI** for serving an interactive Q&A API
+
+Designed for fast prototyping and educational use on **[Saturn Cloud](https://saturncloud.io/)**.
+
+---
+
+## ๐ 1. Get Started โ Understand the Folder Layout
+
+Before you start coding, review the project structure below.
+Each file serves a clear role; ensure youโre working from the correct one.
+
+```
+NVIDIA_RAG-MINI/
+โโ data/ # Folder for your .txt documents
+โ โโ saturndoc.txt # Sample document included for testing
+โโ rag_machine.py # Core logic: embeddings, Chroma, QA engine
+โโ rag-api.py # REST API built with FastAPI
+โโ requirements.txt
+```
+
+๐ **Action:** Create or upload `.txt` files into the `data/` folder before running the template.
+A sample file named **`saturndoc.txt`** is already included โ you can use it immediately to test model training and query responses.
+
+---
+
+## ๐งฉ 2. Set Up the Environment
+
+To run this project, youโll need Python โฅ 3.10.
+If youโre using **Saturn Cloud**, create a new environment and install dependencies from `requirements.txt`.
+
+### โ๏ธ Step-by-step
+
+```bash
+# (optional) create a fresh virtual environment
+python -m venv rag-env
+source rag-env/bin/activate # or .\rag-env\Scripts\activate on Windows
+
+# install dependencies
+pip install -r requirements.txt
+```
+
+### ๐ฆ requirements.txt
+
+```text
+torch>=2.2.0
+transformers>=4.44.0
+sentence-transformers>=3.0.0
+chromadb>=0.5.0
+fastapi>=0.115.0
+uvicorn[standard]>=0.30.0
+pydantic>=2.7.0
+tqdm>=4.66.0
+```
+
+๐ **Action:** Run the install command inside your active environment before executing any Python file.
+
+---
+
+## โ๏ธ 3. Configure Models and Paths
+
+All configuration happens inside **`rag_machine.py`**.
+Defaults are already suitable for most cases:
+
+```python
+CHROMA_DIR = "rag_chroma_store" # Persistent database for embeddings
+DATA_DIR = Path("data") # Directory containing your .txt files
+EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
+LLM_MODEL = "google/flan-t5-base"
+```
+
+๐ **Action:**
+If you want faster inference, you can change `LLM_MODEL` to `google/flan-t5-small`.
+If you have a GPU, keep `flan-t5-base` or try `flan-t5-large`.
+
+---
+
+## ๐ป 4. Run in CLI Mode โ Test the RAG Machine
+
+Use this mode for quick experimentation.
+The script loads models, indexes your `.txt` files, and opens an interactive prompt.
+
+```bash
+python rag_machine.py
+```
+
+Youโll see output similar to:
+
+```
+๐ง Starting RAG Machine (Transformers + Chroma)...
+โป๏ธ Reindexing documents...
+๐ Indexing 5 documents...
+โ
Indexed 5 documents successfully.
+๐ Current collection size: 5 documents
+โ Enter your question (or 'exit'):
+```
+
+๐ **Action:**
+Type a question like
+`What is this project about?`
+and the model will respond based on your documents.
+
+> You can use the included **`saturndoc.txt`** file for your first run โ itโs already in the `data/` folder and serves as a ready-made example for testing and model training.
+
+---
+
+## ๐ 5. Run as an API โ Serve Questions via HTTP
+
+Now, letโs turn your RAG engine into a service.
+Start the FastAPI server with Uvicorn:
+
+```bash
+uvicorn rag-api:app --reload
+```
+
+Once running, open your browser at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
+to explore the built-in Swagger interface.
+
+### ๐งญ Endpoints
+
+| Endpoint | Method | Description |
+| ---------------------- | ------ | -------------------------------------------------- |
+| `/query` | POST | Submit a question and get an answer |
+| `/reload` *(optional)* | POST | Reindex `.txt` files without restarting the server |
+
+### Example Query
+
+```bash
+curl -X POST "http://127.0.0.1:8000/query" \
+ -H "Content-Type: application/json" \
+ -d "{\"query\": \"What does the onboarding doc say?\"}"
+```
+
+Response:
+
+```json
+{
+ "result": "The onboarding doc explains the project setup and data structure."
+}
+```
+
+๐ **Action:** Use `/query` to test, and `/reload` whenever you add new `.txt` files.
+
+---
+
+## ๐ 6. How It Works (Conceptually)
+
+1. **Document Loading** โ Reads all `.txt` files from `data/`.
+2. **Embedding Generation** โ Converts text into dense vectors using SentenceTransformers.
+3. **Vector Storage** โ Saves these embeddings persistently in **ChromaDB** (`rag_chroma_store/`).
+4. **Retrieval** โ Finds the most relevant text chunks for your query.
+5. **LLM Answering** โ Passes retrieved context + query into **FLAN-T5** to generate the final answer.
+
+๐ **Action:** Skim through `rag_machine.py` to see how each step is implementedโyou can easily swap models or add chunking later.
+
+---
+
+## ๐ 7. Reindex vs Reuse
+
+* **`reindex=True`** โ Clears and rebuilds embeddings from scratch
+* **`reindex=False`** โ Loads existing persistent store (faster)
+
+```python
+index_documents(reindex=True) # rebuild everything
+index_documents(reindex=False) # reuse old vectors
+```
+
+๐ **Action:**
+Use reindexing only after you add or update text files in `data/`.
+The included **`saturndoc.txt`** is already indexed by default when you run the script for the first time โ so you can test immediately without adding new documents.
+
+---
+
+## ๐งฉ 8. Best Practices
+
+* Keep each text file focused on one topic for cleaner retrieval.
+* For long documents, consider manually splitting them into sections.
+* If using CPU only, choose smaller models for faster inference.
+* Delete the `rag_chroma_store/` folder to fully reset the database.
+
+---
+
+## ๐ฐ๏ธ 9. Deploying on Saturn Cloud
+
+You can easily host this on **Saturn Cloud**:
+
+1. Create a new Jupyter or VS Code resource.
+2. Upload this project folder.
+3. Install requirements:
+
+ ```bash
+ pip install -r requirements.txt
+ ```
+4. Run `python rag_machine.py` to test indexing.
+5. Launch the API:
+
+ ```bash
+ uvicorn rag-api:app --host 0.0.0.0 --port 8000
+ ```
+6. Expose port **8000** in your Saturn environment to access it externally.
+
+๐ Learn more about Saturn Cloud and GPU-accelerated workflows at **[https://saturncloud.io](https://saturncloud.io)**
+
+---
+
+## ๐ Credits
+
+Built with โค๏ธ using:
+
+* ๐ค **Transformers**
+* ๐ง **SentenceTransformers**
+* ๐พ **ChromaDB**
+* โก **FastAPI**
+* and hosted proudly on **[Saturn Cloud](https://saturncloud.io/)**
\ No newline at end of file
diff --git a/examples/nlp_and_llms/nvidia-rag-mini/data/saturndoc.txt b/examples/nlp_and_llms/nvidia-rag-mini/data/saturndoc.txt
new file mode 100644
index 00000000..f9375715
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-rag-mini/data/saturndoc.txt
@@ -0,0 +1,5 @@
+Saturn Cloud provides a scalable cloud platform for data science and machine learning.
+It supports Jupyter environments, Dask clusters, and GPU-powered instances.
+Users can collaborate on notebooks, deploy APIs, and run scheduled jobs.
+You can also fine-tune large language models and deploy them with minimal effort.
+Saturn Cloud offers integrations with Hugging Face, AWS, and GitHub.
\ No newline at end of file
diff --git a/examples/nlp_and_llms/nvidia-rag-mini/rag-api.py b/examples/nlp_and_llms/nvidia-rag-mini/rag-api.py
new file mode 100644
index 00000000..aa072e79
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-rag-mini/rag-api.py
@@ -0,0 +1,17 @@
+from fastapi import FastAPI
+from pydantic import BaseModel
+from rag_machine import query_docs, index_documents
+
+app = FastAPI(title="RAG Mini Docs Q&A")
+
+class QueryRequest(BaseModel):
+ query: str
+
+@app.on_event("startup")
+def startup_event():
+ index_documents(reindex=False)
+
+@app.post("/query")
+def query(req: QueryRequest):
+ answer = query_docs(req.query)
+ return {"result": answer}
diff --git a/examples/nlp_and_llms/nvidia-rag-mini/rag_machine.py b/examples/nlp_and_llms/nvidia-rag-mini/rag_machine.py
new file mode 100644
index 00000000..5721867a
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-rag-mini/rag_machine.py
@@ -0,0 +1,114 @@
+# rag_machine.py
+from pathlib import Path
+import os
+import torch
+from sentence_transformers import SentenceTransformer
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+import chromadb
+
+# --------------------------
+# ๐ง Configuration
+# --------------------------
+CHROMA_DIR = "rag_chroma_store"
+DATA_DIR = Path("data")
+EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
+LLM_MODEL = "google/flan-t5-base"
+
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+
+DATA_DIR.mkdir(exist_ok=True)
+Path(CHROMA_DIR).mkdir(exist_ok=True)
+
+# --------------------------
+# โ๏ธ Initialize Components
+# --------------------------
+print("๐ Loading models...")
+embedder = SentenceTransformer(EMBED_MODEL)
+tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL)
+llm = AutoModelForSeq2SeqLM.from_pretrained(LLM_MODEL)
+
+client = chromadb.PersistentClient(path=CHROMA_DIR)
+collection = client.get_or_create_collection("rag_docs")
+
+# --------------------------
+# ๐ Document Loader
+# --------------------------
+def load_all_documents(data_dir: Path):
+ docs = []
+ for file in data_dir.glob("*.txt"):
+ with open(file, "r", encoding="utf-8") as f:
+ text = f.read().strip()
+ if text:
+ docs.append({"file": file.name, "text": text})
+ print(f"๐ Loaded: {file.name}")
+ return docs
+
+# --------------------------
+# ๐ข Index Documents
+# --------------------------
+def index_documents(reindex: bool = False):
+ """Rebuild or load existing document embeddings."""
+ if reindex:
+ print("โป๏ธ Reindexing documents...")
+ try:
+ collection.reset()
+ print("๐งน Cleared existing collection.")
+ except AttributeError:
+ ids = collection.get()["ids"]
+ if ids:
+ collection.delete(ids=ids)
+ print("๐งน Deleted existing documents manually.")
+
+ docs = load_all_documents(DATA_DIR)
+ for i, d in enumerate(docs):
+ emb = embedder.encode(d["text"])
+ collection.add(
+ ids=[str(i)],
+ documents=[d["text"]],
+ embeddings=[emb.tolist()],
+ metadatas=[{"source": d["file"]}],
+ )
+ print("โ
Documents reindexed and stored in Chroma.")
+ else:
+ print("๐ฆ Using existing Chroma store.")
+
+
+# --------------------------
+# ๐ Query System
+# --------------------------
+def query_docs(question: str, top_k: int = 3):
+ """Retrieve top-k relevant docs and generate an answer."""
+ print(f"\n๐ Question: {question}")
+
+ # Embed the query and search
+ q_emb = embedder.encode(question).tolist()
+ results = collection.query(query_embeddings=[q_emb], n_results=top_k)
+
+ if not results["documents"]:
+ return "No relevant documents found."
+
+ context = "\n".join(results["documents"][0])
+ prompt = f"Answer based on the following context:\n{context}\n\nQuestion: {question}"
+
+ inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
+ outputs = llm.generate(**inputs, max_length=512)
+ answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
+
+ return answer
+
+# --------------------------
+# ๐งช CLI Test Mode
+# --------------------------
+if __name__ == "__main__":
+ print("๐ง Starting RAG Machine (Transformers + Chroma)...")
+ index_documents(reindex=True)
+
+ while True:
+ q = input("\nโ Enter your question (or 'exit'): ").strip()
+ if q.lower() == "exit":
+ break
+ try:
+ ans = query_docs(q)
+ print(f"\n๐ฌ {ans}\n")
+ except Exception as e:
+ print(f"โ ๏ธ Error: {e}")
diff --git a/examples/nlp_and_llms/nvidia-rag-mini/requirements.txt b/examples/nlp_and_llms/nvidia-rag-mini/requirements.txt
new file mode 100644
index 00000000..624f58dd
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-rag-mini/requirements.txt
@@ -0,0 +1,7 @@
+torch>=2.2.0
+transformers>=4.44.0
+sentence-transformers>=3.0.0
+chromadb>=0.5.0
+fastapi>=0.115.0
+uvicorn[standard]>=0.30.0
+pydantic>=2.7.0
diff --git a/examples/nlp_and_llms/nvidia-rag-serve-api/README.md b/examples/nlp_and_llms/nvidia-rag-serve-api/README.md
new file mode 100644
index 00000000..911fabd4
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-rag-serve-api/README.md
@@ -0,0 +1,105 @@
+# ๐ Ray Serve LLM API โ Qwen 1.5B (vLLM)
+
+This template shows how to deploy a **Qwen2.5-1.5B-Instruct LLM** using:
+
+* **Ray Serve**
+* **vLLM**
+* **OpenAI-compatible API format**
+
+You get a local inference server running at:
+
+```
+http://127.0.0.1:8000/v1/chat/completions
+```
+
+This template is designed for **Saturn Cloud custom templates** so users can plug-and-play LLM inference environments with GPU acceleration.
+
+๐ **Back to Saturn Cloud โ [https://saturncloud.io](https://saturncloud.io)**
+
+---
+
+## ๐ Features
+
+* Fully OpenAI-compatible API endpoint
+* Deploys Qwen 1.5B using vLLM (fast inference)
+* Simple Ray Serve deployment
+* Example client request included
+* Clean and minimal code structure
+* Works inside Jupyter or full terminal environment
+
+---
+
+## ๐ฆ Requirements
+
+The notebook installs everything automatically:
+
+```
+torch
+transformers
+ray[serve, llm]
+fastapi
+uvicorn
+requests
+huggingface_hub
+```
+
+GPU recommended for optimal performance.
+
+---
+
+## ๐ Project Structure
+
+```
+ray-serve-llm/
+โ
+โโโ serve_llm.py # Ray Serve deployment definition
+โโโ start_server.py # Ray launcher (if using outside notebook)
+โโโ test_client.py # Example API client test
+โโโ ray_serve_llm_template.ipynb # Full Jupyter notebook template (generated)
+```
+
+---
+
+## โถ๏ธ How It Works
+
+### 1. Write your Ray Serve deployment file
+
+Defines:
+
+* Model ID (`Qwen2.5-1.5B-Instruct`)
+* Engine config
+* Autoscaling
+* OpenAI-compatible app
+
+### 2. Start Ray and deploy the model
+
+Ray Serve loads the model via vLLM and exposes the API.
+
+### 3. Send a test request
+
+JSON API format identical to OpenAI:
+
+```python
+payload = {
+ "model": "qwen-1.5b",
+ "messages": [{"role": "user", "content": "Explain API design."}]
+}
+```
+
+### 4. Extract the assistant text
+
+```python
+res = out.json()["choices"][0]["message"]["content"]
+```
+
+---
+
+## ๐ Conclusion
+
+This template provides a clean, reproducible Ray Serve LLM deployment that works both in Jupyter and full terminal mode.
+You can adapt it to larger models, scale it across nodes, or wrap it inside FastAPI.
+
+๐ **Back to Saturn Cloud โ [https://saturncloud.io](https://saturncloud.io)**
+
+---
+
diff --git a/examples/nlp_and_llms/nvidia-rag-serve-api/ray_serve_llm.ipynb b/examples/nlp_and_llms/nvidia-rag-serve-api/ray_serve_llm.ipynb
new file mode 100644
index 00000000..fe2de35a
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-rag-serve-api/ray_serve_llm.ipynb
@@ -0,0 +1,188 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "d1a1950f",
+ "metadata": {},
+ "source": [
+ "# ๐ Ray Serve LLM API\n",
+ "\n",
+ "This template demonstrates how to deploy **Models** using **Ray Serve + vLLM** and expose it through an **OpenAI-compatible API**.\n",
+ "\n",
+ "This a custom template on **Saturn Cloud custom templates** so users can plug-and-play LLM inference environments with GPU acceleration.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d06caaab",
+ "metadata": {},
+ "source": [
+ "## ๐ฆ Install required libraries\n",
+ "Install all the requireed library for the template"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "78b5ec11",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Install required libraries\n",
+ "!pip install torch transformers fastapi uvicorn \"ray[serve, llm]\" requests huggingface_hub\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ec4c2ac1",
+ "metadata": {},
+ "source": [
+ "## ๐งฉ Create Ray Serve Deployment File\n",
+ "\n",
+ "his writes a file called **`serve_llm.py`** which:\n",
+ "\n",
+ "* Configures the model (Qwen2.5-1.5B-Instruct)\n",
+ "* Creates a Ray Serve LLMConfig\n",
+ "* Builds an OpenAI-compatible API using Ray's `build_openai_app`\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bc3b43ec",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%writefile serve_llm.py\n",
+ "from ray.serve.llm import LLMConfig, build_openai_app\n",
+ "\n",
+ "MODEL_ID = \"Qwen/Qwen2.5-1.5B-Instruct\"\n",
+ "MODEL_ALIAS = \"qwen-1.5b\"\n",
+ "\n",
+ "engine_kwargs = dict(\n",
+ " tensor_parallel_size=1,\n",
+ " max_model_len=4096,\n",
+ ")\n",
+ "\n",
+ "deployment_config = dict(\n",
+ " autoscaling_config=dict(\n",
+ " min_replicas=1,\n",
+ " max_replicas=1,\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "llm_config = LLMConfig(\n",
+ " model_loading_config=dict(\n",
+ " model_id=MODEL_ALIAS,\n",
+ " model_source=MODEL_ID,\n",
+ " ),\n",
+ " engine_kwargs=engine_kwargs,\n",
+ " deployment_config=deployment_config,\n",
+ ")\n",
+ "\n",
+ "app = build_openai_app({\"llm_configs\": [llm_config]})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8f3464f5",
+ "metadata": {},
+ "source": [
+ "## โถ๏ธ Start Ray Serve and Deploy the Model\n",
+ "\n",
+ "This will:\n",
+ "\n",
+ "* Initialize Ray\n",
+ "* Start Ray Serve\n",
+ "* Deploy the Qwen model as an API at:\n",
+ " **[http://127.0.0.1:8000/v1/chat/completions](http://127.0.0.1:8000/v1/chat/completions)**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1e011e24",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import ray\n",
+ "from serve_llm import app\n",
+ "from ray import serve\n",
+ "\n",
+ "ray.init(ignore_reinit_error=True)\n",
+ "\n",
+ "serve.start(detached=False)\n",
+ "serve.run(app)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3700cc7d",
+ "metadata": {},
+ "source": [
+ "## ๐ฌ Test the API\n",
+ "\n",
+ "Sends a real chat request to your Ray Serve LLM deployment."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "eb912c3a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "\n",
+ "payload = {\n",
+ " \"model\": \"qwen-1.5b\",\n",
+ " \"messages\": [{\"role\": \"user\", \"content\": \"Explain API design.\"}]\n",
+ "}\n",
+ "\n",
+ "out = requests.post(\"http://127.0.0.1:8000/v1/chat/completions\", json=payload)\n",
+ "print(out.json())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "78c72539",
+ "metadata": {},
+ "source": [
+ "## โจ Extract Only the Model \n",
+ "\n",
+ "This grabs the generated text only (no metadata)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e440e110",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "res = out.json()[\"choices\"][0][\"message\"][\"content\"]\n",
+ "print(res)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "17f4ac64",
+ "metadata": {},
+ "source": [
+ "## ๐ **Conclusion**\n",
+ "\n",
+ "You now have a fully running **Ray Serve LLM API** using Qwen2.5-1.5B-Instruct, powered by **vLLM** and exposed through an **OpenAI-compatible endpoint**.\n",
+ "This template can be extended to larger models, added to pipelines, or used inside production-grade ML workloads within Saturn Cloud.\n",
+ "\n",
+ "๐ **Back to Saturn Cloud โ [https://saturncloud.io](https://saturncloud.io)**"
+ ]
+ }
+ ],
+ "metadata": {
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/examples/nlp_and_llms/nvidia-vector-db/.env b/examples/nlp_and_llms/nvidia-vector-db/.env
new file mode 100644
index 00000000..1622d39c
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-vector-db/.env
@@ -0,0 +1,3 @@
+ZILLIZ_URI="https://in03-e969f44404493f8.serverless.aws-eu-central-1.cloud.zilliz.com"
+ZILLIZ_TOKEN="a71de8fc4a75f5cb758d0fcf2b92fb2ebc1f851d7e776247d440e887cd355d7b575649f63a514fb7b78fdeac6f3b416e2ef11150"
+PG_CONNECTION="postgresql://neondb_owner:npg_ymHkZNUVr2I7@ep-lingering-silence-ah4wmlqw-pooler.c-3.us-east-1.aws.neon.tech/neondb?sslmode=require&channel_binding=require"
\ No newline at end of file
diff --git a/examples/nlp_and_llms/nvidia-vector-db/README.md b/examples/nlp_and_llms/nvidia-vector-db/README.md
new file mode 100644
index 00000000..0d97129f
--- /dev/null
+++ b/examples/nlp_and_llms/nvidia-vector-db/README.md
@@ -0,0 +1,233 @@
+
+# ๐ **Vector DB Menu (FAISS โข Zilliz Milvus โข Neon PGVector)**
+
+> A unified FastAPI search service that lets you test and compare **FAISS (local)**, **Milvus (Zilliz Cloud free tier)**, and **PostgreSQL with PGVector (Neon free tier)** using a common API.
+
+๐ **Built for the Saturn Cloud AI Community**
+๐ [https://saturncloud.io/](https://saturncloud.io/)
+
+---
+
+## ๐ง Overview
+
+This project loads a public dataset (State of the Union speeches), embeds it with `sentence-transformers/all-MiniLM-L6-v2`, stores vectors in **three different databases**, and exposes a **FastAPI endpoint** to query them interchangeably.
+
+### โ
Whatโs included:
+
+* FAISS (local in-memory vector search)
+* Milvus (via **Zilliz Cloud free tier**)
+* PostgreSQL + PGVector (via **Neon free tier**)
+* FastAPI for querying all 3 backends
+* CLI & Browser UI testing
+* Modular, deploy-ready architecture
+
+---
+
+## โ ๏ธ Free-Tier Credentials Notice
+
+This repo includes **working test credentials** for quick validation.
+However, because they are **free-tier**, they may:
+
+โ ๏ธ expire at any time
+โ ๏ธ be rate-limited
+โ ๏ธ be deleted automatically
+
+โ
You are **strongly encouraged to create your own accounts** using the setup guide below.
+
+---
+
+---
+
+# ๐ ๏ธ **1. Project Setup**
+
+### Clone Repository
+
+```sh
+git clone https://github.com/your-repo/nvidia-vector-db.git
+cd nvidia-vector-db
+```
+
+---
+
+### Create and Activate Virtual Environment
+
+#### Windows (PowerShell)
+
+```sh
+python -m venv vectordb-env
+vectordb-env\Scripts\activate
+```
+
+#### macOS / Linux
+
+```sh
+python3 -m venv vectordb-env
+source vectordb-env/bin/activate
+```
+
+---
+
+### Install Dependencies
+
+```sh
+pip install -r requirements.txt
+```
+
+---
+
+# โ๏ธ **2. Create Neon (PostgreSQL + PGVector) Free Account**
+
+1. Visit: [https://neon.tech/](https://neon.tech/)
+2. Click **Sign Up** (free tier)
+3. Create a new project
+4. Go to **Dashboard โ Connection Details**
+5. Copy the connection string:
+
+ ```
+ postgresql://