📖 [Paper Title] Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning (🔗arXiv, 🤗Huggingface, 📄CoolPapers, 🎧YouTube Podcast)
🤝 [Project members] Ge-Peng Ji (🇦🇺 Australian National University), Jingyi Liu (🇨🇳 VCIP Lab, Nankai University), Deng-Ping Fan* (🇨🇳 VCIP Lab, Nankai University), and Nick Barnes (🇦🇺 Australian National University)
🏥 [Multimodal datasets] ColonVQA/ColonEval/ColonPert/ColonReason (🔗Google Drive & 🤗Huggingface)
👉 Fill out the 🈸google form to unlock full access to our data. 👈
🤖 [Reasoning model] The first R1-Styled thinking model, ColonR1, tailored for colonoscopy tasks (🔗Google Drive & 🤗Huggingface)
🔍 [Keywords] Multimodal Colonoscopy Analysis, Multimodal Understanding, Clinical Reasoning, Reinforcement Learning, Multimodal Benchmark, AI Healthcare, and Abdomen
💬 [Contact] Our Colon-X project is ongoing. We would love to hear your questions and suggestions: 📧gepengai.ji@gmail.com. To track the latest updates, please follow our 👍research gallery page.
Research roadmap of our Colon-X project. Building upon the most comprehensive multimodal colonoscopy dataset (ColonVQA), we propel a pivotal transition in intelligent colonoscopy, evolving from multimodal understanding (ColonEval and ColonPert) to clinical reasoning (ColonReason and ColonR1). These efforts collectively illuminate the path to next-generation advances in clinical colonoscopy and broader medical applications.
- [Dec/19/2025] New function! We have supported the Gradio web demo for our ColonR1 model! Please refer to the 📝demo guide for interactive web interface.
- [Dec/09/2025] 🔥 Project release, including markdown guides, data access links, and model assets for colonoscopy-specific R1-styled ColonR1 model.
Important
📌 TL;DR “太长不看版” -- Colonoscopy saves lives — but AI for colonoscopy is still far from intelligent. We are excited to launch the Colon-X project, an open initiative aimed at advancing multimodal intelligence in colonoscopy and beyond. Beyond serving as a community-wide data foundation, we're focused on a critical yet underexplored transition – evolving from multimodal understanding to clinical reasoning.
- Motivation: Multimodal Data Scarcity in Colonoscopy “巧妇难为无米之炊 -- 多模态结肠镜数据的匮乏“
- Currently, the field still struggles with a persistent benchmarking crisis, which stems not only from the scarcity of biomedical data, but also from the convention of task-specific models trained on isolated benchmarks.
- Contribution: Building a million-scale data foundation for multimodal colonoscopy analysis “数据基建 -- 首个专注于结肠镜领域的百万级别多模态数据”
- To address this, we construct the largest multimodal colonoscopy dataset, ColonVQA, by consolidating public data sources, thus enabling task-modality synergies essential in multimodal intelligence.
- 💡 ColonVQA is the most extensive database ever built for multimodal colonoscopy analysis, featuring 1,100,786 visual question-answering (VQA) queries, equivalent to over 49.9 million textual tokens. It is distinguished by its category-rich composition, containing 212,742 images across 76 clinically meaningful findings, and task-diverse design, covering 18 multimodal tasks organized within a five-level taxonomy.
- Data access: Refer to 📝markdown guide to download and prepare our entire dataset.
- Motivation: Multimodal understanding abilities are still unknown in colonoscopy “未知之境 -- 多模态大模型在结肠镜领域到底发展到了什么地步?”
- Contribution: Benchmarking generalizability and reliability of MLLMs in colonoscopy “评估多模态大模型在结肠镜领域中的可用性和可靠性”
- 💡 Generalizability: “可用性评估” We introduce a clinically reviewed set, ColonEval, that assesses the generalizability of 22 multimodal large language models (MLLMs) across diverse colonoscopy tasks. Refer to 📝markdown guide to quickly start generalizability evaluation.
- 💡 Reliability: “可靠性评估” We introduce ColonPert to quantify robustness against human-induced perturbations. We identified a critical "text-dominance bias", where models are easily misled by implicit on-image text or explicit textual prompts. Refer to 📝markdown guide to quickly start reliability evaluation.
-
Motivation: Although large reasoning models (eg., o-series, DeepSeek-R1) have demonstrated impressive chain-of-thought capability in complex tasks, their potential in colonoscopy remains largely unexplored. This inspires us to advance this frontier beyond understanding toward clinical reasoning, through both data and model innovations.
-
Contribution: Evolving multimodal understanding to clinical reasoning in intelligent colonoscopy “进阶蜕变 -- 让结肠镜多模态大模型从理解能力进化为推理能力”
- 💡 ColonReason: “使用多专家辩论框架构建推理链条” A clinically grounded reasoning dataset annotated through a multi-expert debating pipeline. It simulates a clinical peer discussion loop (interpretation, debating, self-reflection) to generate structured reasoning traces. Refer to 📝markdown guide to access the curated reasoning dataset.
- 💡 ColonR1: “不仅要会决策,还要知道背后的原因” The first R1-styled model tailored for colonoscopy, incorporating task-adaptive rewarding to accommodate diverse tasks. It employs self-evolving prompting to learn from past errors, achieving SOTA performance with only ~7.5K training samples. We provide a quick demo below to help you get started. More details can be found in 📝markdown guide.
Below is a code snippet to help you quickly try out our ColonR1 model using 🤗Huggingface Transformers. For convenience, we manually combined some configuration and code files. Please note that this is a quick inference code, we recommend you using our full code to explore more.
-
Before running the snippet, you need to install the following minimum dependencies.
conda create -n quickstart python=3.10 conda activate quickstart pip install torch transformers accelerate pillow
-
Then you can use
python ColonR1/quickstart.pyto run it, as shown in the following code.import torch from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration from PIL import Image import warnings import os warnings.filterwarnings('ignore') device = "cuda" if torch.cuda.is_available() else "cpu" MODEL_PATH = "ai4colonoscopy/ColonR1" # You can replace it with your own image path and question. IMAGE_PATH = "ColonR1/serve/test_examples/02/102.jpg" Question = "Does the image contain a polyp? Answer me with Yes or No." print(f"[Info] Loading model from {MODEL_PATH}...") model = Qwen2_5_VLForConditionalGeneration.from_pretrained( MODEL_PATH, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto" ) model.eval() processor = AutoProcessor.from_pretrained(MODEL_PATH) if not os.path.exists(IMAGE_PATH): raise FileNotFoundError(f"Image not found at {IMAGE_PATH}. Please provide a valid image path.") image = Image.open(IMAGE_PATH).convert("RGB") TASK_SUFFIX = ( "Your task: 1. First, Think through the question step by step, enclose your reasoning process " "in <think>...</think> tags. 2. Then provide the correct answer inside <answer>...</answer> tags. " "3. No extra information or text outside of these tags." ) final_question = f"{Question}\n{TASK_SUFFIX}" messages = [ { "role": "user", "content": [ {"type": "image", "image": IMAGE_PATH}, {"type": "text", "text": final_question}, ], } ] print("[Info] Processing inputs...") text_prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor( text=[text_prompt], images=[image], padding=True, return_tensors="pt", ).to(device) print("[Info] Generating response...") with torch.no_grad(): generated_ids = model.generate( **inputs, max_new_tokens=1024, do_sample=False ) generated_ids_trimmed = generated_ids[:, inputs.input_ids.shape[1]:] output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0] print(output_text)
ColonR1 is designed to assist in medical colonoscopy by leveraging multimodal reasoning capabilities, but it comes with no guarantees regarding its predictive accuracy or reliability in clinical practice. Users should be aware that the datasets and pre-trained models used in ColonR1 may contain inherent biases, including socioeconomic factors, which can lead to misclassification or other undesirable behaviors, such as the generation of offensive or inappropriate content.
We urge users and developers to carefully review and validate the performance of pre-trained models, particularly those integrated through the ColonR1 framework, before considering practical applications in a clinical setting. It is crucial that any AI-driven tool used in healthcare undergoes rigorous testing to ensure patient safety and avoid unintended consequences. Our commitment to ethical AI use extends to ongoing efforts to investigate, address, and mitigate the risks of bias and inappropriate behavior in ColonR1. Continuous improvement of this codebase is a priority to ensure that the system aligns with responsible and equitable healthcare standards.
Feel free to cite if you find the Colon-X Project useful for your work:
@article{ji2025colonx,
title={Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning},
author={Ji, Ge-Peng and Liu, Jingyi and Fan, Deng-Ping and Barnes, Nick},
journal={arXiv preprint arXiv:2512.03667},
year={2025}
}
We are actively looking for potential collaborators to help push this community forward — especially hospitals or medical institutions that can provide diverse, real-world clinical colonoscopy data (eg., data across different devices, modalities, patient populations, and clinical workflows). If you’re interested in contributing or partnering with us, we’d be very happy to connect.
We’re still on the journey toward building truly intelligent colonoscopy systems, and this project is very much under active development. We warmly welcome any feedback, ideas, or suggestions that can help shape its future.
For any inquiries or thoughts you’d like to share, feel free to reach out to us at 📧 gepengai.ji@gmail.com & 📧 jingyi.liu2657@gmail.com
We gratefully acknowledge the contributions of the following projects, which served as the foundation and inspiration for our work:
- 📦 Qwen2.5-VL: The most powerful vision-language model in the Qwen series to date.
- 📦 R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3.
- 📦 open-r1: A fully open reproduction of DeepSeek-R1.
Moreover, special thanks to the public datasets, as their contributions made it possible to build such such the largest-scale benchmark. These datasets include: 🗂️CAD-CAP, 🗂️CVC-ClinicDB, 🗂️CVC-ColonDB, 🗂️EDD2020, 🗂️ETIS-Larib, 🗂️PICCOLO, 🗂️PolypGen, 🗂️PS-NBI2K, 🗂️Kvasir, 🗂️Hyper-Kvasir, 🗂️ASEI, 🗂️Kvasir-Capsule, 🗂️GastroVision, 🗂️SUN-SEG, 🗂️WCEBleedGen, 🗂️Capsule Vision 2024, 🗂️KID1, 🗂️KID2, 🗂️in vivo, 🗂️KUMC, 🗂️CP-CHILD, 🗂️LIMUC, 🗂️SSL-CPCD, 🗂️MedFMC, 🗂️WCE Colon Disease, 🗂️CPC-Paired, 🗂️ColonoscopicDS, 🗂️PolypDB, 🗂️Kvasir-Instrument, 🗂️LDPolyVideo, 🗂️Endo4IE, and 🗂️Nerthus.