Skip to content

A step-by-step tutorial on building a Retrieval-Augmented Generation (RAG) pipeline with LangChain, ChromaDB, and Hugging Face models, using cardiology research PDFs as the knowledge base.

License

Notifications You must be signed in to change notification settings

RegoDefies/rag-cardio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Cardiology RAG – Retrieval-Augmented Generation with PDFs

This project demonstrates how to build a Retrieval-Augmented Generation (RAG) pipeline for question answering on a collection of cardiology research articles in PDF format. Full Linkedin article at Building a Retrieval-Augmented Generation (RAG) System for Cardiology Research: A Step-by-Step Guide

The system uses LangChain, ChromaDB, and a local Hugging Face model (Qwen-7B-Instruct by default) to retrieve relevant text chunks and generate grounded answers.


✨ Features

  • Parse multiple PDF documents (e.g., cardiology articles)
  • Split text into chunks for efficient retrieval
  • Store embeddings in a persistent ChromaDB vector database
  • Run a local LLM (no API key required)
  • Answer questions with sources included for transparency

πŸ“‚ Project Structure

cardiology_rag/
│── articles/                # Folder with cardiology PDFs
β”‚    β”œβ”€β”€ atrial_fibrillation.pdf
β”‚    β”œβ”€β”€ heart_failure.pdf
β”‚    └── coronary_disease.pdf
│── chroma_db/               # Vector DB (auto-created)
│── main.py                  # Main RAG pipeline script

βš™οΈ Installation

Create a clean environment and install dependencies:

# Clone repository
git clone https://github.com/yourusername/cardiology_rag.git
cd cardiology_rag

# Create virtual environment
python -m venv rag_env
source rag_env/bin/activate   # Linux / Mac
rag_env\Scripts\activate      # Windows

# Install dependencies
pip install -r requirements.txt

If you don’t have a requirements.txt yet, install manually:

pip install langchain langchain-community langchain-huggingface langchain-chroma chromadb pypdf sentence-transformers transformers torch accelerate

πŸš€ Usage

  1. Place your cardiology PDFs inside the articles/ folder.
  2. Run the main script:
python main.py

Example output:

Loaded 90 pages from 10 PDF files.
Total chunks: 634

=== Answer ===
Atrial fibrillation can cause stroke, heart failure, and increased mortality.

=== Sources ===
[1] atrial_fibrillation.pdf | p.12
[2] heart_failure.pdf       | p.4

πŸ”§ Customization

  • Change model:
    Default is Qwen/Qwen2.5-7B-Instruct. You can try smaller models like google/flan-t5-base or larger ones like mistralai/Mistral-7B-Instruct-v0.2.

  • Adjust retrieval:
    Modify k in db.as_retriever(search_kwargs={"k": 4}) to return more or fewer chunks.

  • Expand knowledge base:
    Add more PDFs (guidelines, textbooks, papers) into articles/.


πŸ“Œ Applications

  • Medical students preparing for exams
  • Clinicians reviewing updated guidelines
  • Researchers summarizing across papers
  • Startups building intelligent assistants

⚠️ Disclaimer: This project is for educational purposes only. Always verify medical information with trusted clinical sources before applying it in practice.


πŸ“œ License

MIT License – see LICENSE for details.
Copyright (c) 2025 pvrego

About

A step-by-step tutorial on building a Retrieval-Augmented Generation (RAG) pipeline with LangChain, ChromaDB, and Hugging Face models, using cardiology research PDFs as the knowledge base.

Topics

Resources

License

Stars

Watchers

Forks

Languages