Skip to content

AI-powered PDF chat system using LangChain, Pinecone & Google Gemini. Ask questions about any PDF document and get intelligent answers.

Notifications You must be signed in to change notification settings

aashutosh585/RAG-PDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF-Based AI Chat System

A Node.js application that creates an intelligent chat system capable of answering questions about PDF documents using AI embeddings and vector search.

Tutorial Thumbnail

📺 Tutorial

This project was built following this YouTube tutorial: PDF-Based AI Chat System Tutorial

🚀 Features

  • PDF Processing: Load and chunk PDF documents for better processing
  • Vector Embeddings: Convert text chunks into vector embeddings using Google's Generative AI
  • Vector Database: Store embeddings in Pinecone for efficient similarity search
  • Intelligent Chat: Interactive chat system that answers questions based on PDF content
  • Context-Aware: Maintains conversation history for better context understanding
  • Query Transformation: Rewrites follow-up questions for better search results

📋 Prerequisites

  • Node.js (v16 or higher)
  • npm or yarn
  • Google Gemini API key
  • Pinecone account and API key

🛠️ Installation

  1. Clone this repository:
git clone <your-repo-url>
cd Day08
  1. Install dependencies:
npm install
  1. Set up environment variables: Create a .env file in the root directory with:
GEMINI_API_KEY=your_gemini_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_ENVIRONMENT=your_pinecone_environment
PINECONE_INDEX_NAME=your_pinecone_index_name

📂 Project Structure

├── index.js          # PDF processing and vector storage script
├── query.js           # Interactive chat interface
├── dsa.pdf           # Sample PDF document (Data Structures & Algorithms)
├── package.json      # Project dependencies
├── .env              # Environment variables (not tracked in git)
└── README.md         # Project documentation

🎯 Usage

1. Process PDF and Store Embeddings

First, run the indexing script to process your PDF and store it in Pinecone:

node index.js

This will:

  • Load the PDF document (dsa.pdf)
  • Split it into manageable chunks
  • Generate vector embeddings
  • Store everything in Pinecone database

2. Start Interactive Chat

Once indexing is complete, start the chat interface:

node query.js

The system will prompt you to ask questions about the PDF content.

🔧 Dependencies

Core Dependencies

  • @google/genai: Google Generative AI client
  • @langchain/community: LangChain community modules
  • @langchain/google-genai: Google AI integration for LangChain
  • @langchain/pinecone: Pinecone integration for LangChain
  • @langchain/textsplitters: Text splitting utilities
  • @pinecone-database/pinecone: Pinecone database client
  • dotenv: Environment variable management
  • readline-sync: Synchronous user input

🏗️ How It Works

  1. Document Processing (index.js):

    • Loads PDF using PDFLoader
    • Splits text into chunks with overlap for better context
    • Converts chunks to vector embeddings using Google's text-embedding-004 model
    • Stores embeddings in Pinecone with metadata
  2. Query Processing (query.js):

    • Takes user input and transforms follow-up questions for clarity
    • Converts questions to vector embeddings
    • Performs similarity search in Pinecone to find relevant content
    • Uses retrieved context to generate answers via Gemini AI
    • Maintains conversation history for context-aware responses

🔑 Key Features

  • Chunking Strategy: Uses RecursiveCharacterTextSplitter with 1000 character chunks and 200 character overlap
  • Embedding Model: Google's text-embedding-004 for high-quality vector representations
  • Vector Search: Top-K similarity search (K=10) for relevant context retrieval
  • AI Model: Gemini-2.0-flash for generating human-like responses
  • Error Handling: Graceful handling when answers aren't found in the document

🚨 Important Notes

  • Ensure your PDF file is placed in the root directory as dsa.pdf or update the path in index.js
  • Run index.js before query.js to populate the vector database
  • The system is configured for Data Structure and Algorithm content but can be adapted for other domains
  • Keep your API keys secure and never commit them to version control

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

� Tutorial

This project was built following this YouTube tutorial: PDF-Based AI Chat System Tutorial

🐛 Troubleshooting

Common Issues

  1. Module Type Error: Ensure "type": "module" is set in package.json
  2. API Key Issues: Verify your environment variables are correctly set
  3. Dependency Conflicts: Use npm install --legacy-peer-deps if needed
  4. Pinecone Connection: Ensure your Pinecone index exists and is accessible

Support

If you encounter any issues, please check:

  • Node.js version compatibility
  • API key validity
  • Network connectivity
  • Pinecone index configuration

About

AI-powered PDF chat system using LangChain, Pinecone & Google Gemini. Ask questions about any PDF document and get intelligent answers.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published