A Node.js application that creates an intelligent chat system capable of answering questions about PDF documents using AI embeddings and vector search.
This project was built following this YouTube tutorial: PDF-Based AI Chat System Tutorial
- PDF Processing: Load and chunk PDF documents for better processing
- Vector Embeddings: Convert text chunks into vector embeddings using Google's Generative AI
- Vector Database: Store embeddings in Pinecone for efficient similarity search
- Intelligent Chat: Interactive chat system that answers questions based on PDF content
- Context-Aware: Maintains conversation history for better context understanding
- Query Transformation: Rewrites follow-up questions for better search results
- Node.js (v16 or higher)
- npm or yarn
- Google Gemini API key
- Pinecone account and API key
- Clone this repository:
git clone <your-repo-url>
cd Day08- Install dependencies:
npm install- Set up environment variables:
Create a
.envfile in the root directory with:
GEMINI_API_KEY=your_gemini_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_ENVIRONMENT=your_pinecone_environment
PINECONE_INDEX_NAME=your_pinecone_index_name├── index.js # PDF processing and vector storage script
├── query.js # Interactive chat interface
├── dsa.pdf # Sample PDF document (Data Structures & Algorithms)
├── package.json # Project dependencies
├── .env # Environment variables (not tracked in git)
└── README.md # Project documentation
First, run the indexing script to process your PDF and store it in Pinecone:
node index.jsThis will:
- Load the PDF document (
dsa.pdf) - Split it into manageable chunks
- Generate vector embeddings
- Store everything in Pinecone database
Once indexing is complete, start the chat interface:
node query.jsThe system will prompt you to ask questions about the PDF content.
@google/genai: Google Generative AI client@langchain/community: LangChain community modules@langchain/google-genai: Google AI integration for LangChain@langchain/pinecone: Pinecone integration for LangChain@langchain/textsplitters: Text splitting utilities@pinecone-database/pinecone: Pinecone database clientdotenv: Environment variable managementreadline-sync: Synchronous user input
-
Document Processing (
index.js):- Loads PDF using PDFLoader
- Splits text into chunks with overlap for better context
- Converts chunks to vector embeddings using Google's text-embedding-004 model
- Stores embeddings in Pinecone with metadata
-
Query Processing (
query.js):- Takes user input and transforms follow-up questions for clarity
- Converts questions to vector embeddings
- Performs similarity search in Pinecone to find relevant content
- Uses retrieved context to generate answers via Gemini AI
- Maintains conversation history for context-aware responses
- Chunking Strategy: Uses RecursiveCharacterTextSplitter with 1000 character chunks and 200 character overlap
- Embedding Model: Google's text-embedding-004 for high-quality vector representations
- Vector Search: Top-K similarity search (K=10) for relevant context retrieval
- AI Model: Gemini-2.0-flash for generating human-like responses
- Error Handling: Graceful handling when answers aren't found in the document
- Ensure your PDF file is placed in the root directory as
dsa.pdfor update the path inindex.js - Run
index.jsbeforequery.jsto populate the vector database - The system is configured for Data Structure and Algorithm content but can be adapted for other domains
- Keep your API keys secure and never commit them to version control
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project was built following this YouTube tutorial: PDF-Based AI Chat System Tutorial
- Module Type Error: Ensure
"type": "module"is set inpackage.json - API Key Issues: Verify your environment variables are correctly set
- Dependency Conflicts: Use
npm install --legacy-peer-depsif needed - Pinecone Connection: Ensure your Pinecone index exists and is accessible
If you encounter any issues, please check:
- Node.js version compatibility
- API key validity
- Network connectivity
- Pinecone index configuration
