Skip to content

๐Ÿ” Intelligent semantic search for email archives using RAG, multilingual embeddings, and FAISS. Ask natural questions, get instant answers from your Gmail inbox.

Notifications You must be signed in to change notification settings

nsourlos/RAG_gmail

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“ง RAG Email Search System

Python License Jupyter

๐Ÿ” Intelligent semantic search for your email inbox - Find exactly what you're looking for using state-of-the-art RAG (Retrieval-Augmented Generation) technology

Transform your Gmail archive into a searchable knowledge base using advanced NLP embeddings and vector similarity search. Ask natural language questions like "What was my latest eBay order?" or "When was my last flight booking?" and get instant, relevant results.


โœจ Features

  • ๐ŸŽฏ Semantic Search: Find emails based on meaning, not just keywords
  • ๐ŸŒ Multilingual Support: Works with multiple languages using multilingual-e5-large-instruct
  • โšก Fast Retrieval: Efficient similarity search using FAISS vectorstore
  • ๐Ÿ”„ Flexible Architecture: Support for both chunked and full-document embeddings
  • ๐Ÿ“Š Multiple Approaches: Compare direct cosine similarity vs. vectorstore implementations
  • ๐ŸŽจ Metadata Enrichment: Preserve sender, subject, date, and labels for enhanced context
  • ๐Ÿงน Robust Preprocessing: Handle multipart emails, HTML content, and various encodings
  • ๐Ÿ” Cross-Encoder Reranking: Optional reranking for improved result quality
  • ๐Ÿ’พ Caching System: Save embeddings for fast subsequent queries

๐Ÿš€ Quick Start

Prerequisites

Installation

  1. Clone the repository
git clone https://github.com/nsourlos/rag_gmail.git
cd rag_gmail
  1. Create virtual environment using UV (recommended)
# For macOS/Linux
uv venv RAG_email --python 3.10
source RAG_email/bin/activate

# For Windows
uv venv RAG_email --python 3.10
.\RAG_email\Scripts\activate
  1. Install dependencies
uv pip install sentence-transformers==5.1.1 ipykernel langchain==0.3.27 faiss-cpu==1.12.0

# Optional: For GPU support on Windows
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
  1. Set up Jupyter kernel
# macOS/Linux
RAG_email/bin/python -m ipykernel install --user --name=RAG_email --display-name "RAG_email"

# Windows
RAG_email\Scripts\python -m ipykernel install --user --name=RAG_email --display-name "RAG_email"
  1. Place your Gmail MBOX file
# Export your Gmail and place the file as:
# ./gmail.mbox

๐Ÿ“– Usage

Basic Example

# Load your emails
mbox = load_mbox('gmail.mbox')

# Process and create embeddings
format_messages_for_embeddings(mbox)

# Load the model and embeddings
doc_embeddings, model = load_embeddings()

# Ask natural language questions
queries = [
    'What was my latest eBay order?',
    'When was my last flight booking?',
    'What was my most recent Amazon purchase?'
]

# Get results
task = 'Given an email, retrieve relevant passages that answer the query'
similarity_scores = get_similarity_scores(queries, task, model, doc_embeddings, top_k=200)
print_results(similarity_scores, queries, documents_cleaned, top_k=200)

Advanced: Using Chunking

For better performance with long emails:

# Enable chunking
chunked = True
chunk_size = 800
chunk_overlap = 160

# Process with chunks
format_messages_for_embeddings(mbox, chunk_size=chunk_size, chunk_overlap=chunk_overlap)

Advanced: Using FAISS Vectorstore

For scalability with large email archives:

# Build vectorstore index
index = build_vectorstore_index(embeddings)

# Search using vectorstore
similarity_scores, retrieved_indices = get_similarity_scores(
    queries, task, model, doc_embeddings, index=index, top_k=200
)

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Gmail MBOX     โ”‚
โ”‚  Export File    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Preprocessing  โ”‚
โ”‚  โ€ข Metadata     โ”‚
โ”‚  โ€ข Body Extract โ”‚
โ”‚  โ€ข Cleaning     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Embedding      โ”‚
โ”‚  multilingual-  โ”‚
โ”‚  e5-large       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Vector Store   โ”‚
โ”‚  (FAISS/Direct) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Query & Search โ”‚
โ”‚  Top-K Results  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ฌ Technical Details

Embedding Model

  • Model: intfloat/multilingual-e5-large-instruct
  • Dimensions: 1024
  • Context Length: 512 tokens
  • Languages: 100+ languages supported

Search Methods

  1. Direct Cosine Similarity - Fast GPU-accelerated search for smaller datasets
  2. FAISS Vectorstore - Scalable CPU-based search with IndexFlatIP
  3. Cross-Encoder Reranking - Optional refinement using ms-marco-MiniLM

Supported Email Features

  • โœ… Multipart emails (text/plain, text/html)
  • โœ… Multiple encodings (UTF-8, Latin-1, etc.)
  • โœ… Gmail labels and threads
  • โœ… HTML content cleaning
  • โœ… Email metadata (sender, date, subject)

๐ŸŽ“ Use Cases

  • ๐Ÿ“ฆ Order Tracking: "Show me all my Amazon orders from last year"
  • โœˆ๏ธ Travel Planning: "When was my flight to Paris?"
  • ๐Ÿ’ผ Work History: "Find emails about the Q3 project proposal"
  • ๐Ÿ” Account Recovery: "What's my Netflix subscription email?"
  • ๐Ÿ“… Event Recall: "When did I receive the wedding invitation?"

๐Ÿ› ๏ธ Customization

Adjust Search Parameters

# More results for better recall
top_k = 500

# Enable chunking for long emails
chunk_size = 800
chunk_overlap = 160

# Use GPU/MPS acceleration
device = "mps" if torch.backends.mps.is_available() else "cpu"

Use Different Embedding Models

The notebook includes experiments with:

  • google/embeddinggemma-300m
  • Qwen/Qwen3-Embedding-0.6B
  • Cross-encoders for reranking

๐Ÿ“ Project Structure

RAG_email/
โ”œโ”€โ”€ RAG_email.ipynb               # Main notebook
โ”œโ”€โ”€ README.md                     # This file
โ”œโ”€โ”€ gmail.mbox                    # Your email export (not included)

๐Ÿ› Troubleshooting

Issue: Out of Memory

Solution: Enable chunking or use CPU device instead of GPU

Issue: Encoding Errors

Solution: The notebook includes robust encoding handlers for Greek, UTF-8, and other character sets

Issue: Slow Search

Solution: Use FAISS vectorstore or reduce top_k parameter

Issue: MBOX File Not Found

Solution: Export your Gmail following this guide and place it in the project root


๐Ÿ”ฎ Future Enhancements

  • Web interface for easier querying
  • Integration with LLMs for answer generation
  • Support for attachments and images
  • Real-time email monitoring
  • Advanced filtering by date ranges and senders
  • Export search results to CSV
  • Multi-account support

๐Ÿ“š References & Acknowledgments


๐Ÿค Contributing

Contributions are welcome! Feel free to:

  • ๐Ÿ› Report bugs
  • ๐Ÿ’ก Suggest features
  • ๐Ÿ”ง Submit pull requests

โญ Star History

If you find this project useful, please consider giving it a star! โญ


๐Ÿ“ง Contact

For questions or feedback, please open an issue on GitHub.

About

๐Ÿ” Intelligent semantic search for email archives using RAG, multilingual embeddings, and FAISS. Ask natural questions, get instant answers from your Gmail inbox.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published