๐ Intelligent semantic search for your email inbox - Find exactly what you're looking for using state-of-the-art RAG (Retrieval-Augmented Generation) technology
Transform your Gmail archive into a searchable knowledge base using advanced NLP embeddings and vector similarity search. Ask natural language questions like "What was my latest eBay order?" or "When was my last flight booking?" and get instant, relevant results.
- ๐ฏ Semantic Search: Find emails based on meaning, not just keywords
- ๐ Multilingual Support: Works with multiple languages using
multilingual-e5-large-instruct - โก Fast Retrieval: Efficient similarity search using FAISS vectorstore
- ๐ Flexible Architecture: Support for both chunked and full-document embeddings
- ๐ Multiple Approaches: Compare direct cosine similarity vs. vectorstore implementations
- ๐จ Metadata Enrichment: Preserve sender, subject, date, and labels for enhanced context
- ๐งน Robust Preprocessing: Handle multipart emails, HTML content, and various encodings
- ๐ Cross-Encoder Reranking: Optional reranking for improved result quality
- ๐พ Caching System: Save embeddings for fast subsequent queries
- Python 3.10+
- Gmail MBOX export file (How to export Gmail)
- 8GB+ RAM recommended
- Clone the repository
git clone https://github.com/nsourlos/rag_gmail.git
cd rag_gmail- Create virtual environment using UV (recommended)
# For macOS/Linux
uv venv RAG_email --python 3.10
source RAG_email/bin/activate
# For Windows
uv venv RAG_email --python 3.10
.\RAG_email\Scripts\activate- Install dependencies
uv pip install sentence-transformers==5.1.1 ipykernel langchain==0.3.27 faiss-cpu==1.12.0
# Optional: For GPU support on Windows
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121- Set up Jupyter kernel
# macOS/Linux
RAG_email/bin/python -m ipykernel install --user --name=RAG_email --display-name "RAG_email"
# Windows
RAG_email\Scripts\python -m ipykernel install --user --name=RAG_email --display-name "RAG_email"- Place your Gmail MBOX file
# Export your Gmail and place the file as:
# ./gmail.mbox# Load your emails
mbox = load_mbox('gmail.mbox')
# Process and create embeddings
format_messages_for_embeddings(mbox)
# Load the model and embeddings
doc_embeddings, model = load_embeddings()
# Ask natural language questions
queries = [
'What was my latest eBay order?',
'When was my last flight booking?',
'What was my most recent Amazon purchase?'
]
# Get results
task = 'Given an email, retrieve relevant passages that answer the query'
similarity_scores = get_similarity_scores(queries, task, model, doc_embeddings, top_k=200)
print_results(similarity_scores, queries, documents_cleaned, top_k=200)For better performance with long emails:
# Enable chunking
chunked = True
chunk_size = 800
chunk_overlap = 160
# Process with chunks
format_messages_for_embeddings(mbox, chunk_size=chunk_size, chunk_overlap=chunk_overlap)For scalability with large email archives:
# Build vectorstore index
index = build_vectorstore_index(embeddings)
# Search using vectorstore
similarity_scores, retrieved_indices = get_similarity_scores(
queries, task, model, doc_embeddings, index=index, top_k=200
)โโโโโโโโโโโโโโโโโโโ
โ Gmail MBOX โ
โ Export File โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Preprocessing โ
โ โข Metadata โ
โ โข Body Extract โ
โ โข Cleaning โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Embedding โ
โ multilingual- โ
โ e5-large โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Vector Store โ
โ (FAISS/Direct) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Query & Search โ
โ Top-K Results โ
โโโโโโโโโโโโโโโโโโโ
- Model:
intfloat/multilingual-e5-large-instruct - Dimensions: 1024
- Context Length: 512 tokens
- Languages: 100+ languages supported
- Direct Cosine Similarity - Fast GPU-accelerated search for smaller datasets
- FAISS Vectorstore - Scalable CPU-based search with IndexFlatIP
- Cross-Encoder Reranking - Optional refinement using ms-marco-MiniLM
- โ Multipart emails (text/plain, text/html)
- โ Multiple encodings (UTF-8, Latin-1, etc.)
- โ Gmail labels and threads
- โ HTML content cleaning
- โ Email metadata (sender, date, subject)
- ๐ฆ Order Tracking: "Show me all my Amazon orders from last year"
โ๏ธ Travel Planning: "When was my flight to Paris?"- ๐ผ Work History: "Find emails about the Q3 project proposal"
- ๐ Account Recovery: "What's my Netflix subscription email?"
- ๐ Event Recall: "When did I receive the wedding invitation?"
# More results for better recall
top_k = 500
# Enable chunking for long emails
chunk_size = 800
chunk_overlap = 160
# Use GPU/MPS acceleration
device = "mps" if torch.backends.mps.is_available() else "cpu"The notebook includes experiments with:
google/embeddinggemma-300mQwen/Qwen3-Embedding-0.6B- Cross-encoders for reranking
RAG_email/
โโโ RAG_email.ipynb # Main notebook
โโโ README.md # This file
โโโ gmail.mbox # Your email export (not included)
Solution: Enable chunking or use CPU device instead of GPU
Solution: The notebook includes robust encoding handlers for Greek, UTF-8, and other character sets
Solution: Use FAISS vectorstore or reduce top_k parameter
Solution: Export your Gmail following this guide and place it in the project root
- Web interface for easier querying
- Integration with LLMs for answer generation
- Support for attachments and images
- Real-time email monitoring
- Advanced filtering by date ranges and senders
- Export search results to CSV
- Multi-account support
- Sentence Transformers - Embedding framework
- FAISS - Efficient similarity search
- LangChain - RAG framework
- multilingual-e5-large-instruct - Embedding model
Contributions are welcome! Feel free to:
- ๐ Report bugs
- ๐ก Suggest features
- ๐ง Submit pull requests
If you find this project useful, please consider giving it a star! โญ
For questions or feedback, please open an issue on GitHub.