Skip to content

This project develops an Information Retrieval System for Digikala’s FAQ section using TF-IDF and LaBSE-based embeddings, with ranking via cosine similarity, Jaccard similarity, and a custom metric. It supports Persian text processing with Hazm and Parsivar, focusing on data collection, augmentation, and evaluation.

License

Notifications You must be signed in to change notification settings

jonathansina/information-retrieval-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 Information Retrieval System for Digikala's FAQ

MIT License Python 3.8+ Flask Persian Text Support

An advanced, real-time Information Retrieval System specifically designed for Digikala's FAQ section. This system combines state-of-the-art natural language processing techniques with Persian text processing capabilities to deliver accurate and contextually relevant answers to user queries.

🚀 Key Highlights:

  • Persian Language Expertise: Native support for Persian text processing and understanding
  • Real-time Processing: Instant query processing with WebSocket integration
  • AI-Powered Responses: LLM integration for enhanced, contextual answers
  • Multiple Ranking Algorithms: TF-IDF, LaBSE embeddings with cosine, Jaccard, and custom similarity metrics
  • Interactive Web Interface: Modern UI with Persian language support (DigiSage)
  • Modular Architecture: Extensible pipeline design for easy customization and scaling

📑 Table of Contents

🎯 Overview

This Information Retrieval System is engineered to handle Persian language queries for Digikala's FAQ section with high accuracy and speed. The system leverages modern NLP techniques, including transformer-based embeddings and traditional IR methods, to provide contextually relevant answers.

Why This System?

  • Language-Specific Optimization: Tailored for Persian text with proper tokenization, normalization, and linguistic processing
  • Hybrid Approach: Combines traditional IR methods (TF-IDF) with modern embeddings (LaBSE) for optimal performance
  • Real-time Performance: Sub-second response times with efficient indexing and retrieval algorithms
  • Scalable Architecture: Modular design allows easy integration and scaling for enterprise environments
  • User-Friendly Interface: Intuitive web interface designed for Persian speakers

✨ Features

🔤 Persian Text Processing

  • Advanced Tokenization: Handles Persian text complexities including compound words and abbreviations
  • Text Normalization: Comprehensive normalization including diacritic removal and character standardization
  • Stemming & Lemmatization: Persian-specific morphological analysis for better matching
  • Stopword Filtering: Customizable Persian stopword lists for improved relevance
  • Informal Text Handling: Processes colloquial and informal Persian text

🧠 Information Retrieval

  • Multiple Embedding Models:
    • TF-IDF with customizable parameters
    • LaBSE (Language-agnostic BERT Sentence Embeddings)
    • Support for custom embedding models
  • Advanced Similarity Metrics:
    • Cosine Similarity for semantic matching
    • Jaccard Similarity for term overlap
    • Custom hybrid metrics combining multiple signals
  • Intelligent Ranking: Multi-factor ranking considering relevance, category, and user context

🚀 Real-time System

  • WebSocket Integration: Real-time query processing and response streaming
  • Asynchronous Processing: Non-blocking query handling for improved user experience
  • Live Results: Progressive result loading as processing completes
  • Session Management: Maintains context across multiple queries

🤖 AI Integration

  • LLM Enhancement: Integration with Together AI for contextual response generation
  • Prompt Engineering: Optimized prompts for Persian customer service scenarios
  • Fallback Mechanisms: Graceful degradation when AI services are unavailable
  • Response Personalization: Contextual responses based on retrieval results

📊 Analytics & Monitoring

  • MLflow Integration: Comprehensive experiment tracking and model versioning
  • Performance Metrics: Real-time monitoring of query response times and accuracy
  • Usage Analytics: Detailed logging of user interactions and system performance
  • A/B Testing Support: Framework for testing different retrieval strategies

🎨 User Interface

  • DigiSage Interface: Modern, responsive web interface
  • Persian Language Support: Right-to-left text support and Persian fonts
  • Mobile Responsive: Optimized for desktop and mobile devices
  • Accessibility: Screen reader compatible and keyboard navigation support

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Web Browser   │◄──►│   Flask Server   │◄──►│  Pipeline Core  │
│   (DigiSage)    │    │   (app.py)       │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                               │                         │
                               │                         ▼
                               │                ┌─────────────────┐
                               │                │   Preprocessor  │
                               │                │   - Tokenizer   │
                               │                │   - Normalizer  │
                               │                │   - Stemmer     │
                               │                └─────────────────┘
                               │                         │
                               │                         ▼
                               │                ┌─────────────────┐
                               │                │   Vectorizer    │
                               │                │   - TF-IDF      │
                               │                │   - LaBSE       │
                               │                └─────────────────┘
                               │                         │
                               │                         ▼
                               │                ┌─────────────────┐
                               │                │   Similarity    │
                               │                │   - Cosine      │
                               │                │   - Jaccard     │
                               │                │   - Custom      │
                               │                └─────────────────┘
                               │                         │
                               ▼                         ▼
                      ┌─────────────────┐    ┌─────────────────┐
                      │   Together AI   │    │   Evaluator     │
                      │   (LLM)         │    │   & Logger      │
                      └─────────────────┘    └─────────────────┘

Key Components:

  1. Pipeline Core: Modular processing pipeline with training and inference modes
  2. Web Server: Flask application with SocketIO for real-time communication
  3. Text Processing: Persian-specific NLP pipeline
  4. Retrieval Engine: Multi-algorithm similarity search
  5. AI Enhancement: LLM integration for response generation
  6. Monitoring: MLflow-based experiment tracking and logging

📋 Prerequisites

Before installing the system, ensure you have:

  • Python 3.8 or higher
  • pip (Python package installer)
  • Virtual environment (recommended: venv or conda)
  • Git for cloning the repository
  • 4GB+ RAM for optimal performance with embeddings
  • Internet connection for downloading language models and dependencies

System Requirements

Component Minimum Recommended
Python 3.8+ 3.9+
RAM 4GB 8GB+
Storage 2GB 5GB+
CPU 2 cores 4+ cores

🚀 Installation

1. Clone the Repository

git clone https://github.com/jonathansina/information-retrieval-system.git
cd information-retrieval-system

2. Create Virtual Environment

# Using venv (recommended)
python -m venv venv

# Activate on Linux/Mac
source venv/bin/activate

# Activate on Windows
venv\Scripts\activate

3. Install Dependencies

# Install all required packages
pip install -r requirements.txt

# Verify installation
python -c "import hazm, sklearn, flask; print('Dependencies installed successfully')"

4. Download Language Models

# Download Persian language models (if needed)
python -c "import hazm; print('Hazm ready for Persian processing')"

5. Prepare Data

# Ensure data directory exists and contains required files
ls data/
# Should show: augmented_dataset.csv, test_digikala_faq.csv, stops.txt, etc.

⚡ Quick Start

Start the System

# Run the application
python app.py

Access the Interface

  1. Open your web browser
  2. Navigate to http://localhost:5000
  3. You'll see the DigiSage interface
  4. Enter a Persian query and press send

Example Query

Try entering: چگونه می‌توانم سفارشم را تغییر دهم؟ (How can I change my order?)

📖 Usage

Development Mode

Training a New Model

from src.pipelines.pipeline.builder import PipelineBuilder
from src.pipelines.type_hint import ControllerType
from src.pipelines.config.default import PIPELINE_DEFAULT_CONFIG

# Create training pipeline
pipeline = (
    PipelineBuilder(controller_type=ControllerType.TRAINING)
    .with_logger("information-retrieval", "IRS", "development")
    .with_config(PIPELINE_DEFAULT_CONFIG)
    .with_preprocessor(True)
    .with_vectorizer(True)
    .with_vocabulary()
    .with_similarity()
    .with_evaluator(True)
    .build(True)
)

# Train the model
pipeline.run()

Running Inference

# Create inference pipeline
inference_pipeline = (
    PipelineBuilder(controller_type=ControllerType.INFERENCE)
    .with_logger("information-retrieval", "IRS", "development")
    .with_config(PIPELINE_DEFAULT_CONFIG)
    .with_preprocessor()
    .with_vectorizer()
    .with_vocabulary()
    .with_similarity()
    .with_evaluator()
    .build()
)

# Process a query
result = inference_pipeline.run("سوال شما به فارسی")
print(result)

Development with Jupyter

# Start Jupyter for experimentation
cd dev/
jupyter notebook data_analysis.ipynb

Production Mode

Running the Web Application

# Production mode with optimized settings
python app.py

Using the REST API

# Submit a query via API
curl -X POST http://localhost:5000/submit \
  -H "Content-Type: application/json" \
  -d '{"query": "چگونه محصولی را برگردانم؟"}'

WebSocket Integration

// Connect to WebSocket
const socket = io('http://localhost:5000');

// Listen for retrieval results
socket.on('ir_results', (data) => {
    console.log('Search results:', data.ir_results);
});

// Listen for AI responses
socket.on('llm_response', (data) => {
    console.log('AI response:', data.llm_response);
});

API Documentation

Endpoints

POST /submit

Submit a query for processing.

Request:

{
  "query": "سوال شما به زبان فارسی"
}

Response:

{
  "status": "Processing started"
}

WebSocket Events:

  • ir_results: Returns retrieval results
  • llm_response: Returns AI-generated response
GET /

Serves the main interface (DigiSage).

WebSocket Events

Client to Server
  • connect: Establish connection
Server to Client
  • ir_results: Retrieval results

    {
      "ir_results": [
        {
          "question": "سوال",
          "answer": "پاسخ", 
          "category": "دسته‌بندی"
        }
      ]
    }
  • llm_response: AI-generated response

    {
      "llm_response": "پاسخ تولید شده توسط هوش مصنوعی"
    }

💡 Examples

Basic Persian Query Processing

# Example 1: Simple product return query
query = "چگونه می‌توانم کالای خریداری شده را برگردانم؟"
result = inference_pipeline.run(query)

# Result structure:
{
    "retrieved_question": ["سوالات مشابه"],
    "retrieved_answer": ["پاسخ‌های مربوطه"], 
    "retrieved_category": ["دسته‌بندی‌ها"]
}

Custom Configuration

# Example 2: Custom preprocessing configuration
from src.pipelines.config.config import PreprocessorConfig

custom_config = PreprocessorConfig(
    normalizer=True,
    stemmer=True,
    stopwords=True,
    lemmatizer=False
)

# Use in pipeline
pipeline = PipelineBuilder().with_config(custom_config).build()

Batch Processing

# Example 3: Process multiple queries
queries = [
    "هزینه ارسال چقدر است؟",
    "چه زمانی سفارشم ارسال می‌شود؟",
    "چگونه می‌توانم سفارش را لغو کنم؟"
]

results = []
for query in queries:
    result = inference_pipeline.run(query)
    results.append(result)

Integration with Custom Models

# Example 4: Using custom similarity metrics
from src.pipelines.similarity.base import BaseSimilaritySearch

class CustomSimilarity(BaseSimilaritySearch):
    def search(self, query_vectors, corpus_vectors):
        # Your custom similarity implementation
        pass

# Use in pipeline
pipeline = (
    PipelineBuilder()
    .with_similarity(CustomSimilarity())
    .build()
)

📊 Performance

Benchmark Results

Metric Value Description
Average Response Time < 200ms End-to-end query processing
Retrieval Accuracy 89.5% Top-5 relevant results
Persian Text Coverage 95%+ Successful processing rate
Concurrent Users 100+ Simultaneous query handling
Memory Usage ~2GB With full embeddings loaded

Evaluation Metrics

The system is evaluated using:

  • Precision@K: Relevance of top-K results
  • Recall@K: Coverage of relevant documents
  • MRR: Mean Reciprocal Rank
  • NDCG: Normalized Discounted Cumulative Gain

Performance Optimization

# Enable MLflow tracking for performance monitoring
import mlflow

mlflow.start_run()
mlflow.log_metric("response_time", response_time)
mlflow.log_metric("accuracy", accuracy_score)
mlflow.end_run()

📁 Project Structure

information-retrieval-system/
├── 📁 data/                          # Dataset storage
│   ├── augmented_dataset.csv         # Training data with augmentations
│   ├── test_digikala_faq.csv        # Evaluation dataset
│   ├── digikala_faq.csv             # Original FAQ data
│   ├── stops.txt                     # Persian stopwords
│   ├── synonyms.json                 # Synonym mappings
│   └── vocabulary.txt                # Processed vocabulary
├── 📁 dev/                           # Development notebooks
│   ├── data_analysis.ipynb          # Data exploration and analysis
│   ├── test.ipynb                   # Testing and experimentation
│   └── time_experiment.png          # Performance visualization
├── 📁 src/                           # Core source code
│   ├── 📁 pipelines/                # Processing pipelines
│   │   ├── 📁 config/               # Configuration management
│   │   ├── 📁 evaluator/            # Performance evaluation
│   │   ├── 📁 logger/               # Logging utilities
│   │   ├── 📁 pipeline/             # Pipeline orchestration
│   │   ├── 📁 preprocessor/         # Text preprocessing
│   │   ├── 📁 similarity/           # Similarity algorithms
│   │   ├── 📁 vectorizer/           # Text vectorization
│   │   ├── 📁 vocabulary/           # Vocabulary management
│   │   └── type_hint.py             # Type definitions
│   └── 📁 generate/                 # Data generation utilities
├── 📁 ui/                            # Web interface (DigiSage)
│   ├── index.html                   # Main interface
│   ├── 📁 js/                       # JavaScript functionality
│   ├── 📁 styles/                   # CSS styling
│   └── 📁 images/                   # UI assets
├── app.py                           # Flask web application
├── requirements.txt                 # Python dependencies
├── README.md                        # Project documentation
└── LICENSE                          # MIT license

Key Files Description

File/Directory Purpose
app.py Main Flask application with WebSocket support
src/pipelines/ Modular NLP processing pipeline
ui/ DigiSage web interface
data/ Training and evaluation datasets
dev/ Jupyter notebooks for experimentation

⚙️ Configuration

Environment Variables

Create a .env file in the root directory:

# API Configuration
TOGETHER_API_KEY=your_together_api_key_here
FLASK_ENV=development
FLASK_DEBUG=True

# Server Configuration
HOST=0.0.0.0
PORT=5000

# MLflow Configuration
MLFLOW_TRACKING_URI=http://localhost:9437

Pipeline Configuration

Customize processing parameters in src/pipelines/config/default.py:

# Text Processing
PREPROCESSOR_DEFAULT_CONFIG = PreprocessorConfig(
    normalizer=True,
    stemmer=False,
    lemmatizer=False,
    stopwords=True,
    # ... other parameters
)

# Vectorization
VECTORIZER_DEFAULT_CONFIG = VectorizerConfig(
    vectorizer="tf-idf",
    vectorizer_param={
        "max_features": 10000,
        "ngram_range": (1, 2),
        # ... other parameters
    }
)

# Similarity Search
SIMILARITY_SEARCH_DEFAULT_CONFIG = SimilaritySearchConfig(
    metrics="cosine",
    metrics_param={
        "n_neighbors": 5,
    }
)

🚀 Deployment

Docker Deployment (Recommended)

# Build Docker image
docker build -t digikala-ir-system .

# Run container
docker run -p 5000:5000 \
  -e TOGETHER_API_KEY=your_key \
  digikala-ir-system

Production Deployment

# Using Gunicorn for production
pip install gunicorn

# Start with multiple workers
gunicorn --worker-class eventlet -w 1 --bind 0.0.0.0:5000 app:app

Cloud Deployment

AWS EC2

# Install dependencies on EC2
sudo yum update -y
sudo yum install python3 python3-pip -y

# Clone and setup
git clone https://github.com/jonathansina/information-retrieval-system.git
cd information-retrieval-system
pip3 install -r requirements.txt

# Run with systemd service
sudo systemctl enable digikala-ir
sudo systemctl start digikala-ir

Docker Compose

version: '3.8'
services:
  web:
    build: .
    ports:
      - "5000:5000"
    environment:
      - TOGETHER_API_KEY=your_key
    volumes:
      - ./data:/app/data
  
  mlflow:
    image: python:3.9
    command: pip install mlflow && mlflow server --host 0.0.0.0 --port 9437
    ports:
      - "9437:9437"

🔧 Troubleshooting

Common Issues

1. Dependencies Installation Fails

Problem: pip install -r requirements.txt fails

Solution:

# Upgrade pip first
pip install --upgrade pip

# Install with verbose output
pip install -r requirements.txt -v

# If still fails, install individually
pip install flask flask-socketio flask-cors
pip install hazm scikit-learn pandas numpy

2. Persian Text Not Displaying Correctly

Problem: Persian text appears as question marks or broken characters

Solution:

  • Ensure your terminal/browser supports UTF-8 encoding
  • Check that Persian fonts are installed on your system
  • Verify the UI CSS includes proper Persian font specifications

3. MLflow Connection Issues

Problem: Cannot connect to MLflow tracking server

Solution:

# Start MLflow server manually
mlflow server --host 0.0.0.0 --port 9437

# Or disable MLflow in development
export MLFLOW_TRACKING_URI=""

4. Out of Memory Errors

Problem: System runs out of memory during processing

Solution:

  • Reduce the size of your dataset for testing
  • Decrease max_features in TF-IDF configuration
  • Use smaller batch sizes for processing
  • Consider using lighter embedding models

5. WebSocket Connection Fails

Problem: Real-time features not working

Solution:

// Check browser console for errors
// Ensure SocketIO is properly loaded
// Verify CORS settings in app.py

Debug Mode

Enable debug logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Performance Issues

Monitor system performance:

# Check memory usage
htop

# Monitor Python processes
ps aux | grep python

# Check disk space
df -h

Getting Help

  1. Check Issues: Look for similar problems in GitHub Issues
  2. Create Issue: If not found, create a new issue with:
    • System information (OS, Python version)
    • Complete error message
    • Steps to reproduce
    • Expected vs actual behavior

🤝 Contributing

We welcome contributions! Here's how you can help:

Development Setup

# Fork the repository
git clone https://github.com/your-username/information-retrieval-system.git
cd information-retrieval-system

# Create development branch
git checkout -b feature/your-feature-name

# Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt  # If available

Contribution Guidelines

  1. Code Style: Follow PEP 8 conventions
  2. Documentation: Update relevant documentation
  3. Testing: Add tests for new functionality
  4. Commits: Use clear, descriptive commit messages
  5. Pull Requests: Include description of changes and testing

Areas for Contribution

  • 🌐 Internationalization: Support for other languages
  • 🚀 Performance: Optimization and caching
  • 🎨 UI/UX: Interface improvements
  • 🧪 Testing: Comprehensive test coverage
  • 📚 Documentation: Examples and tutorials
  • 🔧 DevOps: CI/CD and deployment automation

Code Review Process

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request
  6. Address review feedback
  7. Merge after approval

Community

  • 💬 Discussions: Use GitHub Discussions for questions
  • 🐛 Bug Reports: Use GitHub Issues
  • 📧 Email: Contact maintainers for sensitive issues

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary

  • Commercial use allowed
  • Modification allowed
  • Distribution allowed
  • Private use allowed
  • Liability not included
  • Warranty not included

🙏 Acknowledgments

  • Hazm Team: For excellent Persian NLP tools
  • Digikala: For the problem domain and inspiration
  • Open Source Community: For the amazing libraries used
  • Contributors: Everyone who has contributed to this project

Made with ❤️ for the Persian NLP community

⭐ Star this repository | 🐛 Report Bug | 💡 Request Feature

About

This project develops an Information Retrieval System for Digikala’s FAQ section using TF-IDF and LaBSE-based embeddings, with ranking via cosine similarity, Jaccard similarity, and a custom metric. It supports Persian text processing with Hazm and Parsivar, focusing on data collection, augmentation, and evaluation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •