🔍 Information Retrieval System for Digikala's FAQ

An advanced, real-time Information Retrieval System specifically designed for Digikala's FAQ section. This system combines state-of-the-art natural language processing techniques with Persian text processing capabilities to deliver accurate and contextually relevant answers to user queries.

🚀 Key Highlights:

Persian Language Expertise: Native support for Persian text processing and understanding
Real-time Processing: Instant query processing with WebSocket integration
AI-Powered Responses: LLM integration for enhanced, contextual answers
Multiple Ranking Algorithms: TF-IDF, LaBSE embeddings with cosine, Jaccard, and custom similarity metrics
Interactive Web Interface: Modern UI with Persian language support (DigiSage)
Modular Architecture: Extensible pipeline design for easy customization and scaling

🎯 Overview

This Information Retrieval System is engineered to handle Persian language queries for Digikala's FAQ section with high accuracy and speed. The system leverages modern NLP techniques, including transformer-based embeddings and traditional IR methods, to provide contextually relevant answers.

Why This System?

Language-Specific Optimization: Tailored for Persian text with proper tokenization, normalization, and linguistic processing
Hybrid Approach: Combines traditional IR methods (TF-IDF) with modern embeddings (LaBSE) for optimal performance
Real-time Performance: Sub-second response times with efficient indexing and retrieval algorithms
Scalable Architecture: Modular design allows easy integration and scaling for enterprise environments
User-Friendly Interface: Intuitive web interface designed for Persian speakers

✨ Features

🔤 Persian Text Processing

Advanced Tokenization: Handles Persian text complexities including compound words and abbreviations
Text Normalization: Comprehensive normalization including diacritic removal and character standardization
Stemming & Lemmatization: Persian-specific morphological analysis for better matching
Stopword Filtering: Customizable Persian stopword lists for improved relevance
Informal Text Handling: Processes colloquial and informal Persian text

🧠 Information Retrieval

Multiple Embedding Models:
- TF-IDF with customizable parameters
- LaBSE (Language-agnostic BERT Sentence Embeddings)
- Support for custom embedding models
Advanced Similarity Metrics:
- Cosine Similarity for semantic matching
- Jaccard Similarity for term overlap
- Custom hybrid metrics combining multiple signals
Intelligent Ranking: Multi-factor ranking considering relevance, category, and user context

🚀 Real-time System

WebSocket Integration: Real-time query processing and response streaming
Asynchronous Processing: Non-blocking query handling for improved user experience
Live Results: Progressive result loading as processing completes
Session Management: Maintains context across multiple queries

🤖 AI Integration

LLM Enhancement: Integration with Together AI for contextual response generation
Prompt Engineering: Optimized prompts for Persian customer service scenarios
Fallback Mechanisms: Graceful degradation when AI services are unavailable
Response Personalization: Contextual responses based on retrieval results

📊 Analytics & Monitoring

MLflow Integration: Comprehensive experiment tracking and model versioning
Performance Metrics: Real-time monitoring of query response times and accuracy
Usage Analytics: Detailed logging of user interactions and system performance
A/B Testing Support: Framework for testing different retrieval strategies

🎨 User Interface

DigiSage Interface: Modern, responsive web interface
Persian Language Support: Right-to-left text support and Persian fonts
Mobile Responsive: Optimized for desktop and mobile devices
Accessibility: Screen reader compatible and keyboard navigation support

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Web Browser   │◄──►│   Flask Server   │◄──►│  Pipeline Core  │
│   (DigiSage)    │    │   (app.py)       │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                               │                         │
                               │                         ▼
                               │                ┌─────────────────┐
                               │                │   Preprocessor  │
                               │                │   - Tokenizer   │
                               │                │   - Normalizer  │
                               │                │   - Stemmer     │
                               │                └─────────────────┘
                               │                         │
                               │                         ▼
                               │                ┌─────────────────┐
                               │                │   Vectorizer    │
                               │                │   - TF-IDF      │
                               │                │   - LaBSE       │
                               │                └─────────────────┘
                               │                         │
                               │                         ▼
                               │                ┌─────────────────┐
                               │                │   Similarity    │
                               │                │   - Cosine      │
                               │                │   - Jaccard     │
                               │                │   - Custom      │
                               │                └─────────────────┘
                               │                         │
                               ▼                         ▼
                      ┌─────────────────┐    ┌─────────────────┐
                      │   Together AI   │    │   Evaluator     │
                      │   (LLM)         │    │   & Logger      │
                      └─────────────────┘    └─────────────────┘

Key Components:

Pipeline Core: Modular processing pipeline with training and inference modes
Web Server: Flask application with SocketIO for real-time communication
Text Processing: Persian-specific NLP pipeline
Retrieval Engine: Multi-algorithm similarity search
AI Enhancement: LLM integration for response generation
Monitoring: MLflow-based experiment tracking and logging

📋 Prerequisites

Before installing the system, ensure you have:

Python 3.8 or higher
pip (Python package installer)
Virtual environment (recommended: venv or conda)
Git for cloning the repository
4GB+ RAM for optimal performance with embeddings
Internet connection for downloading language models and dependencies

System Requirements

Component	Minimum	Recommended
Python	3.8+	3.9+
RAM	4GB	8GB+
Storage	2GB	5GB+
CPU	2 cores	4+ cores

🚀 Installation

1. Clone the Repository

git clone https://github.com/jonathansina/information-retrieval-system.git
cd information-retrieval-system

2. Create Virtual Environment

# Using venv (recommended)
python -m venv venv

# Activate on Linux/Mac
source venv/bin/activate

# Activate on Windows
venv\Scripts\activate

3. Install Dependencies

# Install all required packages
pip install -r requirements.txt

# Verify installation
python -c "import hazm, sklearn, flask; print('Dependencies installed successfully')"

4. Download Language Models

# Download Persian language models (if needed)
python -c "import hazm; print('Hazm ready for Persian processing')"

5. Prepare Data

# Ensure data directory exists and contains required files
ls data/
# Should show: augmented_dataset.csv, test_digikala_faq.csv, stops.txt, etc.

⚡ Quick Start

Start the System

# Run the application
python app.py

Access the Interface

Open your web browser
Navigate to http://localhost:5000
You'll see the DigiSage interface
Enter a Persian query and press send

Example Query

Try entering: چگونه می‌توانم سفارشم را تغییر دهم؟ (How can I change my order?)

📖 Usage

Development Mode

Training a New Model

from src.pipelines.pipeline.builder import PipelineBuilder
from src.pipelines.type_hint import ControllerType
from src.pipelines.config.default import PIPELINE_DEFAULT_CONFIG

# Create training pipeline
pipeline = (
    PipelineBuilder(controller_type=ControllerType.TRAINING)
    .with_logger("information-retrieval", "IRS", "development")
    .with_config(PIPELINE_DEFAULT_CONFIG)
    .with_preprocessor(True)
    .with_vectorizer(True)
    .with_vocabulary()
    .with_similarity()
    .with_evaluator(True)
    .build(True)
)

# Train the model
pipeline.run()

Running Inference

# Create inference pipeline
inference_pipeline = (
    PipelineBuilder(controller_type=ControllerType.INFERENCE)
    .with_logger("information-retrieval", "IRS", "development")
    .with_config(PIPELINE_DEFAULT_CONFIG)
    .with_preprocessor()
    .with_vectorizer()
    .with_vocabulary()
    .with_similarity()
    .with_evaluator()
    .build()
)

# Process a query
result = inference_pipeline.run("سوال شما به فارسی")
print(result)

Development with Jupyter

# Start Jupyter for experimentation
cd dev/
jupyter notebook data_analysis.ipynb

Production Mode

Running the Web Application

# Production mode with optimized settings
python app.py

Using the REST API

# Submit a query via API
curl -X POST http://localhost:5000/submit \
  -H "Content-Type: application/json" \
  -d '{"query": "چگونه محصولی را برگردانم؟"}'

WebSocket Integration

// Connect to WebSocket
const socket = io('http://localhost:5000');

// Listen for retrieval results
socket.on('ir_results', (data) => {
    console.log('Search results:', data.ir_results);
});

// Listen for AI responses
socket.on('llm_response', (data) => {
    console.log('AI response:', data.llm_response);
});

API Documentation

Endpoints

POST `/submit`

Submit a query for processing.

Request:

{
  "query": "سوال شما به زبان فارسی"
}

Response:

{
  "status": "Processing started"
}

WebSocket Events:

ir_results: Returns retrieval results
llm_response: Returns AI-generated response

GET `/`

Serves the main interface (DigiSage).

WebSocket Events

Client to Server

connect: Establish connection

Server to Client

ir_results: Retrieval results

{
  "ir_results": [
    {
      "question": "سوال",
      "answer": "پاسخ", 
      "category": "دسته‌بندی"
    }
  ]
}

llm_response: AI-generated response

{
  "llm_response": "پاسخ تولید شده توسط هوش مصنوعی"
}

💡 Examples

Basic Persian Query Processing

# Example 1: Simple product return query
query = "چگونه می‌توانم کالای خریداری شده را برگردانم؟"
result = inference_pipeline.run(query)

# Result structure:
{
    "retrieved_question": ["سوالات مشابه"],
    "retrieved_answer": ["پاسخ‌های مربوطه"], 
    "retrieved_category": ["دسته‌بندی‌ها"]
}

Custom Configuration

# Example 2: Custom preprocessing configuration
from src.pipelines.config.config import PreprocessorConfig

custom_config = PreprocessorConfig(
    normalizer=True,
    stemmer=True,
    stopwords=True,
    lemmatizer=False
)

# Use in pipeline
pipeline = PipelineBuilder().with_config(custom_config).build()

Batch Processing

# Example 3: Process multiple queries
queries = [
    "هزینه ارسال چقدر است؟",
    "چه زمانی سفارشم ارسال می‌شود؟",
    "چگونه می‌توانم سفارش را لغو کنم؟"
]

results = []
for query in queries:
    result = inference_pipeline.run(query)
    results.append(result)

Integration with Custom Models

# Example 4: Using custom similarity metrics
from src.pipelines.similarity.base import BaseSimilaritySearch

class CustomSimilarity(BaseSimilaritySearch):
    def search(self, query_vectors, corpus_vectors):
        # Your custom similarity implementation
        pass

# Use in pipeline
pipeline = (
    PipelineBuilder()
    .with_similarity(CustomSimilarity())
    .build()
)

📊 Performance

Benchmark Results

Metric	Value	Description
Average Response Time	< 200ms	End-to-end query processing
Retrieval Accuracy	89.5%	Top-5 relevant results
Persian Text Coverage	95%+	Successful processing rate
Concurrent Users	100+	Simultaneous query handling
Memory Usage	~2GB	With full embeddings loaded

Evaluation Metrics

The system is evaluated using:

Precision@K: Relevance of top-K results
Recall@K: Coverage of relevant documents
MRR: Mean Reciprocal Rank
NDCG: Normalized Discounted Cumulative Gain

Performance Optimization

# Enable MLflow tracking for performance monitoring
import mlflow

mlflow.start_run()
mlflow.log_metric("response_time", response_time)
mlflow.log_metric("accuracy", accuracy_score)
mlflow.end_run()

📁 Project Structure

information-retrieval-system/
├── 📁 data/                          # Dataset storage
│   ├── augmented_dataset.csv         # Training data with augmentations
│   ├── test_digikala_faq.csv        # Evaluation dataset
│   ├── digikala_faq.csv             # Original FAQ data
│   ├── stops.txt                     # Persian stopwords
│   ├── synonyms.json                 # Synonym mappings
│   └── vocabulary.txt                # Processed vocabulary
├── 📁 dev/                           # Development notebooks
│   ├── data_analysis.ipynb          # Data exploration and analysis
│   ├── test.ipynb                   # Testing and experimentation
│   └── time_experiment.png          # Performance visualization
├── 📁 src/                           # Core source code
│   ├── 📁 pipelines/                # Processing pipelines
│   │   ├── 📁 config/               # Configuration management
│   │   ├── 📁 evaluator/            # Performance evaluation
│   │   ├── 📁 logger/               # Logging utilities
│   │   ├── 📁 pipeline/             # Pipeline orchestration
│   │   ├── 📁 preprocessor/         # Text preprocessing
│   │   ├── 📁 similarity/           # Similarity algorithms
│   │   ├── 📁 vectorizer/           # Text vectorization
│   │   ├── 📁 vocabulary/           # Vocabulary management
│   │   └── type_hint.py             # Type definitions
│   └── 📁 generate/                 # Data generation utilities
├── 📁 ui/                            # Web interface (DigiSage)
│   ├── index.html                   # Main interface
│   ├── 📁 js/                       # JavaScript functionality
│   ├── 📁 styles/                   # CSS styling
│   └── 📁 images/                   # UI assets
├── app.py                           # Flask web application
├── requirements.txt                 # Python dependencies
├── README.md                        # Project documentation
└── LICENSE                          # MIT license

Key Files Description

File/Directory	Purpose
`app.py`	Main Flask application with WebSocket support
`src/pipelines/`	Modular NLP processing pipeline
`ui/`	DigiSage web interface
`data/`	Training and evaluation datasets
`dev/`	Jupyter notebooks for experimentation

⚙️ Configuration

Environment Variables

Create a .env file in the root directory:

# API Configuration
TOGETHER_API_KEY=your_together_api_key_here
FLASK_ENV=development
FLASK_DEBUG=True

# Server Configuration
HOST=0.0.0.0
PORT=5000

# MLflow Configuration
MLFLOW_TRACKING_URI=http://localhost:9437

Pipeline Configuration

Customize processing parameters in src/pipelines/config/default.py:

# Text Processing
PREPROCESSOR_DEFAULT_CONFIG = PreprocessorConfig(
    normalizer=True,
    stemmer=False,
    lemmatizer=False,
    stopwords=True,
    # ... other parameters
)

# Vectorization
VECTORIZER_DEFAULT_CONFIG = VectorizerConfig(
    vectorizer="tf-idf",
    vectorizer_param={
        "max_features": 10000,
        "ngram_range": (1, 2),
        # ... other parameters
    }
)

# Similarity Search
SIMILARITY_SEARCH_DEFAULT_CONFIG = SimilaritySearchConfig(
    metrics="cosine",
    metrics_param={
        "n_neighbors": 5,
    }
)

🚀 Deployment

Docker Deployment (Recommended)

# Build Docker image
docker build -t digikala-ir-system .

# Run container
docker run -p 5000:5000 \
  -e TOGETHER_API_KEY=your_key \
  digikala-ir-system

Production Deployment

# Using Gunicorn for production
pip install gunicorn

# Start with multiple workers
gunicorn --worker-class eventlet -w 1 --bind 0.0.0.0:5000 app:app

Cloud Deployment

AWS EC2

# Install dependencies on EC2
sudo yum update -y
sudo yum install python3 python3-pip -y

# Clone and setup
git clone https://github.com/jonathansina/information-retrieval-system.git
cd information-retrieval-system
pip3 install -r requirements.txt

# Run with systemd service
sudo systemctl enable digikala-ir
sudo systemctl start digikala-ir

Docker Compose

version: '3.8'
services:
  web:
    build: .
    ports:
      - "5000:5000"
    environment:
      - TOGETHER_API_KEY=your_key
    volumes:
      - ./data:/app/data
  
  mlflow:
    image: python:3.9
    command: pip install mlflow && mlflow server --host 0.0.0.0 --port 9437
    ports:
      - "9437:9437"

🔧 Troubleshooting

Common Issues

1. Dependencies Installation Fails

Problem: pip install -r requirements.txt fails

Solution:

# Upgrade pip first
pip install --upgrade pip

# Install with verbose output
pip install -r requirements.txt -v

# If still fails, install individually
pip install flask flask-socketio flask-cors
pip install hazm scikit-learn pandas numpy

2. Persian Text Not Displaying Correctly

Problem: Persian text appears as question marks or broken characters

Solution:

Ensure your terminal/browser supports UTF-8 encoding
Check that Persian fonts are installed on your system
Verify the UI CSS includes proper Persian font specifications

3. MLflow Connection Issues

Problem: Cannot connect to MLflow tracking server

Solution:

# Start MLflow server manually
mlflow server --host 0.0.0.0 --port 9437

# Or disable MLflow in development
export MLFLOW_TRACKING_URI=""

4. Out of Memory Errors

Problem: System runs out of memory during processing

Solution:

Reduce the size of your dataset for testing
Decrease max_features in TF-IDF configuration
Use smaller batch sizes for processing
Consider using lighter embedding models

5. WebSocket Connection Fails

Problem: Real-time features not working

Solution:

// Check browser console for errors
// Ensure SocketIO is properly loaded
// Verify CORS settings in app.py

Debug Mode

Enable debug logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Performance Issues

Monitor system performance:

# Check memory usage
htop

# Monitor Python processes
ps aux | grep python

# Check disk space
df -h

Getting Help

Check Issues: Look for similar problems in GitHub Issues
Create Issue: If not found, create a new issue with:
- System information (OS, Python version)
- Complete error message
- Steps to reproduce
- Expected vs actual behavior

🤝 Contributing

We welcome contributions! Here's how you can help:

Development Setup

# Fork the repository
git clone https://github.com/your-username/information-retrieval-system.git
cd information-retrieval-system

# Create development branch
git checkout -b feature/your-feature-name

# Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt  # If available

Contribution Guidelines

Code Style: Follow PEP 8 conventions
Documentation: Update relevant documentation
Testing: Add tests for new functionality
Commits: Use clear, descriptive commit messages
Pull Requests: Include description of changes and testing

Areas for Contribution

🌐 Internationalization: Support for other languages
🚀 Performance: Optimization and caching
🎨 UI/UX: Interface improvements
🧪 Testing: Comprehensive test coverage
📚 Documentation: Examples and tutorials
🔧 DevOps: CI/CD and deployment automation

Code Review Process

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request
Address review feedback
Merge after approval

Community

💬 Discussions: Use GitHub Discussions for questions
🐛 Bug Reports: Use GitHub Issues
📧 Email: Contact maintainers for sensitive issues

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary

✅ Commercial use allowed
✅ Modification allowed
✅ Distribution allowed
✅ Private use allowed
❌ Liability not included
❌ Warranty not included

🙏 Acknowledgments

Hazm Team: For excellent Persian NLP tools
Digikala: For the problem domain and inspiration
Open Source Community: For the amazing libraries used
Contributors: Everyone who has contributed to this project

Made with ❤️ for the Persian NLP community

⭐ Star this repository | 🐛 Report Bug | 💡 Request Feature

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
data		data
dev		dev
src		src
ui		ui
.gitignore		.gitignore
.project_root		.project_root
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

License

jonathansina/information-retrieval-system

Folders and files

Latest commit

History

Repository files navigation

🔍 Information Retrieval System for Digikala's FAQ

📑 Table of Contents

🎯 Overview

Why This System?

✨ Features

🔤 Persian Text Processing

🧠 Information Retrieval

🚀 Real-time System

🤖 AI Integration

📊 Analytics & Monitoring

🎨 User Interface

🏗️ Architecture

Key Components:

📋 Prerequisites

System Requirements

🚀 Installation

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Download Language Models

5. Prepare Data

⚡ Quick Start

Start the System

Access the Interface

Example Query

📖 Usage

Development Mode

Training a New Model

Running Inference

Development with Jupyter

Production Mode

Running the Web Application

Using the REST API

WebSocket Integration

API Documentation

Endpoints

POST /submit

GET /

WebSocket Events

Client to Server

Server to Client

💡 Examples

Basic Persian Query Processing

Custom Configuration

Batch Processing

Integration with Custom Models

📊 Performance

Benchmark Results

Evaluation Metrics

Performance Optimization

📁 Project Structure

Key Files Description

⚙️ Configuration

Environment Variables

Pipeline Configuration

🚀 Deployment

Docker Deployment (Recommended)

Production Deployment

Cloud Deployment

AWS EC2

Docker Compose

🔧 Troubleshooting

Common Issues

1. Dependencies Installation Fails

2. Persian Text Not Displaying Correctly

3. MLflow Connection Issues

4. Out of Memory Errors

5. WebSocket Connection Fails

Debug Mode

Performance Issues

Getting Help

🤝 Contributing

Development Setup

Contribution Guidelines

POST `/submit`

GET `/`

Packages