RetClean

Retrieval-Based Data Cleaning Using Foundation Models and Data Lakes

Welcome to the official repository of RetClean, a cutting-edge tool developed by QCRI for data repair tasks that demand world knowledge. RetClean combines the power of large language models (LLMs) with indexed CSV datalakes to clean and repair datasets efficiently.

Features

CSV Upload: Upload CSV files for cleaning.
LLM-Driven Cleaning: Perform standalone data repair using LLMs.
Data Lake Support: Upload and index CSV-based datalakes for contextual cleaning.
Custom Configurations: Choose indices, reasoner models, and re-ranker settings for tailored cleaning workflows.

Tech Stack

RetClean is built using modern tools and frameworks:

Frontend: React
Backend: FastAPI
LLM Serving: Ollama
Indexing and Search: Elasticsearch and Qdrant
Containerization: Docker

Getting Started

Installation

Set up RetClean effortlessly using Docker Compose.

Clone this repository and navigate to the project root.
Create a environment file (.env) in the root folder with the following format:

# === General Settings ===
LOG_LEVEL=INFO
SEARCH_ENABLED=true
USE_RERANKER=false
VECTOR_DB=qdrant
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

# === Elasticsearch ===
ES_HOST=http://elasticsearch:9200
ES_USER=
ES_PASSWORD=

# === Qdrant ===
QDRANT_HOST=http://qdrant
QDRANT_PORT=6333

# === Azure OpenAI (use these names) ===
AZURE_OPENAI_API_KEY=<put your info here>
AZURE_OPENAI_ENDPOINT=<put your info here>
AZURE_OPENAI_API_VERSION=2024-12-01-preview
AZURE_OPENAI_DEPLOYMENT=<put your info here>


# === OpenAI (use these names) ===
OPENAI_API_KEY=<put your info here>
OPENAI_ORG_ID=<put your info here>
OPENAI_PROJECT_ID=<put your info here>
OPENAI_MODEL=gpt-4o<put your info here>

# === Ollama (Optional for Local Models) ===
OLLAMA_URL=http://ollama:11434

# === Backend ===
BACKEND_PORT=8000

Build the application using:

docker-compose build

This will install the necessary services as well as the dependencies for the client and server.

Note

For the local model example available in the application, we use a quantized LLaMA 3.1 model pulled from Ollama. This may require you to increase the allocated max disk space from the default in Docker.

Start

To get the application running, just run the following command:

docker-compose up

The application hosted locally will run on port 3000: http://localhost:3000/

How to Use

Upload Your Data
- Upload your CSV file.
- Select the target column to repair.
- Choose optional pivot/context columns.
Use a Data Lake for Cleaning
- Upload a folder of CSVs to create an indexed datalake.
- Indices are managed using FAISS and Elasticsearch.
Configure and Start Repair
- Choose the reasoner model, index, and re-ranker settings.
- Start a repair job.
Review and Confirm Changes
- View the suggested repairs for the target column.
- Confirm or adjust changes as needed.

Contribution

We welcome contributions! Feel free to submit issues or pull requests. For significant changes, please open a discussion to ensure alignment with project goals.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
assets		assets
backend		backend
frontend		frontend
ollama		ollama
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
fix_docker_install.sh		fix_docker_install.sh
table_1.csv		table_1.csv
table_2.csv		table_2.csv
test_repair.csv		test_repair.csv
tester.ipynb		tester.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RetClean

Retrieval-Based Data Cleaning Using Foundation Models and Data Lakes

Features

Tech Stack

Getting Started

Installation

Note

Start

How to Use

Contribution

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

qcri/RetClean

Folders and files

Latest commit

History

Repository files navigation

RetClean

Retrieval-Based Data Cleaning Using Foundation Models and Data Lakes

Features

Tech Stack

Getting Started

Installation

Note

Start

How to Use

Contribution

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages