A local Retrieval-Augmented Generation (RAG) application built with Kotlin and Ktor that allows you to upload PDF documents and query them using local LLMs via Ollama. Document embeddings are stored in PostgreSQL with pgVector for fast similarity search.
- PDF document ingestion and processing
- Vector similarity search using PostgreSQL pgVector
- Local LLM inference with Ollama
- Fast embedding generation with mxbai-embed-large
- Query expansion for better retrieval
- Fully local deployment - no external API calls
- Java 17 or higher
- Docker and Docker Compose
- Gradle (or use included wrapper)
This option runs Ollama in Docker alongside the database.
# 1. Start all services including Ollama
docker compose --profile ollama up -d
# 2. Wait for Ollama to download models (first time only)
docker logs -f local-rag-ollama-setup
# 3. Build and run the application
./gradlew runIf you have Ollama running elsewhere (e.g., locally installed or remote server):
# 1. Start only the database
docker compose up -d
# 2. Update src/main/resources/application.yaml
# Change ollamaBaseUrl to your Ollama instance URL
# 3. Build and run the application
./gradlew runThe application will be available at http://localhost:8080
Upload a PDF file to be processed and stored in the vector database.
Endpoint: POST /embed
Example using curl:
curl -X POST http://localhost:8080/embed \
-F "file=@/path/to/your/document.pdf"Success Response (200 OK):
{
"message": "File embedded successfully"
}Error Response (400 Bad Request):
{
"error": "No valid PDF file provided"
}What happens behind the scenes:
- PDF text is extracted using Apache PDFBox
- Text is split into chunks (7500 characters with 100 characters overlap)
- Each chunk is embedded using Ollama's mxbai-embed-large model (1024 dimensions)
- Document chunks and embeddings are stored in PostgreSQL with metadata
- IVFFlat index enables fast similarity search
Ask questions about your uploaded documents.
Endpoint: POST /query
Example using curl:
curl -X POST http://localhost:8080/query \
-H "Content-Type: application/json" \
-d '{
"query": "What are the main findings in the document?"
}'Success Response (200 OK):
{
"message": "Based on the document, the main findings include... [AI-generated answer based on retrieved context]"
}Error Response (400 Bad Request):
{
"error": "Query cannot be empty"
}If you receive a timeout error, retry the request. It takes time to load the model into memory for the first time.
What happens behind the scenes:
- Query is expanded into 5 variations using the LLM for better retrieval
- Each variation is embedded using mxbai-embed-large
- Top 3 similar document chunks are retrieved for each variation using cosine similarity
- Retrieved chunks are deduplicated and combined into context
- Context is sent to Ollama's qwen3 LLM to generate an answer
- Answer is returned to the client
Interactive API documentation is available via Swagger UI:
http://localhost:8080/swagger
OpenAPI specification:
http://localhost:8080/openapi
Edit src/main/resources/application.yaml to customize.
# Start only the database
docker compose up db -d
### Ollama (Optional)
```bash
# Start Ollama with the ollama profile
docker compose --profile ollama up ollama -d
# View available models
docker exec local-rag-ollama ollama list
# Pull additional models
docker exec local-rag-ollama ollama pull llama3.2
# Test Ollama directly
curl http://localhost:11434/api/tags# Build the project
./gradlew build
# Run tests
./gradlew test
# Run the application
./gradlew run
# Build executable JAR
./gradlew shadowJarError: Connection refused to http://localhost:11434
Solution:
- If using Docker Ollama: Ensure the service is running with
docker ps | grep ollama - If using external Ollama: Update
ollamaBaseUrlinapplication.yaml - Check Ollama is responding:
curl http://localhost:11434/api/tags
Error: Connection to localhost:5432 refused
Solution:
# Start PostgreSQL
docker compose up db -d
# Check it's running
docker ps | grep postgres
# Check logs
docker logs postgres-pgvectorSolution: The application is configured with 5-minute timeouts. If you still experience timeouts:
- First query takes longer as the model loads into memory
- Consider using a smaller/faster model (e.g.,
llama3.2:1binstead ofqwen3) - Check Ollama logs:
docker logs local-rag-ollama
Solution:
- Ensure you've uploaded at least one PDF document via
/embedfirst - Check that embeddings were generated successfully in the logs
- Verify documents are in the database:
docker exec -it postgres-pgvector psql -U dev_user -d embedding_db \ -c "SELECT COUNT(*) FROM documents;"
- Vector Index: IVFFlat index trades some accuracy for speed (~100 lists)
- Connection Pool: 2-10 concurrent database connections
- Chunk Size: 7500 characters balances context vs. precision
- Timeout: 5-minute timeout accommodates slower LLM inference
- Embedding Model: mxbai-embed-large provides good accuracy at 1024 dimensions
MIT