Skip to content

A complete solution for importing, storing, and searching BMECat XML product catalogs. Features a REST API powered by FastAPI and OpenSearch, plus a modern web interface for browsing and exporting products with faceted search, filtering, and batch export capabilities.

Notifications You must be signed in to change notification settings

HendrikReh/BMECatExplorer

Repository files navigation

BMECat Catalog‑Explorer

Version GitHub issues GitHub last commit Python Ruff Black

BMECatExplorer is an end‑to‑end, memory‑light pipeline to ingest and explore large BMECat 1.2 product catalogs. It stream‑converts XML to JSONL, imports into PostgreSQL, indexes to OpenSearch (BM25 and optional ), and exposes a FastAPI backend plus a small HTMX/Tailwind UI.

Pipeline

  1. Stream‑convert large BMECat XML → JSONL (main.py)
  2. Import JSONL into PostgreSQL (src/db)
  3. Index products into OpenSearch with optional OpenAI embeddings (src/search)
  4. Serve search + hybrid RAG‑friendly endpoints via FastAPI (src/api)
  5. Browse/export results in the frontend (frontend/)

Highlights

  • Streaming XML converter (iterparse + clear()) stays O(1) memory
  • Faceted BM25 search + autocomplete
  • Hybrid BM25 + vector search with RRF fusion (POST /api/v1/search/hybrid)
  • Multi‑catalog namespaces via catalog_id (composite uniqueness in DB + index IDs)
  • Normalized unit price (price_unit_amount) for correct price filters
  • Optional embeddings (OPENAI_API_KEY) and provenance fields for RAG

Prerequisites

  • Python 3.12+
  • Docker & Docker Compose
  • uv and just

Quick start (single catalog)

uv sync
just up

# Convert → import → index (safe to rerun; replaces the "default" catalog)
just pipeline data/BME-cat_eClass_8.xml

just serve
just serve-frontend

You can also run the steps manually:

just convert data/BME-cat_eClass_8.xml data/products.jsonl
just import data/products.jsonl --replace-catalog
just index

Pricing model (BMECat bundles)

BMECat prices can refer to bundles. PRICE_AMOUNT applies to PRICE_QUANTITY units (often 100). The backend computes a normalized unit price:

price_unit_amount = price_amount / price_quantity
  • UI and API show both unit price and raw amount.
  • price_min, price_max, and price_band filters operate on unit price.

Multi‑catalog import/index

To keep multiple XML sources in one DB/index without ID collisions:

just up
just pipeline-catalog data/catalog_a.xml catalog_a
just pipeline-catalog data/catalog_b.xml catalog_b

Search can be scoped with catalog_id=catalog_a (repeatable).

Upgrade note: OpenSearch document IDs are catalog_id:supplier_aid. If you have an existing index from an older version that used supplier_aid as _id, recreate and reindex (e.g., just index) to avoid duplicates.

API

Endpoint Description
GET /api/v1/search BM25 search with filters and facets
GET /api/v1/search/autocomplete?q= Prefix suggestions
GET /api/v1/products/{supplier_aid} Fetch a single product (use ?catalog_id= if needed)
GET /api/v1/facets Facet counts for UI
POST /api/v1/search/hybrid BM25 / vector / hybrid RRF search
POST /api/v1/search/batch Batch hybrid queries
GET /api/v1/catalogs List available catalogs

Common query params (/api/v1/search)

  • q – Full‑text query (descriptions, manufacturer, IDs)
  • manufacturer – Manufacturer name filter (repeatable)
  • eclass_id – Exact ECLASS ID filter (repeatable)
  • eclass_segment – ECLASS segment/2‑digit prefix filter (repeatable)
  • order_unit – Order unit filter (repeatable)
  • price_min / price_maxUnit price range filter
  • price_band – Predefined unit price bands (0‑10, 10‑50, 50‑200, 200‑1000, 1000+)
  • catalog_id – Catalog namespace filter (repeatable)
  • exact_match – Exact matches for EAN/IDs
  • page / size – Pagination

Example:

curl "http://localhost:9019/api/v1/search?q=Kabel&manufacturer=Walraven%20GmbH&catalog_id=default&size=10"

Commands

Run just --list for all tasks. Common ones:

Command Description
just up / just down Start/stop PostgreSQL and OpenSearch
just convert <in.xml> <out.jsonl> XML → JSONL
just convert-with-header <in.xml> <out.jsonl> <header.json> Convert and save header
just import <file.jsonl> [--catalog-id <id>] [--source-file <xml>] [--replace-catalog] Load JSONL into PostgreSQL
just index / just index-embed Index DB rows to OpenSearch (embeddings optional)
just index-catalog <catalog_id> <source.xml> Append a catalog to existing index
just pipeline <xml> Convert → import → index (replaces default catalog)
just pipeline-catalog <xml> <catalog_id> Pipeline under a catalog namespace
just serve / just serve-frontend Run backend / frontend
`just test-unit test-integration
just lint / just format Ruff / Black

Configuration

Backend env vars (via .env or shell) follow src/config.py. Key ones:

Variable Default Notes
POSTGRES_* from docker-compose.yml DB connection
OPENSEARCH_* from docker-compose.yml OpenSearch connection
OPENAI_API_KEY unset Required for index-embed and server‑side vector fallback
OPENAI_EMBEDDING_MODEL text-embedding-3-small
OPENAI_EMBEDDING_DIMENSIONS 1536 Must match index mapping

Frontend uses FRONTEND_API_BASE_URL and related settings (see frontend/config.py).

Migrations (Alembic)

For production or long‑lived databases, prefer Alembic migrations over runtime create_all:

uv run alembic upgrade head

Project structure

├── main.py                 # XML → JSONL converter
├── alembic/                # DB migrations
├── justfile                # Task runner commands
├── docker-compose.yml      # PostgreSQL + OpenSearch
├── src/
│   ├── config.py           # Settings
│   ├── db/                 # SQLAlchemy models + importer
│   ├── search/             # OpenSearch mapping/client/indexer
│   └── api/                # FastAPI app + routes
├── frontend/               # HTMX/Tailwind web UI
└── tests/                  # unit/, integration/, smoke/

About

A complete solution for importing, storing, and searching BMECat XML product catalogs. Features a REST API powered by FastAPI and OpenSearch, plus a modern web interface for browsing and exporting products with faceted search, filtering, and batch export capabilities.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •