LexiClass

Extensible document classification toolkit with modern ML capabilities and plugin architecture. Choose from 11+ pre-built plugins or create your own.

Version: 0.3.0 | License: MIT

Features

Core Capabilities

11 Built-in Plugins: 4 tokenizers, 4 feature extractors, 3 classifiers
Flexible Architecture: Protocol-based plugin system for easy customization
Production Ready: Comprehensive error handling, logging, and configuration
Memory Efficient: Streaming support for large datasets with token caching
Multi-Label Support: Handle both single-label and multi-label classification
GPU Acceleration: Optional GPU support for transformer models
CLI & Library: Use as command-line tool or Python library

Plugin Ecosystem

Tokenizers:

ICU: Fast, locale-aware tokenization (baseline)
spaCy: Linguistic features with stop word filtering
SentencePiece: Trainable subword tokenization for any language
Hugging Face: Access to 1000+ pre-trained tokenizers

Feature Extractors:

Bag-of-Words (BoW): Simple word counts (baseline)
TF-IDF: Term frequency-inverse document frequency weighting
FastText: Subword embeddings with OOV handling
Sentence-BERT: State-of-the-art transformer embeddings

Classifiers:

SVM: Fast linear classifier (baseline)
XGBoost: Gradient boosting with GPU support
Transformer: Fine-tune BERT/RoBERTa for classification

Performance

Pipeline	Time (10K docs)	Accuracy	Use Case
Fast	~21s	88%	Prototyping
Production	~81s	92%	Balanced
State-of-the-Art	~20min*	95%	Research

*With GPU. See benchmarks for details.

Quick Start

Installation

Basic installation:

# Clone or download the repository
git clone https://github.com/yourusername/LexiClass.git
cd LexiClass

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate  # or .venv/bin/activate.fish for Fish shell

# Install LexiClass
pip install -e .

With optional plugins:

# For spaCy tokenizer
pip install spacy
python -m spacy download en_core_web_sm

# For XGBoost classifier
pip install xgboost

# For Sentence-BERT and Transformer plugins
pip install sentence-transformers transformers torch

# For SentencePiece tokenizer
pip install sentencepiece

# ICU tokenizer (locale-aware)
pip install PyICU

30-Second Example

Using the CLI:

# Discover available plugins
lexiclass plugins list

# Build an index from documents
lexiclass build-index ./my_docs ./my_index --tokenizer icu --features tfidf

# Train a classifier
lexiclass train ./my_index ./labels.tsv ./model.pkl

# Make predictions
lexiclass predict ./model.pkl ./my_index ./test_docs --output predictions.tsv

# Evaluate results
lexiclass evaluate predictions.tsv ground_truth.tsv --confusion-matrix

Using the library:

from lexiclass.plugins import registry
from lexiclass.classifier import SVMDocumentClassifier
from lexiclass.io import DocumentLoader, load_labels

# Create custom pipeline
tokenizer = registry.create('spacy', model_name='en_core_web_sm')
features = registry.create('tfidf', normalize=True)
classifier = registry.create('xgboost', n_estimators=200)

# Build index and train
clf = SVMDocumentClassifier(
    tokenizer=tokenizer,
    feature_extractor=features
)

def doc_stream():
    return DocumentLoader.iter_documents_from_directory('./my_docs')

clf.build_index(
    index_path='./my_index',
    document_stream_factory=doc_stream
)

labels = load_labels('./labels.tsv')
clf.train(labels)
clf.save_model('./model.pkl', index_path='./my_index')

# Predict
test_docs = DocumentLoader.load_documents_from_directory('./test_docs')
predictions = clf.predict(test_docs)

Documentation

Comprehensive Guides

User Guide - Complete guide to using LexiClass (CLI, library, integrations)
Plugin Development Guide - Create custom plugins in 30 minutes

Project Instructions

Configuration

LexiClass reads configuration from environment variables (optionally from a .env file):

# Logging
LEXICLASS_LOG_LEVEL=INFO          # DEBUG, INFO, WARNING, ERROR
LEXICLASS_LOG_FORMAT=text         # text or json
LEXICLASS_LOG_FILE=./app.log      # Optional log file path

# Library-specific logging
LEXICLASS_GENSIM_LOG_LEVEL=WARNING
LEXICLASS_SKLEARN_LOG_LEVEL=WARNING

# Behavior
LEXICLASS_RANDOM_SEED=42          # For reproducibility
LEXICLASS_LOCALE=en               # Default locale for ICU tokenizer

Use -v flag with CLI commands to enable verbose DEBUG logging.

CLI Reference

Plugin Management

# List all available plugins
lexiclass plugins list

# List by type
lexiclass plugins list --type tokenizer
lexiclass plugins list --type feature_extractor
lexiclass plugins list --type classifier

# Show only available plugins (dependencies installed)
lexiclass plugins list --available-only

# Get detailed information about a plugin
lexiclass plugins describe tfidf
lexiclass plugins describe xgboost

Building an Index

# Basic: Build index from directory of .txt files
lexiclass build-index ./my_docs ./my_index

# With custom plugins
lexiclass build-index ./my_docs ./my_index \
  --tokenizer spacy \
  --features tfidf

# With token caching (avoids re-tokenization)
lexiclass build-index ./my_docs ./my_index \
  --tokenizer spacy \
  --features tfidf \
  --token-cache-path ./cache.jsonl.gz

Training and Prediction

# Train classifier (labels.tsv format: doc_id<TAB>label)
lexiclass train ./my_index ./labels.tsv ./model.pkl

# Make predictions
lexiclass predict ./model.pkl ./my_index ./test_docs --output predictions.tsv

# Evaluate predictions
lexiclass evaluate predictions.tsv ground_truth.tsv
lexiclass evaluate predictions.tsv ground_truth.tsv --confusion-matrix
lexiclass evaluate predictions.tsv ground_truth.tsv --format json --output metrics.json

Similarity Search

# Find similar documents
lexiclass similar ./my_index DOC_ID --top-k 5 --threshold 0.0

For more CLI examples, see the User Guide.

Using as a Library

Basic Usage

from lexiclass.classifier import SVMDocumentClassifier
from lexiclass.io import DocumentLoader, load_labels

# Create classifier
clf = SVMDocumentClassifier()

# Build index from documents
def doc_stream():
    return DocumentLoader.iter_documents_from_directory("./my_docs")

clf.build_index(
    index_path="./my_index",
    document_stream_factory=doc_stream
)

# Train
labels = load_labels("./labels.tsv")
clf.train(labels)
clf.save_model("./model.pkl", index_path="./my_index")

# Predict
test_docs = DocumentLoader.load_documents_from_directory("./test_docs")
predictions = clf.predict(test_docs)  # dict: doc_id -> (label, score)

Using Custom Plugins

from lexiclass.plugins import registry
from lexiclass.classifier import SVMDocumentClassifier

# Create plugins
tokenizer = registry.create('spacy', model_name='en_core_web_sm')
features = registry.create('tfidf', normalize=True)

# Use with classifier
clf = SVMDocumentClassifier(
    tokenizer=tokenizer,
    feature_extractor=features
)
# ... rest of workflow

Advanced: Direct Plugin Usage

from lexiclass.plugins import registry

# Create XGBoost classifier directly
xgb = registry.create('xgboost', n_estimators=200, use_gpu=True)

# Train with feature matrix
xgb.train(feature_matrix, labels)

# Predict
predictions = xgb.predict(feature_matrix)

Integration with External Services

LexiClass can be integrated into web services, background job processors, and streaming pipelines:

REST API: Flask, FastAPI integration examples
Background Jobs: Celery task examples
Streaming: Kafka consumer examples
Batch Processing: Large dataset handling

See the User Guide for complete integration examples.

Creating Custom Plugins

LexiClass uses a protocol-based architecture that makes it easy to create custom plugins.

Quick Example

from lexiclass.plugins import registry, PluginType, PluginMetadata

# Define your plugin
class MyTokenizer:
    def tokenize(self, text: str) -> list[str]:
        return text.lower().split()

# Register it
metadata = PluginMetadata(
    name="mytok",
    display_name="My Custom Tokenizer",
    description="Simple whitespace tokenizer",
    plugin_type=PluginType.TOKENIZER,
    dependencies=[],
    performance_tier="fast",
    quality_tier="basic"
)

registry.register(metadata, lambda: MyTokenizer())

# Use it
tokenizer = registry.create('mytok')

Plugin Protocols

Core interfaces are defined as Protocols in lexiclass.interfaces:

TokenizerProtocol: .tokenize(text) -> list[str]
FeatureExtractorProtocol: .fit(), .fit_streaming(), .transform(), .tokens_to_bow(), .num_features()
ClassifierProtocol: .train(), .predict(), .save(), .load()
DocumentIndexProtocol: .build_index(), .query_by_id(), .query_by_vector()

For a complete guide to creating custom plugins, see the Plugin Development Guide.

Architecture

Project Structure

src/lexiclass/
├── plugins/
│   ├── tokenizers/      # ICU, spaCy, SentencePiece, HuggingFace
│   ├── features/        # BoW, TF-IDF, FastText, Sentence-BERT
│   └── classifiers/     # SVM, XGBoost, Transformer
├── classifier.py        # SVMDocumentClassifier orchestration
├── index.py             # DocumentIndex (similarity search)
├── features.py          # Base feature extraction
├── tokenization.py      # Base tokenization
├── interfaces.py        # Protocol definitions
├── io.py                # Document loading and label parsing
├── config.py            # Environment-based configuration
├── exceptions.py        # Custom exceptions
└── cli/                 # Command-line interface

Key Design Patterns

Protocol-based plugins: Easy to extend without modifying core
Streaming support: Memory-efficient for large datasets
Two-pass indexing: Build indices without loading all data
Token caching: Avoid re-tokenization between passes
Dependency checking: Clear error messages for missing dependencies
Lazy loading: Models loaded only when needed

Use Cases

By Dataset Size

Size	Recommended Pipeline	Time	Accuracy
Small (<10K)	Any pipeline	Fast	85-95%
Medium (10K-100K)	spaCy + TF-IDF + XGBoost	~1-2min	88-92%
Large (100K-1M)	ICU + TF-IDF + SVM	~5-10min	88-91%
Very Large (>1M)	ICU + BoW + SVM	~20-30min	85-89%

By Accuracy Requirements

Quick prototyping (85-88%): ICU + BoW + SVM
Production balanced (88-92%): spaCy + TF-IDF + XGBoost
High quality (90-93%): spaCy + FastText + XGBoost
State-of-the-art (93-95%): HuggingFace + Sentence-BERT + Transformer (GPU required)

Contributing

Contributions are welcome! Areas for contribution:

New plugins: Add tokenizers, feature extractors, or classifiers
Documentation: Improve guides and examples
Testing: Add unit tests and integration tests
Benchmarks: Run benchmarks on new datasets
Bug fixes: Fix issues and improve error handling

Troubleshooting

Installation issues:

# Upgrade build tools if installation fails
python -m pip install --upgrade pip setuptools wheel

Missing dependencies:

# Check which plugins need dependencies
lexiclass plugins list

# Get installation instructions
lexiclass plugins describe <plugin_name>

macOS OpenMP issue with XGBoost:

brew install libomp

Memory issues with large datasets:

Use streaming mode with document_stream_factory
Enable token caching with --token-cache-path
Use simpler plugins (ICU + BoW instead of spaCy + SBERT)

For more troubleshooting help, see the User Guide.

License

MIT License - see LICENSE file for details.

Links

Documentation: docs/USER_GUIDE.md
Issues: Please report bugs and feature requests via GitHub issues
Contributing: See contributing section above

LexiClass v0.3.0 - Production-ready document classification toolkit

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.vscode		.vscode
docs		docs
scripts		scripts
src/lexiclass		src/lexiclass
.gitignore		.gitignore
.pylintrc		.pylintrc
README.md		README.md
pyproject.toml		pyproject.toml

arafat-hasan/LexiClass

Folders and files

Latest commit

History

Repository files navigation