Skip to content

Conversation

@gandersteele
Copy link
Contributor

Summary

  • Adds Python SDK support for model-based custom entities, enabling programmatic management of LLM-in-the-loop NER model training
  • Introduces classes for entity management, version/guideline refinement, and model training workflows
  • Wraps existing backend endpoints with no server-side changes required

Changes

New files:

  • tonic_textual/classes/model_entity.py - Core classes (ModelEntity, ModelEntityVersion, TrainedModel) with full workflow support
  • tonic_textual/services/model_entity.py - Service layer for CRUD operations

Modified:

  • tonic_textual/redact_api.py - Added convenience methods on TonicTextual client

Features

Entity Management:

  • create_model_entity(), get_model_entity(), list_model_entities(), delete_model_entity()

Test Data & Guidelines Refinement:

  • upload_test_data() - Upload files with ground truth spans
  • version.wait_for_completion() - Wait for LLM annotation
  • version.get_metrics() - Get F1/precision/recall scores
  • version.get_suggested_guidelines() - Get LLM-suggested improvements
  • entity.create_version() - Create new version with refined guidelines

Training:

  • upload_training_data() / upload_training_file()
  • create_trained_model(version_id) - Create model with specific guidelines
  • model.start_training() / model.wait_for_training()
  • model.activate() - Set as active model for entity

Example Usage

from tonic_textual.redact_api import TonicTextual

textual = TonicTextual()

# Create entity and upload test data
entity = textual.create_model_entity(
    name='PRODUCT_CODE',
    guidelines='Identify product codes like SKU-12345.'
)
entity.upload_test_data([
    {'text': 'Order SKU-123 shipped.', 'spans': [{'start': 6, 'end': 13}]}
])

# Refine guidelines based on metrics
version = entity.get_latest_version()
version.wait_for_completion()
print(f'F1: {version.get_metrics().f1_score}')

# Train model
entity.upload_training_data([{'text': '...', 'fileName': 'train.txt'}])
model = entity.create_trained_model(version.id)
model.start_training()
model.wait_for_training()
model.activate()

Test plan

  • Tested entity CRUD operations against production API
  • Verified test data upload with ground truth saves correctly (files show "Reviewed" status)
  • Confirmed guidelines refinement loop works (F1 improved from 0.6 → 0.73 with refined guidelines)
  • Validated full training workflow: upload → annotate → train → activate

gandersteele and others added 2 commits January 15, 2026 14:58
Introduces Python SDK classes and methods for managing model-based custom
entities, which allow users to define NER models by refining annotation
guidelines with LLM-in-the-loop, upload test/training data, and train
encoder models.

New files:
- tonic_textual/classes/model_entity.py: Core classes (ModelEntity,
  ModelEntityVersion, TrainedModel) with methods for test data upload,
  ground truth annotation, training, and model activation
- tonic_textual/services/model_entity.py: Service layer for CRUD operations

Modified:
- tonic_textual/redact_api.py: Added convenience methods for model entity
  management (create, get, list, delete)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The training files endpoint returns a paginated dict with 'records' key,
unlike test files which returns a list directly.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gandersteele gandersteele requested a review from Copilot January 15, 2026 23:20
Comment on lines +406 to +412
if files_with_spans and wait_for_processing:
# Wait for files to be processed before saving ground truth
self._wait_for_files_ready(file_ids, timeout_seconds=processing_timeout)

for file_id, spans in files_with_spans:
self._save_ground_truth(file_id, spans)

This comment was marked as outdated.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces SDK support for managing model-based custom entities in Tonic Textual, enabling programmatic workflows for training custom NER models with LLM-assisted annotation and guidelines refinement.

Changes:

  • Added core classes for model entities, versions, and trained models with full lifecycle management
  • Implemented service layer for CRUD operations on model-based entities
  • Extended the TonicTextual client with convenience methods for entity management

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
tonic_textual/classes/model_entity.py Core domain classes supporting entity creation, test/training data upload, version management, metrics retrieval, and model training workflows
tonic_textual/services/model_entity.py Service layer providing CRUD operations for model entities with API endpoint integration
tonic_textual/redact_api.py Client-level convenience methods delegating to the model entity service

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +59 to +63
with requests.Session() as session:
data = self.client.http_get(
f"/api/model-based-entities/{entity_id}",
session=session,
)
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get() method creates a new requests.Session for a single HTTP call, which provides no benefit over using the session already managed by the client. This pattern is repeated in list() and creates unnecessary overhead. Consider removing the session context manager unless the client requires it, or reuse a client-managed session.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +85
if item.get("entityType") == "ModelBased":
# Fetch full entity data
entity = self.get(item["id"])
model_entities.append(entity)
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list() method makes N+1 HTTP requests: one to fetch all entities, then one per model-based entity to fetch full details. For large numbers of entities, this will be slow. Consider whether the /api/custom-entities endpoint can return full entity data, or if a dedicated endpoint exists for listing model-based entities with complete information.

Copilot uses AI. Check for mistakes.
Comment on lines +505 to +508
import json as json_module
endpoint = "training/files" if is_training else "test/files"
# API expects multipart with 'document' (JSON metadata) and 'file' (content)
document = json_module.dumps({"fileName": file_name})
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The json module is already imported at the module level (line 8), so this local import is redundant and creates naming confusion. Remove this line and use the module-level json import instead.

Suggested change
import json as json_module
endpoint = "training/files" if is_training else "test/files"
# API expects multipart with 'document' (JSON metadata) and 'file' (content)
document = json_module.dumps({"fileName": file_name})
endpoint = "training/files" if is_training else "test/files"
# API expects multipart with 'document' (JSON metadata) and 'file' (content)
document = json.dumps({"fileName": file_name})

Copilot uses AI. Check for mistakes.
Comment on lines +505 to +508
import json as json_module
endpoint = "training/files" if is_training else "test/files"
# API expects multipart with 'document' (JSON metadata) and 'file' (content)
document = json_module.dumps({"fileName": file_name})
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the module-level json.dumps() instead of json_module.dumps() after removing the redundant import on line 505.

Suggested change
import json as json_module
endpoint = "training/files" if is_training else "test/files"
# API expects multipart with 'document' (JSON metadata) and 'file' (content)
document = json_module.dumps({"fileName": file_name})
endpoint = "training/files" if is_training else "test/files"
# API expects multipart with 'document' (JSON metadata) and 'file' (content)
document = json.dumps({"fileName": file_name})

Copilot uses AI. Check for mistakes.

def _save_ground_truth(self, file_id: str, spans: List[Dict]) -> None:
"""Save ground truth annotations for a file."""
annotations = [{"start": s["start"], "end": s["end"]} for s in spans]

This comment was marked as outdated.

Tests cover:
- Entity CRUD operations (create, get, list, delete)
- Version management (get latest, list versions)
- Test data upload with ground truth spans
- Version metrics and wait_for_completion
- Guidelines refinement (create new version, get suggestions)
- Training data upload
- Trained model creation and listing
- Full workflow integration test

Tests are skipped by default. Set ENABLE_MODEL_ENTITY_TESTS=1 to run.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Comment on lines +433 to +436
if all_ready:
return

sleep(poll_interval)

This comment was marked as outdated.

- Add 9 quick API tests that run without LLM (CRUD, versions, data upload)
- Add 7 LLM tests behind ENABLE_MODEL_ENTITY_LLM_TESTS flag
- Track created entity IDs for safe cleanup (only deletes test entities)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants