Skip to content

Conversation

@WSL0809
Copy link
Contributor

@WSL0809 WSL0809 commented Jan 2, 2026

Summary

Enforce stricter, database-aware validation for collection names in pyseekdb, ensuring that invalid names are rejected at the client layer before any SQL is executed #74

Solution Description

  • Added a centralized collection name validator in BaseClient:
    • Only allows [a-zA-Z0-9_].
    • Computes an effective max length based on:
      • The underlying DB table name limit (_MAX_TABLE_NAME_LENGTH = 64) minus the configurable collection table prefix from CollectionNames.table_name("").
    • For the default prefix c$v1$, this yields a safe max length (e.g. ~59 chars) so that prefix + name stays within the DB limit.

Summary by CodeRabbit

  • Documentation

    • Clarified collection naming rules and configuration guidance, including character restrictions, maximum length, and configuration wrapper usage.
  • New Features

    • Enforced collection naming conventions to prevent invalid collection creation.
  • Tests

    • Added unit tests covering collection name validation and related edge cases.
  • Other

    • Minor clarification to a warning message for clearer diagnostics.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 2, 2026

📝 Walkthrough

Walkthrough

Adds collection name validation to the client (type, non-empty, allowed chars [A-Za-z0-9_], max length), invokes validation in collection creation/get-or-create flows, adds unit tests for validation, and updates README documenting name constraints and Configuration wrapper notes.

Changes

Cohort / File(s) Summary
Documentation
README.md
Updated name parameter docs to require non-empty strings, only letters/digits/underscore ([A-Za-z0-9_]), max length 512 in docs (effective DB table max enforced in code). Expanded configuration section to show Configuration wrapper supporting HNSWConfiguration and FulltextParserConfig, and backward-compatibility notes for dimension.
Client implementation
src/pyseekdb/client/client_base.py
Added _COLLECTION_NAME_PATTERN and _MAX_COLLECTION_NAME_LENGTH, implemented _validate_collection_name(name: str) (type, non-empty, length, character checks), integrated validation into create_collection and get_or_create_collection, and standardized a GET_SQL warning message.
Tests
tests/unit_tests/test_collection_name_validation.py
New tests covering effective max length calculation, acceptance of valid names (including edge case at max length), TypeError for non-string inputs, and ValueError for empty, too-long, or invalid-character names.

Sequence Diagram(s)

(omitted — changes do not introduce multi-component sequential interactions requiring a diagram)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble names with careful care,
Letters, digits, underscores fair.
No empty hops, no stray characters roam,
Validated fields find their home.
A tiny hop, a tidy schema poem.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: Validate collection names' directly and clearly describes the main change—adding validation for collection names throughout the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 91.67% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/pyseekdb/client/client_base.py (1)

89-127: Solid validation logic with comprehensive error handling.

The validation function correctly enforces all requirements: type checking, non-empty constraint, length bounds, and character restrictions. The guard against prefix misconfiguration (lines 112-116) is a thoughtful addition.

Optional: Remove redundant variable assignment

Line 111 assigns effective_max = available_length, but available_length is already the value you need. Consider simplifying:

     # Calculate effective maximum based on table prefix and database limit
     table_prefix = CollectionNames.table_name("")
     # Guard against misconfiguration where prefix itself is too long
     available_length = max(0, _MAX_TABLE_NAME_LENGTH - len(table_prefix))
-    effective_max = available_length
-    if effective_max <= 0:
+    if available_length <= 0:
         raise ValueError(
             "Invalid collection table prefix configuration: no space left for collection name. "
             f"Prefix={table_prefix!r}, table name limit={_MAX_TABLE_NAME_LENGTH}."
         )
-    if len(name) > effective_max:
+    if len(name) > available_length:
         raise ValueError(
             f"Collection name too long: {len(name)} characters; "
-            f"maximum allowed is {effective_max} for the current prefix."
+            f"maximum allowed is {available_length} for the current prefix."
         )
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c24053 and fb0a768.

📒 Files selected for processing (3)
  • README.md
  • src/pyseekdb/client/client_base.py
  • tests/unit_tests/test_collection_name_validation.py
🧰 Additional context used
🧬 Code graph analysis (2)
src/pyseekdb/client/client_base.py (1)
src/pyseekdb/client/meta_info.py (2)
  • CollectionNames (12-15)
  • table_name (14-15)
tests/unit_tests/test_collection_name_validation.py (2)
src/pyseekdb/client/client_base.py (1)
  • _validate_collection_name (89-126)
src/pyseekdb/client/meta_info.py (2)
  • CollectionNames (12-15)
  • table_name (14-15)
🪛 Ruff (0.14.10)
src/pyseekdb/client/client_base.py

104-104: Avoid specifying long messages outside the exception class

(TRY003)


106-106: Avoid specifying long messages outside the exception class

(TRY003)


113-116: Avoid specifying long messages outside the exception class

(TRY003)


118-121: Avoid specifying long messages outside the exception class

(TRY003)


123-126: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (5)
src/pyseekdb/client/client_base.py (3)

43-49: LGTM! Well-documented validation constants.

The regex pattern and table name length constant are correctly defined. The comments clearly explain the validation strategy of computing an effective maximum by subtracting the collection prefix length from the database limit.


435-435: Excellent placement of validation.

Calling _validate_collection_name(name) at the beginning of create_collection ensures early failure before any table name construction or SQL execution, providing clear error messages to users.


794-795: Good practice: validate early before any database interaction.

Validating the collection name before checking existence ensures immediate feedback for invalid names. While this validation also occurs in create_collection (line 435) if a new collection is created, the early check here is appropriate for fast failure.

tests/unit_tests/test_collection_name_validation.py (1)

1-81: Excellent test coverage for collection name validation.

The test suite comprehensively covers all validation rules:

  • Valid names including the edge case exactly at maximum length
  • Type checking for non-string inputs
  • Empty name rejection
  • Length limit enforcement
  • Invalid character detection with multiple examples

The _effective_max_length property correctly mirrors the production calculation, ensuring tests remain synchronized with the implementation.

README.md (1)

261-261: Documentation accurately reflects validation constraints.

The parameter documentation correctly describes all collection name requirements enforced by _validate_collection_name: non-empty, alphanumeric plus underscore characters, and the computed maximum length (59 characters with the default c$v1$ prefix).

@zhanghuidinah
Copy link

@hnwyllmm PTAL

@hnwyllmm
Copy link
Member

hnwyllmm commented Jan 5, 2026

Thanks very much for your contribution.
But there's a mistake. The max length of collection name is 512 but not (64-len('c$v1$')).
64 is the limitation of seekdb/oceanbase and we should fix it in the future. For now, we should only limit the length of collection name to 512 characters even we can't test the scenario.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/pyseekdb/client/client_base.py (2)

82-109: Validation logic is comprehensive and correct.

The function properly enforces all required constraints (type, non-empty, length, character set) with clear error messages.

Optional: Address Ruff TRY003 warnings

Static analysis suggests extracting long error messages into custom exception classes for better code organization. Since the error messages are clear and actionable as-is, this is purely a style improvement:

class CollectionNameError(ValueError):
    """Base exception for collection name validation errors."""
    pass

class EmptyCollectionNameError(CollectionNameError):
    """Collection name cannot be empty."""
    def __init__(self):
        super().__init__("Collection name must not be empty")

class CollectionNameTooLongError(CollectionNameError):
    """Collection name exceeds maximum length."""
    def __init__(self, length: int, max_length: int):
        super().__init__(
            f"Collection name too long: {length} characters; "
            f"maximum allowed is {max_length}."
        )

class InvalidCollectionNameCharsError(CollectionNameError):
    """Collection name contains invalid characters."""
    def __init__(self):
        super().__init__(
            "Collection name contains invalid characters. "
            "Only letters, digits, and underscore are allowed: [a-zA-Z0-9_]"
        )

Then use: raise EmptyCollectionNameError() instead of constructing messages inline.


536-632: Consider adding validation to other collection methods for consistency.

While create_collection and get_or_create_collection correctly validate names, get_collection, delete_collection, and has_collection do not. Adding validation would provide clearer error messages when users attempt operations with invalid names (e.g., "Invalid collection name" instead of "Collection not found").

Proposed additions
 def get_collection(
     self,
     name: str,
     embedding_function: EmbeddingFunctionParam = _NOT_PROVIDED
 ) -> "Collection":
+    _validate_collection_name(name)
     # Construct table name: c$v1${name}
     table_name = CollectionNames.table_name(name)
 def delete_collection(self, name: str) -> None:
+    _validate_collection_name(name)
     # Construct table name: c$v1${name}
     table_name = CollectionNames.table_name(name)
 def has_collection(self, name: str) -> bool:
+    _validate_collection_name(name)
     # Construct table name: c$v1${name}
     table_name = CollectionNames.table_name(name)

Also applies to: 634-652, 721-741

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb0a768 and 6f58e45.

📒 Files selected for processing (3)
  • README.md
  • src/pyseekdb/client/client_base.py
  • tests/unit_tests/test_collection_name_validation.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit_tests/test_collection_name_validation.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/pyseekdb/client/client_base.py (3)
src/pyseekdb/client/admin_client.py (1)
  • AdminAPI (34-94)
src/pyseekdb/client/collection.py (3)
  • Collection (19-570)
  • name (63-65)
  • embedding_function (88-90)
src/pyseekdb/client/embedding_function.py (1)
  • EmbeddingFunction (38-65)
🪛 Ruff (0.14.10)
src/pyseekdb/client/client_base.py

96-96: Avoid specifying long messages outside the exception class

(TRY003)


98-98: Avoid specifying long messages outside the exception class

(TRY003)


100-103: Avoid specifying long messages outside the exception class

(TRY003)


105-108: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (5)
src/pyseekdb/client/client_base.py (3)

39-42: LGTM! Constants correctly define validation constraints.

The pattern restricts collection names to [A-Za-z0-9_] and the 512-character limit aligns with the reviewer's feedback addressing issue #74.


417-417: LGTM! Validation correctly placed at method entry.

The validation occurs before any database interaction, ensuring invalid names are rejected at the client as intended.


776-777: LGTM! Validation with clear intent comment.

The validation is correctly placed before the has_collection check, ensuring invalid names are rejected early.

README.md (2)

261-261: LGTM! Documentation accurately describes validation rules.

The collection name constraints are clearly documented and match the implementation in _validate_collection_name.


262-272: LGTM! Configuration documentation is clear and comprehensive.

The updated documentation effectively explains the Configuration wrapper pattern and provides helpful guidance on when to use it versus HNSWConfiguration directly.

@WSL0809
Copy link
Contributor Author

WSL0809 commented Jan 6, 2026

Thanks very much for your contribution. But there's a mistake. The max length of collection name is 512 but not (64-len('c$v1$')). 64 is the limitation of seekdb/oceanbase and we should fix it in the future. For now, we should only limit the length of collection name to 512 characters even we can't test the scenario.

already fixed

@hnwyllmm hnwyllmm merged commit 3a29a29 into oceanbase:develop Jan 6, 2026
6 checks passed
@WSL0809 WSL0809 deleted the issue74 branch January 7, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants