-
Notifications
You must be signed in to change notification settings - Fork 10
feat(sdk): add input_type parameter for file type detection #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…nitial SDK configuration
…ackage configuration
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: PowerRAG SDKThe PR introduces a new 'input_type' parameter for file type detection in document parsing methods, allowing explicit control over file type detection and bypassing auto-detection. The current documentation does not mention this parameter or its usage in the document parsing section or examples. To remain accurate and complete, the documentation should be updated to describe the 'input_type' parameter, its options, and provide usage examples reflecting the new capability. View Suggested Changes@@ -9,6 +9,7 @@
- Create, update, list, and delete datasets.
- Upload documents to datasets, supporting batch operations.
- Parse documents to Markdown format, including direct binary parsing for PDF, Office documents, images, and HTML ([source](https://github.com/oceanbase/powerrag/pull/40)).
+- Explicit file type detection control via `input_type` parameter for parsing methods (see below).
- Asynchronous and synchronous document parsing, with status polling and cancellation.
- Manage document metadata, download content, and handle document chunks.
@@ -75,6 +76,40 @@
- `enable_formula`: Recognize formulas (bool, default False)
- `enable_table`: Recognize tables (bool, default True)
- `from_page`, `to_page`: Page range for PDFs (int, default 0 and 100000)
+- `input_type`: File type detection mode (default: 'auto'). Controls how the SDK determines the file type for parsing. Options:
+ - `'auto'`: Try filename extension first, then auto-detect from binary if no extension or unsupported extension (default)
+ - `'pdf'`, `'office'`, `'html'`, `'image'`: Explicitly specify the file type to bypass detection
+
+#### Example Usage of `input_type`
+
+```python
+# Default: auto-detect using extension or binary content
+result = client.document_manager.parse_to_md_binary(
+ file_binary=open("document.pdf", "rb").read(),
+ filename="document.pdf"
+)
+
+# Auto-detect from binary when file has no extension
+result = client.document_manager.parse_to_md_binary(
+ file_binary=open("document_no_ext", "rb").read(),
+ filename="document_no_ext"
+)
+
+# Explicitly specify file type (bypass detection)
+result = client.document_manager.parse_to_md_binary(
+ file_binary=open("document", "rb").read(),
+ filename="document",
+ input_type="pdf"
+)
+
+# Using config and input_type together
+result = client.document_manager.parse_to_md_binary(
+ file_binary=open("document.html", "rb").read(),
+ filename="document.html",
+ config={"layout_recognize": "mineru"},
+ input_type="html"
+)
+```
### Chat Creation
@@ -115,13 +150,21 @@
### Parse Documents to Markdown (Binary)
```python
+# Default: auto-detect file type
result = client.document_manager.parse_to_md_binary(
file_binary=open("document.pdf", "rb").read(),
- filename="document.pdf",
- config={"layout_recognize": "mineru", "enable_ocr": True}
+ filename="document.pdf"
)
print(result['markdown'])
print(f"Parsed {result['total_images']} images")
+
+# Explicitly specify file type
+result = client.document_manager.parse_to_md_binary(
+ file_binary=open("document", "rb").read(),
+ filename="document",
+ input_type="pdf"
+)
+print(result['markdown'])(source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for automatic file type detection and URL-based file parsing to the PowerRAG SDK. The changes enable the system to handle files without extensions and download files directly from URLs for parsing.
Changes:
- Added
input_typeparameter ('auto', 'pdf', 'office', 'html', 'image', 'markdown') to control file type detection mode across parsing methods - Implemented
detect_file_typeutility function to identify file formats from binary content using magic numbers - Enhanced
parse_to_md_uploadendpoint to acceptfile_urlparameter for downloading and parsing remote files - Updated SDK methods to propagate and handle
input_typeconsistently - Added comprehensive test coverage for auto-detection and file_url functionality
- Updated test configuration to require API key via environment variables with
.envfile support
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 20 comments.
Show a summary per file
| File | Description |
|---|---|
| powerrag/utils/file_utils.py | Adds detect_file_type function to identify file formats from binary data using magic numbers |
| powerrag/server/services/parse_service.py | Adds input_type parameter to parse_file_binary and implements auto-detection logic |
| powerrag/server/routes/powerrag_routes.py | Enhances /parse_to_md/upload endpoint to support file_url and input_type parameters |
| powerrag/sdk/modules/document_manager.py | Updates SDK methods to accept and propagate input_type parameter through API calls |
| powerrag/sdk/tests/conftest.py | Requires POWERRAG_API_KEY from environment, changes default port to 9390 |
| powerrag/sdk/tests/.env.example | Provides template for test environment configuration |
| powerrag/sdk/tests/pytest.ini | Adds test timeout and configuration settings |
| powerrag/sdk/tests/test_document.py | Adds comprehensive tests for input_type auto-detection and file_url functionality |
| powerrag/sdk/README.md | Documents new input_type and file_url features with usage examples |
Comments suppressed due to low confidence (1)
powerrag/server/routes/powerrag_routes.py:1160
- The route is calling the private method
_parse_to_markdowndirectly instead of using the publicparse_file_binarymethod. This bypasses the logic and validation inparse_file_binaryand creates duplicate code for input_type handling. It would be better to callservice.parse_file_binary(binary, filename, config, input_type)which already handles all the format detection logic, and then extract markdown and images from its return value. This would reduce code duplication and ensure consistent behavior across the codebase.
# Create service and parse
gotenberg_url = config.get("gotenberg_url", GOTENBERG_URL)
service = PowerRAGParseService(gotenberg_url=gotenberg_url)
# Parse to markdown
md_content, images = service._parse_to_markdown(
filename=filename,
binary=binary,
format_type=format_type,
config=config
)
| def test_explicit_input_type_pdf(self, client: PowerRAGClient, tmp_path): | ||
| """测试显式指定 input_type='pdf'""" | ||
| # 创建一个简单的 PDF 内容 | ||
| pdf_header = b"%PDF-1.4\n%\xE2\xE3\xCF\xD3\n" | ||
| pdf_content = pdf_header + b"1 0 obj\n<<\n/Type /Catalog\n>>\nendobj\n" | ||
|
|
||
| # 显式指定为 PDF 类型,即使文件名没有扩展名 | ||
| result = client.document.parse_to_md_binary( | ||
| file_binary=pdf_content, | ||
| filename="document", | ||
| input_type="pdf" # 显式指定类型 | ||
| ) | ||
|
|
||
| assert "filename" in result |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test at line 653-666 similarly uses a minimal, potentially invalid PDF and only asserts that "filename" is in the result. This is a very weak assertion that doesn't verify whether the parsing actually succeeded or if the PDF content was correctly processed. Consider strengthening the assertion to check for "markdown" in result and verify markdown_length > 0, or use a valid PDF fixture.
| elif input_type in ['pdf', 'office', 'html', 'image']: | ||
| # Use explicitly specified input_type | ||
| format_type = input_type | ||
| logger.info(f"Using explicit input_type: {format_type} for file: {filename}") | ||
| else: | ||
| return jsonify({ | ||
| "code": 400, | ||
| "message": f"Unsupported file format: {file_ext}. Supported formats: pdf, doc, docx, ppt, pptx, jpg, png, html" | ||
| "message": f"Invalid input_type: {input_type}. Must be 'auto' (default), 'pdf', 'office', 'html', or 'image'." | ||
| }), 400 |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The valid input_type values shown in the error message at line 1147 don't include 'markdown', but the parse_file_binary method at line 190 in parse_service.py accepts 'markdown' as a valid input_type. This inconsistency means users will get an error if they try to use 'markdown' as input_type in this endpoint, even though the underlying service method supports it. Either add 'markdown' to the validation here, or remove it from parse_file_binary's validation.
| # Download from URL | ||
| import requests | ||
| from urllib.parse import urlparse | ||
| from pathlib import Path | ||
|
|
||
| logger.info(f"Downloading file from URL: {file_url}") | ||
|
|
||
| try: | ||
| response = requests.get(file_url, timeout=60) | ||
| response.raise_for_status() | ||
| binary = response.content |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file download functionality doesn't validate the file_url scheme or implement any URL allowlist. This could potentially allow Server-Side Request Forgery (SSRF) attacks where an attacker could make the server request internal network resources. Consider adding URL validation to only allow specific schemes (http/https) and optionally implement an allowlist or blocklist for domains/IPs, especially to prevent requests to internal/private network addresses.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary
Solution Description