-
Notifications
You must be signed in to change notification settings - Fork 5
Add .eml support with nested attachments parsing #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| def _html_to_text(self, html: str) -> str: | ||
| """ | ||
| Convert HTML to plain text using simple regex. | ||
| Args: | ||
| html: HTML content. | ||
| Returns: | ||
| str: Plain text version of the HTML. | ||
| """ | ||
| # Remove HTML tags | ||
| text = re.sub(r"<[^>]+>", " ", html) | ||
| # Normalize whitespace | ||
| return re.sub(r"\s+", " ", text).strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks suspicious, gotta check what it is doing in bigger context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Add support for processing .eml email files—including header/body extraction and nested attachment handling—into the existing attachment pipeline.
- Introduce new methods for parsing
.emlfiles, extracting/saving attachments, and converting HTML bodies to text - Integrate
.emlsupport in_process_documentand extendforwardto queue nested attachments - Update summary generation to report email files and nested attachments
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| mxtoai/tools/attachment_processing_tool.py | Added .eml parsing methods, integrated .eml handling into document processing and forwarding, and enhanced summary reporting for nested attachments |
| mxtoai/agents/email_agent.py | Updated imports to include Tool alongside ToolCallingAgent |
Comments suppressed due to low confidence (2)
mxtoai/tools/attachment_processing_tool.py:639
- The
imagescounter in_create_attachment_summaryis never incremented when handling image attachments, so the final status report will never include images. Consider addingimages += 1in the block wherecontent.get("type") == "image".
summary_parts.append(f"Image {att['filename']}: Requires visual processing")
mxtoai/tools/attachment_processing_tool.py:132
- The new
.emlparsing functionality (_process_eml_file, attachment extraction, body extraction) lacks unit tests. Consider adding tests to verify email parsing, nested attachment saving, and HTML-to-text conversion.
def _process_eml_file(self, file_path: Path, extract_attachments: bool = True) -> tuple[str, list[dict[str, Any]]]:
| def _html_to_text(self, html: str) -> str: | ||
| """ | ||
| Convert HTML to plain text using simple regex. | ||
| Args: | ||
| html: HTML content. | ||
| Returns: | ||
| str: Plain text version of the HTML. | ||
| """ | ||
| # Remove HTML tags | ||
| text = re.sub(r"<[^>]+>", " ", html) | ||
| # Normalize whitespace | ||
| return re.sub(r"\s+", " ", text).strip() |
Copilot
AI
Jun 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using a regex to strip HTML tags is brittle and may fail on complex markup. Consider using a dedicated HTML-to-text parser (e.g., html.parser or a third-party library) for more reliable results.
| def _html_to_text(self, html: str) -> str: | |
| """ | |
| Convert HTML to plain text using simple regex. | |
| Args: | |
| html: HTML content. | |
| Returns: | |
| str: Plain text version of the HTML. | |
| """ | |
| # Remove HTML tags | |
| text = re.sub(r"<[^>]+>", " ", html) | |
| # Normalize whitespace | |
| return re.sub(r"\s+", " ", text).strip() | |
| class _HTMLToTextParser(HTMLParser): | |
| """ | |
| Custom HTML parser to extract plain text from HTML content. | |
| """ | |
| def __init__(self): | |
| super().__init__() | |
| self.text_parts = [] | |
| def handle_data(self, data): | |
| self.text_parts.append(data) | |
| def get_text(self) -> str: | |
| return "".join(self.text_parts) | |
| def _html_to_text(self, html: str) -> str: | |
| """ | |
| Convert HTML to plain text using a dedicated HTML parser. | |
| Args: | |
| html: HTML content. | |
| Returns: | |
| str: Plain text version of the HTML. | |
| """ | |
| parser = self._HTMLToTextParser() | |
| parser.feed(html) | |
| parser.close() | |
| # Normalize whitespace | |
| return " ".join(parser.get_text().split()) |
|
|
||
| # Update imports to use proper classes from smolagents | ||
| from smolagents import ToolCallingAgent | ||
| from smolagents import Tool, ToolCallingAgent |
Copilot
AI
Jun 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Tool import is not used in this file and can be removed to clean up unused imports.
| from smolagents import Tool, ToolCallingAgent | |
| from smolagents import ToolCallingAgent |
|
|
||
| # Create a subdirectory for extracted attachments | ||
| eml_name = original_eml_path.stem | ||
| extraction_dir = self.attachments_dir / f"{eml_name}_extracted_{uuid.uuid4().hex[:8]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks weird, what is / doing here
|
|
||
| # Write the attachment to disk | ||
| with open(attachment_path, "wb") as f: | ||
| f.write(payload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we verify that the cleanup is happening for these, post processing of the task
| summary = None | ||
| if mode == "full" and self.model and len(content) > 4000: | ||
| messages = [ | ||
| { | ||
| "role": MessageRole.SYSTEM, | ||
| "content": [ | ||
| { | ||
| "type": "text", | ||
| "text": f"Here is a file:\n### {attachment['filename']}\n\n{content[: self.text_limit]}", | ||
| } | ||
| ], | ||
| }, | ||
| { | ||
| "role": MessageRole.USER, | ||
| "content": [ | ||
| { | ||
| "type": "text", | ||
| "text": "Please provide a comprehensive summary of this document in 5-7 sentences.", | ||
| } | ||
| ], | ||
| }, | ||
| ] | ||
| summary = self.model(messages).content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we passing attachments through llm and generating summaries? How will this work for excel files and others?
No description provided.