Add .eml support with nested attachments parsing #41

gautamp8 · 2025-06-04T10:54:39Z

No description provided.

satwikkansal · 2025-06-04T11:21:22Z

mxtoai/tools/attachment_processing_tool.py

+    def _html_to_text(self, html: str) -> str:
+        """
+        Convert HTML to plain text using simple regex.
+
+        Args:
+            html: HTML content.
+
+        Returns:
+            str: Plain text version of the HTML.
+
+        """
+        # Remove HTML tags
+        text = re.sub(r"<[^>]+>", " ", html)
+        # Normalize whitespace
+        return re.sub(r"\s+", " ", text).strip()


this looks suspicious, gotta check what it is doing in bigger context

Copilot

Pull Request Overview

Add support for processing .eml email files—including header/body extraction and nested attachment handling—into the existing attachment pipeline.

Introduce new methods for parsing .eml files, extracting/saving attachments, and converting HTML bodies to text
Integrate .eml support in _process_document and extend forward to queue nested attachments
Update summary generation to report email files and nested attachments

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
mxtoai/tools/attachment_processing_tool.py	Added `.eml` parsing methods, integrated `.eml` handling into document processing and forwarding, and enhanced summary reporting for nested attachments
mxtoai/agents/email_agent.py	Updated imports to include `Tool` alongside `ToolCallingAgent`

Comments suppressed due to low confidence (2)

mxtoai/tools/attachment_processing_tool.py:639

The images counter in _create_attachment_summary is never incremented when handling image attachments, so the final status report will never include images. Consider adding images += 1 in the block where content.get("type") == "image".

summary_parts.append(f"Image {att['filename']}: Requires visual processing")

mxtoai/tools/attachment_processing_tool.py:132

The new .eml parsing functionality (_process_eml_file, attachment extraction, body extraction) lacks unit tests. Consider adding tests to verify email parsing, nested attachment saving, and HTML-to-text conversion.

def _process_eml_file(self, file_path: Path, extract_attachments: bool = True) -> tuple[str, list[dict[str, Any]]]:

Copilot · 2025-06-04T11:21:59Z

mxtoai/tools/attachment_processing_tool.py

+    def _html_to_text(self, html: str) -> str:
+        """
+        Convert HTML to plain text using simple regex.
+
+        Args:
+            html: HTML content.
+
+        Returns:
+            str: Plain text version of the HTML.
+
+        """
+        # Remove HTML tags
+        text = re.sub(r"<[^>]+>", " ", html)
+        # Normalize whitespace
+        return re.sub(r"\s+", " ", text).strip()


[nitpick] Using a regex to strip HTML tags is brittle and may fail on complex markup. Consider using a dedicated HTML-to-text parser (e.g., html.parser or a third-party library) for more reliable results.

Suggested change

def _html_to_text(self, html: str) -> str:

"""

Convert HTML to plain text using simple regex.

Args:

html: HTML content.

Returns:

str: Plain text version of the HTML.

"""

# Remove HTML tags

text = re.sub(r"<[^>]+>", " ", html)

# Normalize whitespace

return re.sub(r"\s+", " ", text).strip()

class _HTMLToTextParser(HTMLParser):

"""

Custom HTML parser to extract plain text from HTML content.

"""

def __init__(self):

super().__init__()

self.text_parts = []

def handle_data(self, data):

self.text_parts.append(data)

def get_text(self) -> str:

return "".join(self.text_parts)

def _html_to_text(self, html: str) -> str:

"""

Convert HTML to plain text using a dedicated HTML parser.

Args:

html: HTML content.

Returns:

str: Plain text version of the HTML.

"""

parser = self._HTMLToTextParser()

parser.feed(html)

parser.close()

# Normalize whitespace

return " ".join(parser.get_text().split())

Copilot · 2025-06-04T11:22:00Z

mxtoai/agents/email_agent.py


 # Update imports to use proper classes from smolagents
-from smolagents import ToolCallingAgent
+from smolagents import Tool, ToolCallingAgent


The Tool import is not used in this file and can be removed to clean up unused imports.

Suggested change

from smolagents import Tool, ToolCallingAgent

from smolagents import ToolCallingAgent

satwikkansal · 2025-06-04T11:22:34Z

mxtoai/tools/attachment_processing_tool.py

+
+        # Create a subdirectory for extracted attachments
+        eml_name = original_eml_path.stem
+        extraction_dir = self.attachments_dir / f"{eml_name}_extracted_{uuid.uuid4().hex[:8]}"


this looks weird, what is / doing here

satwikkansal · 2025-06-04T11:23:25Z

mxtoai/tools/attachment_processing_tool.py

+
+                        # Write the attachment to disk
+                        with open(attachment_path, "wb") as f:
+                            f.write(payload)


did we verify that the cleanup is happening for these, post processing of the task

satwikkansal · 2025-06-04T11:26:24Z

mxtoai/tools/attachment_processing_tool.py

+                    summary = None
+                    if mode == "full" and self.model and len(content) > 4000:
+                        messages = [
+                            {
+                                "role": MessageRole.SYSTEM,
+                                "content": [
+                                    {
+                                        "type": "text",
+                                        "text": f"Here is a file:\n### {attachment['filename']}\n\n{content[: self.text_limit]}",
+                                    }
+                                ],
+                            },
+                            {
+                                "role": MessageRole.USER,
+                                "content": [
+                                    {
+                                        "type": "text",
+                                        "text": "Please provide a comprehensive summary of this document in 5-7 sentences.",
+                                    }
+                                ],
+                            },
+                        ]
+                        summary = self.model(messages).content


are we passing attachments through llm and generating summaries? How will this work for excel files and others?

Add .eml support with nested attachments parsing

f26bb84

satwikkansal requested a review from Copilot June 4, 2025 11:19

satwikkansal reviewed Jun 4, 2025

View reviewed changes

Copilot AI reviewed Jun 4, 2025

View reviewed changes

satwikkansal reviewed Jun 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add .eml support with nested attachments parsing #41

Add .eml support with nested attachments parsing #41

Uh oh!

gautamp8 commented Jun 4, 2025

Uh oh!

satwikkansal Jun 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 4, 2025

Uh oh!

Copilot AI Jun 4, 2025

Uh oh!

satwikkansal Jun 4, 2025

Uh oh!

satwikkansal Jun 4, 2025 •

edited

Loading

Uh oh!

satwikkansal Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	from smolagents import Tool, ToolCallingAgent
	from smolagents import ToolCallingAgent

Add .eml support with nested attachments parsing #41

Are you sure you want to change the base?

Add .eml support with nested attachments parsing #41

Uh oh!

Conversation

gautamp8 commented Jun 4, 2025

Uh oh!

satwikkansal Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

satwikkansal Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

satwikkansal Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satwikkansal Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

satwikkansal Jun 4, 2025 •

edited

Loading