Skip to content

Conversation

@gautamp8
Copy link
Collaborator

@gautamp8 gautamp8 commented Jun 4, 2025

No description provided.

@satwikkansal satwikkansal requested a review from Copilot June 4, 2025 11:19
Comment on lines +387 to +401
def _html_to_text(self, html: str) -> str:
"""
Convert HTML to plain text using simple regex.
Args:
html: HTML content.
Returns:
str: Plain text version of the HTML.
"""
# Remove HTML tags
text = re.sub(r"<[^>]+>", " ", html)
# Normalize whitespace
return re.sub(r"\s+", " ", text).strip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks suspicious, gotta check what it is doing in bigger context

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Add support for processing .eml email files—including header/body extraction and nested attachment handling—into the existing attachment pipeline.

  • Introduce new methods for parsing .eml files, extracting/saving attachments, and converting HTML bodies to text
  • Integrate .eml support in _process_document and extend forward to queue nested attachments
  • Update summary generation to report email files and nested attachments

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
mxtoai/tools/attachment_processing_tool.py Added .eml parsing methods, integrated .eml handling into document processing and forwarding, and enhanced summary reporting for nested attachments
mxtoai/agents/email_agent.py Updated imports to include Tool alongside ToolCallingAgent
Comments suppressed due to low confidence (2)

mxtoai/tools/attachment_processing_tool.py:639

  • The images counter in _create_attachment_summary is never incremented when handling image attachments, so the final status report will never include images. Consider adding images += 1 in the block where content.get("type") == "image".
summary_parts.append(f"Image {att['filename']}: Requires visual processing")

mxtoai/tools/attachment_processing_tool.py:132

  • The new .eml parsing functionality (_process_eml_file, attachment extraction, body extraction) lacks unit tests. Consider adding tests to verify email parsing, nested attachment saving, and HTML-to-text conversion.
def _process_eml_file(self, file_path: Path, extract_attachments: bool = True) -> tuple[str, list[dict[str, Any]]]:

Comment on lines +387 to +401
def _html_to_text(self, html: str) -> str:
"""
Convert HTML to plain text using simple regex.
Args:
html: HTML content.
Returns:
str: Plain text version of the HTML.
"""
# Remove HTML tags
text = re.sub(r"<[^>]+>", " ", html)
# Normalize whitespace
return re.sub(r"\s+", " ", text).strip()
Copy link

Copilot AI Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using a regex to strip HTML tags is brittle and may fail on complex markup. Consider using a dedicated HTML-to-text parser (e.g., html.parser or a third-party library) for more reliable results.

Suggested change
def _html_to_text(self, html: str) -> str:
"""
Convert HTML to plain text using simple regex.
Args:
html: HTML content.
Returns:
str: Plain text version of the HTML.
"""
# Remove HTML tags
text = re.sub(r"<[^>]+>", " ", html)
# Normalize whitespace
return re.sub(r"\s+", " ", text).strip()
class _HTMLToTextParser(HTMLParser):
"""
Custom HTML parser to extract plain text from HTML content.
"""
def __init__(self):
super().__init__()
self.text_parts = []
def handle_data(self, data):
self.text_parts.append(data)
def get_text(self) -> str:
return "".join(self.text_parts)
def _html_to_text(self, html: str) -> str:
"""
Convert HTML to plain text using a dedicated HTML parser.
Args:
html: HTML content.
Returns:
str: Plain text version of the HTML.
"""
parser = self._HTMLToTextParser()
parser.feed(html)
parser.close()
# Normalize whitespace
return " ".join(parser.get_text().split())

Copilot uses AI. Check for mistakes.

# Update imports to use proper classes from smolagents
from smolagents import ToolCallingAgent
from smolagents import Tool, ToolCallingAgent
Copy link

Copilot AI Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Tool import is not used in this file and can be removed to clean up unused imports.

Suggested change
from smolagents import Tool, ToolCallingAgent
from smolagents import ToolCallingAgent

Copilot uses AI. Check for mistakes.

# Create a subdirectory for extracted attachments
eml_name = original_eml_path.stem
extraction_dir = self.attachments_dir / f"{eml_name}_extracted_{uuid.uuid4().hex[:8]}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks weird, what is / doing here


# Write the attachment to disk
with open(attachment_path, "wb") as f:
f.write(payload)
Copy link
Collaborator

@satwikkansal satwikkansal Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we verify that the cleanup is happening for these, post processing of the task

Comment on lines +551 to +573
summary = None
if mode == "full" and self.model and len(content) > 4000:
messages = [
{
"role": MessageRole.SYSTEM,
"content": [
{
"type": "text",
"text": f"Here is a file:\n### {attachment['filename']}\n\n{content[: self.text_limit]}",
}
],
},
{
"role": MessageRole.USER,
"content": [
{
"type": "text",
"text": "Please provide a comprehensive summary of this document in 5-7 sentences.",
}
],
},
]
summary = self.model(messages).content
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we passing attachments through llm and generating summaries? How will this work for excel files and others?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants