-
Notifications
You must be signed in to change notification settings - Fork 5
UPDATED: Add ListSource and DataFrameSource #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds two new source types—DataFrameSource and ListSource—to enable streaming data from pandas DataFrames and Python lists with customizable tagging.
- Introduces DataFrameSource for iterating over DataFrame rows as tagged packets
- Introduces ListSource for iterating over list elements as tagged packets
- Updates
src/orcapod/core/sources.pywith new imports and class definitions
Comments suppressed due to low confidence (3)
src/orcapod/core/sources.py:336
- The docstring refers to 'expected_keys' but the parameter is named 'expected_tag_keys'. Update the doc to match the code.
If expected_keys are provided, they will be used instead of the default keys.
src/orcapod/core/sources.py:494
- The docstring refers to 'expected_keys' but the parameter is named 'expected_tag_keys'. Update the doc to match the code.
If expected_keys are provided, they will be used instead of the default keys.
src/orcapod/core/sources.py:207
- New classes DataFrameSource and ListSource have been introduced without accompanying unit tests. Please add tests to cover the core functionality and edge cases.
class DataFrameSource(Source):
| from typing import Any, Literal | ||
|
|
||
| import pandas as pd | ||
| import polars as pl |
Copilot
AI
Jul 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import 'polars as pl' is not used in this file. Consider removing it to keep imports clean.
| for idx, row in self.dataframe.iterrows(): | ||
| tag = self.tag_function(row, idx) | ||
| packet = {col: row[col] for col in self.columns} |
Copilot
AI
Jul 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using DataFrame.iterrows() can be slow for large data sets. Consider iterating with itertuples or another vectorized approach for better performance.
| for idx, row in self.dataframe.iterrows(): | |
| tag = self.tag_function(row, idx) | |
| packet = {col: row[col] for col in self.columns} | |
| for row in self.dataframe.itertuples(index=True, name=None): | |
| idx, *values = row | |
| row_series = pd.Series(values, index=self.dataframe.columns) | |
| tag = self.tag_function(row_series, idx) | |
| packet = {col: row_series[col] for col in self.columns} |
| if missing_columns: | ||
| raise ValueError(f"Columns not found in DataFrame: {missing_columns}") | ||
|
|
||
| if tag_function is None: |
Copilot
AI
Jul 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Similar tag_function initialization logic appears in both DataFrameSource and ListSource—consider abstracting this into a shared helper or the base class to reduce duplication.
I manually rebased the brian branch from the dev branch and added this code back in,
I addressed your previous comments regarding specifying expected_tag_keys as row_index for the dataframesource and element_index for the ListSource. There was one comment you made that I didn't know how to interpret, made on the line specifying the tag_function in the init section of the dataframeSource class:
"Would you mind adding a feature where if a list of strings are passed, they are interpreted as columns whose values should be used as the tags?"
I'm assuming you mean the tags should have the same keys as the column named passed in? Just double checking.