Skip to content

Conversation

@kevinschaper
Copy link
Collaborator

Summary

  • Implement LinkML-to-Pandera schema generator using Jinja templates
  • Generate schemas for MatrixNode, MatrixEdge, UnionedNode, UnionedEdge classes
  • Integrate with Makefile build system via make gen-pandera target

Key Features

  • Auto-generates from LinkML schema: No more manual Pandera schema maintenance
  • PySpark compatibility: Proper ArrayType with nullable=False for list items
  • Enum validation: Preserves existing validation for predicates, categories, etc.
  • Consistent formatting: Clean, properly indented output matching project style
  • Always regenerates: .PHONY target ensures fresh generation on every run

Test plan

  • Verify make gen-pandera generates all four schema functions
  • Confirm generated schemas compile without syntax errors
  • Check ArrayType fields use nullable=False for list items
  • Validate enum checks are properly applied
  • Ensure unique constraints match original patterns

🤖 Generated with Claude Code

- Create PanderaGenerator class in matrix_schema/generators/panderagen.py
- Generate schemas for MatrixNode, MatrixEdge, UnionedNode, UnionedEdge
- Integrate with Makefile via gen-pandera target (always runs)
- Maintain PySpark compatibility with proper ArrayType nullable=False for list items
- Preserve existing validation patterns (enum checks, unique constraints)
- Auto-generate from LinkML schema with proper formatting

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWeeeeeesommmmmmmeeee

THANKS!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the main Makefile be manually edited? It seems the cookiecutter template should update it?

…rated schema so that we get a better diff on the PR
Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I love it. I assuming you wont merge before QC failures are dealt with :P

return DataFrameSchema(
columns={
"id": Column(T.StringType(), nullable=False),
"id": Column(T.StringType(), nullable=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you ordered the output now, why are there so many changes to the schema? for example nullable True seems like a big change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll figure out if the linkml is wrong or the schema generator is wrong. One way or another, I think id should clearly be nullable=False

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to set required: true on a whole bunch of slots

…id, for multivalued fields, and generate the enum checks in a more generic way in panderagen.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants