Skip to content

Conversation

@ZiyueXu77
Copy link
Collaborator

Fixes # .

Description

Convert KM example from JobAPI to Recipe, also add production instructions with provisioned HE context

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings December 15, 2025 18:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR converts the Kaplan-Meier homomorphic encryption example from the JobAPI to the Recipe API, enabling both simulation and production deployment modes. The key changes include:

  • Replaced km_job.py with a new Recipe-based job.py that supports both simulation and production environments
  • Added production deployment infrastructure including provisioning configuration (project.yml) and convenience scripts (start_all.sh)
  • Updated controller initialization to explicitly pass empty persistor_id parameter
  • Enhanced documentation with comprehensive instructions for both deployment modes

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
examples/advanced/kaplan-meier-he/job.py New Recipe-based job configuration replacing JobAPI implementation
examples/advanced/kaplan-meier-he/km_job.py Removed old JobAPI implementation
examples/advanced/kaplan-meier-he/server.py Updated controller initialization with explicit persistor_id
examples/advanced/kaplan-meier-he/server_he.py Updated controller initialization with explicit persistor_id
examples/advanced/kaplan-meier-he/project.yml New provisioning configuration for production deployment with CKKS HE scheme
examples/advanced/kaplan-meier-he/start_all.sh Convenience script for starting all production components locally
examples/advanced/kaplan-meier-he/README.md Comprehensive documentation update covering both simulation and production modes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 15, 2025

Greptile Summary

This PR successfully converts the Kaplan-Meier example from the old JobAPI (km_job.py) to the new Recipe API (job.py), enabling both simulation and production deployment modes. Major changes include:

  • Replaced km_job.py with a new Recipe-based job.py that wraps the KMRecipe class for unified job configuration
  • Restructured file layout: moved scripts from src/ directory to root level (client.py, client_he.py, server.py, server_he.py)
  • Updated HE scheme from BFV to CKKS throughout for consistency between simulation and production modes
  • Added production mode support with project.yml provisioning config, start_all.sh convenience script
  • Enhanced README with comprehensive instructions for both simulation and production deployment

Critical Issue: The HE context path handling has a logic bug that will break production mode. The code uses a hardcoded simulation path (/tmp/nvflare/he_context/he_context_client.txt) for both modes, but production mode requires paths like <startup_kit>/client_context.tenseal. This contradicts the README documentation which claims paths are "automatically managed."

Minor Issue: The ScriptRunner initialization omits params_exchange_format="raw" that was present in the old implementation.

Confidence Score: 2/5

  • This PR has a critical bug that will cause production mode with HE to fail
  • The simulation mode implementation looks correct, but production mode with HE encryption will fail due to incorrect HE context path handling. The code uses simulation-mode file paths that won't exist in production environments where HE context is provisioned via startup kits. This is a blocking issue for the stated goal of supporting production deployments with HE.
  • Pay close attention to job.py - the HE context path logic needs to differentiate between simulation and production modes

Important Files Changed

Filename Overview
examples/advanced/kaplan-meier-he/job.py New Recipe API wrapper replacing old JobAPI; has critical issue with HE context path in production mode not being automatically set
examples/advanced/kaplan-meier-he/README.md Comprehensive documentation update for Recipe API with simulation and production mode instructions; well-written overall
examples/advanced/kaplan-meier-he/client_he.py Renamed from src/kaplan_meier_train_he.py; proper CKKS encryption implementation
examples/advanced/kaplan-meier-he/server_he.py Renamed from src/kaplan_meier_wf_he.py; properly aggregates encrypted histograms with CKKS

Sequence Diagram

sequenceDiagram
    participant User
    participant JobPy as job.py (Recipe)
    participant Server as Server (KM_HE)
    participant Client1 as Client 1
    participant Client2 as Client 2
    
    Note over User,Client2: Production Mode Setup
    User->>User: Run provisioning with project.yml
    User->>Server: Start server with HE context
    User->>Client1: Start client-1 with HE context
    User->>Client2: Start client-2 with HE context
    
    Note over User,Client2: Job Execution
    User->>JobPy: python job.py --encryption --startup_kit_location ...
    JobPy->>JobPy: Create KMRecipe with KM_HE controller
    JobPy->>Server: Submit job via ProdEnv
    
    Note over Server,Client2: Round 1 - Collect Max Index
    Server->>Client1: Send empty message (round=1)
    Server->>Client2: Send empty message (round=1)
    Client1->>Client1: Calculate local max histogram index
    Client2->>Client2: Calculate local max histogram index
    Client1->>Server: Send max_idx
    Client2->>Server: Send max_idx
    Server->>Server: Aggregate max_idx (select global max)
    
    Note over Server,Client2: Round 2 - Collect Encrypted Histograms
    Server->>Client1: Send max_idx_global (round=2)
    Server->>Client2: Send max_idx_global (round=2)
    Client1->>Client1: Create histograms with global length
    Client1->>Client1: Encrypt histograms with CKKS
    Client2->>Client2: Create histograms with global length
    Client2->>Client2: Encrypt histograms with CKKS
    Client1->>Server: Send encrypted hist_obs, hist_cen
    Client2->>Server: Send encrypted hist_obs, hist_cen
    Server->>Server: Aggregate encrypted histograms (HE addition)
    
    Note over Server,Client2: Round 3 - Distribute Global Results
    Server->>Client1: Send encrypted global histograms (round=3)
    Server->>Client2: Send encrypted global histograms (round=3)
    Client1->>Client1: Decrypt global histograms
    Client1->>Client1: Perform Kaplan-Meier analysis
    Client1->>Client1: Save km_curve_fl_he.png
    Client2->>Client2: Decrypt global histograms
    Client2->>Client2: Perform Kaplan-Meier analysis
    Client2->>Client2: Save km_curve_fl_he.png
    Client1->>Server: Send completion acknowledgment
    Client2->>Server: Send completion acknowledgment
    
    Server->>JobPy: Job complete
    JobPy->>User: Display job status
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. examples/advanced/kaplan-meier-he/start_all.sh, line 71 (link)

    style: stop_all.sh script doesn't exist

    Update to mention using the admin console instead:

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (4)

  1. examples/advanced/kaplan-meier-he/client_he.py, line 140-141 (link)

    logic: HE scheme mismatch - code uses BFV (ts.bfv_vector) but project.yml provisions CKKS scheme for production. Production mode will fail when it tries to use BFV functions with CKKS context. Either change project.yml scheme to BFV or update client/server code to use scheme-agnostic operations.

  2. examples/advanced/kaplan-meier-he/server_he.py, line 95-97 (link)

    logic: HE scheme mismatch - code uses BFV (ts.bfv_vector_from) but project.yml provisions CKKS scheme for production. Production mode will fail at deserialization. Either change project.yml scheme to BFV or update code to be scheme-agnostic.

  3. examples/advanced/kaplan-meier-he/project.yml, line 50 (link)

    logic: CKKS scheme incompatible with code - client_he.py and server_he.py use BFV-specific functions (ts.bfv_vector, ts.bfv_vector_from) for integer histogram operations. Change to BFV scheme or update code.

  4. examples/advanced/kaplan-meier-he/job.py, line 143 (link)

    logic: string replacement will fail if --workspace_dir doesn't contain "/km/". use pathlib or explicit path construction instead.

9 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (4)

  1. examples/advanced/kaplan-meier-he/job.py, line 143 (link)

    logic: String replacement fails when workspace_dir ends with /km (no trailing slash). The default value /tmp/nvflare/workspaces/km won't match the pattern /km/.

  2. examples/advanced/kaplan-meier-he/project.yml, line 3 (link)

    syntax: Comment is incorrect - both simulation and production modes now use CKKS scheme. The code was converted from BFV to CKKS in client_he.py and server_he.py.

  3. examples/advanced/kaplan-meier-he/README.md, line 76 (link)

    syntax: Documentation is outdated - the implementation now uses CKKS scheme, not BFV. Update to reflect the actual scheme change in client_he.py and server_he.py.

  4. examples/advanced/kaplan-meier-he/README.md, line 85 (link)

    syntax: Documentation is outdated - the implementation now uses CKKS scheme, not BFV. Update to reflect the actual scheme change.

10 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

print("Exporting job to", args.job_dir)
recipe.job.export_job(args.job_dir)

# Run recipe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many prints cover the main flow of the code

# In production mode, job runs asynchronously on the FL system
# Check status via admin console instead of waiting for result here
if args.startup_kit_location:
print("\nJob submitted successfully to the FL system!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code suppose to be simple, this looks complicated now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these no import statements to a separate utils functions


# Run recipe
if args.startup_kit_location:
print("Running job in production mode...")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to know this information

env = ProdEnv(startup_kit_location=args.startup_kit_location, username=args.username)
else:
print("Running job in simulation mode...")
print("workspace_dir=", workspace_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we print out all these ?

@chesterxgchen
Copy link
Collaborator

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. examples/advanced/kaplan-meier-he/job.py, line 39-51 (link)

    logic: HE context path won't work in production mode. The README (line 268) states the he_context_path is "automatically managed by startup kits" in production, but this implementation uses a hardcoded simulation path /tmp/nvflare/he_context/he_context_client.txt for both modes. In production, the HE context files are at <startup_kit>/client_context.tenseal (for clients) and <startup_kit>/server_context.tenseal (for server), not at this simulation path. This will cause production mode to fail when trying to load HE context. How should the HE context paths be automatically determined in production mode?

  2. examples/advanced/kaplan-meier-he/job.py, line 66-68 (link)

    style: Missing params_exchange_format parameter. The old km_job.py included params_exchange_format="raw" in ScriptRunner configuration. While this may not cause runtime errors, omitting it could lead to unexpected behavior if the default format differs from what the client scripts expect.

10 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants