Skip to content

Conversation

@acwhite211
Copy link
Member

@acwhite211 acwhite211 commented Dec 2, 2025

Fixes #7577

Edit the workbench upload code to batch rows together used in workbench uploads and validation. This is intended to speedup workbench uploads and validations. Will need to continue tuning the batch size that seems to be optimal for workbench uploads. Also, still working on adjusting the progress bar update for the batch upload code.

A good test case that has been causing problems with slow uploads is this one here: https://drive.google.com/file/d/1Mpr_KWMkCY74_yZv_knXiNGeG6TSKCYk/view?usp=drive_link
There are many fields in this file, here is a data mapping I made for testing purposes: https://drive.google.com/file/d/1eo56GKwGbMXV7luGD_SJ24b-ADxFb53X/view?usp=drive_link

Checklist

  • Self-review the PR after opening it to make sure the changes look good and
    self-explanatory (or properly documented)
  • Add relevant issue to release milestone
  • Add pr to documentation list

Testing instructions

Initial Testing:

  • Run the validation process on workbench data that you know takes a decent amount of time to validate. Run the validation in this branch and in the main branch for comparison.
  • See if there was a speed up in the validation time compared to the main branch.
  • See if the validation results looks the same as the validation results in the main branch

Further Testing:

  • Run workbench validation on a large workbench dataset, see that it finishes within a few minutes.
  • Run workbench upload on a large workbench dataset, see that it finishes within a few minutes.

@acwhite211
Copy link
Member Author

Did some profiling to determine which parts of the upload/validate pipeline is taking the most time for each row.

Here are the timing results done on the first 1000 rows of the cash upload dataset.:

  • Total: 124.39 s
  • apply_scoping: 59.18 s (~47.6%)
  • process_row: 58.70 s (~47.2%)
  • bind result: 5.13 s (~4.1%)

After adding caching for apply_scoping, the new timing results were

  • Total: 64.5897 s
  • process_row 58.57 s (90.7%)
  • bind result 4.84 s
  • apply_scoping 0.14 s

So, that helps us get about a 2x improvement, from about 500 rows per minute to 1000 rows per minute. Trying to get 5x to 10x improvements if possible.

Working now on speeding up the sections in the process_row function. It's not lending itself well to batching, so exploring multiple solutions and adding more fine grained profiling.

@acwhite211
Copy link
Member Author

acwhite211 commented Dec 3, 2025

Added code that can use bulk_insert on applicable rows. The full validation of the cash Workbench dataset of 321,216 records took 50 minutes to complete. So, in terms of rows-per-minute, we've gone from 500 to 1,000, and now to about 6,000. We've roughing got a 10x speed increase in the cash example. Still need to look into which types are rows can be used in bulk_insert, and which should not be to avoid possible issues. Also looking into implementing bulk_update for other situations. There is also some possible speedups that might work for the binding and matching sections of the code.

@acwhite211
Copy link
Member Author

acwhite211 commented Jan 7, 2026

Anyone who has been doing conversions or working with a large workbench validations that takes a while to run, go ahead and try running the validation on this branch and see if the results are the same and if there is a speed up. If you can, put the workbench data on google drive so any issues can be recreated and debugged. Thanks.

Don't plan on merges this branch for 7.12. It doesn't seem to work with batch edit. So, we'll likely just use this branch internally when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 📋Back Log

Development

Successfully merging this pull request may close these issues.

WB upload speedup needed

2 participants