-
Notifications
You must be signed in to change notification settings - Fork 23
Add benchmarking suite #252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #252 +/- ##
==========================================
- Coverage 81.79% 80.01% -1.79%
==========================================
Files 9 10 +1
Lines 1494 1621 +127
Branches 257 282 +25
==========================================
+ Hits 1222 1297 +75
- Misses 226 277 +51
- Partials 46 47 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Wow, thank you for the effort on this @ugognw! It looks well thought out. I'm traveling this week and part of next week but will dedicate some time to reviewing in detail as soon as I can. |
c1da2ed to
5735dea
Compare
|
Thanks! I'd like to think so! Sounds good. I look forward to hearing your feedback! |
Signed-off-by: Ugochukwu Nwosu <ugn@sfu.ca>
Replace ReferenceData with _BenchmarkEntry. This named tuple now separates solution and solute properties which enables the separate retrieval of each type of property. An analogous _get_solution_property method has been implemented. _get_reference is renamed to _get_dataset as the data structure returned by the method defines a general solution/solute property data set (reference data + Solution object which can be used for benchmarking). benchmark_engine functionality moved to report_results for simplicity. If more fine-tuned control over data set partitioning or result reporting is desired, this can be implemented quite readily by undoing this change. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Solution creation is now handed when reference files are read. Benchmarking is parametrized by engine. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Although EOS subclasses define get_activity_coefficicient to return the activity coefficients of individual solutes, one can only measure the mean activity coefficient experimentally. As such, to benchmark against experimental data, we must use the definition of the mean activity coefficient in terms of the activity coefficients of the individual ions. Alternatively, this function could be implemented as a property (Solution.mean_activity) or a method (Solution.get_mean_activity). Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Publicize functions/classes for processing input/output data models Return values from main function (renamed to benchmark_engine for clarity) and pass engine and source as arguments. Store results instead of printing to console. Parse Solution dictionary into Solution object when reading dataset from file. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Passing a string corresponding to an internal source name retrieves the Path to the data file (distributed with package). All files are loaded in the same way. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
...to list[tuple[str, Quantity]] Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
The new RMSE is more robust in the case that property values are Python primitives. Property getters are now called to retrieve properties. This fixes a bug where the getter was returned instead of the value. Variables are renamed (from property) to avoid shadowing a Python built-in. Enumerating solution data (in get_dataset) yields the right number of items to unpack. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Improve docstrings Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Salt takes no extra kwargs. reduce takes no kwargs. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Users should be able to select which properties as well as which solutions (compositions, thermodynamic state) to benchmark against. This will also enable speed-ups in testing. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Implementing BenchmarkEntry.solute_properties/solution_properties in this way will enable speed-ups in testing when calculating error statistics between calculated and reference datasets (list[BenchmarkEntry]) as data lookup from a dictionary by key has O(1) time complexity (one will have the name of the property when calculating error) while finding a specific item in a list has O(N) time complexity. Accordingly, load_dataset now expects solute/solution properties within data files to be structured as dictionaries instead of prperty-value 2-tuples and accepts solutions, solute_properties, and solution_properties as keyword-only arguments. That data for a solution is included is determined by its components, molar concentration, temperature, and pressure. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Encapsulating the logic for creating dictionary keys in a function and creating a type alias for its type should minimize the amount of refactor required if the format is changed and appease mypy. This also makes comparison of Solutions for the purpose of benchmarking considerably more simple. load_dataset now returns a dictionary with keys indicating the Solution for which the data applies as generated by _create_solution_key. Internal data sources are now opened more reliably using resources.files. Generating the list of SolutionKeys corresponding to the `solutions` parameter in advance simplifies the logic required to check for the Solution in `solutions`. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Using the same data structures for engine and reference data is desired because it will separate the logic for retrieving properties and calculating stats and simplify the retrieval of corresponding items in calculate_stats. This implementation ensures that only one Solution object is created for each Solution across all benchmark datasets and that each BenchmarkEntry object is associated with a unique Solution. In this way, a single dataset can be generated for the engine, which can then be benchmarked against each reference dataset. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Accepting two identical data structures generalizes calculate_stats for all combinations of sources of data (e.g., reference-reference, engine-engine, engine-reference). An additional benefit is that now, from the signature, it is more clear to users how to organize the two sets of data to be compared. Previously, engine data was implicitly generated using the Solution objects in each BenchmarkEntry. As mentioned previously, this now removes the logic for retrieving solute and solution properties. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
FormulaDict was initially used in order to automatically standardize how solutes appear in solution compositions and solute property lists. However, in addition to standardizing ion formulas, FormulaDict also sorts its internal copy of the dictionary by its values. This led to an error in the previous implementation because the values are dictionaries, and dictionaries don't implement the required methods for comparison with '<' and '>'. Instead, we manually call standardize_formula in the public functions where malformatted formulas are likely. This is when loading datasets and when creating a BenchmarkEntry from a Solution. Bugs: initialize property data as dictionaries in create_entry; fix indices and data type in _create_engine_dataset; initialize data pair lists in calculate_stats. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
The concentration for Solution objects that are created in _create_engine_dataset is determined from the SolutionKey. Without units, the amounts are not interpreted as concentrations and the Solution does not have the correct composition. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
The three load_dataset unit tests are just outlines and are expected to fail. The explicit test cases for test_should_benchmark_all_engines are intended to eventually overlap with those in test_conductivity and eventually replace them. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
The number of moles of solvent is dependent on the engine, but in the capacity that SolutionKey will be used, two Solutions with the same concentrations of their solutes with differing engines (and thus differing moles of solvent) should be considered identical if they are at the same state (temperature and pressure). Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Updated module docstring variable for consistency. Require both _create_engine_dataset arguments. For now, only require that the metric argument to calculate_stats can handle Quantity arguments, so that we can ensure that units are handled correctly. Note that the property values in BenchmarkEntry.solute_data and BenchmarkEntry.solution_data are expected to be Quantity objects. Also, the additional conditional statements in calculate_stats serve to handle the edge cases of when the return values of _get_solute_property and _get_solution_property are None and when there does not exist data in `calculated` corresponding to `reference`. Because this may be of intereste to the user, the event is logged at the INFO level. Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
Signed-off-by: Ugochukwu Nwosu <ugognw@gmail.com>
fb3df4c to
82e06df
Compare
Update: I will do my best to look at this next week. I have some proposals due at the end of this week that will keep me busy until then. Thanks for your patience! |
Summary
Major changes:
pyEQL.benchmarkTodos
validate data
scrutinize dataset format
finalize benchmarking statistics format
remove redundant tests
add more granular unit tests
optimize number of test cases/time
fix PhreeqcEOS benchmarking
add docs under Electrolyte Modeling Engines
remove logic related to
_get_activity_coefficient_pitzer?convert % concentration in datasets to mol/L in load_dataset
Update changelog
Google format doc strings added.
Code linted with
ruff. (For guidance in fixing rule violates, see rule list)Type annotations included. Check with
mypy.Tests added for new features/fixes.
I have run the tests locally and they passed.
Happy to hear any feedback on the current direction!