-
Notifications
You must be signed in to change notification settings - Fork 94
Description
With #299 we get even better support for running on HPC. However, the existing way in which results are stored does not scale well once going to a very large number of experiments or when creating high dimensional data. Presently, the results are stored as a collection of CSVs wrapped in a tarball. The main advantage of this is that the results are easy to unzip and open with any text editor or even Excel. It is also a very convenient way of storing results in a cross-platform, cross-language way. However, it breaks with large outputs because you will run into memory errors.
A short-term solution is to change save_results. It currently builds up the entire tarball in memory before flushing it to disk. A slightly more memory-efficient solution is to create a directory on disk, write each CSV file to it, and then turn the entire directory into a tarball. Some memory profiling is likely needed as to how much of a difference this will make.
A longer-term solution is to add other storage solutions where results are flushed to disk while they are coming in. This avoids having to build up in memory the very large results dataset. The basic machinery for this is in place because of the callback keyword argument that is passed to perform_experiments. It requires, probably, however, a minor rethink of how to capture the serialization of all classes of outcomes (i.e., to_disk and from_disk ). Depending on the chosen storage solution, a slightly different serialization will be required.