Use real data for the overlaps at the start / end of every iblock iteration #4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hey Cees,
Decided to port this one back while I was doing the cuFFT one -- it re-uses data in order to pad the FFT data with real data where previously 0s were used.
The overall setup to perform this results in less samples being processed on the first iteration so that the end of the first buffer can be filled, but after that it just recycles the data already in the buffer from the current iteration to prepare cp1p/cp2p for the next iteration.
So the overall data structure look like this:
t=N: overlap | processed data | overlap
t=0: <overlap_0 = 0> | <noverlap = reflected data><data_0> | <overlap_1 = data>
t=1: <data_0 overlap> | <overlap_1><data_1> | <overlap_2>
t=2: <data_1 overlap> | <overlap_2>
... etc.
I'm going to make a note of it here as it took me a couple tries to get the indexing on it right: on the first iteration, I discarded / offset the output by 2 * noverlap samples as we are effectively losing noverlap samples on each end of the data.
At the start because we offset the starting point in the array due to there being insufficient data, losing noverlap samples, and at the end we perform overlap which causes another loss of noverlap samples.
Overall the implementation is stable judging by my outputs, but I suspect the process could be made more efficient by tweaking the block/grid sizes for padd_next_iteration (since it only needs to iterate over the first 2 * noverlap samples) and the new unpack_and_padd (as it can skip the first 2 * noverlap samples), though with my layout it's hard to judge what kind of performance effect it'll have on your setup (I reduce nforward from 100 to 8 and increase nsub to 488)
Cheers