AVX and SSE transpose-based float resizers with ks<=4 #440

DTL2020 · 2025-05-04T14:28:55Z

For Bilinear, Bicubic, sinc-based up to taps=2 and may be others resizers.

pre-AVX performance about +100% at i5-9600K and AVX performance about +30% to 4302 commit.

Test script

#SetMaxCPU("SSSE3")
BlankClip(1000000, 320,320, pixel_type="Y32")
mul=4
LanczosResize(width*mul, height, taps=2)

transpose-based SIMD H-resize function resize_h_planar_float_sse_transpose()

H-resize

H-resizers transpose-based. Ready for testing.

resizers for float32 up to kernel_size 4.

based resizers for kernel_size up to 4.

pinterf · 2025-05-06T07:57:19Z

I like this completely different approach. I have read your questions (mod4, over read, etc) in the comments and will look into it whether they are safe and when they are safe. The 'real' kernel size is never mod4 or mod8. The kernel size is aligned and padded on the coefficient side. There is a safety x limit however, precomputed and stored in the resizing_program struct, which has to be considered. This is the "danger zone" from where the current line position indexed with "begin" offsets is not safe to read into SIMD. E.g. if we'd read 32 bytes from there, it should still remain within the aligned_stride.

If this works, it may be quicker for integer samples as well, converting to float, do the stuff, convert back. Like I did in Overlay masked blend, where the integer arithmetic tricks and conversions and shifts were slower and much more complex than do everything in float internally. Not to say that it resulted in cleaner, easier to understand and more maintainable code.

pinterf · 2025-05-07T08:26:47Z

(Added .editorconfig to the project, I was surprised there wasn't any, I'm using it in my other repos. Your commits contain mixed TABs and spaces for indentation with changing indent level. From now on these settings are governed by this file, and the VS editor will use these settings automatically)

V-resize and H-resize for kernel_size <=4

DTL2020 · 2025-05-09T16:01:33Z

Tried to add 2 versions of AVX512 functions for float V-resize and H-resize for kernel_size<=4 - 2 more commits up to DTL2020@5bd7a28

But got very strange Prefetch-based performance - with adding any Prefetch(N) even N=1 performance of AVX512 version start to drop and with Prefetch(8) become about 1/2 of AVX2 resizers. Very strange. Even adding Prefetch(1) line to the end of script already make AVX512 performance lower:

For script

#SetMaxCPU("AVX2")
BlankClip(1000000, 320,320, pixel_type="Y32")
mul=4
LanczosResize(width*mul, height*mul, taps=2)

AVX2:
no Prefetch - 1247fps
Prefetch(1) - 1256 fps
Prefetch(1, 16) - 1242 fps
Prefetch(2) - 2513 fps

For AVX512:
no Prefetch - 1647fps
Prefetch(1) - 1445 fps
Prefetch(1, 16) - 1146 fps
Prefetch(2) - 2376 fps

May something very non-memory friendly happens with current Prefetch implementation in the AVS+ core and it somehow significantly damage performance with AVX512 parts of program. Or may be VS2019 compiler also create not very good executable. Can not test other compilers yet.

Tested with AVSmeter64

pinterf · 2025-05-09T16:11:31Z

Or there is too much overhead with state changes or register save and load? Or too many vzerouppers, or the opposite: missing vzerouppers. Meanwhile I have cleaned up and made memory-access safe your H sse and avx resizers for small kernel sizes, I'm going to show you this weekend when I can spend more time. Anyway, we have to check avx512 possibility and support in CMakeLists.txt covering all used compilers gcc clang llvm and msvc. Avx512 got new options in vs2022 whether to prefer 256 or 512 bit registers (and also for avx10). Reason not changing to vs2022? Soon we have vs 2025.

float H-resize for AVX2 ks4, ks8. AVX512 ks4, ks8m ks16 (selectors here - some AVX512 functions are not finally debugged)

float H-resizers: AVX2 ks4 and ks8 AVX512 ks4, ks8, ks16 (8 and 16 are performance test only - not finally debugged)

DTL2020 · 2025-05-10T11:17:49Z

I will try some day to download and setup VS2022 to my development host at the work. But it typically require lots of downloads and long setup to HDD and looking if everything will work again.
Up to this commit DTL2020@489c19a looks like next good step to performance of H-resize - the shuffling of input H-samples to V-fma is finally performed from a single (or a few) loaded 256 or 512 bit words in register file only using full-width permute (available in AVX2 for float AVX instruction _mm256_permutevar8x32_ps and natural for AVX512F _mm512_permutexvar_ps) without complex loading folloed by standard transpositions of 4x4 or larger. The ks8 and ks16 for AVX512 are initial design for performance test only - will try to debug later. I currently have only limited time at work with AVX512 host accessible for performance testing and debugging. The gathering-based function in AVX2 is sort of yet another example of input data transposition but works slower (in comparison with 3.7.5 release) and may be left as some example or removed.
ks8 version working with AVX2 too and looks like good enough for default LanczosResize(taps=3) - it expected to support taps=4 with same performance. Though permute-based implementations are only very few tested with mul=4 and taps 2 to 4 - it need to be tested if applicable with large or small upsize/downsize ratios or more limitations need to be added in calling of this processing functions. They are currently naturally limited by max supported offset of last sample from the beginning of first sample to process (all must fit in single 256bit or 512bit read word for ks4 at least). May be for larger offsets functions may be upgraded later (like it currently implemented with secondary offseting load in AVX2 ks8 version - https://github.com/DTL2020/AviSynthPlus/blob/489c19aefee83cb426ed6100755775a7a2f2a2cb/avs_core/filters/intel/resample_avx2.cpp#L1385 )

Update: Tested AVX512 resizers and looks like working good at least with single test script.
Now H+V mul=4 resize of the script:

BlankClip(1000000, 320,320, pixel_type="Y32")
mul=4
LanczosResize(width*mul, height*mul, taps=2) #(tested with taps=2 ks=4, taps=4 ks=8, taps=8 ks=16)

works at AVX512 (Xeon Gold chip) in 1 thread mode (no Prefetch in script) about 2.4..2 times faster in comparison with 3.7.5 release. The AVX2 version also somehow faster.

Release 3.7.5 test results:
taps=2 870 fps
taps=4 662 fps
taps=8 371 fps

This update:
AVX2
taps=2 1271 fps
taps=4 824 fps
taps=8 427 fps

AVX512
taps=2 2131 fps
taps=4 1159 fps
taps=8 647 fps

Also I found some issue with VirtualDub testing - the 'reloaded' script (without VirtualDub process restarting) can not switch MaxCPU limits. Only full restart of VirtualDub process can switch to new SetMaxCPU setting. If it is a feature - it good to be documented somewhere. The Info() updates only after VirtuaDub process is restarted.

pinterf · 2025-05-10T13:36:41Z

For tests (and for the future clean integration) I'm directly cloning your repo for myself, it's easier to copy and apply diff your actual code against my files.

Well, my figures, of course it contain only with your transposed ks>=4 versions, and not the just commited ones.


SetMaxCPU("SSSE3") # uncomment for AVX2
BlankClip(1000000, 320,320, pixel_type="YUV444PS")
mul=2
BilinearResize(Int(width*mul), height)

Benchmarks

kernel size<=4 tests
3.7.4 and 3.7.5 MSVCs are from official filesonly x64 distribution.
3.7.6 and others are my builds.

Running the SSE2 intrinsics code with the compiler settings below

3.7.4 MSVC SSE2     1147
3.7.4 INTEL SSE4.1  1725
3.7.4 INTEL AVX2    1928
3.7.5 MSVC SSE2     1156
3.7.6 MSVC SSE2     3256 // 3.7.6: DTL2020 idea preloaded up to 4 coeffs with transposing
3.7.6 ClangCL SSE2  3179
3.7.6 ClangCL AVX2  3179

Running the AVX2 intrinsics code with the compiler settings below

3.7.4 MSVC SSE2     2470
3.7.5 MSVC SSE2     2350 (strange, consistently slower; nothing has been changed in code)
3.7.6 MSVC SSE2     3677

pinterf · 2025-05-10T13:44:37Z

Also I found some issue with VirtualDub testing - the 'reloaded' script (without VirtualDub process restarting) can not switch MaxCPU limits. Only full restart of VirtualDub process can switch to new SetMaxCPU setting. If it is a feature - it good to be documented somewhere. The Info() updates only after VirtuaDub process is restarted.

Yes, once you set (limit) the CPU flags, it stays there until the DLL restarts (it's a singleton, and filled only once on creation).
Maybe you can use the enabling feature, using the + marker.

https://avisynthplus.readthedocs.io/en/latest/avisynthdoc/syntax/syntax_internal_functions_global_options.html#setmaxcpu

DTL2020 · 2025-05-10T14:14:57Z

I tried to do release for testing - https://github.com/DTL2020/AviSynthPlus/releases/tag/post-3.7.5-r.4312 But I understand it was build with not-best MSVC2019 compiler only. Can you make build with LLVM or what is best at 2025 ? To compare.

I test it vs avsresize and fmtconv at AVX512 CPU - it now only slightly slower of avsresize (it uses very best optimizations on Xeon and setting 'generic-AVX512' cpu_opt="avx512f" makes performance significantly worse - auto selects something better) and some better in comparison with fmtconv.

pinterf · 2025-05-10T14:30:15Z

First I have to rework them a bit for the edge case boundary conditions. I've done it for the transpose based ones, but it took time to figure out how. Since this time the filter size limit checks and the x-loop offset loading conditions had to be separated. And still haven't done the safe and generic avx512 support and compiler parameterizing in rhe cmake environment. Now I'm doing the avx2 additions.

pinterf · 2025-05-11T08:13:31Z

I'm still checking resize_h_planar_float_avx2_permutex_vstripe_ks4.

1.) just a note: asserts work just the opposite, instead of >7 you must use assert((end_off - start_off) <= 7);
You have to test for the expected case.
2.) For the color script, it can be seen that it's giving false results when I'm going down from mul=2.0 to smaller values like 1.1 or 0.99:

ColorbarsHD(320, 320).ConvertBits(32)
mul=0.99
BilinearResize(Int(width*mul), height)

Still I don't understand it yet how it should work, I continue testing it...

EDIT: maybe it's because there is a simple 8 pixel load __m256 data_src = _mm256_loadu_ps(src_ptr); which is not able to contain all pixels it would read from
src_ptr + begin1 (4 pixels) till src_ptr + begin8 (4 pixels)
For example when the last offset (begin8) is 7 pixels away from src_ptr, we'd need pixels from [7], [8], [9], [10] indexes, but data_src contain pixels only from indexes [0] to [7].

DTL2020 · 2025-05-11T08:27:40Z

Some more notes and ideas:
The fastest permutex-based functions with a small SIMD load (down to 1 SIMD word) are limited in supported offset between first and last start offset (begining) in the resampling program for the currently processed number of output samples. So they mostly usable with some upsampling ratios and may be much more limited downsampling ratios. The functions based on either gatering instructions purely or hand-made gathering of kernel-size sized sequences of input samples expected to be universal and support any offset. But they are slower because any gathering way from memory subsystem slower in comparison with pertumation in register file (from already loaded data).
AVX512 is more flexible in gathering from several loaded SIMD datawords because it supports permutex2var instruction and can gather from 2 different SIMD registers or add some data to current register from other register. AVX2 only support single gathering source (though multi-gathering may be implemented with some blendng masked or controlled by immediate in instruction).
The TRANSPOSE_N macros may be better to replace with permutex program (in not critical for performance places like outer loops may be resplaced with even memory-gathering instructions sequence) - need testing. Also the 'transpose-permutex program' controlled by a set of permutex of gathering indexes is very simple and most values are different to 1 only from prevoius. But as SIMD do not support values incrementing instructions like _mm_inc_epi32 - we need to _add_ps(one) and it may waste some space in register file (in case of 8 and 16 kernel sizes more). So it is idea to test keeping only beginning permutex offsets and use _add(one) in the main processing loop. It may or may not be better because it cause data dependency in a sequence of permutex instructions agruments.

So it looks we need some more complex selection of processing function in resample.cpp GetResampler() based on both kernel size and the max required load offset of source samples between some sequence of output samples in resampling program. Currently I tried to add some debug-assert to indicate if input program have non-supported too large offsets (also not sure if it is designed correctly) into permutex-based functions.

pinterf · 2025-05-11T08:37:59Z

I really don't want 10 different versions, it's still OK to specialize one for kernel sizes < 4 one for 4-8 and one for the larger ones, but adding extra versions for factors less than x or larger than y is I think a no-go. After evaluating them we'll choose some which have 'good enough' generic performance.

DTL2020 · 2025-05-11T08:40:27Z

For example when the last offset (begin8) is 7 pixels away from src_ptr, we'd need pixels from [7], [8], [9], [10] indexes, but data_src contain pixels only from indexes [0] to [7].

Yes - this function version may be mostly limited to some upsampling ratios (and some downsampling). The hand-made memory gathering of required for each output sample kernel-sized sequence of input sequence like https://github.com/DTL2020/AviSynthPlus/blob/489c19aefee83cb426ed6100755775a7a2f2a2cb/avs_core/filters/intel/resample_avx2.cpp#L1141

followed by any transposing method looks like more (or complete) universal. So I at least keep both versions of functions in sources. But in AVX2 after we load lots of H-positioned samples sequences (with more or less equal sequences depending on resize ratio) it looks the only usable transpose way is shuffle-based with immediates because permutex (in AVX2) can not gather many data from many input sources in single instruction. And in AVX512 it is limited to 2 sources only.

DTL2020 · 2025-05-11T08:45:30Z

I really don't want 10 different versions,

The most universal expected to be each required inputs gathering way like resize_h_planar_float_avx_transpose_vstripe_ks4
with each input sequence offset processed individually - https://github.com/DTL2020/AviSynthPlus/blob/489c19aefee83cb426ed6100755775a7a2f2a2cb/avs_core/filters/intel/resample_avx2.cpp#L1141
But with increasing of dataword size this may more and more limit performance - the gathering of 4x128 words in 512bit dataword looks like even more slower - https://github.com/DTL2020/AviSynthPlus/blob/489c19aefee83cb426ed6100755775a7a2f2a2cb/avs_core/filters/intel/resample_avx512.h#L63

For upsampling resize programs most of this data gathering is redundant and greatly limit the performance and can be replaced with single SIMD dataword loading into register file and collect required data for V-fma using single permutex instruction per row.

So we have at least 2 versions for different H-resize - for H-downsampling may be used more universal but slow versions with each sequence separated addressing load version and transpose next. And for H-upsampling (with some ratios ? the more ratio the better ?) may be used faster versions with small source loading and permutation in register file.

For example for subsample percision processing we need a sequence of upsampling->process->downsampling. And for H-upsample and H-downsample may be used 2 different resizers with better total performance.

Also we have no-resize filtering or shifting use case - need to be tested what is the best version for it (or specially designed 3rd version for convolution only no-resize processing).

DTL2020 · 2025-05-11T13:52:06Z

One possible idea for auto-selection of H-resize processing function - make class (?) with member CheckProgram(pointer to resampling program) with return value 'supported or not' and in GetResamplerH sequentially ask available H-resizers from best to worst in performance if current resampling program is supported (the distance between first and last input sample to read in the implemented H to V transposition is not more of the supported and kernel size is supported).

pinterf · 2025-05-11T17:14:34Z

Probably you know something similar, I found it a quite good document (see the pdf):
https://builders.intel.com/solutionslibrary/intel-avx-512-permuting-data-within-and-between-avx-registers-technology-guide

Another note.
In my problematic example, where I wrote, "when the last offset (begin8) is 7 pixels away from src_ptr, we'd need pixels from [7], [8], [9], [10] indexes, but data_src contain pixels only from indexes [0] to [7]." the dual source permute would probably help. But it is I think only available on avx512; nevertheless the avx2 version would have not enough registers for the temporary results.
Still, the end-of-line overread and garbage must be specially handled, this was relatively easy and efficient in the modified transpose_vstripe_ks4 versions, but for the speed it required already four templates for the four different remaining mod4 pixels. (still not cleaned up, now I was working on the avx512 cmakelist support)

DTL2020 · 2025-05-11T20:03:39Z

In my problematic example, where I wrote, "when the last offset (begin8) is 7 pixels away from src_ptr, we'd need pixels from [7], [8], [9], [10] indexes, but data_src contain pixels only from indexes [0] to [7]." the dual source permute would probably help.

Dual source permute can be emulated with blends (and shifts-rotates-shuffling etc) on AVX (and SSE) - but it is not universal solution because it will helps only in that case with not very large offset of source samples for last output sample in current processed set of output samples. For (some/very) large downsample ratios the offset between input samples may be too large for any number of SIMD registers sequential reading from memory. I think the only universal solution (for downsampling) is direct addressing of each subset of input samples using either direct instruction loading like load(u)_ps or with indexes(offsets) with gathering instruction. But gathering of 32bit floats work still slower in comparison of loading 128bit datawords and filling 256 or 512bit register using inserts.

Using gathering instruction is even slower in comparison with 128bit loads + transposition. Because 32bit single floats gathering can prepare transposed for V-fma set of registers in a single operation without need of next H to V transposition. But in my tests it is still slower. Though gathering instruction is the very 'high-level' and it cause lots of u-ops work on memory subsystem in CPU and its performance may greatly depends on the hardware implementation of CPU and its memory controller. So its performance may become better in the future hardware.

Gathering instruction for 32bit data using 32bit offsets looks like limited to 4GB (x4 or x8 multiplier possible) address offset max from base address and it must be enough for any downsample ratio (image row length below 4GBx4 in size).

Next week I will be at work and will try to make and test more universal versions for AVX512 ks8 and ks16 with each source sequence separated loading from memory. It may be even somehow faster to load
_mm512_loadu_2_m256 for ks8
and natural
_mm512_loadu_ps for ks16
instead of load 4x128 bits words and fill 512bit using inserts.
Currently it is only resize_h_planar_float_avx512_transpose_vstripe_ks4() expected to be usable for any upsampling and downsampling ratios safely.

…singleton member. mentioned in #440 - env->GetCPUFlags works per environment, not per loaded DLL's flags - make SetMaxCPU set distinct flag per ScriptEnvironment

pinterf · 2025-05-12T10:58:41Z

Boundary safe avx512 is also ready from the good plain ks <=4 series of horizontal + transpose.
On plain MSVC the gain is not too much for the 320x320 float H resample, only 22%. I have not yet checked the MT performance.

This time Intel was much better with AVX2 and AVX512, than MSVC.
And interestingly the Intel C++ 2025 AVX2 and AVX512 performance is almost the same.
EDIT2: More interestingly the MSVC ClangCL AVX2 performance was significantly better than the AVX512 version.

Results updated:

(i7-11700)
kernel size<=4 tests
Running the SSE2 intrinsics code with the compiler settings below
3.7.4 and 3.7.5 MSVCs are from official filesonly x64 distribution.
3.7.6 and others are my builds.

3.7.4 MSVC SSE2     1147
3.7.4 INTEL SSE4.1  1725
3.7.4 INTEL AVX2    1928
3.7.5 MSVC SSE2     1156
3.7.6 MSVC SSE2     3256 // 3.7.6: DTL2020 idea preloaded up to 4 coeffs with transposing
3.7.6 ClangCL SSE2  3179
3.7.6 ClangCL AVX2  3179
3.7.6 Intel SSE2    3140 (Intel LLVM 2025)

Running the AVX2 intrinsics code with the compiler settings below
 (the module itself was built with avx2 of course)

3.7.4 MSVC SSE2     2470
3.7.5 MSVC SSE2     2350 (strange, consistently slower; nothing has been changed in code)
3.7.6 MSVC SSE2     3677 (// 3.7.6: DTL2020 idea preloaded up to 4 coeffs with transposing)
3.7.6 ClangCL SSE2  5150 
3.7.6 Intel SSE2    5100 (Intel LLVM 2025)

Running the AVX512 intrinsics code with the compiler settings below (the module itself was built with avx512 of course)
3.7.6 MSVC SSE2     4532 (only 22% gain compared to AVX2)
3.7.6 ClangCl SSE2  4522 (essentially the same as MSVC, and slower than AVX2
3.7.6 Intel SSE2    5288 (Intel LLVM 2025)

DTL2020 · 2025-05-13T07:54:10Z

It is good to test with best compilers like LLVM the performance difference (for upsampling) between small load and permutex and full gathering from memory (versions _transpose_vstripe_ks4 vs _permutex_vstripe_ks4). For my tests with significant AVX512 performance boost I use permutex-based versions with small source load with 1 (2) load instructions only.
Also can you post (may be as attachment) the currently used sources and executables by different or at least best LLVM compiler so I can try to understand what versions of processing functions were used and also what was changed to fix edges processing issues and test it at my CPUs ?

Also got new ideas to check:

For AVX512 and ks4 we can increase H-size of single loop processing by doubling the number of transposed kernel coefficients. AVX512 with 32 registers will store 4 more coefs registers and process 4 more source registers. So the number of processed output samples per loop spin may be increased to 32 instead of 16 (32 in H-direction instead of dual-16 for 2 rows as in current versions). Also may be checked with a combination of 2 rows processing. But generally H-direction processing is expected to be faster in comparison with V-direction (but current tests still show V-resize and vstripe processing is not worse in performance - the memory still can serve this not very nice memory access pattern. But it may be only for small frame size fitted in the CPU cache - with many threads and typical AVS+ cache all frames in host RAM this may be changed so we also need to test resize with large frame size (input and output > CPU cache or (input and output)xthreads_number > CPU cache size).
Idea to stop making coefs transposition in main resample loops and make pre-transposition to RAM temporal buffer (after resampling program is created and before processing start - may be done in the beginning of each resample function for its current kernel size). This also can allow us to make H-direction output frame scan instead of current V-scan (expected to be not memory friendly because of many read-write streams). If transposed coefs will be stored in the L1D cache - its reload is expected to be not very slow. 1D-resampling kernel of output size 10000 multiplied to kernel size is about 10000x4=40(x4 sizeof_float)= 160 kBytes - will fit only in L2 caches.
For no-resize processing (convolution only) the completely different design possible without V-transposition of input samples and coefs and only with shift of inputs relative to kernel coefs (at least for AVX2 and AVX512 with good permutex instructions for shifting of 'very long words' between several SIMD registers horizontally. Because this version does not change size - it is expected not to have issues with too large and random offset of input samples and can load the smallest possible data from memory to the register file.

Also I think some 'smart-AI' compilers like LLVM can even understand some design ideas of simple SIMD loops and can do sort of loop-unroll in H or V directions automatically so we can see the big difference between old MSVC compiler and new AI-based (?) LLVM .

DTL2020 · 2025-05-13T13:10:41Z

Running the AVX2 intrinsics code with the compiler settings below
(the module itself was built with avx2 of course)

3.7.6 MSVC SSE2 3677 (// 3.7.6: DTL2020 idea preloaded up to 4 coeffs with transposing)
3.7.6 ClangCL SSE2 5150
3.7.6 Intel SSE2 5100 (Intel LLVM 2025)

Running the AVX512 intrinsics code with the compiler settings below (the module itself was built with avx512 of course)
3.7.6 MSVC SSE2 4532 (only 22% gain compared to AVX2)
3.7.6 ClangCl SSE2 4522 (essentially the same as MSVC, and slower than AVX2
3.7.6 Intel SSE2 5288 (Intel LLVM 2025)

Testing on the i7-11700 CPU with AVX512 may be not completely clear for compare AVX2 vs AVX512 performance. The CPU with AVX512 SIMD may not have AVX2 separated units and may process AVX2 instructions on universal SIMD dispatch ports up to AVX512. This may cause dual-rate execution of AVX2 instructions also AVX512 CPU (intel full-blood AVX512 and AMD Zen5 512bit AVX512) have 512bit datapath present. So intel compiler may give some hints to AVX512-present CPUs about possibility to dual-rate execution of AVX2 instructions if no data dependency present and CPU microcode can detect it and current CPUs instructions decoders are very smart to process as much data as possible at the present dispatch units.

Better to test AVX2 performance on the AVX2-only CPU of close or even next generation like Gen-12 intel with AVX2 only. With same frequency and same memory.

resize_h_planar_float_avx512_permutex_vstripe_ks4 . Fastest for big frame size and many threads - with 64 output samples in single row output (smallest number of SDRAM read-write streams ?).

DTL2020 · 2025-05-13T21:43:02Z

I tried to add that commit for SetMaxCPU patch - 3d7c1a8 . It looks like partially working - it allow to set lower CPU features in script text and Info() show it with script-reload (F2) but can not restore to highest if SetMaxCPU is commented out until VirtualDub restart. Strange.

I make several test versions for AVX512 resize_h_planar_float_avx512_permutex_vstripe_ks4() with different processing patterns and workunits sizes and test with different frame sizes and threads count. That commit 5a85b22
The version with 64 output samples in a single row (4x16) make visibly best performance in script

BlankClip(1000000, 3200,3200, pixel_type="Y32")
mul=4
LanczosResize(width*mul, height, taps=2)
Prefetch(8) # at 8 cores Xeon

Though in 1 thread and input frame size of 320x320 it not best performer (in comparison with other like 32 samples per row and dual-rows). This partially confirms idea that using too many RAM read-write streams in many rows processing per loop spin can cause finally SDRAM performance impact if processing many threads and non-cacheable frame sizes.
Fps differences is about (with 3200x3200 input and mul=4 for H-resize only and Prefetch(8))
2rows of 16samples - 72fps
1row of 32samples - 78 fps
1row of 64 samples - 87 fps
2row of 32samples - 80 fps

DTL2020 · 2025-05-17T19:49:14Z

I tried to make images for 3 cases of H-scaling - the first is upscale. With increasing scale-ratio the begin-offsets for second and next output samples become lower and lower and we can load single SIMD dataword and permute-transpose from it.

It is valid starting from some combination of scale ratio and kernel size. With AVX512 can be loaded 2x512 bit from memory sequentially and transpose from 2 sources with permutex2var and the limits may be different from AVX2 with single source of fast permute only.
So AVX2 only support permutex-gathering of transposed input samples from data loading of only 256bit (8x32bit floats max) from RAM without additional complex shift-blends-etc. And AVX512 support permutex-gathering of transposed input samples from data loading of 2x512bit (32x32bit floats max) from RAM.

DTL2020 · 2025-05-17T19:51:15Z

Second special case is no-resize convolution only - it can be covered with same or different design with shifting of very long word in between several SIMD registers and performing usual V-fma to got output samples. Example of the shifting is
https://github.com/Asd-g/AviSynth-JincResize/blob/b7fbf5d680a2950dff65b907134e6719efd11916/src/KernelRow_avx2.cpp#L535

Because begin-offset for input samples for each next output samples always = 1.

DTL2020 · 2025-05-17T19:53:13Z

And third case is downsample - the begin-offset may be very large as the downsampling ratio decreases from 1 to 0 so the only visible universal solution is each input sequence direct gathering from memory using begin-offsets. It is slowest solution.

pinterf · 2025-05-20T09:45:23Z

Yep, my Rocket Lake architecture is just an appetizer for AVX-512. It's great for validating code, but by far not ideal for benchmarking due to limited throughput.
So, it seems AVX-512 shouldn't be enabled by default in general-purpose in Avisynth or plugins? Or maybe it's better to treat it as a dead end and focus on AVX10 going forward, which we cannot do since we (and possibly most of users) don't have such hardware.

The comparison below was generated by ChatGPT, I asked for those performance aspects. I haven’t verified every detail, but it gives a good overview of when and why it might be worth investing time in hand-crafted, fine-tuned SIMD development.

Gen	Codename	AVX-512 Support	FMA Units	Exec Ports	Mem Load/Store	Notes
11	Rocket Lake	✅ Yes (P-cores)	1× 512-bit	1 AVX-512	2× 256-bit	Internally split; downclocks under load
12	Alder Lake	❌ Disabled	N/A	N/A	N/A	Fused off in silicon; mixed P/E-core ISA
13	Raptor Lake	❌ Disabled	N/A	N/A	N/A	Same as Alder Lake
14	Meteor Lake	⚠️ Partial (AVX10)	1× 512-bit	1 AVX-512	Likely 2× 256-bit	P-cores: 512-bit; E-cores: 256-bit
15	Arrow Lake	✅ AVX10	1× 512-bit	1 AVX-512	Likely 2× 256-bit	Unified AVX10 ISA; full 512-bit on P-cores
Xeon	Skylake-SP	✅ Full	2× 512-bit	2 AVX-512	True 512-bit	First Xeon with AVX-512; high power draw
Xeon	Cascade Lake	✅ Full	2× 512-bit	2 AVX-512	True 512-bit	Adds DL Boost (VNNI); improved thermal handling
Xeon	Ice Lake-SP	✅ Full	2× 512-bit	2 AVX-512	True 512-bit	10nm; better AVX-512 efficiency and memory bandwidth
Xeon	Sapphire Rapids	✅ Full	2× 512-bit	2 AVX-512	True 512-bit	Improved AVX-512 scaling; lower frequency penalties

DTL2020 · 2025-05-20T18:02:06Z

The most end-user friendly with AVX-512 is AMD now with Zen4 CPUs with partial-speed AVX-512 and Zen5 with better speed. Its L1D cache performance looks like doubled with transition from Zen4 to Zen5 and looks like full 512 bit now. We can ask users at doom9 forum with Zen4 and Zen5 AMD CPUs to make tests of performance too.
Also as intel points - if user need good speed CPU for data processing he can use Workstation with full-blood Xeon. And lower CPUs are for gameing and internet and not need complex and still somehow expensive AVX-512 unit.

added new universal function for AVX2 float ks4 processing using auto-selection between gathering by all addresses offsets or small load and permuting.

new universal function of AVX2 float ks4 processing with auto-selection between 2 source loading methods

universal procesing ks4 H-resize (calling from resample.cpp)

resize_h_planar_float_avx512_gather_permutex_vstripe_ks4() universal function with auto-selection and loading up to 32 sequential floats of sources for 16 output float samples. Not yet good debugged. Also the workunit size for permutex transpose looks like too small for AVX512 and need the adjustment to 2x or 4x size (in H or H and V directions - need more performance tuning).

pinterf · 2025-05-23T12:05:05Z

The new ones are gather+permutex? Checking...

Bravo :) significant AVX2 and very significant AVX512 improvement, algorithms do count.
EDIT: quick x64 MSVC + Clangcl build here (see also readme_history.txt):
https://drive.google.com/uc?export=download&id=1AAeyzVbIqZ4ZTEBBrnzagGlGpn_idmeP

Running the AVX2 intrinsics code with the compiler settings below

3.7.4 MSVC SSE2     2470
3.7.5 MSVC SSE2     2350 (strange, consistently slower; nothing has been changed in code)
3.7.6 MSVC SSE2     3677 (// 3.7.6: DTL2020 idea preloaded up to 4 coeffs with transposing)
3.7.6 ClangCL SSE2  5150 
3.7.6 Intel SSE2    5100 (Intel LLVM 2025)
3.7.6 ClangCL SSE2  6762 Gather+permute DTL250523

Running the AVX512 intrinsics code with the compiler settings below
3.7.6 MSVC SSE2     4532 (only 22% gain compared to AVX2)
3.7.6 ClangCl SSE2  4522 (essentially the same as MSVC, and slower than AVX2
3.7.6 Intel SSE2    5288 (Intel LLVM 2025)
3.7.6 ClangCl SSE2  8704 Gather+permute DTL250523 (!)

DTL2020 · 2025-05-23T12:59:25Z

The new gather+permutex are examples of universal H-scaling functions to support any upscaling and any downscaling ratios with internal selection of best possible performance processing method:

For downscale it selects the only currently known working direct addressing method of each input samples sequence.
For upscale (and no-scale and some not very small downscale with AVX512) it should select small load and permutex method (it is faster typically).

The selection function is to measure max width (length) of input samples (per loop spin) to transpose in the resampling program: Is it fit in single AVX2 256bit register (8 floats)
https://github.com/DTL2020/AviSynthPlus/blob/3303ba400a71a891af251b450538d39dac04870d/avs_core/filters/intel/resample_avx2.cpp#L1521

or is it fit in the dual AVX512 512bit registers (32 floats)
https://github.com/DTL2020/AviSynthPlus/blob/3303ba400a71a891af251b450538d39dac04870d/avs_core/filters/intel/resample_avx512.cpp#L269

and if yes - select small sequential load and permutex method to transpose inside register file. AVX512 support dual-registers transpose instruction so can support all upscale and no-resize and some downscale ratios (for ks<=16 may be) with better performance method. AVX2 can only transpose with permute instruction from single source so only support upscale ratios from about 2 to infinity with better performance method.

AVX512 is still very few tested - I do not have good time access to develop and debug at AVX512 host. Only built executable testing is available and some remote-debug at some time.

Current AVX512 version is only performance test demo - it can do over-read over the end of buffer (via dual-512bit load at
https://github.com/DTL2020/AviSynthPlus/blob/3303ba400a71a891af251b450538d39dac04870d/avs_core/filters/intel/resample_avx512.cpp#L426

) so need some fixes like process internal frame areas with current designed version and process last row end with read-safe method (or end of each row ?). Also for better performance it is better to separate general upsample resampling programs fitting in single 512 bit load from no-resize and some downsampling programs required dual-512bit load. It may make performance somehow better but need 2 more processing functions selection (with single 512bit word load and permute and with dual-512bit load and permutex from 2x512 words).

pinterf · 2025-05-23T13:08:50Z

AVX512 support dual-registers transpose instruction

For 32 bit floats it is covered in AVX512F fortunately.

Once you say it's final, I can add the safe-end measures here, where you put the comment.

      const float* src_ptr = src + program->pixel_offset[x + 0]; // all permute offsets relative to this start offset
      [...]
        __m512 data_src = _mm512_loadu_ps(src_ptr);
        __m512 data_src2 = _mm512_loadu_ps(src_ptr + 16); // not always needed for upscale also can cause end of buffer overread - need to add limitation (special end of buffer processing ?)

DTL2020 · 2025-05-23T13:33:37Z

AVX512 (small load + permute) version still not tuned for workunit size - it uses small possible 16 floats processing per loop spin. While AVX512 register file of at least 32 official (and about 100..200..300 in real hardware) 512bit enties expected to support about 32 or 64 floats per loop spin with some better performance (at least longer streaming transfers and less bus direction switches and more rare core to bus access counts). So next stages is to check performance with 32 and 64 floats processing with different 1D/2D patterns like:
4 rows of 16 floats (1x4 HxV)
2 rows of 32 floats (2x2 HxV)
1 row of 64 floats (4x1 HxV)

For most SDRAM memory friendly streaming with massive multithreading access it is expected somehow better in performance 2x2 or 4x1 methods - they create lowest possible number of memory read-write streams with large stride offset. But 1 row of 64 floats may require most of efforts to handle end of buffer overread.

Same is partially applicable to AVX2 part - it need to be tested with 16floats (2x8) per loop spin processing in
1 row of 16 floats
2 rows of 8 floats patterns.

DTL2020 · 2025-05-23T19:55:15Z

" Once you say it's final, " It is currently smallest and fastest known (in checked designs) elementary building block for H-scale (supporting some limited downscale and upscale up to infinity ratio with AVX512). It is unlikely possible to make it smaller. For medium and higher upsample ratios we can use load of single 512bit register and transpose from it but the throughput of single and dual sources is about equal and good AVX512 capable chip can do single and dual load wit no great performance difference (though second load will be completely redundant and some clever AI-powered firmware of CPU may understand it from the next following permute control word and skip load). I do not have good small ideas how to better handle end of row over-read with both single or dual registers load. The only current ideas is to limit the x-processing loop to some value where dual loads do not cross the end of source row boundary and process the final samples in simple way (may be duplicate current safe direct-addressing method as we have in the first part of function (for downsample). Next time for performance tuning tests with different workunit size (based on this elementary building block) I may have at the next week only. So if you can provide some solutions till next week it may be good help. Also I do not test how it correctly processes many scale ratios around 1.0 (may be from 0.5 to 2.0 range). The permute control word for single and dual-source transposition looks completely equal in setup ( https://github.com/DTL2020/AviSynthPlus/blob/3303ba400a71a891af251b450538d39dac04870d/avs_core/filters/intel/resample_avx512.cpp#L395 and next lines) but as I undersland the first or second register (a or b) selection is controlled by the i+4 bit of: https://www.laruence.com/sse/#text=_mm512_permutex2var_ps&expand=4262,4250,4238,4226,4286,4286 FOR j := 0 to 15 i := j*32 off := idx[i+3:i]*32 dst[i+31:i] := idx[i+4] ? b[off+31:off] : a[off+31:off] ENDFOR and it is auto-set if difference offset >15 (program->pixel_offset[x + 15] - iStart) - where bit i+4 set to 1 selects a second register. Where 1111b is 15d and any value above 15d sets bit i+4 to 1. Hope it is correct. But good to check. Also I make copy from your version the variable offset for second and other transposition control offsets ( https://github.com/DTL2020/AviSynthPlus/blob/3303ba400a71a891af251b450538d39dac04870d/avs_core/filters/intel/resample_avx512.cpp#L415) but initially I think offsets are always 1 (as we have in fixed TRANSPOSE macros ?). So I not sure if this variable offsets are required (may be I wrong because do not see the use case for this in debugger). So we can expect the core HtoV transpose engine elementary building block is final enough in design. But its control values setup may need to be checked and edge cases (source buffer overreads at end of row) need to be fixed.

DTL2020 · 2025-05-23T21:20:10Z

Also new resampling program analyse function (helper function) https://github.com/DTL2020/AviSynthPlus/blob/3303ba400a71a891af251b450538d39dac04870d/avs_core/filters/intel/resample_avx512.cpp#L270
and
https://github.com/DTL2020/AviSynthPlus/blob/3303ba400a71a891af251b450538d39dac04870d/avs_core/filters/intel/resample_avx2.cpp#L1522
may be better to make external and inlined to make program text shorter as it expected to be used in several functions (for ks<=8 and ks<=16) . It also have some end of buffer issue and it is currently hand-limited with width-8 to save from assert of running out of vector bounds in debug run. I still not look what causes this assert if it looks we access same members as with standard processing with this vector. I typically bad on end conditions understanding and it usually causes some bugs at the edges of buffers and other arrays.
It is expected to be some bool-return function with arguments of resampling program pointer and max supported read offset by calling function and return value if true if requested offset can cover all offsets between elements to load (with given step of x-loop) and false if not supported. So it looks like 3 arguments function like
bool CheckResamplingProgramOffset(ptr* Resampling_program, int supported_max_offset, int x_loop_step)
or int function returning max source sample offset with given x-loop step:
int GetMaxResamplingProgramSrcOffset(ptr* Resampling_program, int x_loop_step) and each H-resize function can compare it with max supported offset by small_load and permute_transpose in register file from small linearly loaded source data part of the function and select one of 2 methods to dispatch the provided resampling program.
Also this C-function may be shared between all SIMD functions files and located in resample.cpp or in some common header file. Because it is required for AVX2 too. To make it universal it looks must have loop to check all offsets for current provided program->filter_size_real combinations of offsets (instead of current each position check in separate block of program).

example of using single temp buf for 3 or 4 planes 2 pass h+V resizing for lower memory usage and better cache reusage (for not very large frame/plane sizes). Still not very nice but easy for testing control with force=3.

V-resizers selection

FilteredResize_2p::GetFrame() function using memory from general AVS+ video frames cache. But only as example because currently there is no analysis of the downstream request frame buffer implemented to request same size/type buffer to set highest probability of the presenting same buffer for writing to downsteam filter.

FilteredResize_2p::GetFrame() used temp buf from main vfb memory cache. It really at least sometime return released buffer as newvideoframe for downstream filter as destination as expected. But its probability is subject to investigate and improument (best request size ? direct ask for request size via filtergraph nodes scan for data sink filter ?). Performance test at script BlankClip(1000000, 320,320, pixel_type="YUV444PS") mul=2 LanczosResize(width*mul, height*mul, taps=2, force=3) ConverttoRGB24() Prefetch(6) at i5-9600K is about 738 fps with env->Allocate/Free and 804fps with env->NewVideoFrame()

V float AVX2 and AVX512 resamplers and also dual-width (32 samples per loop spin) AVX512 H-resampler.

stream (uncached) in new AVX2 and AVX512 float resizers.

8bit format.

8bit format

resampler.

resampler

DTL2020 added 5 commits May 2, 2025 20:24

Added example of

8ec252a

transpose-based SIMD H-resize function resize_h_planar_float_sse_transpose()

Function switch for testing of

04aea19

H-resize

Finally debugged SSE and AVX

b7af591

H-resizers transpose-based. Ready for testing.

SSE and AVX transpose-based

7879878

resizers for float32 up to kernel_size 4.

SSE and AVX transpose-

7a9dc2d

based resizers for kernel_size up to 4.

DTL2020 added 2 commits May 9, 2025 08:52

Added AVX512 for

4fdda09

V-resize and H-resize for kernel_size <=4

Added AVX512 for

5bd7a28

V-resize and H-resize for kernel_size <=4

DTL2020 added 3 commits May 10, 2025 04:06

Added permutex-based

4b15713

float H-resize for AVX2 ks4, ks8. AVX512 ks4, ks8m ks16 (selectors here - some AVX512 functions are not finally debugged)

Added permutex-based AVX2/AVX512

5aa2d5e

float H-resizers: AVX2 ks4 and ks8 AVX512 ks4, ks8, ks16 (8 and 16 are performance test only - not finally debugged)

Cleanup for AVX2 ks8

489c19a

More versions for

5a85b22

resize_h_planar_float_avx512_permutex_vstripe_ks4 . Fastest for big frame size and many threads - with 64 output samples in single row output (smallest number of SDRAM read-write streams ?).

DTL2020 added 4 commits May 22, 2025 18:28

Pinterf update of 20.05.2025 and

d58b402

added new universal function for AVX2 float ks4 processing using auto-selection between gathering by all addresses offsets or small load and permuting.

Added Pinterf update from 20.05.2025 and

df75a2c

new universal function of AVX2 float ks4 processing with auto-selection between 2 source loading methods

Added AVX512

6ef5774

universal procesing ks4 H-resize (calling from resample.cpp)

DTL2020 added 12 commits May 25, 2025 14:47

Added FilteredResize_2p

810f91b

example of using single temp buf for 3 or 4 planes 2 pass h+V resizing for lower memory usage and better cache reusage (for not very large frame/plane sizes). Still not very nice but easy for testing control with force=3.

Added AVX512 in

d945014

V-resizers selection

New H and V resampling functions

e174650

Added wider

06eaf70

V float AVX2 and AVX512 resamplers and also dual-width (32 samples per loop spin) AVX512 H-resampler.

Use stores instead of

d24b9eb

stream (uncached) in new AVX2 and AVX512 float resizers.

Added AVX512 V-resampler for

58708e2

8bit format.

Added AVX512 V-resampler for

6df7005

8bit format

Fixed bug in AVX512 V 8bit

997c5a8

resampler.

Added AVX512 V 16bit

cd01498

resampler

Added AVX512 V 16bit

034a47e

resampler

AVX and SSE transpose-based float resizers with ks<=4 #440

Are you sure you want to change the base?

AVX and SSE transpose-based float resizers with ks<=4 #440

Uh oh!

Conversation

DTL2020 commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pinterf commented May 6, 2025

Uh oh!

pinterf commented May 7, 2025

Uh oh!

DTL2020 commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pinterf commented May 9, 2025

Uh oh!

DTL2020 commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pinterf commented May 10, 2025

Uh oh!

pinterf commented May 10, 2025

Uh oh!

DTL2020 commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pinterf commented May 10, 2025

Uh oh!

pinterf commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DTL2020 commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pinterf commented May 11, 2025

Uh oh!

DTL2020 commented May 11, 2025

Uh oh!

DTL2020 commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DTL2020 commented May 11, 2025

Uh oh!

pinterf commented May 11, 2025

Uh oh!

DTL2020 commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pinterf commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DTL2020 commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DTL2020 commented May 13, 2025

Uh oh!

DTL2020 commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DTL2020 commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DTL2020 commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DTL2020 commented May 17, 2025

Uh oh!

pinterf commented May 20, 2025

Uh oh!

DTL2020 commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pinterf commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DTL2020 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pinterf commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

DTL2020 commented May 4, 2025 •

edited

Loading

DTL2020 commented May 9, 2025 •

edited

Loading

DTL2020 commented May 10, 2025 •

edited

Loading

DTL2020 commented May 10, 2025 •

edited

Loading

pinterf commented May 11, 2025 •

edited

Loading

DTL2020 commented May 11, 2025 •

edited

Loading

DTL2020 commented May 11, 2025 •

edited

Loading

DTL2020 commented May 11, 2025 •

edited

Loading

pinterf commented May 12, 2025 •

edited

Loading

DTL2020 commented May 13, 2025 •

edited

Loading

DTL2020 commented May 13, 2025 •

edited

Loading

DTL2020 commented May 17, 2025 •

edited

Loading

DTL2020 commented May 17, 2025 •

edited

Loading

DTL2020 commented May 20, 2025 •

edited

Loading

pinterf commented May 23, 2025 •

edited

Loading

DTL2020 commented May 23, 2025 •

edited

Loading

pinterf commented May 23, 2025 •

edited

Loading

DTL2020 commented May 23, 2025 •

edited

Loading

DTL2020 commented May 23, 2025 via email •

edited

Loading

DTL2020 commented May 23, 2025 •

edited

Loading