-
Notifications
You must be signed in to change notification settings - Fork 84
AVX and SSE transpose-based float resizers with ks<=4 #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
transpose-based SIMD H-resize function resize_h_planar_float_sse_transpose()
H-resize
H-resizers transpose-based. Ready for testing.
resizers for float32 up to kernel_size 4.
based resizers for kernel_size up to 4.
|
I like this completely different approach. I have read your questions (mod4, over read, etc) in the comments and will look into it whether they are safe and when they are safe. The 'real' kernel size is never mod4 or mod8. The kernel size is aligned and padded on the coefficient side. There is a safety x limit however, precomputed and stored in the resizing_program struct, which has to be considered. This is the "danger zone" from where the current line position indexed with "begin" offsets is not safe to read into SIMD. E.g. if we'd read 32 bytes from there, it should still remain within the aligned_stride. If this works, it may be quicker for integer samples as well, converting to float, do the stuff, convert back. Like I did in Overlay masked blend, where the integer arithmetic tricks and conversions and shifts were slower and much more complex than do everything in float internally. Not to say that it resulted in cleaner, easier to understand and more maintainable code. |
|
(Added .editorconfig to the project, I was surprised there wasn't any, I'm using it in my other repos. Your commits contain mixed TABs and spaces for indentation with changing indent level. From now on these settings are governed by this file, and the VS editor will use these settings automatically) |
V-resize and H-resize for kernel_size <=4
V-resize and H-resize for kernel_size <=4
|
Tried to add 2 versions of AVX512 functions for float V-resize and H-resize for kernel_size<=4 - 2 more commits up to DTL2020@5bd7a28 But got very strange Prefetch-based performance - with adding any Prefetch(N) even N=1 performance of AVX512 version start to drop and with Prefetch(8) become about 1/2 of AVX2 resizers. Very strange. Even adding Prefetch(1) line to the end of script already make AVX512 performance lower: For script AVX2: For AVX512: May something very non-memory friendly happens with current Prefetch implementation in the AVS+ core and it somehow significantly damage performance with AVX512 parts of program. Or may be VS2019 compiler also create not very good executable. Can not test other compilers yet. Tested with AVSmeter64 |
|
Or there is too much overhead with state changes or register save and load? Or too many vzerouppers, or the opposite: missing vzerouppers. Meanwhile I have cleaned up and made memory-access safe your H sse and avx resizers for small kernel sizes, I'm going to show you this weekend when I can spend more time. Anyway, we have to check avx512 possibility and support in CMakeLists.txt covering all used compilers gcc clang llvm and msvc. Avx512 got new options in vs2022 whether to prefer 256 or 512 bit registers (and also for avx10). Reason not changing to vs2022? Soon we have vs 2025. |
float H-resize for AVX2 ks4, ks8. AVX512 ks4, ks8m ks16 (selectors here - some AVX512 functions are not finally debugged)
float H-resizers: AVX2 ks4 and ks8 AVX512 ks4, ks8, ks16 (8 and 16 are performance test only - not finally debugged)
|
I will try some day to download and setup VS2022 to my development host at the work. But it typically require lots of downloads and long setup to HDD and looking if everything will work again. Update: Tested AVX512 resizers and looks like working good at least with single test script. works at AVX512 (Xeon Gold chip) in 1 thread mode (no Prefetch in script) about 2.4..2 times faster in comparison with 3.7.5 release. The AVX2 version also somehow faster. Release 3.7.5 test results: This update: AVX512 Also I found some issue with VirtualDub testing - the 'reloaded' script (without VirtualDub process restarting) can not switch MaxCPU limits. Only full restart of VirtualDub process can switch to new SetMaxCPU setting. If it is a feature - it good to be documented somewhere. The Info() updates only after VirtuaDub process is restarted. |
|
For tests (and for the future clean integration) I'm directly cloning your repo for myself, it's easier to copy and apply diff your actual code against my files. Well, my figures, of course it contain only with your transposed ks>=4 versions, and not the just commited ones. Benchmarks |
Yes, once you set (limit) the CPU flags, it stays there until the DLL restarts (it's a singleton, and filled only once on creation). |
|
I tried to do release for testing - https://github.com/DTL2020/AviSynthPlus/releases/tag/post-3.7.5-r.4312 But I understand it was build with not-best MSVC2019 compiler only. Can you make build with LLVM or what is best at 2025 ? To compare. I test it vs avsresize and fmtconv at AVX512 CPU - it now only slightly slower of avsresize (it uses very best optimizations on Xeon and setting 'generic-AVX512' cpu_opt="avx512f" makes performance significantly worse - auto selects something better) and some better in comparison with fmtconv. |
|
First I have to rework them a bit for the edge case boundary conditions. I've done it for the transpose based ones, but it took time to figure out how. Since this time the filter size limit checks and the x-loop offset loading conditions had to be separated. And still haven't done the safe and generic avx512 support and compiler parameterizing in rhe cmake environment. Now I'm doing the avx2 additions. |
|
I'm still checking 1.) just a note: asserts work just the opposite, instead of Still I don't understand it yet how it should work, I continue testing it... EDIT: maybe it's because there is a simple 8 pixel load |
|
Some more notes and ideas: So it looks we need some more complex selection of processing function in resample.cpp GetResampler() based on both kernel size and the max required load offset of source samples between some sequence of output samples in resampling program. Currently I tried to add some debug-assert to indicate if input program have non-supported too large offsets (also not sure if it is designed correctly) into permutex-based functions. |
|
I really don't want 10 different versions, it's still OK to specialize one for kernel sizes < 4 one for 4-8 and one for the larger ones, but adding extra versions for factors less than x or larger than y is I think a no-go. After evaluating them we'll choose some which have 'good enough' generic performance. |
Yes - this function version may be mostly limited to some upsampling ratios (and some downsampling). The hand-made memory gathering of required for each output sample kernel-sized sequence of input sequence like https://github.com/DTL2020/AviSynthPlus/blob/489c19aefee83cb426ed6100755775a7a2f2a2cb/avs_core/filters/intel/resample_avx2.cpp#L1141 followed by any transposing method looks like more (or complete) universal. So I at least keep both versions of functions in sources. But in AVX2 after we load lots of H-positioned samples sequences (with more or less equal sequences depending on resize ratio) it looks the only usable transpose way is shuffle-based with immediates because permutex (in AVX2) can not gather many data from many input sources in single instruction. And in AVX512 it is limited to 2 sources only. |
The most universal expected to be each required inputs gathering way like resize_h_planar_float_avx_transpose_vstripe_ks4 For upsampling resize programs most of this data gathering is redundant and greatly limit the performance and can be replaced with single SIMD dataword loading into register file and collect required data for V-fma using single permutex instruction per row. So we have at least 2 versions for different H-resize - for H-downsampling may be used more universal but slow versions with each sequence separated addressing load version and transpose next. And for H-upsampling (with some ratios ? the more ratio the better ?) may be used faster versions with small source loading and permutation in register file. For example for subsample percision processing we need a sequence of upsampling->process->downsampling. And for H-upsample and H-downsample may be used 2 different resizers with better total performance. Also we have no-resize filtering or shifting use case - need to be tested what is the best version for it (or specially designed 3rd version for convolution only no-resize processing). |
|
One possible idea for auto-selection of H-resize processing function - make class (?) with member CheckProgram(pointer to resampling program) with return value 'supported or not' and in GetResamplerH sequentially ask available H-resizers from best to worst in performance if current resampling program is supported (the distance between first and last input sample to read in the implemented H to V transposition is not more of the supported and kernel size is supported). |
|
Probably you know something similar, I found it a quite good document (see the pdf): Another note. |
Dual source permute can be emulated with blends (and shifts-rotates-shuffling etc) on AVX (and SSE) - but it is not universal solution because it will helps only in that case with not very large offset of source samples for last output sample in current processed set of output samples. For (some/very) large downsample ratios the offset between input samples may be too large for any number of SIMD registers sequential reading from memory. I think the only universal solution (for downsampling) is direct addressing of each subset of input samples using either direct instruction loading like load(u)_ps or with indexes(offsets) with gathering instruction. But gathering of 32bit floats work still slower in comparison of loading 128bit datawords and filling 256 or 512bit register using inserts. Using gathering instruction is even slower in comparison with 128bit loads + transposition. Because 32bit single floats gathering can prepare transposed for V-fma set of registers in a single operation without need of next H to V transposition. But in my tests it is still slower. Though gathering instruction is the very 'high-level' and it cause lots of u-ops work on memory subsystem in CPU and its performance may greatly depends on the hardware implementation of CPU and its memory controller. So its performance may become better in the future hardware. Gathering instruction for 32bit data using 32bit offsets looks like limited to 4GB (x4 or x8 multiplier possible) address offset max from base address and it must be enough for any downsample ratio (image row length below 4GBx4 in size). Next week I will be at work and will try to make and test more universal versions for AVX512 ks8 and ks16 with each source sequence separated loading from memory. It may be even somehow faster to load |
…singleton member. mentioned in #440 - env->GetCPUFlags works per environment, not per loaded DLL's flags - make SetMaxCPU set distinct flag per ScriptEnvironment
|
Boundary safe avx512 is also ready from the good plain ks <=4 series of horizontal + transpose. This time Intel was much better with AVX2 and AVX512, than MSVC. Results updated: |
|
It is good to test with best compilers like LLVM the performance difference (for upsampling) between small load and permutex and full gathering from memory (versions _transpose_vstripe_ks4 vs _permutex_vstripe_ks4). For my tests with significant AVX512 performance boost I use permutex-based versions with small source load with 1 (2) load instructions only. Also got new ideas to check:
Also I think some 'smart-AI' compilers like LLVM can even understand some design ideas of simple SIMD loops and can do sort of loop-unroll in H or V directions automatically so we can see the big difference between old MSVC compiler and new AI-based (?) LLVM . |
|
Running the AVX2 intrinsics code with the compiler settings below 3.7.6 MSVC SSE2 3677 (// 3.7.6: DTL2020 idea preloaded up to 4 coeffs with transposing) Running the AVX512 intrinsics code with the compiler settings below (the module itself was built with avx512 of course) Testing on the i7-11700 CPU with AVX512 may be not completely clear for compare AVX2 vs AVX512 performance. The CPU with AVX512 SIMD may not have AVX2 separated units and may process AVX2 instructions on universal SIMD dispatch ports up to AVX512. This may cause dual-rate execution of AVX2 instructions also AVX512 CPU (intel full-blood AVX512 and AMD Zen5 512bit AVX512) have 512bit datapath present. So intel compiler may give some hints to AVX512-present CPUs about possibility to dual-rate execution of AVX2 instructions if no data dependency present and CPU microcode can detect it and current CPUs instructions decoders are very smart to process as much data as possible at the present dispatch units. Better to test AVX2 performance on the AVX2-only CPU of close or even next generation like Gen-12 intel with AVX2 only. With same frequency and same memory. |
resize_h_planar_float_avx512_permutex_vstripe_ks4 . Fastest for big frame size and many threads - with 64 output samples in single row output (smallest number of SDRAM read-write streams ?).
|
I tried to add that commit for SetMaxCPU patch - 3d7c1a8 . It looks like partially working - it allow to set lower CPU features in script text and Info() show it with script-reload (F2) but can not restore to highest if SetMaxCPU is commented out until VirtualDub restart. Strange. I make several test versions for AVX512 resize_h_planar_float_avx512_permutex_vstripe_ks4() with different processing patterns and workunits sizes and test with different frame sizes and threads count. That commit 5a85b22 Though in 1 thread and input frame size of 320x320 it not best performer (in comparison with other like 32 samples per row and dual-rows). This partially confirms idea that using too many RAM read-write streams in many rows processing per loop spin can cause finally SDRAM performance impact if processing many threads and non-cacheable frame sizes. |
|
Second special case is no-resize convolution only - it can be covered with same or different design with shifting of very long word in between several SIMD registers and performing usual V-fma to got output samples. Example of the shifting is Because begin-offset for input samples for each next output samples always = 1. |
|
Yep, my Rocket Lake architecture is just an appetizer for AVX-512. It's great for validating code, but by far not ideal for benchmarking due to limited throughput. The comparison below was generated by ChatGPT, I asked for those performance aspects. I haven’t verified every detail, but it gives a good overview of when and why it might be worth investing time in hand-crafted, fine-tuned SIMD development.
|
|
The most end-user friendly with AVX-512 is AMD now with Zen4 CPUs with partial-speed AVX-512 and Zen5 with better speed. Its L1D cache performance looks like doubled with transition from Zen4 to Zen5 and looks like full 512 bit now. We can ask users at doom9 forum with Zen4 and Zen5 AMD CPUs to make tests of performance too. |
added new universal function for AVX2 float ks4 processing using auto-selection between gathering by all addresses offsets or small load and permuting.
new universal function of AVX2 float ks4 processing with auto-selection between 2 source loading methods
universal procesing ks4 H-resize (calling from resample.cpp)
resize_h_planar_float_avx512_gather_permutex_vstripe_ks4() universal function with auto-selection and loading up to 32 sequential floats of sources for 16 output float samples. Not yet good debugged. Also the workunit size for permutex transpose looks like too small for AVX512 and need the adjustment to 2x or 4x size (in H or H and V directions - need more performance tuning).
|
The new ones are gather+permutex? Checking... Bravo :) significant AVX2 and very significant AVX512 improvement, algorithms do count. |
|
The new gather+permutex are examples of universal H-scaling functions to support any upscaling and any downscaling ratios with internal selection of best possible performance processing method:
The selection function is to measure max width (length) of input samples (per loop spin) to transpose in the resampling program: Is it fit in single AVX2 256bit register (8 floats) or is it fit in the dual AVX512 512bit registers (32 floats) and if yes - select small sequential load and permutex method to transpose inside register file. AVX512 support dual-registers transpose instruction so can support all upscale and no-resize and some downscale ratios (for ks<=16 may be) with better performance method. AVX2 can only transpose with permute instruction from single source so only support upscale ratios from about 2 to infinity with better performance method. AVX512 is still very few tested - I do not have good time access to develop and debug at AVX512 host. Only built executable testing is available and some remote-debug at some time. Current AVX512 version is only performance test demo - it can do over-read over the end of buffer (via dual-512bit load at ) so need some fixes like process internal frame areas with current designed version and process last row end with read-safe method (or end of each row ?). Also for better performance it is better to separate general upsample resampling programs fitting in single 512 bit load from no-resize and some downsampling programs required dual-512bit load. It may make performance somehow better but need 2 more processing functions selection (with single 512bit word load and permute and with dual-512bit load and permutex from 2x512 words). |
For 32 bit floats it is covered in AVX512F fortunately. Once you say it's final, I can add the safe-end measures here, where you put the comment. |
|
AVX512 (small load + permute) version still not tuned for workunit size - it uses small possible 16 floats processing per loop spin. While AVX512 register file of at least 32 official (and about 100..200..300 in real hardware) 512bit enties expected to support about 32 or 64 floats per loop spin with some better performance (at least longer streaming transfers and less bus direction switches and more rare core to bus access counts). So next stages is to check performance with 32 and 64 floats processing with different 1D/2D patterns like: For most SDRAM memory friendly streaming with massive multithreading access it is expected somehow better in performance 2x2 or 4x1 methods - they create lowest possible number of memory read-write streams with large stride offset. But 1 row of 64 floats may require most of efforts to handle end of buffer overread. Same is partially applicable to AVX2 part - it need to be tested with 16floats (2x8) per loop spin processing in |
|
" Once you say it's final, "
It is currently smallest and fastest known (in checked designs) elementary
building block for H-scale (supporting some limited downscale and upscale
up to infinity ratio with AVX512). It is unlikely possible to make it
smaller.
For medium and higher upsample ratios we can use load of single 512bit
register and transpose from it but the throughput of single and dual
sources is about equal and good AVX512 capable chip can do single and dual
load wit no great performance difference (though second load will be
completely redundant and some clever AI-powered firmware of CPU may
understand it from the next following permute control word and skip load).
I do not have good small ideas how to better handle end of row over-read
with both single or dual registers load. The only current ideas is to limit
the x-processing loop to some value where dual loads do not cross the end
of source row boundary and process the final samples in simple way (may be
duplicate current safe direct-addressing method as we have in the first
part of function (for downsample).
Next time for performance tuning tests with different workunit size (based
on this elementary building block) I may have at the next week only. So if
you can provide some solutions till next week it may be good help. Also I
do not test how it correctly processes many scale ratios around 1.0 (may be
from 0.5 to 2.0 range). The permute control word for single and dual-source
transposition looks completely equal in setup (
https://github.com/DTL2020/AviSynthPlus/blob/3303ba400a71a891af251b450538d39dac04870d/avs_core/filters/intel/resample_avx512.cpp#L395
and next lines) but as I undersland the first or second register (a or b)
selection is controlled by the i+4 bit of:
https://www.laruence.com/sse/#text=_mm512_permutex2var_ps&expand=4262,4250,4238,4226,4286,4286
FOR j := 0 to 15 i := j*32 off := idx[i+3:i]*32 dst[i+31:i] := idx[i+4] ?
b[off+31:off] : a[off+31:off] ENDFOR
and it is auto-set if difference offset >15 (program->pixel_offset[x + 15]
- iStart) - where bit i+4 set to 1 selects a second register. Where 1111b
is 15d and any value above 15d sets bit i+4 to 1. Hope it is correct. But
good to check.
Also I make copy from your version the variable offset for second and other
transposition control offsets (
https://github.com/DTL2020/AviSynthPlus/blob/3303ba400a71a891af251b450538d39dac04870d/avs_core/filters/intel/resample_avx512.cpp#L415)
but initially I think offsets are always 1 (as we have in fixed TRANSPOSE
macros ?). So I not sure if this variable offsets are required (may be I
wrong because do not see the use case for this in debugger).
So we can expect the core HtoV transpose engine elementary building block
is final enough in design. But its control values setup may need to be
checked and edge cases (source buffer overreads at end of row) need to be
fixed.
|
|
Also new resampling program analyse function (helper function) https://github.com/DTL2020/AviSynthPlus/blob/3303ba400a71a891af251b450538d39dac04870d/avs_core/filters/intel/resample_avx512.cpp#L270 |
example of using single temp buf for 3 or 4 planes 2 pass h+V resizing for lower memory usage and better cache reusage (for not very large frame/plane sizes). Still not very nice but easy for testing control with force=3.
V-resizers selection
FilteredResize_2p::GetFrame() function using memory from general AVS+ video frames cache. But only as example because currently there is no analysis of the downstream request frame buffer implemented to request same size/type buffer to set highest probability of the presenting same buffer for writing to downsteam filter.
FilteredResize_2p::GetFrame() used temp buf from main vfb memory cache. It really at least sometime return released buffer as newvideoframe for downstream filter as destination as expected. But its probability is subject to investigate and improument (best request size ? direct ask for request size via filtergraph nodes scan for data sink filter ?). Performance test at script BlankClip(1000000, 320,320, pixel_type="YUV444PS") mul=2 LanczosResize(width*mul, height*mul, taps=2, force=3) ConverttoRGB24() Prefetch(6) at i5-9600K is about 738 fps with env->Allocate/Free and 804fps with env->NewVideoFrame()
V float AVX2 and AVX512 resamplers and also dual-width (32 samples per loop spin) AVX512 H-resampler.
stream (uncached) in new AVX2 and AVX512 float resizers.
8bit format.
8bit format
resampler.
resampler
resampler



For Bilinear, Bicubic, sinc-based up to taps=2 and may be others resizers.
pre-AVX performance about +100% at i5-9600K and AVX performance about +30% to 4302 commit.
Test script