Accelerating copy_if using SIMD

Introduction I have a Zen 4 CPU with a bunch of AVX512 feature flags. So I thought - let’s try and use it to implement something, even if it’s in the realm of wheel-reinvention. I started with the following goals. Implement an algorithm that cannot be vectorized by my optimizing compiler, even with a polyhedral loop model. Systematically analyze its performance and answer the questions Is it as fast as it can be? If not, why? And how can we fix it? Start simple, make it work. Which means that dead simple algorithms like map/transform, reduce, adjacent_difference etc are out, as they are very autovectorizable. Even 2D stencils are out because look at this. So, I settled on std::copy_if. ...

May 25, 2026