<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Chaitanya Kumar's Blog</title><link>https://loonatick-src.github.io/</link><description>Recent content on Chaitanya Kumar's Blog</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 25 May 2026 10:03:34 +0200</lastBuildDate><atom:link href="https://loonatick-src.github.io/index.xml" rel="self" type="application/rss+xml"/><item><title>Accelerating copy_if using SIMD</title><link>https://loonatick-src.github.io/posts/vectorized-copy-if-analysis/</link><pubDate>Mon, 25 May 2026 10:03:34 +0200</pubDate><guid>https://loonatick-src.github.io/posts/vectorized-copy-if-analysis/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;I have a Zen 4 CPU with a bunch of AVX512 feature flags. So I thought - let&amp;rsquo;s
try and use it to implement something, even if it&amp;rsquo;s in the realm of
wheel-reinvention. I started with the following goals.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Implement an algorithm that cannot be vectorized by my optimizing compiler,
even with a polyhedral loop model.&lt;/li&gt;
&lt;li&gt;Systematically analyze its performance and answer the questions
&lt;ol&gt;
&lt;li&gt;Is it as fast as it can be?&lt;/li&gt;
&lt;li&gt;If not, why? And how can we fix it?&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Start simple, make it work.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Which means that dead simple algorithms like map/transform, reduce,
adjacent_difference etc are out, as &lt;a href="https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:'1',fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,selection:(endColumn:2,endLineNumber:11,positionColumn:2,positionLineNumber:11,selectionStartColumn:2,selectionStartLineNumber:11,startColumn:2,startLineNumber:11),source:'%23include+%3Calgorithm%3E%0A%23include+%3Cnumeric%3E%0A%23include+%3Cvector%3E%0A%0Aint+reduce(std::vector%3Cint%3E+const%26+input)+%7B%0A++++return+std::reduce(input.begin(),+input.end(),+0)%3B%0A%7D%0A%0Avoid+adjacent_difference(std::vector%3Cint%3E+const%26+input,+std::vector%3Cint%3E%26+output)+%7B%0A++++std::adjacent_difference(input.cbegin(),+input.cend(),+output.begin())%3B%0A%7D'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:43.29460179133382,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:clang2210,filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',debugCalls:'1',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1',verboseDemangling:'0'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:3,lang:c%2B%2B,libs:!(),options:'-std%3Dc%2B%2B23+-march%3Dznver4+-O3+-Wall+-Wextra',overrides:!(),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'+x86-64+clang+22.1.0+(Editor+%231)',t:'0'),(h:cfg,i:(centerparents:'1',compilerName:'x86-64+clang+22.1.0',editorid:1,j:3,narrowtreelayout:'0',selectedFunction:'foo(std::vector%3Cint,+std::allocator%3Cint%3E%3E+const%26,+std::vector%3Cint,+std::allocator%3Cint%3E%3E%26):',treeid:0),l:'5',n:'0',o:'CFG+x86-64+clang+22.1.0+(Editor+%231,+Compiler+%233)',t:'0'),(h:output,i:(compilerName:'x86-64+clang+22.1.0',editorid:1,fontScale:14,fontUsePx:'0',j:3,wrap:'1'),l:'5',n:'0',o:'Output+of+x86-64+clang+22.1.0+(Compiler+%233)',t:'0')),header:(),k:56.70539820866619,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4"&gt;they are very autovectorizable&lt;/a&gt;. Even 2D stencils are out because &lt;a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGIM6SuADJ4DJgAcj4ARpjE/lykAA6oCoRODB7evv5JKWkCIWGRLDFxZgl2mA7pQgRMxASZPn4BldUCtfUEhRHRsfG2dQ1N2a1D3aG9Jf3lAJS2qF7EyOwc5gDMocjeWADUJutuAG5VRMQH2CYaAIIbWzuY%2B4fICgToWFQXV7c3APS/uzMABFdgBWAC0yVCBF2r0YyDouwA1rEwrRdhAgkxEgYEYZdgCEJgmDDMABHLwk9KwggAT3os2%2B/12iwIJlBFjw7KB7Is2m5TxBoV5XNBPI52nBXAF1l2wo5ovFfOs0rF%2B0scoYIql3N5/LVsvlnJVuolAvBu0kACojYq9bqbkdUHh0DT4XRQRA0AxXrsqLRUCSrbsAPoh4iYV7EPAOMOa0jfXZJ5Mp1Op/2BgjBsMRqMxghx1kJm5p0ul17oEAgVIAL0wIZh4WL1zLreTFartfrMIAsoybiYAOxWEvJ/jEDEd6t4OsNuWC3bS9YWecHNy7cK7C1LlfWax4Wb7YeJ0vjydvTsz7u7bQLnc3p7rntbxcHXeWazaQ9Dkcttssrw2QVXZg2fWV9RBA4eVHf9kyNCA8BfLhD1A9UVwg5NmQYEgCAQE9YMNLUOQQtDFxQ3YwI1DCk2ZBRWTwmD/0IkUQIo0iIFvbdGTVGiAQAd0jNlGLbZjgNQ2UONI5CBV43ZiVefD/wtSQADoNCoVjbVYyj0N1ZcCQBFZBFiRSUyHaC/yTczvnMjh5loThQV4PwOC0UhUE4Nw9w1OilhWdV1h4UgCE0Oz5iREB1lUgBOLgzAADlBSQNHWDRQQ0LgNGi/ROEkZzQvczheAUEANGC0L5jgWAkDQFhEjoWJyEoWr6voOJtkMYAzHKMqaFoAhYhKiAogKqJQnqWlOCCsbmGIWkAHkom0U4pt4Wq2EEeaGFoSbXN4LAoi8YA3DEWgSu4fbMBYTrxD20h8AjaoTnOtzMFUKpANWILoUwBy7toPAomICaPCwAqCGjFhVtIE5iCiFJMCBK6btCUA9vmf0mGABQADU8EwPj5sSRhof4QQRDEdgpBkQRFBUdQ7t0dZ9E6lBvJsAGohKyB5lQRJHAEc7wQrKDTA/SwzHWLdruWPD1iBGsGFhyQt3m9ZeFQWHoywbmIHmNoBb8CBXBGPwEmCSZilKPRklSQ3TZtvJDZ6K3%2BgqX7ThqcYHfd%2BxDc6BoXb6OIKm9zxmj0V4uiD6YQ/1xZlip%2BzHPyu6PI4XZVHigA2cFs5VjqjEBMwVK4NSMVwQgSAC5DeBC9HSAgGqqGAJqvUYAbiESeoO/OoKWoa4hwlYVYs9z/PdkL4Bi9LtTeEwfAzhdPQmdIWbiFQPiIcwX6mFpVloYNheADEvAYdoXLhc%2BPWoAMSUnwWs1IP076f3Yz9SYAwldAMjGfj%2B8BfwXrsX%2BwBZicGChGTAy8NDJw4E5NeBV05uGPgAcUzjnPOBcDBF26rPDQFdF7Vw2AkXYHg6qDwCusWYdcKrzCJEwLAcQ9akAitnUEal4qDikKCaK2doqSGivFaKAQ/p5UQWnIqtg9D1y0LMOBZhU5uXTrQ9G8xYapGcJIIAA%3D%3D%3D" title="2D five-point stencil on Compiler Explorer"&gt;look at this&lt;/a&gt;. So, I
settled on &lt;code&gt;std::copy_if&lt;/code&gt;.&lt;/p&gt;</description></item></channel></rss>