### sslotin's blog

By sslotin, 2 weeks ago, translation,

UPD (2021-10-13): I've just discovered that I published this post for Russian audience only :(

Hi everyone!

I'm writing a book about performance engineering: the art of optimizing algorithms beyond just asymptotic complexity.

The book walks through the main CPU optimization techniques such as caching, SIMD and pipelining while exploring many large case studies where we speed up some classic algorithm or a data structure, closely matching or even improving on the state-of-the-art.

You will probably not learn a single asymptotically faster algorithm there, but it will help you not to get TL when you are supposed to get AC, and sometimes get AC when you are supposed to get TL.

Among the stuff that you are probably most interested in:

I've only recently finished the research stage and I'm now writing it all up. I've already completed ~120 pages, which is ~30% of what I intend to. I plan to finish a new article every 2-3 days and post any relevant updates here on CodeForces and on my revived twitter.

I'm not sure what I'm going to do with the book when I'm done, but it is definitely going to be available online for free.

Happy to hear any comments and suggestions.

• +373

 » 12 days ago, # | ← Rev. 3 →   +21 Good test for matrix multiplication is the problem "I — Matrix God" from contest here. AsGreyWolf have achieved 405ms in his submission. It is matrix multiplication by modulo in $O(n^3)$ time for $n=1000$ with low constant given by vectorization.
 » 12 days ago, # | ← Rev. 2 →   +19 I want to share interesting Russian problem about elections. Among all 85 pages of submissions only 3 persons solved this problem. Author of problem ch_egor solved it faster than $O(n \cdot m)$, remain two solvers solved it in $O(n\cdot m)$ with vectorization.
 » 11 days ago, # | ← Rev. 2 →   +21 as a constant factor enthusiast this pleases me in particular i enjoy simd unhealthily: some practical uses of simd are montgomery + avx2 to optimize a n^2 dp, n=1e5 3s avx512 + cache optimization to optimize a n^2 brute force, n=5e5 16s avx512 + cache optimization + register optimization + instruction level bundling to optimize a n^2 brute force, this problem but n=5e5 5s avx512 + register optimization + transposition to optimize a n^2/w bitset brute force the latter two are mainly from simonlindholm, the god of simd — he has written a book on applications of simd in CP in github here, #4 is the "bitops" chapter.