<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Optimization | Lance Xu</title><link>https://lancexwq.netlify.app/tag/optimization/</link><atom:link href="https://lancexwq.netlify.app/tag/optimization/index.xml" rel="self" type="application/rss+xml"/><description>Optimization</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Tue, 08 Aug 2023 00:00:00 +0000</lastBuildDate><image><url>https://lancexwq.netlify.app/media/icon_hu0b7a4cb9992c9ac0e91bd28ffd38dd00_9727_512x512_fill_lanczos_center_3.png</url><title>Optimization</title><link>https://lancexwq.netlify.app/tag/optimization/</link></image><item><title>Optimization Techniques in Scientific Computing (Part III)</title><link>https://lancexwq.netlify.app/post/optimization-iii/</link><pubDate>Tue, 08 Aug 2023 00:00:00 +0000</pubDate><guid>https://lancexwq.netlify.app/post/optimization-iii/</guid><description>&lt;ul>
&lt;li>&lt;a href="#introduction-and-recap">Introduction and recap&lt;/a>&lt;/li>
&lt;li>&lt;a href="#run-code-on-a-gpu">Run code on a GPU&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#first-attempt">First attempt&lt;/a>&lt;/li>
&lt;li>&lt;a href="#further-vectorization">Further vectorization&lt;/a>&lt;/li>
&lt;li>&lt;a href="#batch-matrix-multiplication">Batch matrix multiplication&lt;/a>&lt;/li>
&lt;li>&lt;a href="#final-boost">Final boost&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#conclusion">Conclusion&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="introduction-and-recap">Introduction and recap&lt;/h2>
&lt;p>In my &lt;a href="https://lancexwq.github.io/tag/optimization/?q=Optimization%20Techniques%20in%20Scientific%20Computing" target="_blank" rel="noopener">previous two blogs&lt;/a> for optimization techniques in scientific computing, I have talked about concepts such as vectorization and parallelism in the context of my single-molecule video simulation&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>, which can be mathematically formulated as calculating 3D array \(V\) with &lt;/p>
\[V_{fij}=\sum_n \exp[-(x^p_{fi}-x_{fn})^2-(y^p_{fj}-y_{fn})^2].\]
&lt;p> We started with &lt;code>video_sim_v1&lt;/code>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">video_sim_v1&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">F&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">v&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">Array&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="kt">eltype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kt">x&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">}(&lt;/span>&lt;span class="nb">undef&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">f&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">v&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">PSFʸ&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">v&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>then found that introducing multithreading as follows significantly improves the performance.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">video_sim_v3&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">F&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">v&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">Array&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="kt">eltype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kt">x&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">}(&lt;/span>&lt;span class="nb">undef&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">Threads&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="nd">@threads&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">f&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">v&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">PSFʸ&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">v&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Eventually, &lt;code>video_sim_v3&lt;/code> yields a benchmark of &lt;code>12.925 ms (1450 allocations: 123.47 MiB)&lt;/code> on my eight-thread Intel i7 7700K.&lt;/p>
&lt;p>In the part II of my blog series, I have also loosely alluded to the dilemma we face in further optimization:&lt;/p>
&lt;ul>
&lt;li>The number of independent frames can be much larger than the number of threads on a CPU&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>.&lt;/li>
&lt;li>Multiprocessing on a cluster causes significant communication overhead and development challenge, ultimately outweigh the potential performance gain.&lt;/li>
&lt;/ul>
&lt;p>Basically, we want a solution that can efficiently execute numerous relatively lightweight computational tasks in parallel, while maintaining minimal communication overhead. Interestingly, such a solution already exists, and it takes the form of a GPU. According to the experts from &lt;a href="https://www.intel.com/content/www/us/en/products/docs/processors/cpu-vs-gpu.html" target="_blank" rel="noopener">Intel&lt;/a>,&lt;/p>
&lt;blockquote>
&lt;p>The GPU is a processor that is made up of many smaller and more specialized cores. By working together, the cores deliver massive performance when a processing task can be divided up and processed across many cores.&lt;/p>
&lt;/blockquote>
&lt;h2 id="run-code-on-a-gpu">Run code on a GPU&lt;/h2>
&lt;p>Originally popularized in the deep learning community, accelerating scientific computations with GPUs is rapidly getting attentions from researchers across various domains. Many thanks to the continuous efforts from scientists and software developers, writing GPU codes has become much easier than it used to be. Under some circumstances, once properly set up, running a code originally written for CPUs on GPUs can be achieved via merely changing a few lines.&lt;/p>
&lt;p>At the moment, the three leading companies in chips, Nvidia, AMD, and Intel, all offer their own platforms for GPU computation&lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>. Due to the relatively higher popularity, I will use &lt;a href="https://en.wikipedia.org/wiki/CUDA" target="_blank" rel="noopener">CUDA&lt;/a> from Nvidia in this blog. For detailed guidance on installation and integration with Julia, please refer to &lt;a href="https://github.com/JuliaGPU/CUDA.jl" target="_blank" rel="noopener">CUDA.jl&lt;/a> and &lt;a href="https://cuda.juliagpu.org/stable/" target="_blank" rel="noopener">its documentation&lt;/a>.&lt;/p>
&lt;h3 id="first-attempt">First attempt&lt;/h3>
&lt;p>Once the installation of &lt;code>CUDA.jl&lt;/code> is completed, verified, and loaded, to run &lt;code>video_sim_v1&lt;/code> on an Nvidia GPU we simply need to pass arguments as CUDA arrays such as &lt;code>video_sim_v1(CuArray(xᵖ), CuArray(yᵖ), CuArray(x), CuArray(y))&lt;/code>.&lt;/p>
&lt;p>You may expect magic to happen but a warning (or sometimes an error) pops up regarding &lt;code>performing scalar indexing on task&lt;/code>. What&amp;rsquo;s more, the warning message also says &lt;code>such implementations *do not* execute on the GPU, but very slowly on the CPU&lt;/code>, indicating our first attempt has failed. The cause behind this failure is clear from the warning message: CUDA does not accept scalar indexing of a GPU array, like &lt;code>v[:, :, f]&lt;/code>. Consequently, the solution entails a complete vectorization of the code, eliminating the need for the for-loop iteration over \(f\).&lt;/p>
&lt;h3 id="further-vectorization">Further vectorization&lt;/h3>
&lt;p>As stated multiple times thus far, our problem does not align directly with any basic vector operation. However, we can be clever and slightly restructure our data, enabling the potential for vectorization. An approach to achieve this is illustrated in the following figure.&lt;/p>
&lt;p>
&lt;figure id="figure-block-diagonal-matrix-multiplication">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Block diagonal matrix multiplication to calculate PSF." srcset="
/post/optimization-iii/fig1_hufbdff24dc123ad47906c4222b1604286_18213_04666b017c909e48e2e60460767337f0.webp 400w,
/post/optimization-iii/fig1_hufbdff24dc123ad47906c4222b1604286_18213_94cf65fdceef65aa92e944897afac24f.webp 760w,
/post/optimization-iii/fig1_hufbdff24dc123ad47906c4222b1604286_18213_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimization-iii/fig1_hufbdff24dc123ad47906c4222b1604286_18213_04666b017c909e48e2e60460767337f0.webp"
width="760"
height="398"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Block diagonal matrix multiplication
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Here, \(PSF^x\), \(PSF^y\), and \(V\) are restructured as block-diagonal matrices. Blocks sharing the same color correspond to the same frame, while any remaining elements within these matrices are set to zero, visually represented as white-colored sections. As a result, all the frames can be simulated through one matrix multiplication.&lt;/p>
&lt;p>While this approach is indeed valid, I would not recommend implementing it by yourselves. This is due to the potentially vast dimensions of these block matrices. A naive implementation lacking efficient memory allocation handling could greatly worsen overall performance.&lt;/p>
&lt;p>Are there better solutions? The answer is yes. This problem we are facing, namely numerous independent (and typically small) matrix multiplications of identical sizes, is not unique to us. In fact, it is common enough that people have named it &amp;ldquo;batch matrix multiplication&amp;rdquo;.&lt;/p>
&lt;h3 id="batch-matrix-multiplication">Batch matrix multiplication&lt;/h3>
&lt;p>
&lt;figure id="figure-batched-matrix-multiplication">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Batched matrix multiplication to calculate PSF." srcset="
/post/optimization-iii/fig2_hue4fe31342a3d6a87683d56c97d6a780a_10142_f3d3ba72c75220f005441259b1add0d9.webp 400w,
/post/optimization-iii/fig2_hue4fe31342a3d6a87683d56c97d6a780a_10142_ee686c2dfb8920e0b6e127c982785240.webp 760w,
/post/optimization-iii/fig2_hue4fe31342a3d6a87683d56c97d6a780a_10142_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimization-iii/fig2_hue4fe31342a3d6a87683d56c97d6a780a_10142_f3d3ba72c75220f005441259b1add0d9.webp"
width="590"
height="243"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Batched matrix multiplication
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Although batch matrix multiplication is widely recognized and efficiently implemented, it may not always be easy to find the correct function within your programming language. Occasionally, batch matrix multiplication goes by different names. For instance, in MATLAB, it is referred to as &amp;ldquo;&lt;a href="https://www.mathworks.com/help/matlab/ref/pagemtimes.html" target="_blank" rel="noopener">page-wise matrix multiplication&lt;/a>&amp;rdquo;. In certain cases, additional packages are required, and quite often, these packages belong deep-learning libraries! In Python, you can call &lt;code>torch.bmm&lt;/code> from &lt;a href="https://pytorch.org/docs/stable/generated/torch.bmm.html" target="_blank" rel="noopener">PyTorch&lt;/a>, while Julia offers &lt;code>batched_mul&lt;/code> through &lt;a href="https://fluxml.ai/Flux.jl/stable/models/nnlib/#NNlib.batched_mul" target="_blank" rel="noopener">Flux.jl&lt;/a>. Using &lt;code>batched_mul&lt;/code>, we can write a new code as follows:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">video_sim_GPU_v2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">reshape&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">...&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">xᵖ&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">reshape&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">...&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">batched_mul&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">PSFˣ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">batched_adjoint&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">PSFʸ&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Here, &lt;code>reshape&lt;/code> is called to construct &lt;code>PSFˣ&lt;/code> and &lt;code>PSFˣ&lt;/code> as 3D arrays, and &lt;code>batched_adjoint&lt;/code> is just the &amp;ldquo;batched&amp;rdquo; version of transpose.&lt;/p>
&lt;p>Benchmarking &lt;code>video_sim_GPU_v2&lt;/code> on my CPU (i7 7700K) and my GPU (GeForce GTX 1060) yield &lt;code>9.550 ms (75 allocations: 73.44 MiB)&lt;/code> and &lt;code>3.127 ms (9 GPU allocations: 73.468 MiB&lt;/code>&lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup>, respectively. Both of them are beating the multithreaded &lt;code>video_sim_v3&lt;/code>!&lt;/p>
&lt;h3 id="final-boost">Final boost&lt;/h3>
&lt;p>The benchmarks I&amp;rsquo;ve showcased so far are based on double-precision float-point (float64) numbers. However, GPUs are frequently optimized for single-precision float-point (float32) numbers. For instance, once switched to using float32, &lt;code>video_sim_GPU_v2&lt;/code>&amp;rsquo;s benchmark becomes &lt;code>660.627 μs (11 GPU allocations: 36.736 MiB)&lt;/code>, another fivefold acceleration!&lt;/p>
&lt;p>Therefore, it is frequently advantageous to craft your GPU code to support both float64 and float32, and then assess whether altering the datatype affects your outcome. If there&amp;rsquo;s no impact, simply proceed with float32!&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Finally, we have arrived at the conclusion of my blog series concerning optimization techniques for scientific computation. I hope you have enjoyed this journey and learned something useful. Please feel free to get in touch with me should you wish to connect or share your thoughts!&lt;/p>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>In case you haven&amp;rsquo;t read the preceding blogs, I strongly encourage you to take a moment to review their problem description sections. This will provide you with a better picture of the issue I&amp;rsquo;m trying to address.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>As of the date of this blog, even the most advanced desktop CPU, AMD Ryzen™ Threadripper™ PRO 5995WX (~$6,000), only has 128 threads, while frame number can easily be over 1,000.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>&lt;a href="https://en.wikipedia.org/wiki/CUDA" target="_blank" rel="noopener">CUDA&lt;/a> from Nvidia, &lt;a href="https://en.wikipedia.org/wiki/ROCm" target="_blank" rel="noopener">ROCm&lt;/a> from AMD, and &lt;a href="https://en.wikipedia.org/wiki/OneAPI_%28compute_acceleration%29" target="_blank" rel="noopener">OneAPI&lt;/a> from Intel.&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:4">
&lt;p>GPU memory allocation is measured by &lt;code>CUDA.@time&lt;/code>, see this &lt;a href="https://cuda.juliagpu.org/stable/development/profiling/" target="_blank" rel="noopener">page&lt;/a>.&amp;#160;&lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Optimization Techniques in Scientific Computing (Part II)</title><link>https://lancexwq.netlify.app/post/optimizationp-ii/</link><pubDate>Mon, 07 Aug 2023 00:00:00 +0000</pubDate><guid>https://lancexwq.netlify.app/post/optimizationp-ii/</guid><description>&lt;ul>
&lt;li>&lt;a href="#introduction-and-recap">Introduction and recap&lt;/a>&lt;/li>
&lt;li>&lt;a href="#the-first-implementation">The first implementation&lt;/a>&lt;/li>
&lt;li>&lt;a href="#optimization-ideas">Optimization ideas&lt;/a>&lt;/li>
&lt;li>&lt;a href="#parallelism">Parallelism&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#core-level">Core-level&lt;/a>&lt;/li>
&lt;li>&lt;a href="#node-level">Node-level&lt;/a>&lt;/li>
&lt;li>&lt;a href="#cluster-level">Cluster-level&lt;/a>&lt;/li>
&lt;li>&lt;a href="#key-consideration">Key consideration&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="introduction-and-recap">Introduction and recap&lt;/h2>
&lt;p>In &lt;a href="https://lancexwq.github.io/post/optimization-i/" target="_blank" rel="noopener">my previous blog&lt;/a>, I introduced some straightforward yet valuable optimization techniques. While these techniques are generally suitable for relatively simple problems, such as the example presented in my previous blog, they may prove inadequate when dealing with more complex and realistic issues. Specifically, in the previous example, my objective was to simulate a single-molecule microscope image. However, people frequently need to process multiple independent images (e.g., frames in a video). In this blog, I will discuss additional techniques within the context of this video simulation problem.&lt;/p>
&lt;p>To quickly recap, the previous example involves calculating the total contribution, labeled as \(I\), from all molecules. These molecules are indexed by \(n\), and they relate to each pixel, which is indexed by \(i\) and \(j\). Referring to the assumptions discussed earlier, we can express the algorithm&amp;rsquo;s mathematical form as follows: &lt;/p>
\[I_{ij}=\sum_n \exp[-(x^p_i-x_n)^2-(y^p_j-y_n)^2].\]
&lt;p> Now, shifting our focus to the present issue that involves multiple independent images (or frames), we extend the same calculation to each individual image, denoted as \(f\). As a result, the mathematical representation for this new problem takes the following shape (where $V$ stands for video): &lt;/p>
\[V_{fij}=\sum_n \exp[-(x^p_{fi}-x_{fn})^2-(y^p_{fj}-y_{fn})^2].\]
&lt;h2 id="the-first-implementation">The first implementation&lt;/h2>
&lt;p>
&lt;figure id="figure-calculate-psfs-in-a-loop">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Calculate PSFs in a loop." srcset="
/post/optimizationp-ii/fig_loop_hubbbb7445c0883065e0be80260df2da06_21327_f007f7123984859dfb5d7e1f34a5eeab.webp 400w,
/post/optimizationp-ii/fig_loop_hubbbb7445c0883065e0be80260df2da06_21327_7d741b6ab2077fd82bf6c349c9a93ecf.webp 760w,
/post/optimizationp-ii/fig_loop_hubbbb7445c0883065e0be80260df2da06_21327_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimizationp-ii/fig_loop_hubbbb7445c0883065e0be80260df2da06_21327_f007f7123984859dfb5d7e1f34a5eeab.webp"
width="535"
height="602"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Calculate PSFs in a loop
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Based on the description so far, we can readily enclose a for-loop iterating over \(f\) around the previously optimized code to create the initial version of our single-molecule video simulation code&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">video_sim_v1&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">F&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">Array&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="kt">eltype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kt">x&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">}(&lt;/span>&lt;span class="nb">undef&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">f&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">PSFʸ&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">V&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Two points to note in the code above:&lt;/p>
&lt;ul>
&lt;li>&lt;code>x&lt;/code> and &lt;code>y&lt;/code> are both arrays of dimensions \(N\times F\), where \(N\) and \(F\) represent the number of molecule and number of frames, respectively.&lt;/li>
&lt;li>It appears that we have made a bold assumption that all frames contain an equal number of molecules. However, this assumption is acceptable since molecules that should not appear in a frame can be positioned far away from the field-of-view, thereby making no contribution.&lt;/li>
&lt;/ul>
&lt;p>Benchmarking &lt;code>video_sim_v1&lt;/code> using a dataset comprising 20 molecules and 100 frames (each with 256\(\times\)256 pixels) yields &lt;code>50.927 ms (1402 allocations: 123.47 MiB)&lt;/code>. Our overarching goal entails improving upon this benchmark.&lt;/p>
&lt;h2 id="optimization-ideas">Optimization ideas&lt;/h2>
&lt;p>Before exploring new techniques, let&amp;rsquo;s take a moment to consider whether we can apply anything from &lt;a href="https://lancexwq.github.io/post/optimization-i/" target="_blank" rel="noopener">my previous blog&lt;/a>. Since we have only added one extra loop, there isn&amp;rsquo;t much opportunity to reduce memory allocation. What&amp;rsquo;s more, this extra loop cannot be easily eliminated through vectorization, as the formula specified here doesn&amp;rsquo;t align with basic matrix (or tensor) operations. Consequently, we must use other techniques to tackle this challenge.&lt;/p>
&lt;p>In this video simulation problem, it is important to note that all frames are independent of each other. As a result, there is potential to simulate frames simultaneously, or in other words, in parallel.&lt;/p>
&lt;h2 id="parallelism">Parallelism&lt;/h2>
&lt;p>Parallelizing an algorithm is much easier said than done. In view of the intricate nature of contemporary computational infrastructures, attaining parallelism in the present era involves three major tiers: core-level parallelism, node-level parallelism, and cluster-level parallelism&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>. In the upcoming sections, I will delve into a single common scheme within each tier and examine its relevance within the context of our specific problem.&lt;/p>
&lt;h3 id="core-level">Core-level&lt;/h3>
&lt;p>This initial question that may arise is: how is it possible to achieve parallelism on a single core? To illustrate this, let&amp;rsquo;s consider a situation where a program operates on 64-bit integers, and a processor core possesses the capability to fetch 256 bits of data in a solitary operation. In such a scenario, it becomes viable to load four integers as a vector and perform a singular vectorized iteration of the original operation. This could potentially yield a theoretical speedup of fourfold&lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>. This particular approach to parallelization is commonly known as &amp;ldquo;&lt;a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data" target="_blank" rel="noopener">single instruction, multiple data&lt;/a>&amp;rdquo; (SIMD).&lt;/p>
&lt;p>The straightforward concept of SIMD, on one hand, allows numerous modern programming languages to identify points within an algorithm where SIMD can be employed and subsequently apply it automatically. On the other hand, SIMD is frequently constrained to basic operations such as addition or multiplication. Hence, the potential enhancement of &lt;code>video_sim_v1&lt;/code> through this method remains uncertain. Nevertheless, in this scenario, an attempt must be made to explore the possibilities.&lt;/p>
&lt;p>In &lt;a href="https://julialang.org/" target="_blank" rel="noopener">Julia&lt;/a>, it is possible to enforce vectorization by employing the &lt;code>@simd&lt;/code> macro, placed before a for-loop involving independent iterations. This technique results in the creation of &lt;code>video_sim_v2&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">video_sim_v2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">F&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">Array&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="kt">eltype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kt">x&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">}(&lt;/span>&lt;span class="nb">undef&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nd">@simd&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">f&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">PSFʸ&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">V&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>A benchmark analysis yields a result of &lt;code>51.017 ms (1402 allocations: 123.47 MiB)&lt;/code>, indicating a lack of performance improvement. It seems that Julia has indeed automatically vectorized the code in this case.&lt;/p>
&lt;h3 id="node-level">Node-level&lt;/h3>
&lt;p>Moving up a level there is parallelism on a node (often a computer), which is often achieved through &lt;a href="https://en.wikipedia.org/wiki/Multithreading_%28computer_architecture%29" target="_blank" rel="noopener">multithreading&lt;/a>. For multithreading, we require multiple processors (either physical or virtual&lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup>), with each core being associated with a separate thread. Multithreading facilitates the simultaneous execution of these processors, all while utilizing the same memory pool. It is important to note that the implementation of multithreading demands careful consideration to avoid conflicts between threads. Fortunately, developers have often shouldered much of this responsibility, alleviating users of this burden.&lt;/p>
&lt;p>In Julia, multithreading a for-loop can be as easy as follows&lt;sup id="fnref:5">&lt;a href="#fn:5" class="footnote-ref" role="doc-noteref">5&lt;/a>&lt;/sup>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">video_sim_v3&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">F&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">Array&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="kt">eltype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kt">x&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">}(&lt;/span>&lt;span class="nb">undef&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">Threads&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="nd">@threads&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">f&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">PSFʸ&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">V&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>My desktop computer is equipped with four physical CPU cores, which translate into eight threads. When assessing the benchmark of &lt;code>video_sim_v3&lt;/code> with all eight threads, the results demonstrate a remarkable speedup of almost four times comparing to &lt;code>video_sim_v1&lt;/code>, clocking in at &lt;code>12.925 ms (1450 allocations: 123.47 MiB)&lt;/code>.&lt;/p>
&lt;h3 id="cluster-level">Cluster-level&lt;/h3>
&lt;p>Now assume you have access to a cluster, which is not uncommon for universities and institutes nowadays, you could even consider modifying the algorithm to execute across multiple processors spanning numerous computers. A frequently employed strategy involves the utilization of &lt;a href="https://en.wikipedia.org/wiki/Multiprocessing" target="_blank" rel="noopener">multiprocessing&lt;/a>.&lt;/p>
&lt;p>With the concept of multithreading in mind, we can easily comprehend multiprocessing as the simultaneous operation of multiple processors, where each core has access only to its designated memory space. This fundamental distinction from multithreading requires some &amp;ldquo;coding maneuvers&amp;rdquo; as users are now compelled to determine the allocation of data to individual processors. In the context of our example problem, implementing multiprocessing requires some rather major change of the code, contradicting the very impetus driving my blog posts. Therefore, I only provide a preliminary example in &lt;a href="">this GitHub repository of mine&lt;/a>.&lt;/p>
&lt;h3 id="key-consideration">Key consideration&lt;/h3>
&lt;p>
&lt;figure id="figure-communication-overhead-vs-computational-cost">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Communication overhead vs. computational cost." srcset="
/post/optimizationp-ii/fig1_hu8afa2284ccdba917dbb6eb0b766fe8fd_25065_d1f81b9e56e3bfa579736f62e0348575.webp 400w,
/post/optimizationp-ii/fig1_hu8afa2284ccdba917dbb6eb0b766fe8fd_25065_31b7c465da11fe051b0f382fb637da15.webp 760w,
/post/optimizationp-ii/fig1_hu8afa2284ccdba917dbb6eb0b766fe8fd_25065_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimizationp-ii/fig1_hu8afa2284ccdba917dbb6eb0b766fe8fd_25065_d1f81b9e56e3bfa579736f62e0348575.webp"
width="760"
height="302"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Communication overhead vs. computational cost
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>While I aimed to maintain a surface-level discourse in my blog, it is totally reasonable to feel confused when deciding upon a parallelization scheme. &amp;#x1f604; The crucial factor to bear in mind is that an escalation in the number of processors engaged directly corresponds to an increase in communication overhead. This rise in overhead can potentially overshadow the performance benefits gained from task distribution.&lt;/p>
&lt;p>As of the post date of this blog, it is generally advisable to experiment with SIMD and multithreading in your code, as they are relatively easier to test. On the other hand, when it comes to multiprocessing, it is recommended to consider its implementation only when each discrete task consumes several seconds to execute, and the level of inter-process communication remains minimal.&lt;/p>
&lt;p>Although it has been a long journey, our quest remains incomplete. There is one more concept, which is gaining popularity in recent years, that we can test. In the third part of my blog, I will discuss GPU computation.&lt;/p>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>&lt;code>view(x, :, f)&lt;/code> serves the same purpose as &lt;code>x[:, f]&lt;/code> but with smaller memory allocation.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Please note that these concepts are not mutually exclusive.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>You should now recognize that SIMD is closely related to vectorization (introduced in &lt;a href="https://lancexwq.github.io/post/optimization-i/" target="_blank" rel="noopener">my previous blog&lt;/a>). In fact, vectorization constitutes a specific implementation of SIMD principles.&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:4">
&lt;p>For example, see &lt;a href="https://en.wikipedia.org/wiki/Hyper-threading" target="_blank" rel="noopener">hyper-threading&lt;/a>&amp;#160;&lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:5">
&lt;p>In order to enable multithreading, certain programming languages may require additional parameters during startup. You can find instructions on how to accomplish this in Julia on &lt;a href="https://docs.julialang.org/en/v1/manual/multi-threading/" target="_blank" rel="noopener">this page&lt;/a> shows how to do it in Julia.&amp;#160;&lt;a href="#fnref:5" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Optimization Techniques in Scientific Computing (Part I)</title><link>https://lancexwq.netlify.app/post/optimization-i/</link><pubDate>Fri, 28 Jul 2023 00:00:00 +0000</pubDate><guid>https://lancexwq.netlify.app/post/optimization-i/</guid><description>&lt;ul>
&lt;li>&lt;a href="#introduction">Introduction&lt;/a>&lt;/li>
&lt;li>&lt;a href="#description-of-the-problem">Description of the problem&lt;/a>&lt;/li>
&lt;li>&lt;a href="#a-naive-implementation">A naive implementation&lt;/a>&lt;/li>
&lt;li>&lt;a href="#optimization-ideas">Optimization ideas&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#follow-memory-layout">Follow memory layout&lt;/a>&lt;/li>
&lt;li>&lt;a href="#reduce-memory-allocation">Reduce memory allocation&lt;/a>&lt;/li>
&lt;li>&lt;a href="#vectorization">Vectorization&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>Scientific research is inherently linked to the collection and analysis of data. In today&amp;rsquo;s world, the volume of data involved in most scientific research projects far exceeds what can be managed manually. As a result, scientific computing has become an essential requirement for conducting research.&lt;/p>
&lt;p>Despite the remarkable progress in modern programming languages and numerical tools, continually enhanced by scientists and software developers, non-experts still face challenges when it comes to conducting efficient computations on computers for specific research purposes. This can be attributed to the following factors:&lt;/p>
&lt;ul>
&lt;li>Achieving optimal code performance necessitates a comprehensive evaluation of various factors, including hardware specifications, software components, algorithmic efficiency, and the scale of computations.&lt;/li>
&lt;li>Scientists need to find a balance between computational efficiency and development efficiency. They cannot afford to spend excessive time conducting meticulous benchmarks and analyze their statistics.&lt;/li>
&lt;li>While programming languages and numerical tools frequently offer extensive performance tips, they are typically presented in technical jargon, making them less accessible to non-experts.&lt;/li>
&lt;li>Additionally, the examples provided in these resources often lack interconnectedness, making it challenging to grasp their practical application collectively.&lt;/li>
&lt;/ul>
&lt;p>To tackle these concerns, this blog post aims to offer a concise overview of various general and highly effective optimization techniques that are relatively straightforward to implement. The focus will be on a problem I encountered during my research on single-molecule imaging. I will begin with a naive version of my code and gradually enhance its performance.&lt;/p>
&lt;p>&amp;#x2757; All my codes below are provided in terms of an interactive notebook at &lt;a href="https://github.com/lanceXwq/lancexwq.github.io/tree/main/content/post/optimization-I/scripts" target="_blank" rel="noopener">my GitHub&lt;/a>.&lt;/p>
&lt;h2 id="description-of-the-problem">Description of the problem&lt;/h2>
&lt;p>In my research, which involves applications such as &lt;a href="https://en.wikipedia.org/wiki/Super-resolution_microscopy" target="_blank" rel="noopener">super-resolution imaging&lt;/a>, I frequently need to simulate microscope images of individual molecules in the visible spectrum using photon-sensing devices. In this scenario, individual molecules can be accurately represented as point emitters, meaning they are so small that their structures and shapes become insignificant. However, from a physics standpoint, we cannot simply observe sharp, bright dots in microscope images due to two reasons:&lt;/p>
&lt;ul>
&lt;li>The diffraction of light causes a point object to appear as an expanded blur, often referred to as the &lt;a href="https://en.wikipedia.org/wiki/Point_spread_function" target="_blank" rel="noopener">point spread function (PSF)&lt;/a>.&lt;/li>
&lt;li>In images, PSFs are pixelated because the pixel sizes of the detectors are often comparable to the width of a PSF.&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure id="figure-point-emitter-to-a-pixelated-image">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="A comparison of a point emitter, its Gaussian PSF, and the actual pixelated image." srcset="
/post/optimization-i/fig1_hu9b0e52963e2938886e98fe264543e483_16989_67d74e546d28d9422d922e069dcc5c3f.webp 400w,
/post/optimization-i/fig1_hu9b0e52963e2938886e98fe264543e483_16989_7a120171fe7dae3d84c3aa4d1c096184.webp 760w,
/post/optimization-i/fig1_hu9b0e52963e2938886e98fe264543e483_16989_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimization-i/fig1_hu9b0e52963e2938886e98fe264543e483_16989_67d74e546d28d9422d922e069dcc5c3f.webp"
width="760"
height="228"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Point emitter to a pixelated image
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Simulating a single-molecule image involves converting point emitters into their corresponding PSFs and then pixelating the entire image. In these simulations (as well as in the corresponding experimental setups), it is often reasonable to assume that the point emitters are sufficiently far apart from each other, allowing for independent photon emissions. This means that we can calculate the PSF for each molecule individually and combine them.&lt;/p>
&lt;p>Since providing an accurate and detailed simulation process is beyond the scope of this blog, we will make the following approximations:&lt;/p>
&lt;ul>
&lt;li>The pixel size is so small that the impact of pixelization can be considered negligible.&lt;/li>
&lt;li>The PSF is a 2D Gaussian. In other words, for a molecule located at coordinates \((x_n, y_n)\), its influence on the pixel at \((x^p_i, y^p_j)\) is determined by \[PSF_{ijn}=\exp[-(x^p_i-x_n)^2-(y^p_j-y_n)^2]\] where \(n\) represents the molecule index, and \(i\) and \(j\) denote the pixel indices.&lt;/li>
&lt;/ul>
&lt;h2 id="a-naive-implementation">A naive implementation&lt;/h2>
&lt;p>Now, let&amp;rsquo;s move forward with writing a straightforward simulation code. To accomplish this, I will utilize &lt;a href="https://julialang.org/" target="_blank" rel="noopener">Julia&lt;/a> as the preferred programming language. As mentioned previously, we need to compute the PSF value for every \(n\), \(i\), and \(j\). Translating this sentence into code results in the following:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">image_sim_v1&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSF&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">zeros&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">n&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">j&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSF&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">j&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">^&lt;/span>&lt;span class="mi">2&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">j&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">^&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">dropdims&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">sum&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">PSF&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dims&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">dims&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>For a test case with 20 emitting molecules in a 256\(\times\)256 image, this short function does get the job done (see the image below). Conduct a brief benchmark on this function yields &lt;code>13.827 ms (7 allocations: 10.50 MiB)&lt;/code>.&lt;/p>
&lt;p>
&lt;figure id="figure-many-point-emitters-to-the-final-image">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Many point emitters to the final image." srcset="
/post/optimization-i/fig2_huab7a496d2b00d624a24ee68abca9161c_29393_db43ef5c1ddfba68df58f6d0fc7123dc.webp 400w,
/post/optimization-i/fig2_huab7a496d2b00d624a24ee68abca9161c_29393_f6e9aa77ea8983c3843e5c515160975e.webp 760w,
/post/optimization-i/fig2_huab7a496d2b00d624a24ee68abca9161c_29393_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimization-i/fig2_huab7a496d2b00d624a24ee68abca9161c_29393_db43ef5c1ddfba68df58f6d0fc7123dc.webp"
width="750"
height="300"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Many point emitters to the final image
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;h2 id="optimization-ideas">Optimization ideas&lt;/h2>
&lt;p>Now we can begin the process of optimizing &lt;code>image_sim_v1&lt;/code>. Let&amp;rsquo;s start with some simple modifications then move on to more involved techniques.&lt;/p>
&lt;h3 id="follow-memory-layout">Follow memory layout&lt;/h3>
&lt;p>Before explaining anything in words, let&amp;rsquo;s take a look at the following code:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">image_sim_v2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSF&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">zeros&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">j&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">n&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSF&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">j&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">^&lt;/span>&lt;span class="mi">2&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">j&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">^&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">dropdims&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">sum&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">PSF&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dims&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">dims&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>By comparing &lt;code>image_sim_v2&lt;/code> with &lt;code>image_sim_v1&lt;/code>, I merely altered the order of the nested for-loops. Upon benchmarking &lt;code>image_sim_v2&lt;/code>, which recorded &lt;code>8.362 ms (7 allocations: 10.50 MiB)&lt;/code>, we obtain a performance improvement of over 30% through this seemingly insignificant modification!&lt;/p>
&lt;p>The explanation is straightforward: variables are stored in a computer&amp;rsquo;s memory, and accessing this memory requires time. Objects such as arrays are usually stored in a continuous block of memory, and retrieving variables in the order they are stored naturally results in faster retrieval. By interchanging the order of the &lt;code>j&lt;/code>-loop and the &lt;code>n&lt;/code>-loop, the innermost loop of &lt;code>image_sim_v2&lt;/code> consistently operates on a contiguous memory block. It&amp;rsquo;s important to note that different programming languages may have different memory layout conventions, so it&amp;rsquo;s advisable to consult the documentation for specific details.&lt;/p>
&lt;h3 id="reduce-memory-allocation">Reduce memory allocation&lt;/h3>
&lt;p>Similar to accessing memory, memory allocation can also be a time-consuming process. In general, implementing the same algorithm with reduced memory allocation results in improved performance. Furthermore, this improvement tends to be more pronounced when dealing with larger datasets.&lt;/p>
&lt;p>Is there unnecessary memory allocation in &lt;code>image_sim_v2&lt;/code>? The answer is yes. It should be noted that there is no need to store the PSF of each molecule, as we are solely concerned with the final image. Consequently, we can allocate memory for just one image and update the value of each pixel:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">image_sim_v3&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSF&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">zeros&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">j&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">n&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSF&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">j&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">^&lt;/span>&lt;span class="mi">2&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">j&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">^&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">PSF&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Benchmarking &lt;code>image_sim_v3&lt;/code> recorded &lt;code>8.290 ms (2 allocations: 512.05 KiB)&lt;/code>. While there is potential for further improvements by optimizing memory usage, such as exploring &amp;ldquo;mutating functions&amp;rdquo;, pursuing this path is no longer fruitful. Although `image_sim_v3`` managed to reduce memory allocation by a factor of 20, it only resulted in a decrease in computation time of less than 0.3 ms. This outcome was expected since the test case was intentionally designed to be small. Therefore, it is now time to focus on algorithmic optimizations.&lt;/p>
&lt;h3 id="vectorization">Vectorization&lt;/h3>
&lt;p>Vectorization is arguably the most crucial technique discussed in this blog. Its concept is straightforward: execute computations in a manner that aligns with array (matrix or vector) operations (such as matrix multiplication and element-wise operations). By adopting this approach, we can achieve two significant advantages:&lt;/p>
&lt;ul>
&lt;li>Eliminate the need for slow for-loops, which tend to hinder performance in languages like Python or MATLAB.&lt;/li>
&lt;li>Leverage optimized (or even parallelized) routines that greatly enhance efficiency.&lt;/li>
&lt;/ul>
&lt;p>Vectorizing an algorithm is a simple concept, but recognizing the opportunity for implementing vectorization is often more crucial. Based on my personal experience, vectorization should at least be attempted whenever for-loops are involved.&lt;/p>
&lt;p>As a specific example, I will describe the thought process regarding my problem, which currently have three for-loops. First, from &lt;a href="#description-of-the-problem">this section&lt;/a>, we know the final image, denoted as \(i\), is obtained through &lt;/p>
\[I_{ij}=\sum_n PSF_{ijn},\]
&lt;p> but we can also write &lt;/p>
\[PSF_{ijn}=PSF^x_{in}PSF^y_{nj}\]
&lt;p> where &lt;/p>
\[PSF^x_{in}=\exp[-(x^p_i-x_n)^2]~\text{and}~PSF^y_{nj}=\exp[-(y^p_j-y_n)^2].\]
&lt;p> Therefore, we have &lt;/p>
\[I_{ij}=\sum_n PSF^x_{in}PSF^y_{nj}.\]
&lt;p>After this brief re-organization of math, we have arrived at an expression that highly resembles matrix multiplication! Now it is quite clear how we are going to proceed:&lt;/p>
&lt;ol>
&lt;li>Construct two matrices, \(PSF^x\) and \(PSF^y\), with array subtraction&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>, element-wise square, and element-wise exponential.&lt;/li>
&lt;li>Perform a matrix multiplication between \(PSF^x\) and \(PSF^y\).&lt;/li>
&lt;/ol>
&lt;p>
&lt;figure id="figure-vectorized-psf-calculation">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="A simple vectorization scheme." srcset="
/post/optimization-i/fig_vec_huf89d4ba5ab0989bcacb7b454c0316b33_22540_a88c5f0091cf27d6de126925d8156731.webp 400w,
/post/optimization-i/fig_vec_huf89d4ba5ab0989bcacb7b454c0316b33_22540_c44facfd7ff0afe9a2b98ab1ee8d8301.webp 760w,
/post/optimization-i/fig_vec_huf89d4ba5ab0989bcacb7b454c0316b33_22540_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimization-i/fig_vec_huf89d4ba5ab0989bcacb7b454c0316b33_22540_a88c5f0091cf27d6de126925d8156731.webp"
width="760"
height="345"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Vectorized PSF calculation
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>In Julia code, we have&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">image_sim_v4&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">PSFʸ&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>image_sim_v4&lt;/code>&amp;rsquo;s benchmark recorded &lt;code>101.157 μs (14 allocations: 752.33 KiB)&lt;/code>, 80x faster than &lt;code>image_sim_v3&lt;/code>!&lt;/p>
&lt;p>At this stage, we have essentially reached the limit of potential improvements for this simple example. Additional optimizations could involve the utilization of hardware-specific math libraries&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup> and datatype-specific operations, but these aspects are beyond the scope of this blog. However, this does not signal the end of our discussion, as we can introduce a slightly more complex (and realistic) example that allows us to explore more advanced optimization techniques. I will continue this discussion in my next blog post.&lt;/p>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>Refer to &lt;a href="https://www.mathworks.com/help/matlab/matlab_prog/compatible-array-sizes-for-basic-operations.html" target="_blank" rel="noopener">this webpage&lt;/a> for compatible array sizes regarding array subtraction and more.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Such as Intel Math Kernel Library (MKL) and AMD Optimizing CPU Libraries (AOCL).&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item></channel></rss>