<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Julia | Lance Xu</title><link>https://lancexwq.netlify.app/category/julia/</link><atom:link href="https://lancexwq.netlify.app/category/julia/index.xml" rel="self" type="application/rss+xml"/><description>Julia</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Wed, 30 Aug 2023 00:00:00 +0000</lastBuildDate><image><url>https://lancexwq.netlify.app/media/icon_hu0b7a4cb9992c9ac0e91bd28ffd38dd00_9727_512x512_fill_lanczos_center_3.png</url><title>Julia</title><link>https://lancexwq.netlify.app/category/julia/</link></image><item><title>Categorical distribution and Gumbel distribution</title><link>https://lancexwq.netlify.app/post/categorical-gumbel/</link><pubDate>Wed, 30 Aug 2023 00:00:00 +0000</pubDate><guid>https://lancexwq.netlify.app/post/categorical-gumbel/</guid><description>&lt;ul>
&lt;li>&lt;a href="#categorical-distribution">Categorical distribution&lt;/a>&lt;/li>
&lt;li>&lt;a href="#sample-from-a-categorical-distribution">Sample from a categorical distribution&lt;/a>&lt;/li>
&lt;li>&lt;a href="#unnormalized-probabilities">Unnormalized probabilities&lt;/a>&lt;/li>
&lt;li>&lt;a href="#log-probabilities">Log probabilities&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#logsumexp">LogSumExp&lt;/a>&lt;/li>
&lt;li>&lt;a href="#softmax">Softmax&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#the-standard-gumbel-distribution">The standard Gumbel distribution&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#sampling-from-the-standard-gumbel-distribution">Sampling from the standard Gumbel distribution&lt;/a>&lt;/li>
&lt;li>&lt;a href="#the-gumbel-max-trick">The Gumbel-Max trick&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#additional-notes">Additional notes&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="categorical-distribution">Categorical distribution&lt;/h2>
&lt;p>Categorical distribution is probably the most common probability distribution, frequently encountered across a broad spectrum of scenarios: from classic dice-based games to the task of image classification within the realm of computer vision. At its core, the concept of categorical distribution is rather straightforward: it describes the probabilities of occurrence of the possible outcomes (or &amp;ldquo;categories&amp;rdquo;) of an event. In fact, nearly all probability distributions featuring discrete outcomes can be thought of as special cases of categorical distributions, wherein event probabilities are articulated through distinct functions.&lt;/p>
&lt;h2 id="sample-from-a-categorical-distribution">Sample from a categorical distribution&lt;/h2>
&lt;p>Working with categorical distributions, there is often need to simulate their outcomes, or more precisely, to sample from these distributions. The most fundamental and straightforward sampling scheme for a categorical distribution is known as &amp;ldquo;stick-breaking&amp;rdquo;, as illustrated in the figure below.&lt;/p>
&lt;p>
&lt;figure id="figure-stick-breaking-algorithm">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Stick breaking algorithm." srcset="
/post/categorical-gumbel/stick-breaking_hu77ddd6e37f00d3114bb7e3e575051f5f_2559_f63a46f143ee8aecf2588a165041d684.webp 400w,
/post/categorical-gumbel/stick-breaking_hu77ddd6e37f00d3114bb7e3e575051f5f_2559_76d763d0437b76d9f99d3f1c8a345c59.webp 760w,
/post/categorical-gumbel/stick-breaking_hu77ddd6e37f00d3114bb7e3e575051f5f_2559_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/categorical-gumbel/stick-breaking_hu77ddd6e37f00d3114bb7e3e575051f5f_2559_f63a46f143ee8aecf2588a165041d684.webp"
width="700"
height="200"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Stick breaking algorithm
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>To understand stick-breaking with this figure, consider the event of breaking a one-unit-length stick. Here, the stick is partitioned into discrete regions, each uniquely colored. Our objective is to select a specific location to break the stick, essentially determining the outcome of this event. Assuming an unbiased selection process, the probability of breaking the stick within a particular region precisely corresponds to the size of that region, regardless how all regions are arranged.&lt;/p>
&lt;p>Therefore, to sample from any categorical distribution:&lt;/p>
&lt;ol>
&lt;li>Provide all event probabilities as entries of a vector&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup> (create a one-unit-length stick and its partitions).&lt;/li>
&lt;li>Draw a random number that is uniformly distributed between zero and one (unbiasedly choose a location).&lt;/li>
&lt;li>Find in which region this location lies, return the region label as the sample.&lt;/li>
&lt;/ol>
&lt;p>Implementing the stick-breaking algorithm is simple, with &lt;a href="https://julialang.org/" target="_blank" rel="noopener">Julia&lt;/a> as the example programming language we can write&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">categorical_sampler1&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">p&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">i&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">1&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">c&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">p&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">u&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">rand&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">while&lt;/span> &lt;span class="n">c&lt;/span> &lt;span class="o">&amp;lt;&lt;/span> &lt;span class="n">u&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">c&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="n">p&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="o">+=&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">i&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>While the stick-breaking algorithm is simple and straightforward, it is important to address a couple of practical concerns that can arise. Two common concerns involve unnormalized probabilities and log probabilities.&lt;/p>
&lt;h2 id="unnormalized-probabilities">Unnormalized probabilities&lt;/h2>
&lt;p>Be definition, a probability distribution should be normalized, indicating that its probabilities or probability density function (PDF) ought to sum up or integrate to one. Nevertheless, normalizing every encountered distribution is typically not preferred for two primary reasons:&lt;/p>
&lt;ol>
&lt;li>Normalization factors are typically constant multiplicative coefficients that have no impact on the actual algorithm&amp;rsquo;s outcomes.&lt;/li>
&lt;li>Computing the normalization factor of a complex probability distribution can be exceedingly challenging.&lt;/li>
&lt;/ol>
&lt;p>Dealing with unnormalized probabilities within in stick-breaking scheme is very simple. All we need to do is adjust the length of the &amp;ldquo;stick&amp;rdquo; to match the actual sum of probabilities. This adjustment can be achieved by replacing &lt;code>u = rand()&lt;/code> with &lt;code>u = rand() * sum(p)&lt;/code> in the &lt;code>categorical_sampler1&lt;/code> function.&lt;/p>
&lt;h2 id="log-probabilities">Log probabilities&lt;/h2>
&lt;p>Working with probability distributions, there is a frequent requirement to compute the products of probabilities, such as when determining the intersection of events. Depending on the number of terms involved and their respective normalization factors, the value of these products can become very large or very small. Both cases can potentially result in numerical stability problems. Consequently, it is a standard practice to utilize log probabilities, which are the natural logarithms of the actual probabilities, throughout the entire computation process.&lt;/p>
&lt;h3 id="logsumexp">LogSumExp&lt;/h3>
&lt;p>Unlike the simple modification to incorporate unnormalized probabilities, sampling from a categorical distribution given its log event probabilities is tricky. The problem here is how to calculate \(\ln(p_1+p_2)\) given \(\ln p_1\) and \(\ln p_2\), where \(p_1\) and \(p_2\) are event probabilities. One workaround is to use the mathematical identity &lt;/p>
\[\ln(p_1+p_2)=\alpha+\ln[\exp(\ln p_1-\alpha)+\exp(\ln p_2-\alpha)].\]
&lt;p> In this equation, we select the value of \(\alpha\) in a manner that ensures the numerical stability of computing \(\exp(\ln p_1 - \alpha)\) and \(\exp(\ln p_2 - \alpha)\) over directly calculating \(\exp(\ln p_1)\) and \(\exp(\ln p_2)\)&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>. This algorithm is widely implemented in software packages. For instance, in Julia, it is called &lt;code>logaddexp&lt;/code> in &lt;a href="https://juliastats.org/LogExpFunctions.jl/stable/" target="_blank" rel="noopener">LogExpFunctions.jl&lt;/a>. Similarly, there is also &lt;code>logsumexp&lt;/code> which generalizes &lt;code>logaddexp&lt;/code> to more than two operands. Therefore, we can write a new sampler as follows:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">categorical_sampler2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">logp&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">i&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">1&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">c&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">logp&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">u&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">log&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">rand&lt;/span>&lt;span class="p">())&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">logsumexp&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">logp&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">while&lt;/span> &lt;span class="n">c&lt;/span> &lt;span class="o">&amp;lt;&lt;/span> &lt;span class="n">u&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">c&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">logaddexp&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">c&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">logp&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="o">+=&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">i&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="softmax">Softmax&lt;/h3>
&lt;p>Another closely related approach involves transforming all log probabilities into normalized probabilities within the real space, with enhanced numerical stability. This procedure is commonly referred to as the &lt;code>softmax&lt;/code>&lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup> function: &lt;/p>
\[\mathrm{softmax}(\ln p_1,\ln p_2,\dots\ln p_N)_n=\frac{p_n}{\sum_n p_n}.\]
&lt;p> With &lt;code>softmax&lt;/code>, instead of writing any new functions, we can simply call &lt;code>categorical_sampler1(softmax(logp))&lt;/code>.&lt;/p>
&lt;p>While both &lt;code>logsumexp&lt;/code> and &lt;code>softmax&lt;/code> are valid approaches, neither is entirely free of the numerical instability risk: they still require some calculations in the real space. Remarkably, it is conceivable to accomplish all computations exclusively within the logarithmic space using the &lt;a href="https://en.wikipedia.org/wiki/Gumbel_distribution#Standard_Gumbel_distribution" target="_blank" rel="noopener">standard Gumbel distribution&lt;/a>.&lt;/p>
&lt;h2 id="the-standard-gumbel-distribution">The standard Gumbel distribution&lt;/h2>
&lt;p>The standard Gumbel distribution is a special case of the Gumbel distribution where the two parameters, location and scale are equal to zero and one, respectively. Consequently, the PDF for the standard Gumbel distribution takes the form: &lt;/p>
\[f\left(x\right)=\exp\left[-x-\exp\left(-x\right)\right].\]
&lt;p> Although this PDF may appear daunting due to the presence of an exponential within the exponent, it in fact yields two outcomes. These outcomes will be explained in greater detail in the following sections, and they assist us in generating samples from a categorical distribution.&lt;/p>
&lt;h3 id="sampling-from-the-standard-gumbel-distribution">Sampling from the standard Gumbel distribution&lt;/h3>
&lt;p>The first outcome is how easy it is to sample from the standard Gumbel distribution: its PDF can actually be analytically integrated to obtain an invertible cumulative distribution function (CDF) &lt;/p>
\[F\left(x\right)=\exp\left[-\exp\left(-x\right)\right],\]
&lt;p> while its inverse is &lt;/p>
\[F^{-1}\left(u\right)=-\ln\left(-\ln u \right).\]
&lt;p> Therefore, according to the &lt;a href="https://en.wikipedia.org/wiki/Inverse_transform_sampling#Formal_statement" target="_blank" rel="noopener">fundamental theorem of simulation&lt;/a>, sampling from the standard Gumbel distribution is as easy as calculating \(F^{-1}\left(u\right)\) where \(u\) is a uniform random number between zero and one.&lt;/p>
&lt;h3 id="the-gumbel-max-trick">The Gumbel-Max trick&lt;/h3>
&lt;p>Now, consider having a target categorical distribution with \(N\) unnormalized logarithmic event probabilities represented as \(\ln p_1, \ln p_2, \dots, \ln p_N\). Using the algorithm outlined earlier, we can effortlessly generate an equivalent number of independent and identically distributed random variables following the standard Gumbel distribution: \(x_1, x_2 \ldots, x_N\). Interestingly, when we compute the probability of \(n\) being the index that maximizes the expression \(x_n + \ln p_n\), it turns out to be precisely \(p_n/\sum_{n=1}^N p_n\). This indicates that \(x_n + \ln p_n\) itself is a random variable that precisely follows the target categorical distribution, and no calculation is done in the real space!&lt;/p>
&lt;p>This result is often referred to as the &amp;ldquo;Gumbel-Max trick&amp;rdquo;. Although I provide the full derivation in &lt;a href="https://github.com/lanceXwq/lancexwq.github.io/tree/main/content/post/categorical-gumbel/derivation.pdf" target="_blank" rel="noopener">this document&lt;/a>, deriving this result by yourself is highly recommended. Implementing this trick in Julia can be done as:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">categorical_sampler3&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">logp&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="n">log&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">log&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">rand&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">logp&lt;/span>&lt;span class="p">))))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">(&lt;/span>&lt;span class="o">~&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">n&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">findmax&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span> &lt;span class="o">.+&lt;/span> &lt;span class="n">logp&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">n&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Verifying the equivalence of the samplers in this blog post is trivial, you can do it yourself or refer to &lt;a href="https://github.com/lanceXwq/lancexwq.github.io/tree/main/content/post/categorical-gumbel/scripts/code.jl" target="_blank" rel="noopener">this example file&lt;/a>.&lt;/p>
&lt;h2 id="additional-notes">Additional notes&lt;/h2>
&lt;p>Although the Gumbel-Max trick allows all computations to be done in the log space, it is not necessarily always the go-to choice for your code. First, numerical stability may not pose a significant concern when both model and data are well-behaved. Moreover, if the probability of an event is substantially smaller than others to the extent that the &lt;code>softmax&lt;/code> operation could introduce numerical instability, it is possible that this event may never be sampled during a specific timeframe, regardless of the algorithm&amp;rsquo;s stability. In these cases, people may prioritize computational efficiency over numerical stability. (Precision almost always comes at the cost of computational expense.)&lt;/p>
&lt;p>On the other hand, if we look beyond numerical stability, the Gumbel-Max trick still offers distinct advantages. Consider the process of training a neural network (backpropagation), which often relying on gradient computations. This implies that the function embodied by a neural network node must be differentiable. In certain scenarios, this function might involve sampling from a categorical distribution, such as in the case of an image classifier. However, the stick-breaking algorithm, by design, can only yield discrete outcomes and, as a result, lacks differentiability. Conversely, the \(\arg\max\) function in &lt;code>categorical_sampler3&lt;/code> can be substituted with a differentiable &lt;code>softmax&lt;/code> function and thereby enables gradient computation and backpropagation. This transformation is commonly referred to as the Gumbel-Softmax technique&lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup>.&lt;/p>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>This scheme is only suitable for distributions with a finite number of categories.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>People typically choose the larger value between \(\ln p_1\) and \(\ln p_2\) to be \(\alpha\).&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>Roughly, &amp;ldquo;softmax&amp;rdquo; means &amp;ldquo;&lt;a href="https://en.wikipedia.org/wiki/Softmax_function#Smooth_arg_max" target="_blank" rel="noopener">soft (smooth) \(\arg\max\)&lt;/a>&amp;rdquo;.&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:4">
&lt;p>&lt;a href="https://arxiv.org/abs/1611.01144" target="_blank" rel="noopener">This paper&lt;/a> contains more details on this topic.&amp;#160;&lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Optimization Techniques in Scientific Computing (Part III)</title><link>https://lancexwq.netlify.app/post/optimization-iii/</link><pubDate>Tue, 08 Aug 2023 00:00:00 +0000</pubDate><guid>https://lancexwq.netlify.app/post/optimization-iii/</guid><description>&lt;ul>
&lt;li>&lt;a href="#introduction-and-recap">Introduction and recap&lt;/a>&lt;/li>
&lt;li>&lt;a href="#run-code-on-a-gpu">Run code on a GPU&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#first-attempt">First attempt&lt;/a>&lt;/li>
&lt;li>&lt;a href="#further-vectorization">Further vectorization&lt;/a>&lt;/li>
&lt;li>&lt;a href="#batch-matrix-multiplication">Batch matrix multiplication&lt;/a>&lt;/li>
&lt;li>&lt;a href="#final-boost">Final boost&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#conclusion">Conclusion&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="introduction-and-recap">Introduction and recap&lt;/h2>
&lt;p>In my &lt;a href="https://lancexwq.github.io/tag/optimization/?q=Optimization%20Techniques%20in%20Scientific%20Computing" target="_blank" rel="noopener">previous two blogs&lt;/a> for optimization techniques in scientific computing, I have talked about concepts such as vectorization and parallelism in the context of my single-molecule video simulation&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>, which can be mathematically formulated as calculating 3D array \(V\) with &lt;/p>
\[V_{fij}=\sum_n \exp[-(x^p_{fi}-x_{fn})^2-(y^p_{fj}-y_{fn})^2].\]
&lt;p> We started with &lt;code>video_sim_v1&lt;/code>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">video_sim_v1&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">F&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">v&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">Array&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="kt">eltype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kt">x&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">}(&lt;/span>&lt;span class="nb">undef&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">f&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">v&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">PSFʸ&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">v&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>then found that introducing multithreading as follows significantly improves the performance.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">video_sim_v3&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">F&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">v&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">Array&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="kt">eltype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kt">x&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">}(&lt;/span>&lt;span class="nb">undef&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">Threads&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="nd">@threads&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">f&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">v&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">PSFʸ&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">v&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Eventually, &lt;code>video_sim_v3&lt;/code> yields a benchmark of &lt;code>12.925 ms (1450 allocations: 123.47 MiB)&lt;/code> on my eight-thread Intel i7 7700K.&lt;/p>
&lt;p>In the part II of my blog series, I have also loosely alluded to the dilemma we face in further optimization:&lt;/p>
&lt;ul>
&lt;li>The number of independent frames can be much larger than the number of threads on a CPU&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>.&lt;/li>
&lt;li>Multiprocessing on a cluster causes significant communication overhead and development challenge, ultimately outweigh the potential performance gain.&lt;/li>
&lt;/ul>
&lt;p>Basically, we want a solution that can efficiently execute numerous relatively lightweight computational tasks in parallel, while maintaining minimal communication overhead. Interestingly, such a solution already exists, and it takes the form of a GPU. According to the experts from &lt;a href="https://www.intel.com/content/www/us/en/products/docs/processors/cpu-vs-gpu.html" target="_blank" rel="noopener">Intel&lt;/a>,&lt;/p>
&lt;blockquote>
&lt;p>The GPU is a processor that is made up of many smaller and more specialized cores. By working together, the cores deliver massive performance when a processing task can be divided up and processed across many cores.&lt;/p>
&lt;/blockquote>
&lt;h2 id="run-code-on-a-gpu">Run code on a GPU&lt;/h2>
&lt;p>Originally popularized in the deep learning community, accelerating scientific computations with GPUs is rapidly getting attentions from researchers across various domains. Many thanks to the continuous efforts from scientists and software developers, writing GPU codes has become much easier than it used to be. Under some circumstances, once properly set up, running a code originally written for CPUs on GPUs can be achieved via merely changing a few lines.&lt;/p>
&lt;p>At the moment, the three leading companies in chips, Nvidia, AMD, and Intel, all offer their own platforms for GPU computation&lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>. Due to the relatively higher popularity, I will use &lt;a href="https://en.wikipedia.org/wiki/CUDA" target="_blank" rel="noopener">CUDA&lt;/a> from Nvidia in this blog. For detailed guidance on installation and integration with Julia, please refer to &lt;a href="https://github.com/JuliaGPU/CUDA.jl" target="_blank" rel="noopener">CUDA.jl&lt;/a> and &lt;a href="https://cuda.juliagpu.org/stable/" target="_blank" rel="noopener">its documentation&lt;/a>.&lt;/p>
&lt;h3 id="first-attempt">First attempt&lt;/h3>
&lt;p>Once the installation of &lt;code>CUDA.jl&lt;/code> is completed, verified, and loaded, to run &lt;code>video_sim_v1&lt;/code> on an Nvidia GPU we simply need to pass arguments as CUDA arrays such as &lt;code>video_sim_v1(CuArray(xᵖ), CuArray(yᵖ), CuArray(x), CuArray(y))&lt;/code>.&lt;/p>
&lt;p>You may expect magic to happen but a warning (or sometimes an error) pops up regarding &lt;code>performing scalar indexing on task&lt;/code>. What&amp;rsquo;s more, the warning message also says &lt;code>such implementations *do not* execute on the GPU, but very slowly on the CPU&lt;/code>, indicating our first attempt has failed. The cause behind this failure is clear from the warning message: CUDA does not accept scalar indexing of a GPU array, like &lt;code>v[:, :, f]&lt;/code>. Consequently, the solution entails a complete vectorization of the code, eliminating the need for the for-loop iteration over \(f\).&lt;/p>
&lt;h3 id="further-vectorization">Further vectorization&lt;/h3>
&lt;p>As stated multiple times thus far, our problem does not align directly with any basic vector operation. However, we can be clever and slightly restructure our data, enabling the potential for vectorization. An approach to achieve this is illustrated in the following figure.&lt;/p>
&lt;p>
&lt;figure id="figure-block-diagonal-matrix-multiplication">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Block diagonal matrix multiplication to calculate PSF." srcset="
/post/optimization-iii/fig1_hufbdff24dc123ad47906c4222b1604286_18213_04666b017c909e48e2e60460767337f0.webp 400w,
/post/optimization-iii/fig1_hufbdff24dc123ad47906c4222b1604286_18213_94cf65fdceef65aa92e944897afac24f.webp 760w,
/post/optimization-iii/fig1_hufbdff24dc123ad47906c4222b1604286_18213_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimization-iii/fig1_hufbdff24dc123ad47906c4222b1604286_18213_04666b017c909e48e2e60460767337f0.webp"
width="760"
height="398"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Block diagonal matrix multiplication
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Here, \(PSF^x\), \(PSF^y\), and \(V\) are restructured as block-diagonal matrices. Blocks sharing the same color correspond to the same frame, while any remaining elements within these matrices are set to zero, visually represented as white-colored sections. As a result, all the frames can be simulated through one matrix multiplication.&lt;/p>
&lt;p>While this approach is indeed valid, I would not recommend implementing it by yourselves. This is due to the potentially vast dimensions of these block matrices. A naive implementation lacking efficient memory allocation handling could greatly worsen overall performance.&lt;/p>
&lt;p>Are there better solutions? The answer is yes. This problem we are facing, namely numerous independent (and typically small) matrix multiplications of identical sizes, is not unique to us. In fact, it is common enough that people have named it &amp;ldquo;batch matrix multiplication&amp;rdquo;.&lt;/p>
&lt;h3 id="batch-matrix-multiplication">Batch matrix multiplication&lt;/h3>
&lt;p>
&lt;figure id="figure-batched-matrix-multiplication">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Batched matrix multiplication to calculate PSF." srcset="
/post/optimization-iii/fig2_hue4fe31342a3d6a87683d56c97d6a780a_10142_f3d3ba72c75220f005441259b1add0d9.webp 400w,
/post/optimization-iii/fig2_hue4fe31342a3d6a87683d56c97d6a780a_10142_ee686c2dfb8920e0b6e127c982785240.webp 760w,
/post/optimization-iii/fig2_hue4fe31342a3d6a87683d56c97d6a780a_10142_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimization-iii/fig2_hue4fe31342a3d6a87683d56c97d6a780a_10142_f3d3ba72c75220f005441259b1add0d9.webp"
width="590"
height="243"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Batched matrix multiplication
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Although batch matrix multiplication is widely recognized and efficiently implemented, it may not always be easy to find the correct function within your programming language. Occasionally, batch matrix multiplication goes by different names. For instance, in MATLAB, it is referred to as &amp;ldquo;&lt;a href="https://www.mathworks.com/help/matlab/ref/pagemtimes.html" target="_blank" rel="noopener">page-wise matrix multiplication&lt;/a>&amp;rdquo;. In certain cases, additional packages are required, and quite often, these packages belong deep-learning libraries! In Python, you can call &lt;code>torch.bmm&lt;/code> from &lt;a href="https://pytorch.org/docs/stable/generated/torch.bmm.html" target="_blank" rel="noopener">PyTorch&lt;/a>, while Julia offers &lt;code>batched_mul&lt;/code> through &lt;a href="https://fluxml.ai/Flux.jl/stable/models/nnlib/#NNlib.batched_mul" target="_blank" rel="noopener">Flux.jl&lt;/a>. Using &lt;code>batched_mul&lt;/code>, we can write a new code as follows:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">video_sim_GPU_v2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">reshape&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">...&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">xᵖ&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">reshape&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">...&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">batched_mul&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">PSFˣ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">batched_adjoint&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">PSFʸ&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Here, &lt;code>reshape&lt;/code> is called to construct &lt;code>PSFˣ&lt;/code> and &lt;code>PSFˣ&lt;/code> as 3D arrays, and &lt;code>batched_adjoint&lt;/code> is just the &amp;ldquo;batched&amp;rdquo; version of transpose.&lt;/p>
&lt;p>Benchmarking &lt;code>video_sim_GPU_v2&lt;/code> on my CPU (i7 7700K) and my GPU (GeForce GTX 1060) yield &lt;code>9.550 ms (75 allocations: 73.44 MiB)&lt;/code> and &lt;code>3.127 ms (9 GPU allocations: 73.468 MiB&lt;/code>&lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup>, respectively. Both of them are beating the multithreaded &lt;code>video_sim_v3&lt;/code>!&lt;/p>
&lt;h3 id="final-boost">Final boost&lt;/h3>
&lt;p>The benchmarks I&amp;rsquo;ve showcased so far are based on double-precision float-point (float64) numbers. However, GPUs are frequently optimized for single-precision float-point (float32) numbers. For instance, once switched to using float32, &lt;code>video_sim_GPU_v2&lt;/code>&amp;rsquo;s benchmark becomes &lt;code>660.627 μs (11 GPU allocations: 36.736 MiB)&lt;/code>, another fivefold acceleration!&lt;/p>
&lt;p>Therefore, it is frequently advantageous to craft your GPU code to support both float64 and float32, and then assess whether altering the datatype affects your outcome. If there&amp;rsquo;s no impact, simply proceed with float32!&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Finally, we have arrived at the conclusion of my blog series concerning optimization techniques for scientific computation. I hope you have enjoyed this journey and learned something useful. Please feel free to get in touch with me should you wish to connect or share your thoughts!&lt;/p>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>In case you haven&amp;rsquo;t read the preceding blogs, I strongly encourage you to take a moment to review their problem description sections. This will provide you with a better picture of the issue I&amp;rsquo;m trying to address.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>As of the date of this blog, even the most advanced desktop CPU, AMD Ryzen™ Threadripper™ PRO 5995WX (~$6,000), only has 128 threads, while frame number can easily be over 1,000.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>&lt;a href="https://en.wikipedia.org/wiki/CUDA" target="_blank" rel="noopener">CUDA&lt;/a> from Nvidia, &lt;a href="https://en.wikipedia.org/wiki/ROCm" target="_blank" rel="noopener">ROCm&lt;/a> from AMD, and &lt;a href="https://en.wikipedia.org/wiki/OneAPI_%28compute_acceleration%29" target="_blank" rel="noopener">OneAPI&lt;/a> from Intel.&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:4">
&lt;p>GPU memory allocation is measured by &lt;code>CUDA.@time&lt;/code>, see this &lt;a href="https://cuda.juliagpu.org/stable/development/profiling/" target="_blank" rel="noopener">page&lt;/a>.&amp;#160;&lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Optimization Techniques in Scientific Computing (Part II)</title><link>https://lancexwq.netlify.app/post/optimizationp-ii/</link><pubDate>Mon, 07 Aug 2023 00:00:00 +0000</pubDate><guid>https://lancexwq.netlify.app/post/optimizationp-ii/</guid><description>&lt;ul>
&lt;li>&lt;a href="#introduction-and-recap">Introduction and recap&lt;/a>&lt;/li>
&lt;li>&lt;a href="#the-first-implementation">The first implementation&lt;/a>&lt;/li>
&lt;li>&lt;a href="#optimization-ideas">Optimization ideas&lt;/a>&lt;/li>
&lt;li>&lt;a href="#parallelism">Parallelism&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#core-level">Core-level&lt;/a>&lt;/li>
&lt;li>&lt;a href="#node-level">Node-level&lt;/a>&lt;/li>
&lt;li>&lt;a href="#cluster-level">Cluster-level&lt;/a>&lt;/li>
&lt;li>&lt;a href="#key-consideration">Key consideration&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="introduction-and-recap">Introduction and recap&lt;/h2>
&lt;p>In &lt;a href="https://lancexwq.github.io/post/optimization-i/" target="_blank" rel="noopener">my previous blog&lt;/a>, I introduced some straightforward yet valuable optimization techniques. While these techniques are generally suitable for relatively simple problems, such as the example presented in my previous blog, they may prove inadequate when dealing with more complex and realistic issues. Specifically, in the previous example, my objective was to simulate a single-molecule microscope image. However, people frequently need to process multiple independent images (e.g., frames in a video). In this blog, I will discuss additional techniques within the context of this video simulation problem.&lt;/p>
&lt;p>To quickly recap, the previous example involves calculating the total contribution, labeled as \(I\), from all molecules. These molecules are indexed by \(n\), and they relate to each pixel, which is indexed by \(i\) and \(j\). Referring to the assumptions discussed earlier, we can express the algorithm&amp;rsquo;s mathematical form as follows: &lt;/p>
\[I_{ij}=\sum_n \exp[-(x^p_i-x_n)^2-(y^p_j-y_n)^2].\]
&lt;p> Now, shifting our focus to the present issue that involves multiple independent images (or frames), we extend the same calculation to each individual image, denoted as \(f\). As a result, the mathematical representation for this new problem takes the following shape (where $V$ stands for video): &lt;/p>
\[V_{fij}=\sum_n \exp[-(x^p_{fi}-x_{fn})^2-(y^p_{fj}-y_{fn})^2].\]
&lt;h2 id="the-first-implementation">The first implementation&lt;/h2>
&lt;p>
&lt;figure id="figure-calculate-psfs-in-a-loop">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Calculate PSFs in a loop." srcset="
/post/optimizationp-ii/fig_loop_hubbbb7445c0883065e0be80260df2da06_21327_f007f7123984859dfb5d7e1f34a5eeab.webp 400w,
/post/optimizationp-ii/fig_loop_hubbbb7445c0883065e0be80260df2da06_21327_7d741b6ab2077fd82bf6c349c9a93ecf.webp 760w,
/post/optimizationp-ii/fig_loop_hubbbb7445c0883065e0be80260df2da06_21327_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimizationp-ii/fig_loop_hubbbb7445c0883065e0be80260df2da06_21327_f007f7123984859dfb5d7e1f34a5eeab.webp"
width="535"
height="602"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Calculate PSFs in a loop
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Based on the description so far, we can readily enclose a for-loop iterating over \(f\) around the previously optimized code to create the initial version of our single-molecule video simulation code&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">video_sim_v1&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">F&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">Array&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="kt">eltype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kt">x&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">}(&lt;/span>&lt;span class="nb">undef&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">f&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">PSFʸ&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">V&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Two points to note in the code above:&lt;/p>
&lt;ul>
&lt;li>&lt;code>x&lt;/code> and &lt;code>y&lt;/code> are both arrays of dimensions \(N\times F\), where \(N\) and \(F\) represent the number of molecule and number of frames, respectively.&lt;/li>
&lt;li>It appears that we have made a bold assumption that all frames contain an equal number of molecules. However, this assumption is acceptable since molecules that should not appear in a frame can be positioned far away from the field-of-view, thereby making no contribution.&lt;/li>
&lt;/ul>
&lt;p>Benchmarking &lt;code>video_sim_v1&lt;/code> using a dataset comprising 20 molecules and 100 frames (each with 256\(\times\)256 pixels) yields &lt;code>50.927 ms (1402 allocations: 123.47 MiB)&lt;/code>. Our overarching goal entails improving upon this benchmark.&lt;/p>
&lt;h2 id="optimization-ideas">Optimization ideas&lt;/h2>
&lt;p>Before exploring new techniques, let&amp;rsquo;s take a moment to consider whether we can apply anything from &lt;a href="https://lancexwq.github.io/post/optimization-i/" target="_blank" rel="noopener">my previous blog&lt;/a>. Since we have only added one extra loop, there isn&amp;rsquo;t much opportunity to reduce memory allocation. What&amp;rsquo;s more, this extra loop cannot be easily eliminated through vectorization, as the formula specified here doesn&amp;rsquo;t align with basic matrix (or tensor) operations. Consequently, we must use other techniques to tackle this challenge.&lt;/p>
&lt;p>In this video simulation problem, it is important to note that all frames are independent of each other. As a result, there is potential to simulate frames simultaneously, or in other words, in parallel.&lt;/p>
&lt;h2 id="parallelism">Parallelism&lt;/h2>
&lt;p>Parallelizing an algorithm is much easier said than done. In view of the intricate nature of contemporary computational infrastructures, attaining parallelism in the present era involves three major tiers: core-level parallelism, node-level parallelism, and cluster-level parallelism&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>. In the upcoming sections, I will delve into a single common scheme within each tier and examine its relevance within the context of our specific problem.&lt;/p>
&lt;h3 id="core-level">Core-level&lt;/h3>
&lt;p>This initial question that may arise is: how is it possible to achieve parallelism on a single core? To illustrate this, let&amp;rsquo;s consider a situation where a program operates on 64-bit integers, and a processor core possesses the capability to fetch 256 bits of data in a solitary operation. In such a scenario, it becomes viable to load four integers as a vector and perform a singular vectorized iteration of the original operation. This could potentially yield a theoretical speedup of fourfold&lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>. This particular approach to parallelization is commonly known as &amp;ldquo;&lt;a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data" target="_blank" rel="noopener">single instruction, multiple data&lt;/a>&amp;rdquo; (SIMD).&lt;/p>
&lt;p>The straightforward concept of SIMD, on one hand, allows numerous modern programming languages to identify points within an algorithm where SIMD can be employed and subsequently apply it automatically. On the other hand, SIMD is frequently constrained to basic operations such as addition or multiplication. Hence, the potential enhancement of &lt;code>video_sim_v1&lt;/code> through this method remains uncertain. Nevertheless, in this scenario, an attempt must be made to explore the possibilities.&lt;/p>
&lt;p>In &lt;a href="https://julialang.org/" target="_blank" rel="noopener">Julia&lt;/a>, it is possible to enforce vectorization by employing the &lt;code>@simd&lt;/code> macro, placed before a for-loop involving independent iterations. This technique results in the creation of &lt;code>video_sim_v2&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">video_sim_v2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">F&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">Array&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="kt">eltype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kt">x&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">}(&lt;/span>&lt;span class="nb">undef&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nd">@simd&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">f&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">PSFʸ&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">V&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>A benchmark analysis yields a result of &lt;code>51.017 ms (1402 allocations: 123.47 MiB)&lt;/code>, indicating a lack of performance improvement. It seems that Julia has indeed automatically vectorized the code in this case.&lt;/p>
&lt;h3 id="node-level">Node-level&lt;/h3>
&lt;p>Moving up a level there is parallelism on a node (often a computer), which is often achieved through &lt;a href="https://en.wikipedia.org/wiki/Multithreading_%28computer_architecture%29" target="_blank" rel="noopener">multithreading&lt;/a>. For multithreading, we require multiple processors (either physical or virtual&lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup>), with each core being associated with a separate thread. Multithreading facilitates the simultaneous execution of these processors, all while utilizing the same memory pool. It is important to note that the implementation of multithreading demands careful consideration to avoid conflicts between threads. Fortunately, developers have often shouldered much of this responsibility, alleviating users of this burden.&lt;/p>
&lt;p>In Julia, multithreading a for-loop can be as easy as follows&lt;sup id="fnref:5">&lt;a href="#fn:5" class="footnote-ref" role="doc-noteref">5&lt;/a>&lt;/sup>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">video_sim_v3&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">F&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">Array&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="kt">eltype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kt">x&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">}(&lt;/span>&lt;span class="nb">undef&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">Threads&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="nd">@threads&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">f&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">:&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">PSFʸ&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">V&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>My desktop computer is equipped with four physical CPU cores, which translate into eight threads. When assessing the benchmark of &lt;code>video_sim_v3&lt;/code> with all eight threads, the results demonstrate a remarkable speedup of almost four times comparing to &lt;code>video_sim_v1&lt;/code>, clocking in at &lt;code>12.925 ms (1450 allocations: 123.47 MiB)&lt;/code>.&lt;/p>
&lt;h3 id="cluster-level">Cluster-level&lt;/h3>
&lt;p>Now assume you have access to a cluster, which is not uncommon for universities and institutes nowadays, you could even consider modifying the algorithm to execute across multiple processors spanning numerous computers. A frequently employed strategy involves the utilization of &lt;a href="https://en.wikipedia.org/wiki/Multiprocessing" target="_blank" rel="noopener">multiprocessing&lt;/a>.&lt;/p>
&lt;p>With the concept of multithreading in mind, we can easily comprehend multiprocessing as the simultaneous operation of multiple processors, where each core has access only to its designated memory space. This fundamental distinction from multithreading requires some &amp;ldquo;coding maneuvers&amp;rdquo; as users are now compelled to determine the allocation of data to individual processors. In the context of our example problem, implementing multiprocessing requires some rather major change of the code, contradicting the very impetus driving my blog posts. Therefore, I only provide a preliminary example in &lt;a href="">this GitHub repository of mine&lt;/a>.&lt;/p>
&lt;h3 id="key-consideration">Key consideration&lt;/h3>
&lt;p>
&lt;figure id="figure-communication-overhead-vs-computational-cost">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Communication overhead vs. computational cost." srcset="
/post/optimizationp-ii/fig1_hu8afa2284ccdba917dbb6eb0b766fe8fd_25065_d1f81b9e56e3bfa579736f62e0348575.webp 400w,
/post/optimizationp-ii/fig1_hu8afa2284ccdba917dbb6eb0b766fe8fd_25065_31b7c465da11fe051b0f382fb637da15.webp 760w,
/post/optimizationp-ii/fig1_hu8afa2284ccdba917dbb6eb0b766fe8fd_25065_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimizationp-ii/fig1_hu8afa2284ccdba917dbb6eb0b766fe8fd_25065_d1f81b9e56e3bfa579736f62e0348575.webp"
width="760"
height="302"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Communication overhead vs. computational cost
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>While I aimed to maintain a surface-level discourse in my blog, it is totally reasonable to feel confused when deciding upon a parallelization scheme. &amp;#x1f604; The crucial factor to bear in mind is that an escalation in the number of processors engaged directly corresponds to an increase in communication overhead. This rise in overhead can potentially overshadow the performance benefits gained from task distribution.&lt;/p>
&lt;p>As of the post date of this blog, it is generally advisable to experiment with SIMD and multithreading in your code, as they are relatively easier to test. On the other hand, when it comes to multiprocessing, it is recommended to consider its implementation only when each discrete task consumes several seconds to execute, and the level of inter-process communication remains minimal.&lt;/p>
&lt;p>Although it has been a long journey, our quest remains incomplete. There is one more concept, which is gaining popularity in recent years, that we can test. In the third part of my blog, I will discuss GPU computation.&lt;/p>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>&lt;code>view(x, :, f)&lt;/code> serves the same purpose as &lt;code>x[:, f]&lt;/code> but with smaller memory allocation.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Please note that these concepts are not mutually exclusive.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>You should now recognize that SIMD is closely related to vectorization (introduced in &lt;a href="https://lancexwq.github.io/post/optimization-i/" target="_blank" rel="noopener">my previous blog&lt;/a>). In fact, vectorization constitutes a specific implementation of SIMD principles.&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:4">
&lt;p>For example, see &lt;a href="https://en.wikipedia.org/wiki/Hyper-threading" target="_blank" rel="noopener">hyper-threading&lt;/a>&amp;#160;&lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:5">
&lt;p>In order to enable multithreading, certain programming languages may require additional parameters during startup. You can find instructions on how to accomplish this in Julia on &lt;a href="https://docs.julialang.org/en/v1/manual/multi-threading/" target="_blank" rel="noopener">this page&lt;/a> shows how to do it in Julia.&amp;#160;&lt;a href="#fnref:5" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Optimization Techniques in Scientific Computing (Part I)</title><link>https://lancexwq.netlify.app/post/optimization-i/</link><pubDate>Fri, 28 Jul 2023 00:00:00 +0000</pubDate><guid>https://lancexwq.netlify.app/post/optimization-i/</guid><description>&lt;ul>
&lt;li>&lt;a href="#introduction">Introduction&lt;/a>&lt;/li>
&lt;li>&lt;a href="#description-of-the-problem">Description of the problem&lt;/a>&lt;/li>
&lt;li>&lt;a href="#a-naive-implementation">A naive implementation&lt;/a>&lt;/li>
&lt;li>&lt;a href="#optimization-ideas">Optimization ideas&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#follow-memory-layout">Follow memory layout&lt;/a>&lt;/li>
&lt;li>&lt;a href="#reduce-memory-allocation">Reduce memory allocation&lt;/a>&lt;/li>
&lt;li>&lt;a href="#vectorization">Vectorization&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>Scientific research is inherently linked to the collection and analysis of data. In today&amp;rsquo;s world, the volume of data involved in most scientific research projects far exceeds what can be managed manually. As a result, scientific computing has become an essential requirement for conducting research.&lt;/p>
&lt;p>Despite the remarkable progress in modern programming languages and numerical tools, continually enhanced by scientists and software developers, non-experts still face challenges when it comes to conducting efficient computations on computers for specific research purposes. This can be attributed to the following factors:&lt;/p>
&lt;ul>
&lt;li>Achieving optimal code performance necessitates a comprehensive evaluation of various factors, including hardware specifications, software components, algorithmic efficiency, and the scale of computations.&lt;/li>
&lt;li>Scientists need to find a balance between computational efficiency and development efficiency. They cannot afford to spend excessive time conducting meticulous benchmarks and analyze their statistics.&lt;/li>
&lt;li>While programming languages and numerical tools frequently offer extensive performance tips, they are typically presented in technical jargon, making them less accessible to non-experts.&lt;/li>
&lt;li>Additionally, the examples provided in these resources often lack interconnectedness, making it challenging to grasp their practical application collectively.&lt;/li>
&lt;/ul>
&lt;p>To tackle these concerns, this blog post aims to offer a concise overview of various general and highly effective optimization techniques that are relatively straightforward to implement. The focus will be on a problem I encountered during my research on single-molecule imaging. I will begin with a naive version of my code and gradually enhance its performance.&lt;/p>
&lt;p>&amp;#x2757; All my codes below are provided in terms of an interactive notebook at &lt;a href="https://github.com/lanceXwq/lancexwq.github.io/tree/main/content/post/optimization-I/scripts" target="_blank" rel="noopener">my GitHub&lt;/a>.&lt;/p>
&lt;h2 id="description-of-the-problem">Description of the problem&lt;/h2>
&lt;p>In my research, which involves applications such as &lt;a href="https://en.wikipedia.org/wiki/Super-resolution_microscopy" target="_blank" rel="noopener">super-resolution imaging&lt;/a>, I frequently need to simulate microscope images of individual molecules in the visible spectrum using photon-sensing devices. In this scenario, individual molecules can be accurately represented as point emitters, meaning they are so small that their structures and shapes become insignificant. However, from a physics standpoint, we cannot simply observe sharp, bright dots in microscope images due to two reasons:&lt;/p>
&lt;ul>
&lt;li>The diffraction of light causes a point object to appear as an expanded blur, often referred to as the &lt;a href="https://en.wikipedia.org/wiki/Point_spread_function" target="_blank" rel="noopener">point spread function (PSF)&lt;/a>.&lt;/li>
&lt;li>In images, PSFs are pixelated because the pixel sizes of the detectors are often comparable to the width of a PSF.&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure id="figure-point-emitter-to-a-pixelated-image">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="A comparison of a point emitter, its Gaussian PSF, and the actual pixelated image." srcset="
/post/optimization-i/fig1_hu9b0e52963e2938886e98fe264543e483_16989_67d74e546d28d9422d922e069dcc5c3f.webp 400w,
/post/optimization-i/fig1_hu9b0e52963e2938886e98fe264543e483_16989_7a120171fe7dae3d84c3aa4d1c096184.webp 760w,
/post/optimization-i/fig1_hu9b0e52963e2938886e98fe264543e483_16989_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimization-i/fig1_hu9b0e52963e2938886e98fe264543e483_16989_67d74e546d28d9422d922e069dcc5c3f.webp"
width="760"
height="228"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Point emitter to a pixelated image
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Simulating a single-molecule image involves converting point emitters into their corresponding PSFs and then pixelating the entire image. In these simulations (as well as in the corresponding experimental setups), it is often reasonable to assume that the point emitters are sufficiently far apart from each other, allowing for independent photon emissions. This means that we can calculate the PSF for each molecule individually and combine them.&lt;/p>
&lt;p>Since providing an accurate and detailed simulation process is beyond the scope of this blog, we will make the following approximations:&lt;/p>
&lt;ul>
&lt;li>The pixel size is so small that the impact of pixelization can be considered negligible.&lt;/li>
&lt;li>The PSF is a 2D Gaussian. In other words, for a molecule located at coordinates \((x_n, y_n)\), its influence on the pixel at \((x^p_i, y^p_j)\) is determined by \[PSF_{ijn}=\exp[-(x^p_i-x_n)^2-(y^p_j-y_n)^2]\] where \(n\) represents the molecule index, and \(i\) and \(j\) denote the pixel indices.&lt;/li>
&lt;/ul>
&lt;h2 id="a-naive-implementation">A naive implementation&lt;/h2>
&lt;p>Now, let&amp;rsquo;s move forward with writing a straightforward simulation code. To accomplish this, I will utilize &lt;a href="https://julialang.org/" target="_blank" rel="noopener">Julia&lt;/a> as the preferred programming language. As mentioned previously, we need to compute the PSF value for every \(n\), \(i\), and \(j\). Translating this sentence into code results in the following:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">image_sim_v1&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSF&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">zeros&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">n&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">j&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSF&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">j&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">^&lt;/span>&lt;span class="mi">2&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">j&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">^&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">dropdims&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">sum&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">PSF&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dims&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">dims&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>For a test case with 20 emitting molecules in a 256\(\times\)256 image, this short function does get the job done (see the image below). Conduct a brief benchmark on this function yields &lt;code>13.827 ms (7 allocations: 10.50 MiB)&lt;/code>.&lt;/p>
&lt;p>
&lt;figure id="figure-many-point-emitters-to-the-final-image">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Many point emitters to the final image." srcset="
/post/optimization-i/fig2_huab7a496d2b00d624a24ee68abca9161c_29393_db43ef5c1ddfba68df58f6d0fc7123dc.webp 400w,
/post/optimization-i/fig2_huab7a496d2b00d624a24ee68abca9161c_29393_f6e9aa77ea8983c3843e5c515160975e.webp 760w,
/post/optimization-i/fig2_huab7a496d2b00d624a24ee68abca9161c_29393_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimization-i/fig2_huab7a496d2b00d624a24ee68abca9161c_29393_db43ef5c1ddfba68df58f6d0fc7123dc.webp"
width="750"
height="300"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Many point emitters to the final image
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;h2 id="optimization-ideas">Optimization ideas&lt;/h2>
&lt;p>Now we can begin the process of optimizing &lt;code>image_sim_v1&lt;/code>. Let&amp;rsquo;s start with some simple modifications then move on to more involved techniques.&lt;/p>
&lt;h3 id="follow-memory-layout">Follow memory layout&lt;/h3>
&lt;p>Before explaining anything in words, let&amp;rsquo;s take a look at the following code:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">image_sim_v2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSF&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">zeros&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">j&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">n&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSF&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">j&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">^&lt;/span>&lt;span class="mi">2&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">j&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">^&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">dropdims&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">sum&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">PSF&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dims&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">dims&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>By comparing &lt;code>image_sim_v2&lt;/code> with &lt;code>image_sim_v1&lt;/code>, I merely altered the order of the nested for-loops. Upon benchmarking &lt;code>image_sim_v2&lt;/code>, which recorded &lt;code>8.362 ms (7 allocations: 10.50 MiB)&lt;/code>, we obtain a performance improvement of over 30% through this seemingly insignificant modification!&lt;/p>
&lt;p>The explanation is straightforward: variables are stored in a computer&amp;rsquo;s memory, and accessing this memory requires time. Objects such as arrays are usually stored in a continuous block of memory, and retrieving variables in the order they are stored naturally results in faster retrieval. By interchanging the order of the &lt;code>j&lt;/code>-loop and the &lt;code>n&lt;/code>-loop, the innermost loop of &lt;code>image_sim_v2&lt;/code> consistently operates on a contiguous memory block. It&amp;rsquo;s important to note that different programming languages may have different memory layout conventions, so it&amp;rsquo;s advisable to consult the documentation for specific details.&lt;/p>
&lt;h3 id="reduce-memory-allocation">Reduce memory allocation&lt;/h3>
&lt;p>Similar to accessing memory, memory allocation can also be a time-consuming process. In general, implementing the same algorithm with reduced memory allocation results in improved performance. Furthermore, this improvement tends to be more pronounced when dealing with larger datasets.&lt;/p>
&lt;p>Is there unnecessary memory allocation in &lt;code>image_sim_v2&lt;/code>? The answer is yes. It should be noted that there is no need to store the PSF of each molecule, as we are solely concerned with the final image. Consequently, we can allocate memory for just one image and update the value of each pixel:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">image_sim_v3&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSF&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">zeros&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">j&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">n&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="n">eachindex&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSF&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">j&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">^&lt;/span>&lt;span class="mi">2&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">j&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">^&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">PSF&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Benchmarking &lt;code>image_sim_v3&lt;/code> recorded &lt;code>8.290 ms (2 allocations: 512.05 KiB)&lt;/code>. While there is potential for further improvements by optimizing memory usage, such as exploring &amp;ldquo;mutating functions&amp;rdquo;, pursuing this path is no longer fruitful. Although `image_sim_v3`` managed to reduce memory allocation by a factor of 20, it only resulted in a decrease in computation time of less than 0.3 ms. This outcome was expected since the test case was intentionally designed to be small. Therefore, it is now time to focus on algorithmic optimizations.&lt;/p>
&lt;h3 id="vectorization">Vectorization&lt;/h3>
&lt;p>Vectorization is arguably the most crucial technique discussed in this blog. Its concept is straightforward: execute computations in a manner that aligns with array (matrix or vector) operations (such as matrix multiplication and element-wise operations). By adopting this approach, we can achieve two significant advantages:&lt;/p>
&lt;ul>
&lt;li>Eliminate the need for slow for-loops, which tend to hinder performance in languages like Python or MATLAB.&lt;/li>
&lt;li>Leverage optimized (or even parallelized) routines that greatly enhance efficiency.&lt;/li>
&lt;/ul>
&lt;p>Vectorizing an algorithm is a simple concept, but recognizing the opportunity for implementing vectorization is often more crucial. Based on my personal experience, vectorization should at least be attempted whenever for-loops are involved.&lt;/p>
&lt;p>As a specific example, I will describe the thought process regarding my problem, which currently have three for-loops. First, from &lt;a href="#description-of-the-problem">this section&lt;/a>, we know the final image, denoted as \(i\), is obtained through &lt;/p>
\[I_{ij}=\sum_n PSF_{ijn},\]
&lt;p> but we can also write &lt;/p>
\[PSF_{ijn}=PSF^x_{in}PSF^y_{nj}\]
&lt;p> where &lt;/p>
\[PSF^x_{in}=\exp[-(x^p_i-x_n)^2]~\text{and}~PSF^y_{nj}=\exp[-(y^p_j-y_n)^2].\]
&lt;p> Therefore, we have &lt;/p>
\[I_{ij}=\sum_n PSF^x_{in}PSF^y_{nj}.\]
&lt;p>After this brief re-organization of math, we have arrived at an expression that highly resembles matrix multiplication! Now it is quite clear how we are going to proceed:&lt;/p>
&lt;ol>
&lt;li>Construct two matrices, \(PSF^x\) and \(PSF^y\), with array subtraction&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>, element-wise square, and element-wise exponential.&lt;/li>
&lt;li>Perform a matrix multiplication between \(PSF^x\) and \(PSF^y\).&lt;/li>
&lt;/ol>
&lt;p>
&lt;figure id="figure-vectorized-psf-calculation">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="A simple vectorization scheme." srcset="
/post/optimization-i/fig_vec_huf89d4ba5ab0989bcacb7b454c0316b33_22540_a88c5f0091cf27d6de126925d8156731.webp 400w,
/post/optimization-i/fig_vec_huf89d4ba5ab0989bcacb7b454c0316b33_22540_c44facfd7ff0afe9a2b98ab1ee8d8301.webp 760w,
/post/optimization-i/fig_vec_huf89d4ba5ab0989bcacb7b454c0316b33_22540_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://lancexwq.netlify.app/post/optimization-i/fig_vec_huf89d4ba5ab0989bcacb7b454c0316b33_22540_a88c5f0091cf27d6de126925d8156731.webp"
width="760"
height="345"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Vectorized PSF calculation
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>In Julia code, we have&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-julia" data-lang="julia">&lt;span class="line">&lt;span class="cl">&lt;span class="k">function&lt;/span> &lt;span class="n">image_sim_v4&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">yᵖ&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">xᵖ&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PSFʸ&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span> &lt;span class="o">.-&lt;/span> &lt;span class="n">Transpose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yᵖ&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">.^&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">PSFˣ&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">PSFʸ&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">end&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>image_sim_v4&lt;/code>&amp;rsquo;s benchmark recorded &lt;code>101.157 μs (14 allocations: 752.33 KiB)&lt;/code>, 80x faster than &lt;code>image_sim_v3&lt;/code>!&lt;/p>
&lt;p>At this stage, we have essentially reached the limit of potential improvements for this simple example. Additional optimizations could involve the utilization of hardware-specific math libraries&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup> and datatype-specific operations, but these aspects are beyond the scope of this blog. However, this does not signal the end of our discussion, as we can introduce a slightly more complex (and realistic) example that allows us to explore more advanced optimization techniques. I will continue this discussion in my next blog post.&lt;/p>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>Refer to &lt;a href="https://www.mathworks.com/help/matlab/matlab_prog/compatible-array-sizes-for-basic-operations.html" target="_blank" rel="noopener">this webpage&lt;/a> for compatible array sizes regarding array subtraction and more.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Such as Intel Math Kernel Library (MKL) and AMD Optimizing CPU Libraries (AOCL).&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item></channel></rss>