Optimization | Lance Xu

Optimization Techniques in Scientific Computing (Part III)

Tue, 08 Aug 2023 00:00:00 +0000

Introduction and recap
Run code on a GPU
Conclusion

Introduction and recap

In my previous two blogs for optimization techniques in scientific computing, I have talked about concepts such as vectorization and parallelism in the context of my single-molecule video simulation¹, which can be mathematically formulated as calculating 3D array $V$ with

\[V_{fij}=\sum_n \exp[-(x^p_{fi}-x_{fn})^2-(y^p_{fj}-y_{fn})^2].\]

We started with video_sim_v1

function video_sim_v1(xᵖ, yᵖ, x, y)
 F = size(x, 2)
 v = Array{eltype(x),3}(undef, length(xᵖ), length(yᵖ), F)
 for f in 1:F
 PSFˣ = exp.(-(xᵖ .- Transpose(view(x, :, f))) .^ 2)
 PSFʸ = exp.(-(view(y, :, f) .- Transpose(yᵖ)) .^ 2)
 v[:, :, f] = PSFˣ * PSFʸ
 end
 return v
end

then found that introducing multithreading as follows significantly improves the performance.

function video_sim_v3(xᵖ, yᵖ, x, y)
 F = size(x, 2)
 v = Array{eltype(x),3}(undef, length(xᵖ), length(yᵖ), F)
 Threads.@threads for f in 1:F
 PSFˣ = exp.(-(xᵖ .- Transpose(view(x, :, f))) .^ 2)
 PSFʸ = exp.(-(view(y, :, f) .- Transpose(yᵖ)) .^ 2)
 v[:, :, f] = PSFˣ * PSFʸ
 end
 return v
end

Eventually, video_sim_v3 yields a benchmark of 12.925 ms (1450 allocations: 123.47 MiB) on my eight-thread Intel i7 7700K.

In the part II of my blog series, I have also loosely alluded to the dilemma we face in further optimization:

The number of independent frames can be much larger than the number of threads on a CPU².
Multiprocessing on a cluster causes significant communication overhead and development challenge, ultimately outweigh the potential performance gain.

Basically, we want a solution that can efficiently execute numerous relatively lightweight computational tasks in parallel, while maintaining minimal communication overhead. Interestingly, such a solution already exists, and it takes the form of a GPU. According to the experts from Intel,

The GPU is a processor that is made up of many smaller and more specialized cores. By working together, the cores deliver massive performance when a processing task can be divided up and processed across many cores.

Run code on a GPU

Originally popularized in the deep learning community, accelerating scientific computations with GPUs is rapidly getting attentions from researchers across various domains. Many thanks to the continuous efforts from scientists and software developers, writing GPU codes has become much easier than it used to be. Under some circumstances, once properly set up, running a code originally written for CPUs on GPUs can be achieved via merely changing a few lines.

At the moment, the three leading companies in chips, Nvidia, AMD, and Intel, all offer their own platforms for GPU computation³. Due to the relatively higher popularity, I will use CUDA from Nvidia in this blog. For detailed guidance on installation and integration with Julia, please refer to CUDA.jl and its documentation.

First attempt

Once the installation of CUDA.jl is completed, verified, and loaded, to run video_sim_v1 on an Nvidia GPU we simply need to pass arguments as CUDA arrays such as video_sim_v1(CuArray(xᵖ), CuArray(yᵖ), CuArray(x), CuArray(y)).

You may expect magic to happen but a warning (or sometimes an error) pops up regarding performing scalar indexing on task. What’s more, the warning message also says such implementations *do not* execute on the GPU, but very slowly on the CPU, indicating our first attempt has failed. The cause behind this failure is clear from the warning message: CUDA does not accept scalar indexing of a GPU array, like v[:, :, f]. Consequently, the solution entails a complete vectorization of the code, eliminating the need for the for-loop iteration over $f$.

Further vectorization

As stated multiple times thus far, our problem does not align directly with any basic vector operation. However, we can be clever and slightly restructure our data, enabling the potential for vectorization. An approach to achieve this is illustrated in the following figure.

Block diagonal matrix multiplication

Here, $PSF^x$, $PSF^y$, and $V$ are restructured as block-diagonal matrices. Blocks sharing the same color correspond to the same frame, while any remaining elements within these matrices are set to zero, visually represented as white-colored sections. As a result, all the frames can be simulated through one matrix multiplication.

While this approach is indeed valid, I would not recommend implementing it by yourselves. This is due to the potentially vast dimensions of these block matrices. A naive implementation lacking efficient memory allocation handling could greatly worsen overall performance.

Are there better solutions? The answer is yes. This problem we are facing, namely numerous independent (and typically small) matrix multiplications of identical sizes, is not unique to us. In fact, it is common enough that people have named it “batch matrix multiplication”.

Batch matrix multiplication

Batched matrix multiplication

Although batch matrix multiplication is widely recognized and efficiently implemented, it may not always be easy to find the correct function within your programming language. Occasionally, batch matrix multiplication goes by different names. For instance, in MATLAB, it is referred to as “page-wise matrix multiplication”. In certain cases, additional packages are required, and quite often, these packages belong deep-learning libraries! In Python, you can call torch.bmm from PyTorch, while Julia offers batched_mul through Flux.jl. Using batched_mul, we can write a new code as follows:

function video_sim_GPU_v2(xᵖ, yᵖ, x, y)
 PSFˣ = exp.(-(reshape(x, 1, size(x)...) .- xᵖ) .^ 2)
 PSFʸ = exp.(-(reshape(y, 1, size(y)...) .- yᵖ) .^ 2)
 return batched_mul(PSFˣ, batched_adjoint(PSFʸ))
end

Here, reshape is called to construct PSFˣ and PSFˣ as 3D arrays, and batched_adjoint is just the “batched” version of transpose.

Benchmarking video_sim_GPU_v2 on my CPU (i7 7700K) and my GPU (GeForce GTX 1060) yield 9.550 ms (75 allocations: 73.44 MiB) and 3.127 ms (9 GPU allocations: 73.468 MiB⁴, respectively. Both of them are beating the multithreaded video_sim_v3!

Final boost

The benchmarks I’ve showcased so far are based on double-precision float-point (float64) numbers. However, GPUs are frequently optimized for single-precision float-point (float32) numbers. For instance, once switched to using float32, video_sim_GPU_v2’s benchmark becomes 660.627 μs (11 GPU allocations: 36.736 MiB), another fivefold acceleration!

Therefore, it is frequently advantageous to craft your GPU code to support both float64 and float32, and then assess whether altering the datatype affects your outcome. If there’s no impact, simply proceed with float32!

Conclusion

Finally, we have arrived at the conclusion of my blog series concerning optimization techniques for scientific computation. I hope you have enjoyed this journey and learned something useful. Please feel free to get in touch with me should you wish to connect or share your thoughts!

In case you haven’t read the preceding blogs, I strongly encourage you to take a moment to review their problem description sections. This will provide you with a better picture of the issue I’m trying to address. ↩︎
As of the date of this blog, even the most advanced desktop CPU, AMD Ryzen™ Threadripper™ PRO 5995WX (~$6,000), only has 128 threads, while frame number can easily be over 1,000. ↩︎
CUDA from Nvidia, ROCm from AMD, and OneAPI from Intel. ↩︎
GPU memory allocation is measured by CUDA.@time, see this page. ↩︎

Optimization Techniques in Scientific Computing (Part II)

Mon, 07 Aug 2023 00:00:00 +0000

Introduction and recap
The first implementation
Optimization ideas
Parallelism

Introduction and recap

In my previous blog, I introduced some straightforward yet valuable optimization techniques. While these techniques are generally suitable for relatively simple problems, such as the example presented in my previous blog, they may prove inadequate when dealing with more complex and realistic issues. Specifically, in the previous example, my objective was to simulate a single-molecule microscope image. However, people frequently need to process multiple independent images (e.g., frames in a video). In this blog, I will discuss additional techniques within the context of this video simulation problem.

To quickly recap, the previous example involves calculating the total contribution, labeled as $I$, from all molecules. These molecules are indexed by $n$, and they relate to each pixel, which is indexed by $i$ and $j$. Referring to the assumptions discussed earlier, we can express the algorithm’s mathematical form as follows:

\[I_{ij}=\sum_n \exp[-(x^p_i-x_n)^2-(y^p_j-y_n)^2].\]

Now, shifting our focus to the present issue that involves multiple independent images (or frames), we extend the same calculation to each individual image, denoted as $f$. As a result, the mathematical representation for this new problem takes the following shape (where $V$ stands for video):

\[V_{fij}=\sum_n \exp[-(x^p_{fi}-x_{fn})^2-(y^p_{fj}-y_{fn})^2].\]

The first implementation

Calculate PSFs in a loop

Based on the description so far, we can readily enclose a for-loop iterating over $f$ around the previously optimized code to create the initial version of our single-molecule video simulation code¹:

function video_sim_v1(xᵖ, yᵖ, x, y)
 F = size(x, 2)
 V = Array{eltype(x),3}(undef, length(xᵖ), length(yᵖ), F)
 for f in 1:F
 PSFˣ = exp.(-(xᵖ .- Transpose(view(x, :, f))) .^ 2)
 PSFʸ = exp.(-(view(y, :, f) .- Transpose(yᵖ)) .^ 2)
 V[:, :, f] = PSFˣ * PSFʸ
 end
 return V
end

Two points to note in the code above:

x and y are both arrays of dimensions $N\times F$, where $N$ and $F$ represent the number of molecule and number of frames, respectively.
It appears that we have made a bold assumption that all frames contain an equal number of molecules. However, this assumption is acceptable since molecules that should not appear in a frame can be positioned far away from the field-of-view, thereby making no contribution.

Benchmarking video_sim_v1 using a dataset comprising 20 molecules and 100 frames (each with 256$\times$256 pixels) yields 50.927 ms (1402 allocations: 123.47 MiB). Our overarching goal entails improving upon this benchmark.

Optimization ideas

Before exploring new techniques, let’s take a moment to consider whether we can apply anything from my previous blog. Since we have only added one extra loop, there isn’t much opportunity to reduce memory allocation. What’s more, this extra loop cannot be easily eliminated through vectorization, as the formula specified here doesn’t align with basic matrix (or tensor) operations. Consequently, we must use other techniques to tackle this challenge.

In this video simulation problem, it is important to note that all frames are independent of each other. As a result, there is potential to simulate frames simultaneously, or in other words, in parallel.

Parallelism

Parallelizing an algorithm is much easier said than done. In view of the intricate nature of contemporary computational infrastructures, attaining parallelism in the present era involves three major tiers: core-level parallelism, node-level parallelism, and cluster-level parallelism². In the upcoming sections, I will delve into a single common scheme within each tier and examine its relevance within the context of our specific problem.

Core-level

This initial question that may arise is: how is it possible to achieve parallelism on a single core? To illustrate this, let’s consider a situation where a program operates on 64-bit integers, and a processor core possesses the capability to fetch 256 bits of data in a solitary operation. In such a scenario, it becomes viable to load four integers as a vector and perform a singular vectorized iteration of the original operation. This could potentially yield a theoretical speedup of fourfold³. This particular approach to parallelization is commonly known as “single instruction, multiple data” (SIMD).

The straightforward concept of SIMD, on one hand, allows numerous modern programming languages to identify points within an algorithm where SIMD can be employed and subsequently apply it automatically. On the other hand, SIMD is frequently constrained to basic operations such as addition or multiplication. Hence, the potential enhancement of video_sim_v1 through this method remains uncertain. Nevertheless, in this scenario, an attempt must be made to explore the possibilities.

In Julia, it is possible to enforce vectorization by employing the @simd macro, placed before a for-loop involving independent iterations. This technique results in the creation of video_sim_v2:

function video_sim_v2(xᵖ, yᵖ, x, y)
 F = size(x, 2)
 V = Array{eltype(x),3}(undef, length(xᵖ), length(yᵖ), F)
 @simd for f in 1:F
 PSFˣ = exp.(-(xᵖ .- Transpose(view(x, :, f))) .^ 2)
 PSFʸ = exp.(-(view(y, :, f) .- Transpose(yᵖ)) .^ 2)
 V[:, :, f] = PSFˣ * PSFʸ
 end
 return V
end

A benchmark analysis yields a result of 51.017 ms (1402 allocations: 123.47 MiB), indicating a lack of performance improvement. It seems that Julia has indeed automatically vectorized the code in this case.

Node-level

Moving up a level there is parallelism on a node (often a computer), which is often achieved through multithreading. For multithreading, we require multiple processors (either physical or virtual⁴), with each core being associated with a separate thread. Multithreading facilitates the simultaneous execution of these processors, all while utilizing the same memory pool. It is important to note that the implementation of multithreading demands careful consideration to avoid conflicts between threads. Fortunately, developers have often shouldered much of this responsibility, alleviating users of this burden.

In Julia, multithreading a for-loop can be as easy as follows⁵:

function video_sim_v3(xᵖ, yᵖ, x, y)
 F = size(x, 2)
 V = Array{eltype(x),3}(undef, length(xᵖ), length(yᵖ), F)
 Threads.@threads for f in 1:F
 PSFˣ = exp.(-(xᵖ .- Transpose(view(x, :, f))) .^ 2)
 PSFʸ = exp.(-(view(y, :, f) .- Transpose(yᵖ)) .^ 2)
 V[:, :, f] = PSFˣ * PSFʸ
 end
 return V
end

My desktop computer is equipped with four physical CPU cores, which translate into eight threads. When assessing the benchmark of video_sim_v3 with all eight threads, the results demonstrate a remarkable speedup of almost four times comparing to video_sim_v1, clocking in at 12.925 ms (1450 allocations: 123.47 MiB).

Cluster-level

Now assume you have access to a cluster, which is not uncommon for universities and institutes nowadays, you could even consider modifying the algorithm to execute across multiple processors spanning numerous computers. A frequently employed strategy involves the utilization of multiprocessing.

With the concept of multithreading in mind, we can easily comprehend multiprocessing as the simultaneous operation of multiple processors, where each core has access only to its designated memory space. This fundamental distinction from multithreading requires some “coding maneuvers” as users are now compelled to determine the allocation of data to individual processors. In the context of our example problem, implementing multiprocessing requires some rather major change of the code, contradicting the very impetus driving my blog posts. Therefore, I only provide a preliminary example in this GitHub repository of mine.

Key consideration

Communication overhead vs. computational cost

While I aimed to maintain a surface-level discourse in my blog, it is totally reasonable to feel confused when deciding upon a parallelization scheme. 😄 The crucial factor to bear in mind is that an escalation in the number of processors engaged directly corresponds to an increase in communication overhead. This rise in overhead can potentially overshadow the performance benefits gained from task distribution.

As of the post date of this blog, it is generally advisable to experiment with SIMD and multithreading in your code, as they are relatively easier to test. On the other hand, when it comes to multiprocessing, it is recommended to consider its implementation only when each discrete task consumes several seconds to execute, and the level of inter-process communication remains minimal.

Although it has been a long journey, our quest remains incomplete. There is one more concept, which is gaining popularity in recent years, that we can test. In the third part of my blog, I will discuss GPU computation.

view(x, :, f) serves the same purpose as x[:, f] but with smaller memory allocation. ↩︎
Please note that these concepts are not mutually exclusive. ↩︎
You should now recognize that SIMD is closely related to vectorization (introduced in my previous blog). In fact, vectorization constitutes a specific implementation of SIMD principles. ↩︎
For example, see hyper-threading ↩︎
In order to enable multithreading, certain programming languages may require additional parameters during startup. You can find instructions on how to accomplish this in Julia on this page shows how to do it in Julia. ↩︎

Optimization Techniques in Scientific Computing (Part I)

Fri, 28 Jul 2023 00:00:00 +0000

Introduction
Description of the problem
A naive implementation
Optimization ideas

Introduction

Scientific research is inherently linked to the collection and analysis of data. In today’s world, the volume of data involved in most scientific research projects far exceeds what can be managed manually. As a result, scientific computing has become an essential requirement for conducting research.

Despite the remarkable progress in modern programming languages and numerical tools, continually enhanced by scientists and software developers, non-experts still face challenges when it comes to conducting efficient computations on computers for specific research purposes. This can be attributed to the following factors:

Achieving optimal code performance necessitates a comprehensive evaluation of various factors, including hardware specifications, software components, algorithmic efficiency, and the scale of computations.
Scientists need to find a balance between computational efficiency and development efficiency. They cannot afford to spend excessive time conducting meticulous benchmarks and analyze their statistics.
While programming languages and numerical tools frequently offer extensive performance tips, they are typically presented in technical jargon, making them less accessible to non-experts.
Additionally, the examples provided in these resources often lack interconnectedness, making it challenging to grasp their practical application collectively.

To tackle these concerns, this blog post aims to offer a concise overview of various general and highly effective optimization techniques that are relatively straightforward to implement. The focus will be on a problem I encountered during my research on single-molecule imaging. I will begin with a naive version of my code and gradually enhance its performance.

❗ All my codes below are provided in terms of an interactive notebook at my GitHub.

Description of the problem

In my research, which involves applications such as super-resolution imaging, I frequently need to simulate microscope images of individual molecules in the visible spectrum using photon-sensing devices. In this scenario, individual molecules can be accurately represented as point emitters, meaning they are so small that their structures and shapes become insignificant. However, from a physics standpoint, we cannot simply observe sharp, bright dots in microscope images due to two reasons:

The diffraction of light causes a point object to appear as an expanded blur, often referred to as the point spread function (PSF).
In images, PSFs are pixelated because the pixel sizes of the detectors are often comparable to the width of a PSF.

Point emitter to a pixelated image

Simulating a single-molecule image involves converting point emitters into their corresponding PSFs and then pixelating the entire image. In these simulations (as well as in the corresponding experimental setups), it is often reasonable to assume that the point emitters are sufficiently far apart from each other, allowing for independent photon emissions. This means that we can calculate the PSF for each molecule individually and combine them.

Since providing an accurate and detailed simulation process is beyond the scope of this blog, we will make the following approximations:

The pixel size is so small that the impact of pixelization can be considered negligible.
The PSF is a 2D Gaussian. In other words, for a molecule located at coordinates $(x_n, y_n)$, its influence on the pixel at $(x^p_i, y^p_j)$ is determined by \[PSF_{ijn}=\exp[-(x^p_i-x_n)^2-(y^p_j-y_n)^2]\] where $n$ represents the molecule index, and $i$ and $j$ denote the pixel indices.

A naive implementation

Now, let’s move forward with writing a straightforward simulation code. To accomplish this, I will utilize Julia as the preferred programming language. As mentioned previously, we need to compute the PSF value for every $n$, $i$, and $j$. Translating this sentence into code results in the following:

function image_sim_v1(xᵖ, yᵖ, x, y)
 PSF = zeros(length(x), length(xᵖ), length(yᵖ))
 for n in eachindex(x), i in eachindex(xᵖ), j in eachindex(yᵖ)
 PSF[n, i, j] = exp(-(xᵖ[i] - x[n])^2 - (yᵖ[j] - y[n])^2)
 end
 return dropdims(sum(PSF, dims = 1), dims = 1)
end

For a test case with 20 emitting molecules in a 256$\times$256 image, this short function does get the job done (see the image below). Conduct a brief benchmark on this function yields 13.827 ms (7 allocations: 10.50 MiB).

Many point emitters to the final image

Optimization ideas

Now we can begin the process of optimizing image_sim_v1. Let’s start with some simple modifications then move on to more involved techniques.

Follow memory layout

Before explaining anything in words, let’s take a look at the following code:

function image_sim_v2(xᵖ, yᵖ, x, y)
 PSF = zeros(length(x), length(xᵖ), length(yᵖ))
 for j in eachindex(yᵖ), i in eachindex(xᵖ), n in eachindex(x)
 PSF[n, i, j] = exp(-(xᵖ[i] - x[n])^2 - (yᵖ[j] - y[n])^2)
 end
 return dropdims(sum(PSF, dims = 1), dims = 1)
end

By comparing image_sim_v2 with image_sim_v1, I merely altered the order of the nested for-loops. Upon benchmarking image_sim_v2, which recorded 8.362 ms (7 allocations: 10.50 MiB), we obtain a performance improvement of over 30% through this seemingly insignificant modification!

The explanation is straightforward: variables are stored in a computer’s memory, and accessing this memory requires time. Objects such as arrays are usually stored in a continuous block of memory, and retrieving variables in the order they are stored naturally results in faster retrieval. By interchanging the order of the j-loop and the n-loop, the innermost loop of image_sim_v2 consistently operates on a contiguous memory block. It’s important to note that different programming languages may have different memory layout conventions, so it’s advisable to consult the documentation for specific details.

Reduce memory allocation

Similar to accessing memory, memory allocation can also be a time-consuming process. In general, implementing the same algorithm with reduced memory allocation results in improved performance. Furthermore, this improvement tends to be more pronounced when dealing with larger datasets.

Is there unnecessary memory allocation in image_sim_v2? The answer is yes. It should be noted that there is no need to store the PSF of each molecule, as we are solely concerned with the final image. Consequently, we can allocate memory for just one image and update the value of each pixel:

function image_sim_v3(xᵖ, yᵖ, x, y)
 PSF = zeros(length(xᵖ), length(yᵖ))
 for j in eachindex(yᵖ), i in eachindex(xᵖ), n in eachindex(x)
 PSF[i, j] += exp(-(xᵖ[i] - x[n])^2 - (yᵖ[j] - y[n])^2)
 end
 return PSF
end

Benchmarking image_sim_v3 recorded 8.290 ms (2 allocations: 512.05 KiB). While there is potential for further improvements by optimizing memory usage, such as exploring “mutating functions”, pursuing this path is no longer fruitful. Although `image_sim_v3`` managed to reduce memory allocation by a factor of 20, it only resulted in a decrease in computation time of less than 0.3 ms. This outcome was expected since the test case was intentionally designed to be small. Therefore, it is now time to focus on algorithmic optimizations.

Vectorization

Vectorization is arguably the most crucial technique discussed in this blog. Its concept is straightforward: execute computations in a manner that aligns with array (matrix or vector) operations (such as matrix multiplication and element-wise operations). By adopting this approach, we can achieve two significant advantages:

Eliminate the need for slow for-loops, which tend to hinder performance in languages like Python or MATLAB.
Leverage optimized (or even parallelized) routines that greatly enhance efficiency.

Vectorizing an algorithm is a simple concept, but recognizing the opportunity for implementing vectorization is often more crucial. Based on my personal experience, vectorization should at least be attempted whenever for-loops are involved.

As a specific example, I will describe the thought process regarding my problem, which currently have three for-loops. First, from this section, we know the final image, denoted as $i$, is obtained through

\[I_{ij}=\sum_n PSF_{ijn},\]

but we can also write

\[PSF_{ijn}=PSF^x_{in}PSF^y_{nj}\]

where

\[PSF^x_{in}=\exp[-(x^p_i-x_n)^2]~\text{and}~PSF^y_{nj}=\exp[-(y^p_j-y_n)^2].\]

Therefore, we have

\[I_{ij}=\sum_n PSF^x_{in}PSF^y_{nj}.\]

After this brief re-organization of math, we have arrived at an expression that highly resembles matrix multiplication! Now it is quite clear how we are going to proceed:

Construct two matrices, $PSF^x$ and $PSF^y$, with array subtraction¹, element-wise square, and element-wise exponential.
Perform a matrix multiplication between $PSF^x$ and $PSF^y$.

Vectorized PSF calculation

In Julia code, we have

function image_sim_v4(xᵖ, yᵖ, x, y)
 PSFˣ = exp.(-(xᵖ .- Transpose(x)) .^ 2)
 PSFʸ = exp.(-(y .- Transpose(yᵖ)) .^ 2)
 return PSFˣ * PSFʸ
end

image_sim_v4’s benchmark recorded 101.157 μs (14 allocations: 752.33 KiB), 80x faster than image_sim_v3!

At this stage, we have essentially reached the limit of potential improvements for this simple example. Additional optimizations could involve the utilization of hardware-specific math libraries² and datatype-specific operations, but these aspects are beyond the scope of this blog. However, this does not signal the end of our discussion, as we can introduce a slightly more complex (and realistic) example that allows us to explore more advanced optimization techniques. I will continue this discussion in my next blog post.

Refer to this webpage for compatible array sizes regarding array subtraction and more. ↩︎
Such as Intel Math Kernel Library (MKL) and AMD Optimizing CPU Libraries (AOCL). ↩︎