Transcending the memory limitations of single GPU with PreonLab 6.1

March 15, 2024
Siddharth Marathe

With every PreonLab update, we aim to continuously enhance your simulation experience by increasing efficiency and reducing the memory footprint.

While simulation on CPU is still the bread and butter for CFD, it is undeniable that GPUs can provide a significant performance boost towards reducing computation time.
Nevertheless, one advantage CPUs generally have over GPUs is a larger memory space. Due to the limited memory of single GPU cards, simulating very large scenes with a lot of particles can be quite challenging. In addition, importing large tensor fields like airflows can also occupy a lot of precious memory space on the graphic card. While PreonLab can cleverly resample such airflows to fit on single GPU hardware, there will always come a point when sacrificing accuracy will be inevitable.

One logical solution is making use of state-of-the-art GPU hardware that can accommodate large simulation scenes. Professional GPU cards like Nvidia’s H100 GPUs can already offer memory space up to 80 GB. However, does this mean that GPU simulations are only possible through the acquisition of larger and larger professional GPU cards? And what about scenes which might require even more memory space than the latest hardware available?

Share:

Our solution

We believe that GPU simulations should not only be possible by having to acquire expensive GPU hardware every time, but also by being able to fully utilize already existing GPU hardware that users might have access to.
Thus, to overcome this limitation of single GPU, PreonLab 6.1 has introduced some remarkable multi-GPU capabilities, so that you can utilize multiple GPU cards and accelerate simulation performance -however large and demanding your simulation scenario might be.

This means that you can now combine multiple GPU cards on a single node on the cluster and increase the total memory space you have at your disposal. Simulate your most challenging scenes which already require a lot of memory or increase the particle resolution for more accuracy. Let the power of additional GPU processor cores speed-up your simulations to give you insights within a fraction of the time you needed for it before.

This sounds like the perfect solution!

But what exactly makes us say that the multi-GPU capabilities of PreonLab 6.1 are indeed so impressive?

We recently conducted some performance benchmarks for different types of simulations on CPU, GPU and multi-GPU hardware to demonstrate exactly this, and would now like to share some of these results in this article.

Hardware used for the performance benchmarks:

For all the performance benchmarks you will see in the following section, we have compared the normalized simulation time required between a dual-socket CPU, a single GPU card, and a combination of up to 4 GPU cards connected on a single node on the cluster. For this purpose, we have considered two types of professional GPU cards.

Note: While these simulation scenes can already fit on one of the professional GPU cards, performing the simulation using multiple GPU cards should result in a close-to-linear speed-up when using multiple GPU cards of the same type. In most cases, the scenes occupy all the available space on the single GPU card and any further change to the scene, which requires more memory, would exceed the memory available on the card.

Additionally, we will also take a look at how a consumer GPU card, which is installed on one of our workstations, performs against one of the professional GPU cards for a scene simulated with the snow solver and an imported airflow, such that the scene fits on both hardware. Table 1 gives you an overview of the hardware in more detail.

Nr.
1
Name
    AMD EPYC     7543 CPU
Number
2x
      CPU threads/         CUDA SMs
2x 64
TDP (W)
2x 225
Nr.
2
Name
    Nvidia A100     40 GB GPU
Number
1-4x
      CPU threads/         CUDA SMs
108-432
TDP (W)
1-4x 400
Nr.
3
Name
     Nvidia L40     48 GB GPU
Number
1-2x
      CPU threads/         CUDA SMs
142-284
TDP (W)
1-2x 300
Nr.
4
Name
    Nvidia RTX    4090 24 GB         GPU
Number
1x
      CPU threads/         CUDA SMs
128
TDP (W)
1x 450

Table 1

Overview of the hardware used for the performance benchmarks in this article.

Marin Tank

Performance Benchmark 1

Please agree to the transfer of data to third parties as set out in our privacy policy.

Video 1

Marin Tank Simulation Performance Scaling on Multi-GPU

The first step for us is always to have a look at how the PreonSolver performs on any new platform, in this case: multi-GPU hardware. Video 1 shows the performance data for the well-established Marin Tank simulation between CPU and combinations of 1 to 4 professional GPUs.

The simulation consists of around 88 million fluid particles. As the video shows, the simulation scales in a near perfect manner from a single GPU to 2 GPUs as well as from 2 GPUs to 4 GPUs with a scaling factor of 1.9x in each case!
This is a wonderful start.

Impinging Jet

Performance Benchmark 2

An important aspect for many applications is being able to perform thermal analysis. Several thermal phenomena require the use of very small particle sizes to accurately resolve the heat transfer occurring across the regions of interest. This increases the particle count in the simulation and inevitably increases the computational time as well as the memory footprint. Of course, when this principle needs to be extended to an entire E-motor, multi-GPU becomes essential indeed.
Video 2.1 shows the performance data for a thermal impinging jet benchmark simulation. The performance results between CPU, GPU and 2 GPUs can be seen in the following video.

Please agree to the transfer of data to third parties as set out in our privacy policy.

Video 2.1

Impinging Jet Thermal Simulation Performance Scaling on Multi-GPU

The simulation consists of around 39.2 million fluid particles. It benefits from a high speed-up from CPU to GPU with a factor of 8x, and scales almost perfectly from a single GPU to 2 GPUs with a scaling factor of 1.97x. Total speed-up = 15,76x! Again, some great news for thermal simulations.

Please agree to the transfer of data to third parties as set out in our privacy policy.

Video 2.2

Cooling Simulation of a 360° Section of an E-motor with PreonLab 6.1.

Speaking of an entire E-motor, Video 2.2 shows a simulation of the entire 360° section of an E-motor cooling application, where the copper windings are cooled by 4 oil jets from within the rotating shaft. The simulation was performed with PreonLab 6.1 on 2x L40 GPUs on our cluster. With the help of multi-GPU, the computational time can be reduced to the extent that project turnaround times become realistic, even for comprehensive simulations like this one.

Snow

Performance Benchmark 3

Please agree to the transfer of data to third parties as set out in our privacy policy.

Video 3

Snow in Car Wheel Simulation Performance Scaling on Multi-GPU

We have seen how multi-GPU performs for two popular benchmark simulations. Yet, PreonLab’s multi-platform support can never be complete without the availability of our elastoplastic snow model. Calculating elastic and plastic deformations can be computationally expensive especially when there is a need for fine resolution to mimic snowflakes. Video 3 shows how snow simulations can be expected to scale on multi-GPU.

The simulation shown in the video consists of around 34 million fluid particles. Once more, there is a high speed-up benefit from CPU to GPU with a factor of 6.17x, and the performance scales in a near perfect manner from a single GPU to 2 GPUs with a scaling factor of 1.86x. Total Speed-up = 11,47x!

Deep Water Wading

Performance Benchmark 4

Please agree to the transfer of data to third parties as set out in our privacy policy.

Video 4

Deep Water Wading Simulation Performance Scaling on Multi-GPU

Next, we turn our attention to a water wading application simulation and the performance boost PreonLab 6.1 can offer such simulations. You can see the results for a deep-water wading performance benchmark that we performed with PreonLab 6.1 between CPU, GPU and 2 GPUs in Video 4.
The simulation consists of around 31.2 million fluid particles and uses the Car Suspension Model (CSM) to calculate the deflection of the springs as well as the hydrodynamic forces acting on the car, as the car drives through the deep wading channel.
Result: The simulation runs 4 times faster on a single GPU compared to CPU. It scales from a single GPU to 2 GPUs with an impressive, near-perfect scaling factor of 1.95x – resulting in a total simulation time speed-up of 8 times! This means that challenging wading simulations like this which previously took more than a week can now be performed very rapidly within a day.

Snow + Airflow

Performance Benchmark 5

Please agree to the transfer of data to third parties as set out in our privacy policy.

Video 5

Snow with Airflow Simulation Performance Scaling on Multi-GPU

Perhaps you have wondered how PreonLab extended GPU capabilities can help simulations you want to perform right on your personal machines?
As mentioned in the hardware overview section of this article, we also compared the simulation performance results between one dual-socket CPU, a professional GPU card with 48 GB memory space which is installed on our cluster, and a consumer GPU card with 24 GB memory space which is installed on one of our workstations.

You can see the results in Video 5.

The simulation consists of a maximum of 7.8 million snow particles and was run for 12 seconds of physical time. The number of particles was calibrated so that the scene fits well on both graphic cards, without crossing the memory threshold for the card on the workstation. As you can see, the speed-up between the CPU and each GPU card is very similar.

Conclusion

Here’s how we see it: PreonLab 6.1’s extended GPU support and multi-GPU support will benefit everyone – from the largest GPUs to the most popular ones on the market – your choice of GPU should only depend on what you want to simulate and how much you want to simulate. PreonLab 6.1 will take care of the rest.

TIP: The higher the number of particles per GPU card, the more effective the performance benefit which you can expect. In case the number of particles per GPU is too low, making use of multiple GPUs might not always be the way to go!

Outlook - Transcending limitations and reaching new horizons

Simulating on multi-GPU is like hitting two birds with one stone: On one hand, you can speed-up simulations with the help of additional GPU hardware and on the other you simulate scenes which would ordinarily not fit on a single GPU card. Thus, with PreonLab 6.1, we take another massive step towards our goal of providing the ultimate simulation tool with the finest multi-platform support. You can read more about our vision of platform independence for simulations in this article.

Furthermore, we can envision that PreonLab’s unique memory consumption reducing features like continuous particle size (CPS), dynamic sampling, and adaptive sampling on GPU coupled with reliable multi-GPU support, will open the doors for many more demanding and advanced simulation possibilities.

Looking at things long-term, this combination of being able to use powerful hardware together with true software innovation will prove to be key when tackling all the exciting simulation challenges on the horizon ahead of us.

RELATED ARTICLES

LEARN MORE ABOUT FIFTY2 News

How to get started with
Preonlab CFD software?
Get in Touch