NCI-NVIDIA HPC-AI Hackathon 2022

ATTN: Given such a great success of the hackathon in 2022, we are now planning the hackathon in 2023. If you are interested in participating, please email us at training.nci@anu.edu.au to reserve a spot.

What is it?

If you have an application running on Gadi and you are thinking about porting it to GPUs to increase its speed, or if your AI/ML application is already using GPUs and you could use a helping hand getting it to that next level of performance, consider applying to participate in the NCI-NVIDIA HPC-AI Hackathon.

Whether your code is a traditional HPC-centric application or it focuses on AI/ML technologies, the goal of the event is to port and optimise codes on GPUs in a focused and highly collaborative environment. GPU resources at NCI will be in use during this Hackathon. We are expecting 10-12 research teams across various scientific research domains to participate.

This Hackathon is one of several high-impact international training events hosted at NCI this year. Through a range of bootcamps in 2022, we have been building our users’ capacity to make good use of our GPU resources. The precursor events, bootcamps in data science, distributed deep learning, NVIDIA CUDAⓇ Python, OpenACC and CUDA C/Fortran, provided hands-on learning and exercises for skills development. Participants learned how to profile their code, find the computational sweet spots, and improve their performance with different strategies and tools. Finally, during the hackathon they were able to apply what they learned to their own applications. This was the culminating opportunity for NCI users to get dedicated help and guidance from both NCI and NVIDIA mentors.

Extra benefits of being part of the hackathon are that the team will be supported continuously by subject matter experts as an ongoing engagement commitment. Read our story here.

Key dates:

01 Jul 2022 EOI submissions open – Submit your Expression of Interest using this form.
05 Aug 2022 EOI submissions close date is extended by one week.
01 Sept 2022 Full proposal submissions open – We will request a more detailed description of your application so that we can best match tutors to help you during the hackathon. Please register using this link.
21 Sept 2022 Full proposal submissions close.
September-October 2022: Participants develop requisite skills and knowledge.
24 Oct 2022 - 04 Nov 2022: Hackathon

Hackathon Schedule:

21 Oct 2022 NCI GPU Hackathon 2022 Team/Mentor Meeting - online

25 Oct 2022 NCI GPU Hackathon 2022 Day 1 - online

02 Nov 2022 NCI GPU Hackathon 2022 Day 2 - onsite

03 Nov 2022 NCI GPU Hackathon 2022 Day 3 - onsite

04 Nov 2022 NCI GPU Hackathon 2022 Day 4 - onsite

Event Format:

We will run the hackathon in a hybrid mode. See details on the registration page. The last three days 2-4 Nov will be hosted by NCI/NVIDIA at ANU campus, Canberra, Australia. This face-to-face coding hackathon makes debugging and walking through issues much more efficient and improves the whole hackathon experience.

Event Outcome:

We have helped 9 teams from ANU, UQ, Monash University, UNSW Sydney and UNSW Canberra with various scientific HPC and/or AI applications. Every team gained knowledge, support and achieved some decent speed up with the accelerator supported by Gadi.

Here are a short summary of each team's experiences:

Hypersonics Team

The UNSW Canberra Hypersonics consists of two PhD students, Luke Pollock and Zacharia Tuten and a part time masters student Luke Rooney. Our group specialises in the study of hypersonic vehicles. Our objective for this hackathon was to optimise a fundamental piece of aerodynamic analysis software which will be implemented in larger machine learning software being called repeatedly. GPU computing was completely new to us at the beginning of this hackathon but was something we wanted to explore.

On the first day of the hackathon our mentors were able to perfectly couple our code with some easy to implement GPU optimisation tools for python. They also shared some excellent visual code profiling tools with us which helped identify all the weaker points of our code. This practical guidance from our mentors was invaluable, the coding practices we picked up from these early stages of the hackathon are things we will carry on to our studies and work life.

The second day, the first in person day, we spent the majority of the time profiling the code and discovered some overlooked silly coding mistakes which were slowing down the overall performance of the code simply by calling a function it shouldn't have been. The outputs were maintained but by removing this call the runtime of our code halved.

Following this the next morning we were taken through GADI’s front end so that we could use it to begin adding GPU optimisation to our code. Our team was new to using GADI and were pleasantly surprised how easy it was to migrate our code over and install dependencies to the platform. The tour of the actual facility was incredibly impressive as well.

The morning of day 3 we managed to get our code optimised with CuPy which led to some impressive speedups we weren't expecting. Making expanding the machine learning expansion in the future much of a realistic possibility.

Overall, our team had a great time at the hackathon, we met a lot of extremely clever researchers as well, including another Hypersonics group of all people, who were amazing to chat to. The mentors were also amazing for developing our research and professional coding practices, also an extremely friendly and approachable bunch of people.

We were initially hesitant to sign up for the hackathon seeing as our code base is in its early stages of development but I would recommend to anyone in the same situation to just sign up! Approaching GPU optimisation from the very start of development with the help of experts has resulted in an extremely efficient basis for our project and has taught us new approaches for future development. The NCI team is fantastic and runs an excellent program!

CLEX Team

Speed up downscaling climate data using GPU-enabled Machine learning

Our ‘CLEX’ team includes two postdocs, Sanaa Hobeichi and Nidhi Nishant, and a technical expert, Sam Green. We are part of the Centre of Excellence for Climate Extremes and are based at the University of New South Wales in Sydney. We are working on a project that developing and optimising machine learning models for downscaling climate data. By joining the hackathon, we'll have a chance to work closely with NVIDIA experts Juntao Yang and Yi Kwan Zheng on optimizing this work on GPUs. The aim of this project is to use GPU-enabled machine learning to produce the high resolution map from the low resolution one. Evapotranspiration determines how much water the plants transpire, and the soil evaporates. Knowing what Evapotranspiration is at the fine scale provides essential information for local adaptation planning to climate change impacts on agriculture, water supply, and fire risk, among others.

Figure 1: Low resolution (Left) and High resolution (Right) maps of Evapotranspiration over part of New South Wales. The aim of this project is to use GPU-enabled machine learning to produce the high resolution map from the low resolution one.

The maps in Figure 1 show a climate variable “Evapotranspiration” estimated on a day over part of New South Wales at a low resolution (map on the left) and high resolution (map at the right, 9× higher). Evapotranspiration determines how much water the plants transpire, and the soil evaporates. Knowing what Evapotranspiration is at the fine scale provides essential information for local adaptation planning to climate change impacts on agriculture, water supply, and fire risk, among others.

We brought to the Hackathon a running Python script that generates high resolution maps of Evapotranspiration similar to the one at the right from low resolution ones by building neural network models that emulate downscaling using the GPU-enabled torch.nn package. This script builds a neural network for every gridcell in the low-resolution map, which contains more than 6000 gridcells in a sequential way. A single gridcell takes 1.5 minutes to process, so it takes 150 hours to downscale a region of 6000 gridcells. While there are several ways to parallelise processes on CPUs, we didn't know how to run multiprocesses on a GPU. Multi-processing our code on GPUs will result in a significant speed-up of downscaling which is our main goal in this hackathon.

Our journey summerrised:

Online Day 0: We met our mentors and explained how our code worked. We also outlined our goals for the event.

Online Day 1: Our mentors explained how to use the NVIDIA Nsight systems to profile our code and helped us analyse the efficiency of our script in using the CPUs and GPU. We used NVDIA Nsight tool and annotated our script using NVTX library. We found that we only use ~25% of the GPU memory and 5 CPUs out of the 16 requested. As a result, we need to find a way that allows us to build multiple neural networks in parallel such that we make full use of the GPU and CPUs.

Day 2: Our mentors helped us upgrade our script so that GPU multi-processing is possible. To achieve this, we took advantage of the PyTorch.multiprocessing library to assign each grid to its own process on the GPU. While doing this we also refactored the script to reduce repetition and included more nvtx annotations to add information to our outputted profiler.

Day 3: We evaluated the predicted evapotranspiration results to make sure the parallel neural network was performing correctly. We used the Nsight tool to confirm that the processes were running in parallel and that there were no obvious bottlenecks. Then we started to analyse the number of parallel grid cells vs the runtime of the job. As the day folds, we are so excited to see how many processes can run in parallel in order to maximise the use of GPUs.

Day 4: We increased the number of processes running in parallel and tracked various CPU/GPU usage parameters to determine the optimal number of processes that enables full usage of one GPU. Figure 2 shows the CPU and GPU usage and the runtime as we increase the number of processes (number of grid cells) run in prallel. We determined that a single GPU can handle four processes in parallell. Future work will be to scale this application across multiple GPUs to further reduce the runtime.

Figure 2: The relation between the number of grid cells downscaled in parallel the usage of CPU and GPU resources

Figure 3: Relationship between the number of grid cells and the time needed for downscaling in the case of Parallel and Sequential processing.

The NCI hackathon was a great opportunity to work closely with NVIDIA GPU experts. Their guidance helped us to learn how to convert our code from sequential code to multiprocessing using GPUs, how to profile our code and find bottlenecks, how to use GPU and CPU resources more efficiently, and how to optimise these for our current application. As a result, we created a downscaling emulator that uses fewer resources and so far is 4 times faster (see Figure 3). This is a learning in progress, and we are excited to see how much further we can reduce the runtime on multiple GPUs and to apply these new multiprocessing skills to future applications in climate science.

Hyperhacks Team

For as long as there has been computational fluid dynamics, users have complained “Why is my calculation so slow?” For the developers of CFD codes, the latest development in disappointing our users has been Graphical Processing Units or GPUs, a class of parallel compute hardware that is specialised for performing clusters of arithmetic operations all at once.

Though GPUs were originally designed for video games and other 3D rendering tasks, they have recently been applied to a diverse range of mathematically intense calculations, including Machine Learning, Bioinformatics, and Physics Simulators, including CFD. The Hyperhacks team is part of the numerical simulations branch at the University of Queensland's Center for Hypersonics, consisting of PhD students Robert Watt and Christine Mittler, postdoctoral researcher Nick Gibbons, and honorary fellow Peter Jacobs. For the hackathon we built a stripped down flow solver with the whimsical name "Chicken", and flew to Canberra to test, debug, and optimise it with the help of NCI and NVIDIA's GPU programming experts.

Figure: A forward facing step at Mach 3, computed using the Chicken flow solver.

Although Chicken is simple compared to a production CFD code, it is still a very complicated algorithm compared to a typical General-Purpose GPU application, and we discovered several issues with the code's IO, boundary conditions, and especially memory access in the gradient and reconstruction steps, which seems to be the program's primary execution bottleneck. However, the expertise we have gained in using CUDA profiling tools has enabled us to identify and start working on these issues, achieving a factor of 4 speedup in the pure GPU code. We are grateful to all the NVIDIA and NCI staff for their help in this matter, but especially to our mentor Wei Fang, who helped us run the profilers, pull out the results, and figure out what all of that data was trying to tell us.

LTRAC-GO Team

Team LTRAC-GO comprises research fellow Minghang Li, PhD student Antonio Matas, Senior Lecturer Callum Atkinson and Professor Julio Soria from the department of Mechanical and Aerospace Engineering at Monash University and mentor Jeffrey Addie, solutions architect at NVIDIA.

Our aim at the NCI GPU Hackathon was to port to the GPU part of the OpenTBL code, a numerical code for the direct simulation of turbulent boundary layers which utilises a hybrid OpenMP-MPI parallelisation. At the Hackathon event, we aimed to accelerate the code by making use of OpenACC, a programming standard designed to simplify GPU programming. This choice was motivated by the fact that OpenACC allows the creation of efficient parallel GPU code with only minor modifications to the original CPU code.

The first challenge that we came across was to compile the code under the NVHPC environment. The libraries, specific for hdf5 and Zlib, were only built on NCI-Gadi for Intel and GNU compiler support. With the help of our mentor Jeffrey Addie, we succeeded in compiling the code by building our own libraries for the gpuvolta architecture. A dynamic library linking failure prevented us from compiling the code on the dgxa100 architecture. Hence, during the remaining days of the Hackathon we focused on accelerating the OpenTBL code exclusively using the gpuvolta nodes.

Due to the complexity of the code, we focused our attention on parallelising some of the key operators that are frequently called and executed in each step of a computational run. These are interpolation and differentiation operators. When we ported OpenACC clauses to both operators, a larger time to complete a computational run was required compared to the time required for the original CPU code. Making use of the NVIDIA profiling tool Nsight Systems and debugging at runtime enabled us to better understand how the code interacts with the device by examining kernel launches, memory allocation, data movements, etc. This insight helped us in restructuring our targeted operators. We were able to demonstrate a 70x speed up in the kernel computing time. However, for the small test case tested at the Hackathon, a significant amount of time was spent on data movements, especially those from host to device with no overall performance gained.

The NCI GPU Hackathon has been a good starting point to consider the porting of specific parts of the OpenTBL code to GPUs. We will continue this investigation by further examining and optimising the data movements and run tests on larger simulation cases that make full use of the GPU capabilities. It has been a very enriching experience to learn from NCI and NVIDIA experts and from the rest of the participants. We would like to extend our gratitude and a special thank you to Jeffrey Adie, who assisted us throughout the event and whose guidance made our progress possible. We are also grateful to Jingbo Wang and to the rest of the NCI and NVIDIA members who made this event possible.

Rxn-net Team

Team rxn-net are three PhD students (Jiaxin Fan at UNSW, Matt McDermott and Max Gallant from LBNL in Berkeley, CA, USA), and Rui Yang, our mentor from NCI. Our team came to this hackathon with a fully functioning python code that constructs reaction networks for predicting the intermediate phases that form when inorganic solids react (e.g. during the synthesis of battery cathodes, or piezoelectric devices). We construct these networks using data from the Materials Project, a database of density functional theory calculation results that include thermodynamic values such as the enthalpy of formation for each material. Specifically, the process entails the following steps:

Collect all the known phases that fall within chemical system of interest
Enumerate all the possible reactions between them
Construct a network of reactions connecting sets of phases together
Use path finding algorithms to find the most thermodynamically favourable series of reactions between specific sets of precursors and products

From a computational perspective, an outstanding feature of this process is that the number of reactions included in the network grows combinatorially with the number of phases considered. Since part of the pathfinding process involves identifying linear combinations of reactions that yield the desired overall reaction (that is, the coefficients on the reactant and product phases represent a reaction that conserves mass), we end up performing a combinatorially large number of matrix pseudoinverse operations to identify these combinations. At this hackathon, our goal was to move these operations on to the GPU.

Our first step was to install our code on gadi. Part of this involved removing a dependency on graph-tool, which had to be compiled separately. Fortunately, we were able to replace it with a rust implementation of networkx with relatively little effort. We also collected some baseline performance measurements for our code running on CPUs alone.

Next, we identified cupy as the simplest way for us to start utilising the GPU resources. We first attempted to use the library as a drop-in replacement for numpy. This actually yielded substantially worse GPU performance than CPU performance because the CPU code was written for numba (and therefore had to avoid the use of some useful numpy functionality that wasn’t supported by numba). We rewrote the subroutine in pure cupy, which improved performance and simplified the code.

After moving some of the work to GPUs, we noticed that the GPUs were taking much longer to complete their work than the CPUs. To understand what was happening better, we took out some of the pre- and post-processing work and parallelized it separately. We also implemented a simple scheme for splitting work between the CPUs and GPUs.

We had already parallelized this process using the python package `ray`, which is used to parallelize workloads across multiple processors and multiple nodes in a compute cluster. As a result, one of our big challenges was to implement load distribution between the GPUs and CPUs on the gadi nodes.

These changes yielded big improvements in our code’s performance, but they also revealed that the core function we needed (cupy.linalg.pinv) actually has a very very slow implementation, so we didn’t get much benefit from using the GPUs (until we reimplement that function ourselves, at least!). Nevertheless, this hackathon was a great opportunity to learn about the basics of python language tools for GPUs, become familiar with the Gadi cluster, and try out some GPU profiling tools. Thanks to everyone at NCI for hosting this hackathon, and specifically Rui for all his help setting up our development environments on Gadi and clarifying the ray specifics.

WaveQLab3D Team

WaveQLab3D is a petascale solver for simulating seismic wave propagation in complex media and fault rupture dynamics in elastic solids with the goal of understanding and unravelling earthquakes. It uses the high order accurate finite difference summation-by-parts operators for approximating spatial derivatives and boundary conforming curvilinear meshes for resolving complex faults and non-planar free surface topography. The code is implemented in Fortran and parallelised with MPI.

From the outset we knew the slowest part of the code is the finite difference stencil, since it needs to compute the derivative at every point in the mesh, which for large problems can be on the order of 2 billion points.

Bharat Sharma, our mentor from NVIDIA, suggested the roadmap of first porting the code to OpenACC with managed memory. This setting lets the compiler take control of moving data back and forth between the CPU and GPU. Our first day was therefore spent moving the inner finite difference stencil to the GPU and re-profiling our code with managed memory on. We rapidly discovered that, while this is a decent intermediate approach for porting code, to really see a speed up we would have to take control of the memory management ourselves. Moving the finite difference boundary calculations also meant we had to refactor one of the largest functions in the code, since the existing version wouldn’t fit in GPU memory.

During the second day we moved on to taking over the memory management ourselves which brought on new issues to work through. We also had a plan of how to refactor the boundary function from day one. By the end of the second day, we were at the point where our two most expensive functions were running on the GPU, albeit with the refactoring still ongoing for the boundary function. We also took control of the memory transfers from CPU to GPU and vice versa, with the bulk of the data transfer placed at the beginning of code execution to minimise the overhead. The game was then to gradually move more functions to the GPU, raising the internal memory transfer further up the hierarchy of function calls to reduce CPU-GPU communication. To this end, we realised at the end of day two that we could move a third function (which we had originally ignored) to the GPU, improving the code speed further.

On the final day, we cleared up the bugs with the third GPU function and implemented asynchronous execution of parallel functions. This way the code could continue streaming data to the GPU while another function was running. Finally, we re-profiled our code and in the end, we measured roughly a 100x speed up on the most costly functions in our code, with the inner and boundary finite difference stencils coming down from roughly 500-600ms on a small grid with 4 CPUs, to 3.6ms plus around 60ms data transfer time on 4 GPUs (the CPU code has no data transfer).

We'd like to thank NCI and NVIDIA for organising the Hackathon. We’d also like to thank Bharat Sharma for his knowledge and patience. It's remarkable how quickly you can pick up a new skill when you're locked in a room for a few days, able to focus on a single project and able to access a mentor who knows their way around the code to help you debug. We will be continuing with the knowledge we gained at the Hackathon to further improve our code over the coming weeks and months.

Seis Team

“The speedup of 1000x is surprising to us”, said Dr Sheng Wang, Team Seis, from Research School of Earth Sciences, The Australian National University. The team focus on cross-correlation analyses of big seismic data to address cut-edge problems in investigating and understanding the structure and dynamics of the Earth’s and planetary interiors. “By change a only portion of our CPU-based codes to GPU ones, and with guidance and help from mentors and resoures at NCI, the cross-correlation computation time consumption drop from minutes to a few seconds for a proto problem. We look forward to applying that to real problems, and we have a vision to migrate all our cross-correlation codes to GPU version, and that would definitely help not only us but every researchers like us who need to deel with big seismic data.”

EXESS Team

EXESS is a quantum chemistry program which specialises in using High Performance Computing resources, specially GPUs. It has been under development over the past three years achieving impressive results such as scaling over the entirety of the Summit supercomputer at Oak Ridge National Laboratory. The code is implemented in C++ and using CUDA as the offloading language. The EXESS team is exploring the use of HIP as an avenue towards using AMD and Intel GPUs.

We came to this Hackathon with two intentions:

Improve the performance of our RI-MP2 algorithm on A100 GPUs
Implement a small kernel scheduler for increased performance

Using an extensive set of profilers and the help of the mentors, we were able to optimise the RI-MP2 algorithm for A100 GPUs. Figure EXESS-1 shows the strong scaling on up to 8 GPUs for a system comprised of 150 water molecules (450 atoms) using the cc-pVDZ/cc-pVDZ-RIFIT basis set combination:

Additionally, Figure 2 shows the TFLOPs/s performed with respect to the theoretical peak of the A100 (19.5 TFLOPs/s):

At 8 A100s, the EXESS program is able to produce 86% of the theoretical peak of 8 A100s, achieving 134 TFLOPs/s with respect to the 156 TFLOPs/s theoretical peak.

This was done with a combination of the use of pinned-shared-memory and hiding memory latency by using multiple streams.

The small kernel scheduler implementation turned out to introduce a couple of segmentation faults and couldn’t be completed in time during the hackathon.

Caleb Team

Team Caleb is jointly from the UNSW Water Research Centre and University of Melbourne, We are researchers in the field of statistical hydrology. We had an existing algorithm for the stochastic generation of daily and sub daily daily rainfall for any site across Australia. The existing algorithm was written predominantly in python with some elements in fortran. As the algorithm was used to generate ensembles of simulations the original aim was to parralelise the algorithm across GPU cores. However, after profiling the code and parallelising it for CPU we relaised that the current format of the code was unsuited to distribution across GPU cores owing to the heavy logic involved. As such we began the process of refactoring the code and vectorsing components of the code to be sent to the GPU for processing. Initial results showed speed ups of 10x up to 100x! Our understanding of how to write code that can efficiently utilise the processing power available from GPUs how been greatly enhanced and we look forward to continuing the journey of GPU programming.

Page tree