This Parallel Python workshop is designed to teach cutting edge techniques to work with big data and process data in parallel using Python and is suitable for all participants who want to enhance their data science capabilities.
Python is one of the most used programming languages worldwide with applications in almost every data-oriented application domain. The Python data science ecosystem is a rich platform for scaling up workflows, enhancing scientific research and improving insight. However, Python can be performance limited when large datasets or challenging computations are required. Parallel computing and efficient data handling can overcome this barrier, enhancing research throughput.
Parallel Python is also available online on NCI Teachable website.
Prerequistes
Basic experience with Python is required.
Some grasp of array processing with NumPy would be helpful but is not required as we will do a brief refresher during the course.
The training session is driven on NCI Open OnDemand (OOD) service. Attendees are encouraged to review the following page for background information: Open OnDemand (OOD) Service
Objectives
The training is designed to be the first parallel programming course for scientists. As such, it aims to help attendees
Understand array programming with NumPy
Work with large and possibly heterogenous data using xarray.
Perform parallel computation using Dask
Learning Outcomes
At the completion of this training session, you will be able to
How use vectorized computation using NumPy
How to load, annotate and work with data using xarray
Serialise large datasets to file using xarray
Load data from cloud using OpenDap and xarray
Parallelise common workflows and arbitrary code using Dask
Combine Dask and xarray for big data processing
Combine Dask and GPUs for maximum data throughput
Feel confident in your data science skills to tackle your own problems
Topics Covered
Array programming in NumPy
Array datastructures and hierarchies
Loading and saving data efficiently to disk
Cloud-native computing
Parallel computing with Dask
Combining python packages for enhanced functionality.