The Dask themed notebook tutorials demonstrate how to use Dask on data collections hosted at the NCI as well as data extracted from external databases.
Notebook availability
NCI filesystem path: /g/data/dk92/notebooks/examples-dask
Github: https://github.com/NCI-data-analysis-platform/examples-dask
To preview these notebooks: https://nbviewer.jupyter.org/github/NCI-data-analysis-platform/examples-dask/tree/main/
filename | description | dataset | data project to join |
---|---|---|---|
Dask_01_basics.ipynb | Dask lazy loading; progressBar; reduction | none | none |
Dask_02_data_chunks_CMIP6.ipynb | Dask array basics; NetCDF chunks vs dask chunks; chunking practices | ESGF CMIP6 Replication Data | oi10 |
Dask_03_fundamentals_Delayed.ipynb | Dask.delayed feature; parallelise a for loop | none | none |
Dask_04_delayed_pandas_palioceanography.ipynb | Parallelise sequential code using Dask delayed | csv files downloaded from a nature geoscience paper | none |
Dask_05_dataframes_ACTweather.ipynb | Read in ACT weather data in Dask Dataframe; save to Parquet for better performance; comparison between dask.dataframe and Pandas | weather data downloaded from the BoM website | none |
Dask_06_schedulers_ACTweather.ipynb | Introduce Dask schedulers; apply schedular options to weather station data | weather data downloaded from the BoM website | none |
Dask_07_numpy_temperature.ipynb | Introduce Dask.array chunks; parallelise code; performance comparison with real data examples | Australian temperature data provided by the BoM | none |
Dask_08_xarray_CMIP6.ipynb | Use standard xarray operations on Dask Array; persist data into memory to speed up I/O; customise workflows and automatic parallelisation | fs38 | |
Dask_09_xarray_precipitation.ipynb | Calculate the intra-ensemble range for all the mean daily temperature and average seasonal precipitation in Australia using historical precipitation data of the CESM2 model within CMIP6 | fs38 | |
Dask_10_interactive_visualisation_CMIP6.ipynb | Calculate time and zonal mean of the temperature of CMIP6 GFDL models and interactively visualise data | oi10 | |
Dask_11_diagnositc_tools.ipynb | Introduce a few diagnostic tools such as visualising task graphs, local and distributed diagnostics tools | fs38 | |
Dask_12_intensive_calculation_cmip6.ipynb | Explore some of the Coupled Model Intercomparison Project (CMIP6) replication data to demonstrate how Dask handles expensive calculations | oi10 | |
Dask_13_intensive_calculation_eReef.ipynb | Calculate sea level variability using near-real time and hindcast models of hydrodynamics for the Great Barrier Reef | eReefs | fx3 |
Dask_14_distributed_dataframes_geochem.ipynb | Persist common intermediate results in memory and use indices to improve calculation efficiency | OZCHEM - Geoscience Australia's national whole-rock geochemical dataset | dk92 |
Dask_15_distributed_advanced.ipynb | Introduce the feature of distributed futures; persist into memory; asynchronous computation; debugging approaches; discussion on how to set up the number of Dask workers | none | none |
Dask_16_memory_compute_management.ipynb | Strategies for managing larger-than-memory data using partition; saving data onto disk; cleaning ram; executing in the background | oi10 | |
Dask_17_bag.ipynb | Parse json object as a dictionary and apply map, filter and groupby functions | json files | none |
Dask_18_machine_learning.ipynb | Distributed training; training larger-than-memory datasets | none | none |