Abstract

NCI's WeatherBench dataset is for development in medium-range weather forecasting. a modified version of the original Weatherbench.  This is described below.

AI/ML-based approaches, particularly deep learning-based models, have become a powerful prediction tool in the last few years. The problem can be stated as predicting the global weather patterns for upcoming days using the already observed and stored weather patterns for previous years or decades. Recently, there has been an explosion of deep learning-based weather prediction models. One of the hallmarks of those models is that they require a huge amount of data to train new models. A deep learning model typically goes through several test and evaluation phases, and that can be costly in terms of process time and disk space. A cost-saving measure is to test and evaluate the model using a low-resolution dataset set and then train the final model using high-resolution data. There is a lack of a common dataset that contains various resolutions of data and can be used to evaluate and inter-comparison between models. Here we present an ML/AI dataset for medium-range weather forecasting.

This project is based on the Pangeo Weatherbench project that creates a machine learning benchmark from the ERA5 reanalysis dataset. However, this project extends the dataset in a significant way. ERA5 consists of several dozen parameters data at 37 different pressure levels. However, Weather-bench uses a subset of those parameters. The Weatherbench is based on a paper (https://arxiv.org/pdf/2002.00469.pdf) that discusses how to transform ERA5 into a machine-learning-compatible dataset. The project identifies 15 different parameters at 13 pressure levels and five constants that are most impactful for machine learning based on experiments. ERA5 contains data only at 0.25° (721×1440 grid points) resolution. On the other hand, the Weather-bench process the data into three different resolutions: 5.625° (32×64 grid points), 2.8125° (64×128 grid points) and 1.40525° (128×256 grid points).

For the NCI-Weatherbench, data has been derived from the NCI ERA5 archive and it is processed to facilitate ML/AI training at a low cost. This dataset is created using the Pangeo weatherbench specification; however, we have extended the dataset by a significant margin. The original Pangeo dataset contains 40 years of data (1979-2018), whereas this benchmark contains 64 years of data (1959-2022). Hence, our dataset provides a larger selection of data for researchers. Also, since this dataset is already in the NCI disk, there is no need for any data download. The dataset contains three different resolution subsets (1.40625, 2.8125, and 5.625), each containing 14 variables (total_precipitation, toa_incident_solar_radiation, geopotential, vorticity, 10m_v_component_of_wind, v_component_of_wind, 10m_u_component_of_wind, u_component_of_wind, total_cloud_cover, temperature, relative_humidity, specific_humidity, potential_vorticity, 2m_temperature). We hope that the NCI weatherbench will greatly facilitate deep learning-based weather forecasting research among our Australian peers.  

Objectives

  1. To create a software ecosystem consisting of NetCDF, Xarray, and Dask to create the Weatherbench dataset from the NCI ERA5 (/g/data/rt52) according to a standard specification.
  2. The original Weatherbench has 40 years of data, from 1979 to 2018. The NCI-Weatherbench contains more than 60 years of data (from 1959 to 2022), we have expanded the dataset with extra years to give researchers more data to choose from. 
  3. We are experimenting with Regridding to reduce the required service units. Regridding is the process of interpolating from one grid resolution to a different grid resolution. We have used several strategies like memory optimization, weight reuse, and parallelization (both processes and threads).
  4. We have experimented with different types of compute nodes with different configurations and results show that service units can be improved up to 57.89% compared to that of un-optimized solutions. 

Data

Data is located at: /g/data/wb00/NCI-Weatherbench

The following diagram shows the structure of the three folders  for three resolutions 

NCI-Weatherbench
├── NCI-Weatherbench
│   ├── 1.40625deg
│   │   ├── 10m_u_component_of_wind
│   │   ├── 10m_v_component_of_wind
│   │   ├── 2m_temperature
│   │   ├── geopotential
│   │   ├── potential_vorticity
│   │   ├── relative_humidity
│   │   ├── specific_humidity
│   │   ├── temperature
│   │   ├── toa_incident_solar_radiation
│   │   ├── total_cloud_cover
│   │   ├── total_precipitation
│   │   ├── u_component_of_wind
│   │   ├── v_component_of_wind
│   │   └── vorticity
│   ├── 2.8125deg
│   │   ├── 10m_u_component_of_wind
│   │   ├── 10m_v_component_of_wind
│   │   ├── 2m_temperature
│   │   ├── geopotential
│   │   ├── potential_vorticity
│   │   ├── relative_humidity
│   │   ├── specific_humidity
│   │   ├── temperature
│   │   ├── toa_incident_solar_radiation
│   │   ├── total_cloud_cover
│   │   ├── total_precipitation
│   │   ├── u_component_of_wind
│   │   ├── v_component_of_wind
│   │   └── vorticity
│   ├── 5.625deg
│   │   ├── 10m_u_component_of_wind
│   │   ├── 10m_v_component_of_wind
│   │   ├── 2m_temperature
│   │   ├── constants.nc
│   │   ├── geopotential
│   │   ├── potential_vorticity
│   │   ├── relative_humidity
│   │   ├── specific_humidity
│   │   ├── temperature
│   │   ├── toa_incident_solar_radiation
│   │   ├── total_cloud_cover
│   │   ├── total_precipitation
│   │   ├── u_component_of_wind
│   │   ├── v_component_of_wind
│   │   └── vorticity
│   └── license.txt

Inside the subfolders, one can find the netCDF4 files with the processed data. 

Processing cost 

The following code block shows an example of the data creation process. Some common information is shown in the top box. In this case, variable `t` is being processed which consists of 13 pressure levels. The output resolution is 5.625 degrees. Also, note that all the data is read from the location: /g/data/rt52/era5. On the other hand, the bottom box shows rank-specific information. In this case, rank 0 is processing the data for the year 1979 and takes about two hours to process all the NCI ERA5 files for the year 1979. The original data is 386GB at 0.25 degrees, while the processed data is a single file of size 890MB. Furthermore, this code can be run in parallel using MPI, and each rank is designed to process a year's worth of ERA5 data. When run in parallel, 64 years of data (1959-2022) is processed by 64 ranks; thus, increasing the code efficiency. 

Processing data
╔════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ ╔════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗ ║
║ ║                                                    Common Param                                                    ║ ║
║ ║           ERA_param ▶ t                                                                                            ║ ║
║ ║             out_deg ▶ 5.625000                                                                                     ║ ║
║ ║           algorithm ▶ bilinear                                                                                     ║ ║
║ ║      pressure_level ▶ [50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 850, 925, 1000]                            ║ ║
║ ║       in_dir_prefix ▶ /g/data/rt52/era5/pressure-levels/reanalysis/t                                               ║ ║
║ ║     out_file_prefix ▶ /g/data/wb00/admin/testing/NCI_weatherbench/5.625deg/temperature/temperature                 ║ ║
║ ╚════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝ ║
╚════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
┊                                                Thu Nov 30 17:25:53 2023                                                ┊
┊        rank ▻ 0                                                                                                        ┊
┊        node ▻ gadi-cpu-clx-0031.gadi.nci.org.au                                                                        ┊
┊        year ▻ 1979                                                                                                     ┊
┊      in_dir ▻ /g/data/rt52/era5/pressure-levels/reanalysis/t/1979/*                                                    ┊
┊    out_file ▻ /g/data/wb00/admin/testing/NCI_weatherbench/5.625deg/temperature/temperature_1979_5.625deg.nc            ┊
┊  Dimension: ▻ {'longitude': 1440, 'latitude': 721, 'level': 37, 'time': 8760}                                          ┊
┊  Variables: ▻ ['t']                                                                                                    ┊
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

[Thu 30 19:23:23]  End rank: 0,   Out file: /g/data/wb00/admin/testing/NCI_weatherbench/5.625deg/temperature/temperature_1979_5.625deg.nc    ( 117m, 29s)


We also performed experiments with the processing cost and resource allocation. Results show that the service unit consumption varies with processor types and memory allocations. Thus, it is possible to change the resource parameters to obtain better efficiency.

Queue

Normal

Normal-Memory optimized

Mega mem

Sapphire Rapids

Huge mem

Service Units 2962.932197.441810.731377.191247.44
Improvement ---25.83%38.88%53.51%57.89%
NCPUs Used4802884820848
CPU Time Used (hh:mm:ss)73:33:1270:39:43126:10:5669:49:11    84:25:48
Memory Used1.31TB776.16GB2.68TB845.14GB930.25GB
Walltime Used (hh:mm:ss)03:05:1103:48:5407:32:41 03:18:38  08:39:46


Comparing NCI-Weatherbench with the Pangeo dataset

As mentioned before, the NCI-Weatherbench spans over 64 years; whereas, the original Pangeo dataset has 40 years worth of data. Figure 1 below compares the mean and standard deviation of two datasets. The original Pangeo dataset is shown in blue, while the NCI-weatherbench is in green. For the years that we have data from both, the mean and standard deviation are compared. The results for the common years are really close to each other, while for the rest of the years, data from only NCI-weatherbench data is shown.


 


Next, Figure 2 compares the land-sea-mask from two datasets. In this case, the Pangeo-weatherbench is shown on the left, while the NCI-weatherbench is shown on the right. It can be seen that both processed datasets look quite similar. 

A Deep Learning use case: Climax model

The NCI-weathebench can be used for deep learning model training and inference. Figure 3, shows an example of inference with the Climax model. In this case, NCI-weatherbench is used as ground truth and shown on the left. The prediction results from the Climax model are shown on the right. It is possible to train any model from scratch with the NCI-weatherbench data. 


Figure 3. Climax weather forecast model an NCI-weatherbench 
  



    










































  • No labels