You can load the following module to run this example.
module use /g/data/up99/modulefiles module load NCI-geophys/23.04
In your Jupyter notebook or a python script, start a SparkSession
from intake_spark.base import SparkHolder h = SparkHolder(True, [('catalog', )], {}) h.setup() session = h.session[0] session.conf.set("spark.sql.caseSensitive", "true")
Load Geophysics dataset catalog file for project iv65.
import intake catalog = intake.open_catalog("/g/data/dk92/catalog/yml/geophysics-iv65.yml") df = catalog.mydata.to_spark()
There are 5 columns in the Spark DataFrame.
print(df.columns) OUT: ['attributes', 'dimensions', 'file_uri', 'format', 'variables']
The attributes contains the following sub-columns.
df.select("attributes").printSchema() OUT: root |-- attributes: struct (nullable = true) | |-- Conventions: string (nullable = true) | |-- CreationMethod: string (nullable = true) | |-- CreationTime: string (nullable = true) | |-- GDAL: string (nullable = true) | |-- GDAL_COLORSPACE: string (nullable = true) | |-- GDAL_COMPRESSION_RATE_TARGET: long (nullable = true) | |-- GDAL_VERSION: long (nullable = true) | |-- IntrepidSourceDataset: string (nullable = true) | |-- NCO: string (nullable = true) | |-- cdm_data_type: string (nullable = true) | |-- date_created: string (nullable = true) | |-- date_modified: string (nullable = true) | |-- doi: string (nullable = true) | |-- ecat_id: string (nullable = true) | |-- geospatial_bounds: string (nullable = true) | |-- geospatial_bounds_crs: string (nullable = true) | |-- geospatial_lat_max: double (nullable = true) | |-- geospatial_lat_min: double (nullable = true) | |-- geospatial_lat_resolution: string (nullable = true) | |-- geospatial_lat_units: string (nullable = true) | |-- geospatial_lon_max: double (nullable = true) | |-- geospatial_lon_min: double (nullable = true) | |-- geospatial_lon_resolution: string (nullable = true) | |-- geospatial_lon_units: string (nullable = true) | |-- history: string (nullable = true) | |-- institution: string (nullable = true) | |-- keywords: string (nullable = true) | |-- licence: string (nullable = true) | |-- license: string (nullable = true) | |-- location_accuracy_max: double (nullable = true) | |-- location_accuracy_min: double (nullable = true) | |-- median_sample_spacing_m: double (nullable = true) | |-- metadata_link: string (nullable = true) | |-- nominal_pixel_size_lat_degrees: double (nullable = true) | |-- nominal_pixel_size_lon_degrees: double (nullable = true) | |-- nominal_pixel_size_x_metres: double (nullable = true) | |-- nominal_pixel_size_y_metres: double (nullable = true) | |-- pixel_count: long (nullable = true) | |-- product_version: string (nullable = true) | |-- source: string (nullable = true) | |-- summary: string (nullable = true) | |-- survey_id: string (nullable = true) | |-- time_coverage_duration: string (nullable = true) | |-- time_coverage_end: string (nullable = true) | |-- time_coverage_start: string (nullable = true) | |-- title: string (nullable = true) | |-- uuid: string (nullable = true)
View the geospatial range.
df.select(["attributes.geospatial_lat_min","attributes.geospatial_lat_max","attributes.geospatial_lon_min","attributes.geospatial_lon_max"]).distinct().show() OUT: +------------------+------------------+------------------+------------------+ |geospatial_lat_min|geospatial_lat_max|geospatial_lon_min|geospatial_lon_max| +------------------+------------------+------------------+------------------+ | -27.02483| -25.98154| 121.4743| 123.0109| | -26.06157| -23.30061| 143.668| 146.0421| | -14.05401| -11.46848| 140.7658| 143.0429| | -37.7517| -37.38455| 147.7865| 148.9927| | -35.05652| -33.94444| 149.9239| 151.2641| | -20.85701| -20.66078| 139.9486| 140.1133| | -23.37588| -20.99958| 142.9629| 145.799| | -36.51196| -35.91739| 147.3298| 148.1426| | -32.46287| -31.50878| 146.5587| 147.0497| | -16.0142| -14.98587| 124.4862| 126.0145| | -22.10119| -19.9812| 123.4392| 126.0165| | -31.04056| -29.46312| 139.4585| 141.0376| | -27.0204| -24.82354| 148.2168| 151.9096| | -27.10398| -24.98777| 151.465| 153.0789| | -34.78688| -32.49003| 140.9833| 143.6984| | -28.96349| -27.78879| 150.7739| 152.3096| | -21.51489| -19.95301| 141.2282| 144.6018| | -29.00867| -26.99114| 139.4891| 141.0085| | -37.16151| -36.81886| 142.5527| 142.9063| | -38.01049| -37.33214| 143.3085| 144.0336| +------------------+------------------+------------------+------------------+ only showing top 20 rows
Filter the region.
filter_df=df.filter( (df.attributes.geospatial_lat_min > "-15") &(df.attributes.geospatial_lat_max < "-10") &(df.attributes.geospatial_lon_min > "141") &(df.attributes.geospatial_lon_max < "150") ) filter_df.select(["attributes.geospatial_lat_min","attributes.geospatial_lat_max","attributes.geospatial_lon_min","attributes.geospatial_lon_max"]).distinct().show() OUT: +------------------+------------------+------------------+------------------+ |geospatial_lat_min|geospatial_lat_max|geospatial_lon_min|geospatial_lon_max| +------------------+------------------+------------------+------------------+ | -13.08931| -10.8644| 141.4076| 143.5503| | -13.06042| -11.58125| 141.3896| 143.5104| | -14.12344| -12.87207| 141.1922| 143.8161| | -14.12344| -12.87207| 141.1922| 143.9164| | -14.12343| -12.87207| 141.1922| 143.8161| | -13.08931| -11.92395| 141.2204| 143.9286| | -11.83098| -11.53919| 142.4397| 142.7641| | -10.59019| -10.58469| 142.2253| 142.2978| | -13.67501| -11.98582| 141.9298| 143.0608| | -12.66251| -12.37281| 142.8504| 143.0599| +------------------+------------------+------------------+------------------+
Check the column values of filtered DataFrame.
filter_df.select(["file_uri","attributes.survey_id","attributes.cdm_data_type"]).distinct().show(truncate=False) +-----------------------------------------------------------------------------------------------------------------------------------------------+---------+-------------+ |file_uri |survey_id|cdm_data_type| +-----------------------------------------------------------------------------------------------------------------------------------------------+---------+-------------+ |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/grid/P520/P520-grid-dem_geoid.nc |null |null | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/QLD/line/P1326/P1326-line-elevation.nc |0 |null | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P521/P521-line-radiometric-AWAGS_RAD_2015.nc|521 |null | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P521/P521-line-magnetic.nc |521 |null | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P520/P520-line-magnetic.nc |520 |null | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P521/P521-line-magnetic-AWAGS_MAG_2010.nc |521 |null | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P520/P520-line-radiometric.nc |520 |null | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P520/P520-line-radiometric-AWAGS_RAD_2015.nc|520 |null | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P521/P521-line-radiometric.nc |521 |null | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P520/P520-line-magnetic-AWAGS_MAG_2010.nc |520 |null | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P196531/P196531-point-gravity.nc |196531 |Point | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P197957/P197957-point-gravity.nc |197957 |Point | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P196691/P196691-point-gravity.nc |196691 |Point | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P197857/P197857-point-gravity.nc |197857 |Point | +-----------------------------------------------------------------------------------------------------------------------------------------------+---------+-------------+
Conduct further searching.
filter_df=filter_df.filter( df.attributes.cdm_data_type == "Point" ) filter_df.select(["file_uri","attributes.survey_id","attributes.cdm_data_type"]).distinct().show(truncate=False) OUT: +--------------------------------------------------------------------------------------------------------------------------------+---------+-------------+ |file_uri |survey_id|cdm_data_type| +--------------------------------------------------------------------------------------------------------------------------------+---------+-------------+ |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P196531/P196531-point-gravity.nc|196531 |Point | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P197957/P197957-point-gravity.nc|197957 |Point | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P196691/P196691-point-gravity.nc|196691 |Point | |/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P197857/P197857-point-gravity.nc|197857 |Point | +--------------------------------------------------------------------------------------------------------------------------------+---------+-------------+