You can load the following module to run this example.

module use /g/data/up99/modulefiles
module load NCI-geophys/23.04

In your Jupyter notebook or a python script, start a SparkSession

from intake_spark.base import SparkHolder
h = SparkHolder(True, [('catalog', )], {})
h.setup()
session = h.session[0]
session.conf.set("spark.sql.caseSensitive", "true")

Load Geophysics dataset catalog file for project iv65.

import intake
catalog = intake.open_catalog("/g/data/dk92/catalog/yml/geophysics-iv65.yml")
df = catalog.mydata.to_spark()

There are 5 columns in the Spark DataFrame.

print(df.columns)

OUT:
['attributes', 'dimensions', 'file_uri', 'format', 'variables']

The attributes contains the following sub-columns.

df.select("attributes").printSchema()

OUT:
root
 |-- attributes: struct (nullable = true)
 |    |-- Conventions: string (nullable = true)
 |    |-- CreationMethod: string (nullable = true)
 |    |-- CreationTime: string (nullable = true)
 |    |-- GDAL: string (nullable = true)
 |    |-- GDAL_COLORSPACE: string (nullable = true)
 |    |-- GDAL_COMPRESSION_RATE_TARGET: long (nullable = true)
 |    |-- GDAL_VERSION: long (nullable = true)
 |    |-- IntrepidSourceDataset: string (nullable = true)
 |    |-- NCO: string (nullable = true)
 |    |-- cdm_data_type: string (nullable = true)
 |    |-- date_created: string (nullable = true)
 |    |-- date_modified: string (nullable = true)
 |    |-- doi: string (nullable = true)
 |    |-- ecat_id: string (nullable = true)
 |    |-- geospatial_bounds: string (nullable = true)
 |    |-- geospatial_bounds_crs: string (nullable = true)
 |    |-- geospatial_lat_max: double (nullable = true)
 |    |-- geospatial_lat_min: double (nullable = true)
 |    |-- geospatial_lat_resolution: string (nullable = true)
 |    |-- geospatial_lat_units: string (nullable = true)
 |    |-- geospatial_lon_max: double (nullable = true)
 |    |-- geospatial_lon_min: double (nullable = true)
 |    |-- geospatial_lon_resolution: string (nullable = true)
 |    |-- geospatial_lon_units: string (nullable = true)
 |    |-- history: string (nullable = true)
 |    |-- institution: string (nullable = true)
 |    |-- keywords: string (nullable = true)
 |    |-- licence: string (nullable = true)
 |    |-- license: string (nullable = true)
 |    |-- location_accuracy_max: double (nullable = true)
 |    |-- location_accuracy_min: double (nullable = true)
 |    |-- median_sample_spacing_m: double (nullable = true)
 |    |-- metadata_link: string (nullable = true)
 |    |-- nominal_pixel_size_lat_degrees: double (nullable = true)
 |    |-- nominal_pixel_size_lon_degrees: double (nullable = true)
 |    |-- nominal_pixel_size_x_metres: double (nullable = true)
 |    |-- nominal_pixel_size_y_metres: double (nullable = true)
 |    |-- pixel_count: long (nullable = true)
 |    |-- product_version: string (nullable = true)
 |    |-- source: string (nullable = true)
 |    |-- summary: string (nullable = true)
 |    |-- survey_id: string (nullable = true)
 |    |-- time_coverage_duration: string (nullable = true)
 |    |-- time_coverage_end: string (nullable = true)
 |    |-- time_coverage_start: string (nullable = true)
 |    |-- title: string (nullable = true)
 |    |-- uuid: string (nullable = true)

View the geospatial range.

df.select(["attributes.geospatial_lat_min","attributes.geospatial_lat_max","attributes.geospatial_lon_min","attributes.geospatial_lon_max"]).distinct().show()

OUT:
+------------------+------------------+------------------+------------------+
|geospatial_lat_min|geospatial_lat_max|geospatial_lon_min|geospatial_lon_max|
+------------------+------------------+------------------+------------------+
|         -27.02483|         -25.98154|          121.4743|          123.0109|
|         -26.06157|         -23.30061|           143.668|          146.0421|
|         -14.05401|         -11.46848|          140.7658|          143.0429|
|          -37.7517|         -37.38455|          147.7865|          148.9927|
|         -35.05652|         -33.94444|          149.9239|          151.2641|
|         -20.85701|         -20.66078|          139.9486|          140.1133|
|         -23.37588|         -20.99958|          142.9629|           145.799|
|         -36.51196|         -35.91739|          147.3298|          148.1426|
|         -32.46287|         -31.50878|          146.5587|          147.0497|
|          -16.0142|         -14.98587|          124.4862|          126.0145|
|         -22.10119|          -19.9812|          123.4392|          126.0165|
|         -31.04056|         -29.46312|          139.4585|          141.0376|
|          -27.0204|         -24.82354|          148.2168|          151.9096|
|         -27.10398|         -24.98777|           151.465|          153.0789|
|         -34.78688|         -32.49003|          140.9833|          143.6984|
|         -28.96349|         -27.78879|          150.7739|          152.3096|
|         -21.51489|         -19.95301|          141.2282|          144.6018|
|         -29.00867|         -26.99114|          139.4891|          141.0085|
|         -37.16151|         -36.81886|          142.5527|          142.9063|
|         -38.01049|         -37.33214|          143.3085|          144.0336|
+------------------+------------------+------------------+------------------+
only showing top 20 rows

Filter the region.

filter_df=df.filter(
    (df.attributes.geospatial_lat_min > "-15")
    &(df.attributes.geospatial_lat_max < "-10")  
    &(df.attributes.geospatial_lon_min > "141")
    &(df.attributes.geospatial_lon_max < "150")
)
filter_df.select(["attributes.geospatial_lat_min","attributes.geospatial_lat_max","attributes.geospatial_lon_min","attributes.geospatial_lon_max"]).distinct().show()

OUT:
+------------------+------------------+------------------+------------------+
|geospatial_lat_min|geospatial_lat_max|geospatial_lon_min|geospatial_lon_max|
+------------------+------------------+------------------+------------------+
|         -13.08931|          -10.8644|          141.4076|          143.5503|
|         -13.06042|         -11.58125|          141.3896|          143.5104|
|         -14.12344|         -12.87207|          141.1922|          143.8161|
|         -14.12344|         -12.87207|          141.1922|          143.9164|
|         -14.12343|         -12.87207|          141.1922|          143.8161|
|         -13.08931|         -11.92395|          141.2204|          143.9286|
|         -11.83098|         -11.53919|          142.4397|          142.7641|
|         -10.59019|         -10.58469|          142.2253|          142.2978|
|         -13.67501|         -11.98582|          141.9298|          143.0608|
|         -12.66251|         -12.37281|          142.8504|          143.0599|
+------------------+------------------+------------------+------------------+

Check the column values of filtered DataFrame.

filter_df.select(["file_uri","attributes.survey_id","attributes.cdm_data_type"]).distinct().show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------+---------+-------------+
|file_uri                                                                                                                                       |survey_id|cdm_data_type|
+-----------------------------------------------------------------------------------------------------------------------------------------------+---------+-------------+
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/grid/P520/P520-grid-dem_geoid.nc                 |null     |null         |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/QLD/line/P1326/P1326-line-elevation.nc              |0        |null         |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P521/P521-line-radiometric-AWAGS_RAD_2015.nc|521      |null         |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P521/P521-line-magnetic.nc                  |521      |null         |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P520/P520-line-magnetic.nc                  |520      |null         |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P521/P521-line-magnetic-AWAGS_MAG_2010.nc   |521      |null         |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P520/P520-line-radiometric.nc               |520      |null         |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P520/P520-line-radiometric-AWAGS_RAD_2015.nc|520      |null         |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P521/P521-line-radiometric.nc               |521      |null         |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/airborne_geophysics/GA/line/P520/P520-line-magnetic-AWAGS_MAG_2010.nc   |520      |null         |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P196531/P196531-point-gravity.nc               |196531   |Point        |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P197957/P197957-point-gravity.nc               |197957   |Point        |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P196691/P196691-point-gravity.nc               |196691   |Point        |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P197857/P197857-point-gravity.nc               |197857   |Point        |
+-----------------------------------------------------------------------------------------------------------------------------------------------+---------+-------------+

Conduct further searching.

filter_df=filter_df.filter(
    df.attributes.cdm_data_type == "Point"
)
filter_df.select(["file_uri","attributes.survey_id","attributes.cdm_data_type"]).distinct().show(truncate=False)

OUT:
+--------------------------------------------------------------------------------------------------------------------------------+---------+-------------+
|file_uri                                                                                                                        |survey_id|cdm_data_type|
+--------------------------------------------------------------------------------------------------------------------------------+---------+-------------+
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P196531/P196531-point-gravity.nc|196531   |Point        |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P197957/P197957-point-gravity.nc|197957   |Point        |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P196691/P196691-point-gravity.nc|196691   |Point        |
|/g/data/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/ground_gravity/QLD/point/P197857/P197857-point-gravity.nc|197857   |Point        |
+--------------------------------------------------------------------------------------------------------------------------------+---------+-------------+




  • No labels