We’re hosting a tutorial to introduce the NCI data catalog and its two indexing schemes: Intake-ESM and Intake-Spark. A data catalog helps users discover and access datasets through structured metadata, while indexing improves performance by enabling fast, targeted searches. Built on the Python Intake package, these tools support scalable, memory-efficient access to large datasets. At NCI, Intake-Spark uses Parquet-based indexes for high-performance querying with Spark, while Intake-ESM uses lightweight CSV-based indexes ideal for climate data workflows. This session will include hands-on Jupyter Notebook examples showing how to use the catalog in data analysis and machine learning workflows. You’ll learn how to search, load, and filter datasets efficiently from the /g/data collections. The tutorial is ideal for researchers working with large-scale data or looking to streamline their pipelines. Stay tuned for scheduling details—everyone is welcome!

In order to participate in the practises session of workshop please check the prerequisites well in advance of the workshop. 

Prerequisites

Join the following projects before the tutorial

vp91:NCI Training Project

dk92: NCI-data-analysis virtual environment group

wb00: AI/DL reference data collection

oi10: ESGF CMIP6 Replication Data

Agenda

Exploring the NCI Data Catalog with Intake-Spark and Intake-ESM

Date: 11 June 2025  Time: 2:00 PM – 4:00 PM

Registration

Time 

Speaker

Session

2:00 – 2:15 PM

Dr. Rui Yang

Welcome and Set up ARE JupyterLab Session 

2:15 – 2:30 PM

Dr. Hannes Hollmann

Overview of NCI’s Data Services

2:30 – 2:45 PM

Dr. Rui Yang

Introduction to NCI’s Data Catalog and Indexing Schemes

2:45 – 3:00 PM

Dr. Rui Yang

Hands-on Practise: Working with the Intake-ESM Indexing Scheme

3:00 – 3:15 PM

Dr. Rui Yang

Hands-on Practise: Applying the Intake-ESM Scheme in AI/ML Workflows

3:15  3:20 PM

short break

3:20 – 3:45 PM

Dr. Rui Yang

Hands-on Practise: Using the Intake-Spark Indexing Scheme

3:45 – 4:00 PM

All

Open Discussion: Q&A, Support Needs, and Feedback

Instruction

Launching the NCI Data Catalog and Indexing Tutorial on ARE JupyterLab


  • No labels