Meeting notes 2024

2024 meeting notes and ongoing actions…

11/4/2024 – weekly meeting 9 2024

Hannes, Paola, Claire. Apologies: Chloe, Thomas

CLEX ran out of quota at NCI because they didn’t realise the CoE had been extended and quota transfers were meant to be in place.
CF checking
- Hannes has found Kelsey’s old CF checking scripts and a tutorial
Clex datasets
- Currently no plan for ua8, e.g.
- Note rr7 was never finally closed, but now only contains MERRA2
- ua8 contains MERRA2 when rr7 ran out of space
- “It’s a mess” “oh gees”
  - FROGS indices needs to be published
  - SODA is probably complete
  - Some datasets are downloaded subsets on researcher request and could be deleted
  - CMEMS (older version of AVISO)
  - AUS2200 needs to be published
  - C20C mostly to be deleted but some to be published
  - Large Ensemble data (from ua6) – sync to xv83?
  - Paola to sort out as much as she can and then we’ll review in a few weeks
- hh5 will persist beyond June 30 and and has access to MDSS so some data can be archived there
- When a project is decommissioned, access to its MDSS is lost – even if the data is still there, it’s inaccessible because you need project membership to see it.
xarray
- May change the way they handle geospatial coordinates – the way they store as floating points creates problems with raster
- Mike Sumner submitting issues to Ryan Abernathy
  - When subsetting by ‘nearest’ can get columns of NaNs due to floating point representation
  - Need projection information inherent in the dataset to allow accurate subsetting
  - But it’s sort of against what xarray stands for – it’s not simplifying, it’s more complex to accurately support geospatial rasters
Paola away next week, and Hannes unable to attend as acting DS mgr.

28/3/2024 – weekly meeting 8 2024

Hannes, Dougie, Claire, Paola, Ben L, Paul B, Gen

Intake special!
- Hannes presenting NCI’s Intake work
  - Building intake catalogues, particularly around climate data
  - Demo scripts to use Intake catalogues
  - Drivers support diverse file types, and the user doesn’t need to know about the formats
- Dougie has written an ACCESS-NRI Intake catalogue (using a different driver again) which is a catalogue of catalogues, ie a catalogue of Intake sources (could be -ESM, -spark or any other driver, currently only Intake-ESM).
- Intake v2 coming – Dougie watched a talk and it looks cool but it breaks things – need to pin in environments or updates will pick up the alpha version.
  - Can write transformations that add metadata, do operations on data, etc.
  - “Intake take 2”, there’s some info on Intake readthedocs
  - Paul – individual files not really visible in the catalogue anymore. Harder to read (though they were getting that way anyway). Each file represented by a sort of ID. Introspects files to add to catalogue and dynamically applies. Allows less boilerplate code and removes need for customisation. Get a version controlled file that’s like a meta language implementing functionality from generic frameworks (data sources, readers, transformations).
- Paola – Previous limitation was choosing a single way to concatenate files, so v2 should allow people to join files differently.
  - Map files in a relational way sounds very good.
  - Move from user-focussed to managing centralised datasets.
- Dougie – v2 utility for simple datasets is clear, but not sure how it works with more complex things like Intake-ESM which is very bespoke and handles data concatenation and stuff. Additional work will be required to port Intake-ESM to v2
- Ben/Paul – also looking at kerchunk. Overlaps with Intake. Precomputes some work you’d need to do every time (e.g. concatenation of metadata), and has a plain xarray interface to datasets. Save view of where chunks are on disk. This alleviates time to build mfdataset calls.
- Dougie – There’s some projects trying to make kerchunk better to use.
  - https://github.com/TomNicholas/VirtualiZarr
  - https://github.com/NikosAlexandris/rekx
- Ben – There’s STACC catalogues… can build a Kerchunk catalogue on that but it doesn’t handle inconsistent datasets like variables changing names, coming and going, resolution changes. How do you handle it in Intake?
- Paola – do a search e.g. for CMIP6 data and try to load them, but if one of the matching datasets has an issue, the whole lot will fail to load.
  - Kerchunk would be very useful e.g. for 10min BoM data
  - IMOS trying to use kerchunk with their S3 storage where it’s very inefficient to retrieve all data to do a subset. But as a service provider they need to use stable libraries.
- Ben – issue with apparent non-deterministic behaviour (there’s formal issues around this)
  - Can handle netCDF3 and netCDF4 but can’t concatenate them together.
  - Paul – some of the issues are not kerchunks but actually dask/zarr. E.g. xarray can handle fixing individual files but zarr can’t handle non-uniform chunking along a dimension. Zarr will be changing this which will alleviate some of the challenges of poor data.
- Dougie – Consolidate chunks on load – ie define chunking and rechunk on load in order to concatenate dataset components together. Requires a good understanding of the underlying data and it’s issues – that’s a huge undertaking for e.g. 60 NCI catalogues
  - Need to load the catalogue JSON onto every dask worker which needs a lot of memory
  - Paul – implement parquet support for indexing (convert from JSON)
  - Template repeated parts of paths to reduce JSON
  - Compression layer over text.
  - Ben – Append new data as it comes in – effectively generating a new large JSON then converting to parquet so it’s still problematic on memory
- Martin Durant started both kerchunk and intake!
  - Martin is focussed on Intake v2 but there’s 43 kerchunk contributors and Ryan Signall (USGS) has funding to push kerchunk development along
  - xarray will attempt to merge datasets with different chunking, it in theory helps the user experience but can slow things down so much it maybe isn’t a good thing!
Persistency for hh5 environments and catalogues beyond CLEX
- NRI may take them on, but need to negotiate also with NCI, e.g. merge with dk92? Possibly collaboration between NCI and NRI. Claire Carouge discussing with Paola, and be co-responsible for climate python environment with NCI, remove NCI’s need to maintain a Python environment and NRI’s envs to be endorsed by NCI.
CF/ACDD checking
- Hannes to add the same checker (IOOS) and wrapper that Paola uses in hh5 into dk92. Need to check with Kelsey about summary python script that NCI used to use. Does that still exist or use Paola’s?

21/3/2024 – weekly meeting 7 2024

Paola, Chloe, Claire

ACS reference data downloads
- ia39_download user now working to download data automatically
- However it would be wise to change ownership of everything to this user, rather than relying on the write group ACL
- Best method is probably to make everything in the reference collection owned by the functional user, and securely share the secret to access this user with the writers group so any member of the group can access it to modify things as needed, instead of being solely dependent on Paola
- Code sits alongside the data, but the ia39_downlaod user needs to be able to write to the download logs
- Downloads (including auth, logging) are handled through GitHub actions.
- Needed to add ia39_download user to hh5 group.
- Andrew Wellington very responsive
ESMValTool workshop
- Broadly useful for participants, maybe less so for the organisers, not much focus on specific ACCESS runs
- Claire focussed on using CORDEX data
- Alberto working on parsing Clef or Intake searches to produce ESMValTool-ready recipe input
- Gen and Christine working on Jupyter Notebook integration (be great to be able to use ESMValTool alongside other tools).
CORDEX data
- Mixed states of publication – CCAM and BARPA funded through ACS
  - Qld DES funded to publish a small amount of vars
  - NarCLIM?
  - WA NarCLIM not supported to publish yet
CLEX ending at end of this year
- Some people moving to W21C
- Paola can’t guarantee storage beyond the end of this year though
- hh5 underpins so many researchers!
- Maybe NRI might take over maintenance, but could be told to use the NCI envs instead. NCI envs do not seem to be sufficiently agile, hh5 is very responsive to new requests, problems and changes. Also support for bespoke locally developed packages.

14/3/2024 – weekly meeting 6 2024

Paola, Hannes

Discussed CORDEX
- Hannes had a bunch of questions – delegated work from Yiling
- CMORising
- CCAM and/or NarCLIM? Other data?

29/2/2024 – weekly meeting 5 2024

Paola, Thomas, Hannes, Claire

Decommissioning plan for CLEX NCI facilities if CMS data support funding isn’t continued to W21C
- 300TB data (ua8, rr7 ongoing data mgmt)
- hh5 conda environments and storage
- There’s a lot of dependencies on Paola’s NCI ident but if she stops working what happens?
- Better managed data has moved to ia39
- Maybe NRI will give some storage for publishing e.g. AUS2200
- Paola puts a backup of everything published to ks32 to MDSS
Still need to check issues with ia39 functional user for automated downloads with Github Actions (Paola)
AGCD Big Data
- Most incomplete but be good to draw a line under it (Overview — Working with Big/Challenging Data Collections (acdguide.github.io))
- Paige keen to just make clear what’s incomplete and how people can contribute.
- Point to existing much better resources for learning Dask etc now, and make the focus Australian tools
- Note that ACCESS-NRI Discourse exists but can be harder to find the gems among the discussion sometimes.
- Establish moderators so people can raise issues asking for resources to be included
ESMValTool
- ACCESS-NRI hosted a community discussion but mostly attended by NRI and NCI folk, I was the only other Australian (Plus LLNL and ECCC)
- Paige Martin now at NRI working with Romain, brings a good broad overview of tools used elsewhere
- Gab Abramowitz to work with NRI to enhance iLAMB
- Thomas writing his own evaluation tools with xarray and intake, xmip, datatree
- Note there are other tools too like PCMDI Metrics Package, icclim, xclim,
- Inclusion of computing derived variables e.g. in ocean domain?
- Thomas – aim to share metrics as widely as possible so it’s only written once. Claire – this is the goal of ESMValTool.
- Need to write in one of our books – Big Data or Governance – how to write pluggable code for other tools, not specific to COSIMA or ESMValTool or whatever.
NCI data services working on intake to improve internal QA/QC
- intake-esm not really maintained, Anderson has another job now, Dougie needed some changes made and it was not very maintained, may not work with intake2. Who can look after this in the future?
- Hannes is using intake-spark – less curated, scrape all metadata from all netCDF files, should be more robust to intake2 transition
- Scrape all netCDF files for each publication, put into parquet files, with intake can open easily
- Who should take care of intake-esm catalogues? “who loses the staring contest”? How to maintain confidence in cataloguing?

22/2/2024 – weekly meeting 4 2024

Paola, Claire, Chloe

CSIRO has uninstalled Zoom from our laptops, tedious
ACCESS-NRI jobs – 3 team lead positions advertised
- Paola would be well suited but has concerns about applying
- ACCESS testing – Martin is pivotal, doco/process isn’t clear
- One of the jobs is TL of Ocean modelling but that team has always existed, just doesn’t have a team lead
- Secondments?
ACDG Cross-Inst data sharing report
- Paola to send to Andy P, get his okay then send to Angela for publishing

15/2/2024 – weekly meeting 3? 2024

Paola, Gen, Claire

Regridded data for both CMIP and CORDEX are available in BoM project lp01
ACDG Cross-Institutional Data Sharing – Ethics approvals all done now and we’re good to publish the report
- Good lesson in needs for approvals and processes!!
- Chloe has also gone through the same process for her NCESS consultation
ACS reference data
- CMORPH (was it?) data needed updating in ia39, had accidentally hardcoded the 00:00 time so needed to fix that.
- There’s a lot of files so will need concatenation when Paola has time
- Tried to use the functional user on Gadi through github actions but it didn’t work. Need to check if the key is working.
ACDG Metadata portal
- Need to add more records but otherwise going well
ACDG Governance book
- Close to final now, should arrange another meeting
- Tidy up ‘Create’ section and ‘Publishing’
ACDG Big Data
- Paige is back in Australia now, we should spin this back up?

25/1/2024 – weekly meeting 1 2024

Thomas, Claire, Gen, Chloe. Apologies: Paola

2024 is already busy – and tomorrow is a holiday!
AGCD consolidation on hold
Data backups
- No backup strategy still for published CMIP5/6 data but there is a backup of the pre-CMORised CMIP6 data on CSIRO /datastore
- xv83, ia39, hq89 has a lot of storage but it’s stretched and there’s no backup for any of it, and not all is urgently needed on disk
- Stream data to tape when publishing to make a backup copy and delete from work disk
- Also asked to publish Qld CCAM data through ACS storage
- Initial inquiries with IM&T “Non-Standard Requests Team” quoted $330k p.a. to store PBs of data which seemed odd so we went to Joseph Antony, who confirmed we can indeed use /datastore – give feedback via Gareth maybe???
- Follow up with Steve and Gareth established we can stream data in parallel to datastore using Globus after resolving some issues (first tests were about 1/10 the speed of the parallel rsync approach), Steve McMahon has been doing some testing and we are good to proceed with backing up CCAM data. Quoted 4PB in 5mo. Our goal was 1PB/3mo so this is good.
- Figure out what to do with CMIP data
Gen not permanent yet so can’t rock the boat at BoM 🙂
We are now hosted on WordPress as public Confluence access had to be removed
Intake
- Gen has made good headway in the last couple of weeks
- in lp01, some catalogue external stuff, some are the lp01 regridded output data.
- This is Gen’s ACS focus but there is overlap with NESP
- Cataloguing BARRA2, BARPA and CCAM data (from xv83) – but CCAM will move to hq89 next week for a few forcing datasets, will be equivalent to py18 for BARPA.
- BARPA starting with historical for all models, CCAM starting with a few models (ERA5, ACCESS) for all scenarios
- Gen skilling up to take over Francois’ data processing role
- Intake catalogues do not replace good file structure and metadata!! Still needed and it’s needed just to build the catalogue too!
- Claire – Kerchunk is sometimes useful too