Meeting notes 2024
2024 meeting notes and ongoing actions…
11/4/2024 – weekly meeting 9 2024
Hannes, Paola, Claire. Apologies: Chloe, Thomas
- CLEX ran out of quota at NCI because they didn’t realise the CoE had been extended and quota transfers were meant to be in place.
- CF checking
- Hannes has found Kelsey’s old CF checking scripts and a tutorial
- Clex datasets
- Currently no plan for
ua8
, e.g. - Note
rr7
was never finally closed, but now only contains MERRA2 ua8
contains MERRA2 whenrr7
ran out of space- “It’s a mess” “oh gees”
- FROGS indices needs to be published
- SODA is probably complete
- Some datasets are downloaded subsets on researcher request and could be deleted
- CMEMS (older version of AVISO)
- AUS2200 needs to be published
- C20C mostly to be deleted but some to be published
- Large Ensemble data (from ua6) – sync to
xv83
? - Paola to sort out as much as she can and then we’ll review in a few weeks
hh5
will persist beyond June 30 and and has access to MDSS so some data can be archived there- When a project is decommissioned, access to its MDSS is lost – even if the data is still there, it’s inaccessible because you need project membership to see it.
- Currently no plan for
xarray
- May change the way they handle geospatial coordinates – the way they store as floating points creates problems with raster
- Mike Sumner submitting issues to Ryan Abernathy
- When subsetting by ‘nearest’ can get columns of NaNs due to floating point representation
- Need projection information inherent in the dataset to allow accurate subsetting
- But it’s sort of against what xarray stands for – it’s not simplifying, it’s more complex to accurately support geospatial rasters
- Paola away next week, and Hannes unable to attend as acting DS mgr.
28/3/2024 – weekly meeting 8 2024
Hannes, Dougie, Claire, Paola, Ben L, Paul B, Gen
- Intake special!
- Hannes presenting NCI’s Intake work
- Building intake catalogues, particularly around climate data
- Demo scripts to use Intake catalogues
- Drivers support diverse file types, and the user doesn’t need to know about the formats
- Dougie has written an ACCESS-NRI Intake catalogue (using a different driver again) which is a catalogue of catalogues, ie a catalogue of Intake sources (could be -ESM, -spark or any other driver, currently only Intake-ESM).
- Intake v2 coming – Dougie watched a talk and it looks cool but it breaks things – need to pin in environments or updates will pick up the alpha version.
- Can write transformations that add metadata, do operations on data, etc.
- “Intake take 2”, there’s some info on Intake readthedocs
- Paul – individual files not really visible in the catalogue anymore. Harder to read (though they were getting that way anyway). Each file represented by a sort of ID. Introspects files to add to catalogue and dynamically applies. Allows less boilerplate code and removes need for customisation. Get a version controlled file that’s like a meta language implementing functionality from generic frameworks (data sources, readers, transformations).
- Paola – Previous limitation was choosing a single way to concatenate files, so v2 should allow people to join files differently.
- Map files in a relational way sounds very good.
- Move from user-focussed to managing centralised datasets.
- Dougie – v2 utility for simple datasets is clear, but not sure how it works with more complex things like Intake-ESM which is very bespoke and handles data concatenation and stuff. Additional work will be required to port Intake-ESM to v2
- Ben/Paul – also looking at
kerchunk
. Overlaps with Intake. Precomputes some work you’d need to do every time (e.g. concatenation of metadata), and has a plain xarray interface to datasets. Save view of where chunks are on disk. This alleviates time to build mfdataset calls. - Dougie – There’s some projects trying to make kerchunk better to use.
- https://github.com/TomNicholas/VirtualiZarr
- https://github.com/NikosAlexandris/rekx
- Ben – There’s STACC catalogues… can build a Kerchunk catalogue on that but it doesn’t handle inconsistent datasets like variables changing names, coming and going, resolution changes. How do you handle it in Intake?
- Paola – do a search e.g. for CMIP6 data and try to load them, but if one of the matching datasets has an issue, the whole lot will fail to load.
- Kerchunk would be very useful e.g. for 10min BoM data
- IMOS trying to use kerchunk with their S3 storage where it’s very inefficient to retrieve all data to do a subset. But as a service provider they need to use stable libraries.
- Ben – issue with apparent non-deterministic behaviour (there’s formal issues around this)
- Can handle netCDF3 and netCDF4 but can’t concatenate them together.
- Paul – some of the issues are not kerchunks but actually dask/zarr. E.g. xarray can handle fixing individual files but zarr can’t handle non-uniform chunking along a dimension. Zarr will be changing this which will alleviate some of the challenges of poor data.
- Dougie – Consolidate chunks on load – ie define chunking and rechunk on load in order to concatenate dataset components together. Requires a good understanding of the underlying data and it’s issues – that’s a huge undertaking for e.g. 60 NCI catalogues
- Need to load the catalogue JSON onto every dask worker which needs a lot of memory
- Paul – implement
parquet
support for indexing (convert from JSON) - Template repeated parts of paths to reduce JSON
- Compression layer over text.
- Ben – Append new data as it comes in – effectively generating a new large JSON then converting to parquet so it’s still problematic on memory
- Martin Durant started both kerchunk and intake!
- Martin is focussed on Intake v2 but there’s 43 kerchunk contributors and Ryan Signall (USGS) has funding to push kerchunk development along
- xarray will attempt to merge datasets with different chunking, it in theory helps the user experience but can slow things down so much it maybe isn’t a good thing!
- Hannes presenting NCI’s Intake work
- Persistency for hh5 environments and catalogues beyond CLEX
- NRI may take them on, but need to negotiate also with NCI, e.g. merge with dk92? Possibly collaboration between NCI and NRI. Claire Carouge discussing with Paola, and be co-responsible for climate python environment with NCI, remove NCI’s need to maintain a Python environment and NRI’s envs to be endorsed by NCI.
- CF/ACDD checking
- Hannes to add the same checker (IOOS) and wrapper that Paola uses in
hh5
intodk92
. Need to check with Kelsey about summary python script that NCI used to use. Does that still exist or use Paola’s?
- Hannes to add the same checker (IOOS) and wrapper that Paola uses in
21/3/2024 – weekly meeting 7 2024
Paola, Chloe, Claire
- ACS reference data downloads
ia39_download
user now working to download data automatically- However it would be wise to change ownership of everything to this user, rather than relying on the write group ACL
- Best method is probably to make everything in the reference collection owned by the functional user, and securely share the secret to access this user with the writers group so any member of the group can access it to modify things as needed, instead of being solely dependent on Paola
- Code sits alongside the data, but the
ia39_downlaod
user needs to be able to write to the download logs - Downloads (including auth, logging) are handled through GitHub actions.
- Needed to add
ia39_download
user tohh5
group. - Andrew Wellington very responsive
- ESMValTool workshop
- Broadly useful for participants, maybe less so for the organisers, not much focus on specific ACCESS runs
- Claire focussed on using CORDEX data
- Alberto working on parsing Clef or Intake searches to produce ESMValTool-ready recipe input
- Gen and Christine working on Jupyter Notebook integration (be great to be able to use ESMValTool alongside other tools).
- CORDEX data
- Mixed states of publication – CCAM and BARPA funded through ACS
- Qld DES funded to publish a small amount of vars
- NarCLIM?
- WA NarCLIM not supported to publish yet
- Mixed states of publication – CCAM and BARPA funded through ACS
- CLEX ending at end of this year
- Some people moving to W21C
- Paola can’t guarantee storage beyond the end of this year though
hh5
underpins so many researchers!- Maybe NRI might take over maintenance, but could be told to use the NCI envs instead. NCI envs do not seem to be sufficiently agile, hh5 is very responsive to new requests, problems and changes. Also support for bespoke locally developed packages.
14/3/2024 – weekly meeting 6 2024
Paola, Hannes
- Discussed CORDEX
- Hannes had a bunch of questions – delegated work from Yiling
- CMORising
- CCAM and/or NarCLIM? Other data?
29/2/2024 – weekly meeting 5 2024
Paola, Thomas, Hannes, Claire
- Decommissioning plan for CLEX NCI facilities if CMS data support funding isn’t continued to W21C
- 300TB data (
ua8
,rr7
ongoing data mgmt) hh5
conda environments and storage- There’s a lot of dependencies on Paola’s NCI ident but if she stops working what happens?
- Better managed data has moved to
ia39
- Maybe NRI will give some storage for publishing e.g. AUS2200
- Paola puts a backup of everything published to
ks32
to MDSS
- 300TB data (
- Still need to check issues with
ia39
functional user for automated downloads with Github Actions (Paola) - AGCD Big Data
- Most incomplete but be good to draw a line under it (Overview — Working with Big/Challenging Data Collections (acdguide.github.io))
- Paige keen to just make clear what’s incomplete and how people can contribute.
- Point to existing much better resources for learning Dask etc now, and make the focus Australian tools
- Note that ACCESS-NRI Discourse exists but can be harder to find the gems among the discussion sometimes.
- Establish moderators so people can raise issues asking for resources to be included
- ESMValTool
- ACCESS-NRI hosted a community discussion but mostly attended by NRI and NCI folk, I was the only other Australian (Plus LLNL and ECCC)
- Paige Martin now at NRI working with Romain, brings a good broad overview of tools used elsewhere
- Gab Abramowitz to work with NRI to enhance
iLAMB
- Thomas writing his own evaluation tools with
xarray
andintake
,xmip
,datatree
- Note there are other tools too like PCMDI Metrics Package,
icclim
,xclim
, - Inclusion of computing derived variables e.g. in ocean domain?
- Thomas – aim to share metrics as widely as possible so it’s only written once. Claire – this is the goal of ESMValTool.
- Need to write in one of our books – Big Data or Governance – how to write pluggable code for other tools, not specific to COSIMA or ESMValTool or whatever.
- NCI data services working on
intake
to improve internal QA/QCintake-esm
not really maintained, Anderson has another job now, Dougie needed some changes made and it was not very maintained, may not work withintake2
. Who can look after this in the future?- Hannes is using
intake-spark
– less curated, scrape all metadata from all netCDF files, should be more robust tointake2
transition - Scrape all netCDF files for each publication, put into parquet files, with intake can open easily
- Who should take care of
intake-esm
catalogues? “who loses the staring contest”? How to maintain confidence in cataloguing?
22/2/2024 – weekly meeting 4 2024
Paola, Claire, Chloe
- CSIRO has uninstalled Zoom from our laptops, tedious
- ACCESS-NRI jobs – 3 team lead positions advertised
- Paola would be well suited but has concerns about applying
- ACCESS testing – Martin is pivotal, doco/process isn’t clear
- One of the jobs is TL of Ocean modelling but that team has always existed, just doesn’t have a team lead
- Secondments?
- ACDG Cross-Inst data sharing report
- Paola to send to Andy P, get his okay then send to Angela for publishing
15/2/2024 – weekly meeting 3? 2024
Paola, Gen, Claire
- Regridded data for both CMIP and CORDEX are available in BoM project
lp01
- ACDG Cross-Institutional Data Sharing – Ethics approvals all done now and we’re good to publish the report
- Good lesson in needs for approvals and processes!!
- Chloe has also gone through the same process for her NCESS consultation
- ACS reference data
- CMORPH (was it?) data needed updating in
ia39
, had accidentally hardcoded the 00:00 time so needed to fix that. - There’s a lot of files so will need concatenation when Paola has time
- Tried to use the functional user on Gadi through github actions but it didn’t work. Need to check if the key is working.
- CMORPH (was it?) data needed updating in
- ACDG Metadata portal
- Need to add more records but otherwise going well
- ACDG Governance book
- Close to final now, should arrange another meeting
- Tidy up ‘Create’ section and ‘Publishing’
- ACDG Big Data
- Paige is back in Australia now, we should spin this back up?
25/1/2024 – weekly meeting 1 2024
Thomas, Claire, Gen, Chloe. Apologies: Paola
- 2024 is already busy – and tomorrow is a holiday!
- AGCD consolidation on hold
- Data backups
- No backup strategy still for published CMIP5/6 data but there is a backup of the pre-CMORised CMIP6 data on CSIRO
/datastore
xv83, ia39, hq89
has a lot of storage but it’s stretched and there’s no backup for any of it, and not all is urgently needed on disk- Stream data to tape when publishing to make a backup copy and delete from work disk
- Also asked to publish Qld CCAM data through ACS storage
- Initial inquiries with IM&T “Non-Standard Requests Team” quoted $330k p.a. to store PBs of data which seemed odd so we went to Joseph Antony, who confirmed we can indeed use
/datastore
– give feedback via Gareth maybe??? - Follow up with Steve and Gareth established we can stream data in parallel to datastore using Globus after resolving some issues (first tests were about 1/10 the speed of the
parallel rsync
approach), Steve McMahon has been doing some testing and we are good to proceed with backing up CCAM data. Quoted 4PB in 5mo. Our goal was 1PB/3mo so this is good. - Figure out what to do with CMIP data
- No backup strategy still for published CMIP5/6 data but there is a backup of the pre-CMORised CMIP6 data on CSIRO
- Gen not permanent yet so can’t rock the boat at BoM 🙂
- We are now hosted on WordPress as public Confluence access had to be removed
- Intake
- Gen has made good headway in the last couple of weeks
- in
lp01
, some catalogue external stuff, some are the lp01 regridded output data. - This is Gen’s ACS focus but there is overlap with NESP
- Cataloguing BARRA2, BARPA and CCAM data (from
xv83
) – but CCAM will move tohq89
next week for a few forcing datasets, will be equivalent topy18
for BARPA. - BARPA starting with historical for all models, CCAM starting with a few models (ERA5, ACCESS) for all scenarios
- Gen skilling up to take over Francois’ data processing role
- Intake catalogues do not replace good file structure and metadata!! Still needed and it’s needed just to build the catalogue too!
- Claire – Kerchunk is sometimes useful too