Meeting notes 2024

2024 meeting notes and ongoing actions…

18/7/2024 – weekly meeting 18 2024

Paola, Jo, Claire, Thomas

  • Paola attending NCI training this afternoon
  • NCI GeoNetwork change
    • Can now add records without creating a DOI
    • This is appropriate for listing ia39 as a collection and its containing datasets
  • NRI data
    • Kelsey confirmed that a data focussed role won’t be created for a while
    • Thinking of putting someone on a contract, but the new EBA is a problem for that
    • Will hire Clare Richards (BoM retiree) as a casual to do high level data governance
    • Clare focussed at the licence level, not technical level
    • Intake catalogues (e.g. built by Dougie) builders are quite fixed and not necessary compatible with how other groups might do it
    • Catalogue builder needed for every version and could need a different one for ESM1.5, OM2, OM3 etc. 
    • Heavy-handed needing to build a class every time, CLEX just used a regex instead of trying to make data more uniform.
    • NRI started cataloguing at an earlier level in the modelling process, which has some pros but makes it difficult for other data
    • Paola hoped to use them in MoPPER but can’t as too much additional info is needed from the user
    • Raw ACCESS output has all variables in one file (need to look for specific fields at least in the UM) – replicated files, add mapped value in Intake catalogue for MoPPER output
      • Things coming out of ARCHIVER should be okay. AUS2200 could be more difficult but they were already preprocessed out. Don’t know if it’ll work for eg ACCESS-ESM straight out of the model – but might accidentally pick up Restart files.
      • Makes more sense to operate on cleaned Archiver-type output than raw UM output.
      • One command to create mappings. From this, can post-process the data.
      • One command to create intake catalogue.
    • Getting someone seconded from elsewhere in ANU to work on Intake catalogues. Dougie is in (currently leading) the Ocean Modelling team, he just did the Intake catalogues from his own initiative.

27/6/2024 – weekly meeting 17 2024

Thomas, Paola, Hannes, Jo, Claire

  • Object storage
    • Pawsey Acacia training on Tuesday was attended by Paola, Claire, Thomas and Hannes (and also Michael Sumner)
    • Introduction to how object storage works
    • Pros and cons – parallel write very attractive, not really possible with netCDF
    • Be good to collate into a CLEX blog post of info relevant to the climate community
    • Thomas follows Pangeo and developments by Ryan Abernathy & Joe Harman closely
    • When using zarr on posix filesystems should definitely use zipstore to reduce inode problems
    • Hard to know exactly what the problems are from a systems perspective
    • As a community develop shared experiences
    • Thomas has a poster at ACOMO-COSIMA next week and will be giving a talk to COSIMA in August (8th)
    • Conceptual understandings are often lacking in the community – 2 copies of the data if you have both zarr and netCDF
    • Paola to explore nczarr in netCDF 4.9.2 on Gadi (following Claire’s notes from 4.8.0 testing in the Big Data book) – Page not found · GitHub Pages (acdguide.github.io)(?)
    • Object store cheaper than posix -> 10% power consumption means much more environmentally friendly and can have much more storage
    • Understood that object metadata is a json, so surprising to learn that you have to retrieve the object to update just the metadata, and it can be hard to retain which is concerning. We want to be able to edit metadata without rewriting files.
    • Pawsey seem to be a bit ahead of NCI in terms of object storage, but it’s viewed as “warm tape”, whereas we’re keen to be hitting it and computing directly against it without needing to stage to /scratch
    • Pawsey training called in a storage expert to come and answer questions which was great. He indicated they’ll look at putting a database in front of the objects for searching and additional metadata handling
  • Welcome Jo Croucher!
    • Jo works in the Data Collections team at NCI with Hannes
    • Focus on data publishing and NCI data catalogue, and onboarding data
    • Started in Healthy research and retrained as a librarian, worked in that area for a number of years before moving to NCI
  • Voila and remote data access
    • Portal-like jupyter lab scripts
    • Hannes wasn’t able to get working to an acceptable level for today but Nigel may join us when he gets it working well enough
    • ipy widgets with maps, look at available geophysics data, draw a polygon and identify and load data available in that region and load into a dataframe using Intake.
    • What are portals really for?! This could replace some simpler portals
    • We talked about this to ease finding satellite data on Gadi
    • You choose the collection at present
    • AODN data matching use case (model validation – Blake Seers)
      • AODN hackathon recently for CARS data, want to directly access AODN data via parquet on S3
    • The future is probably not accessing data via THREDDS/OPeNDAP, but it is fundamentally very functional, but S3 offers greater performance possibilities
    • netCDF = standards compliance, FAIR; zarr -like = much more efficient for storage and retrieval for serving/remote access
      • But note that 10 years ago data was all netCDF but metadata was still super poor. Tools writing appropriate attributes directly was really the turning point.
    • We want to know full provenance in our data.

20/6/2024 – weekly meeting 16 2024

Hannes, Claire

  • Checking netCDF data
    • When NCI onboards new datasets Hannes tries to check them for CF, ACDD, CORDEX etc standards as much as possible.
    • Can you use CMOR? No
    • IOOS and cfchecker both stop at CF1-8
    • Commented on an open issue to query this – there was a ticket on 1-9 and 1-10 closed last year saying it was coming https://github.com/ioos/compliance-checker/issues/972 
  • Data access/sharing across institutions
    • It’s difficult for various technical reasons – posix storage limitations, S3 auth limitations, THREDDS availability…
    • GA data often duplicated or apparently duplicated across storage platforms and it’s hard to be sure
    • Working on a tool (viola?) to map visualise data in Jupyter notebooks – to demo next week 
  • Hannes on leave for 6 months, visiting Germany for 2 months, next week will be last meeting
    • Adjust meeting time so Jo can attend?

13/6/2024 – weekly meeting 15 2024

Thomas, Claire, Paola, Hannes

  • Zarr file management
    • Use of Zarr zipstore is critical – v45 went over quota yesterday due to a zarr that was 700k inodes.
    • CMS had a difficult help ticket, worked out it was a zarr problem, but user didn’t appreciate how many files it was creating. Didn’t really understand why they were doing what they were anyway – following online examples. Thomas provided advice.
    • It’s not only use of zarr but the chunking is critical – it’s very easy to end up with much smaller chunks than is efficient on memory anyway and creates crazy numbers of inodes for a small file.
    • Thomas will be giving a poster at COSIMA-ACOMO and other presentations in the coming months about the importance of optimising performance through use of zip store, appropriate chunking strategies, etc.
    • Note that Zarr’s inode cost is only an issue on posix filesystems – object stores handle them very well. National HPC centres need to offer object stores then we can leverage these hybrid systems of HPC nodes with object storage.
      • ROI for NCI to offer object store – why not currently available and is it in the roadmap?
    • Pawsey having a Object store tutorial session in a couple of weeks on how to use their object store June 25
      • 25 June

        Acacia Object Storage Workshop

        Join our comprehensive one-day workshop on ‘Using the Acacia Object Store’ In this 5.5-hour online workshop, you will:
        📢 Learn the Basics: Understand the fundamentals of the Acacia object store.
        📢 Hands-on Practice: Gain practical experience using Acacia.
        📢 Expert Guidance: Learn from the experts to optimize data management with Acacia.

        Register here

         

      • Paola keen to upskill ahead of job search

      • This training looks useful for all of us to attend.
    • Paola previously used Mediaflux (object storage) for weather@home work, learning curve but made data retrieval much easier.
    • ArrayLake is where it’s at e.g. in the US, a lot of potential – Thomas: Random dump of links: 
      https://earthmover.io
      Arraylake: A Cloud-Native Data Lake Platform for Earth System Science
      https://youtu.be/tlACkUYYu7A?si=NjxHt_tBTRIgDJaq 
    • Solving the problem of solving cross-institutional data access by getting everything in the cloud. 
    • FAIR: zarr not widely supported so not very FAIR, and lose the netCDF metadata standards etc., but making everything cloud-accessible helps enable FAIRness too
    • Zarr/object storage lets you directly access specific parts of a file.
      • Access directly from home, or spin up data-adjacent compute
    • Break away from ivory towers.
    • At NCI, legacy data is an issue.
      • That’s what kerchunk is for

30/5/2024 – weekly meeting 14 2024

Paola, Claire, Thomas, Hannes

  • Gadi is down today
    • No unscheduled outage events noted since April… at odds with experiences with filesystems
    • Inconsistent performance from one day to another on the same filesystem/project/files
    • dask is hard to understand so it might not just be the filesystem, it is hard to understand where slowdowns are, the inconsistency is the clue.
      • Watching dask dashboard is instructive – can appear to be running but memory stalled – sometimes refreshing page makes a difference and somehow restarts
      • Sometimes just leaving it alone lets it work out – e.g. task completes but dask dashboard shows still running – not synchronised. Can’t trust the dashboard to be accurate and timely.
      • Do prototyping on ARE to tune
      • Changing timeout limits can help but it’s magic
    • Some filesystems are more stable than others, and ours seem to be more effected (gdata1a, 6, 4?)
      • Those with climate data on seem to be less stable but that could be bad luck for us but maybe it’s because of our access patterns.
    • Overseas people are moving to object stores not filesystems. In the US a lot of use of commercial cloud. Pawsey might be a good option for us via Acacia but not currently an option for us. 
    • Zarr just isn’t working on filesystems – Andrey says it’s not supported or recommended by NCI *despite* NCI documentation recommending it!! Sigh. 
      • NCI (and CSIRO) genuinely thought their scratch filesystems were more performant than they seem to be…
    • Gadi refresh/replacement SHOULD be going to tender. New machine must surely support object storage! When will that be?
    • Working with some datasets is difficult because the data itself has inconsistent chunking. If this could be fixed at source (ie NCI reprocess data ingested) this could resolve a bunch of issues.
      • Potentially this could happen following TDS and GeoNetwork updates, NCI (Hannes’ team) are interested in uplifting the on-disk data
    • Similar to kerchunk convo a few weeks ago, we shouldn’t have to fix these issues in software.
      • “All the technology can’t un#(%* your data” – Adam Steer
    • COSIMA data isn’t good and not receptive to feedback 🙁
    • NCI has good staff but we need to be able to have confidence that they’re working on uplifting their infrastructure to include object storage.
    • Zarr as a back end to netCDF will be great ( https://docs.unidata.ucar.edu/netcdf-c/current/md__media_psf_Home_Desktop_netcdf_releases_v4_9_2_release_netcdf_c_docs_nczarr.html ) when it happens – we need performance, it’s not about the specific format.
    • Should talk about zarr in the Climate Data Guidelines book.
  • Thomas will present some of what he works on next time – Hannes would love to actually see it 🙂

23/5/2024 – weekly meeting 13 2024

Paola, Alicia, Thomas, Hannes, Claire. Apologies: Chloe

  • Model documentation
    • Incredibly hard to work out meaning of variables in raw model for MOM – both grid and physical vars
    • NRI could add a lot of value by supporting/documenting MOM set up – it’s really challenging for Paola and Thomas who both have strong oceanography backgrounds!
    • ACCESS-OM2 release – but didn’t we already have OM2? This is the spack version? Are component models all the same versions?
    • Role of NRI in modelling – improvements vs focus on supporting configurations. Path from model development into stable roll out
    • A lot of developments, teams and tools still in their infancy.
  • NCI THREDDS server migration
    • Moving from dapds00.nci.org.au to thredds.nci.org.au
    • Update any dependencies – need to find
    • Climatechangeinaustralia is okay

9/5/2024 – weekly meeting 11 2024

Hannes, Paola, Claire, Gen

  • Data archiving
    • Claire backing up 4PB data to CSIRO tape using Globus using Steve McMahon’s management script to release 10TB at a time, getting a throughput approaching 400MB/s.
    • Data stored on MDSS for CLEX for projects that are ending will become inaccessible as no users will belong to the project to be able to see it, so querying NCI if project assignment can be changed on MDSS without having to pull the data back off tape and re-archive.
  •  Data processing – pacemaker experiments
    • ACCESS output variables as say mol/s but CMOR wants say kg/m2/s – can we have different units or do you have to find a conversion?
      • mol/s is a recognised unit but seems like an assumption is made about the composition of “sea salt” being deposited
      • Matt Woodhouse happy with mol/s but is it okay to leave it in those units
      • MOPPER will pass it through CMOR but doesn’t have to follow CMIP6 convention if not appropriate
  • Claire and Chloe talking about “FAIRest of them all” and data stewardship at CTDIS next week
  • Paola working on a data risk assessment associated with the CoE closure
    • Lots of researchers promise to clean up their data later but then never get to it
    • Need to enforce deadlines to clean up data by!
  • lp01
    • Got a bunch of CCAM and BARPA regridded monthly to 1.5degrees, have done state and NRM region averages. Consistent with CMIP5 and CMIP6 data ready for comparison so can demonstrate value add of downscaling in CORDEX data
  • Jobs
    • Gen’s contract ends today so working on renewal but might disappear for a few weeks.
    • ACCESS-NRI will advertise a Data Steward position – TL without a team to start with
    • NCI also likely to hire a data specialist with a climate focus – to replace Yiling who is now manager.
    • W21C (‘replaces’ CLEX CoE) will have a position to be advertised in June for technical support in Melbourne, a more ML-focussed one in Sydney and one in ANU these will be the CMS replacement roles. 
      • W21C will be smaller and have a narrower focus – more about regional modelling and weather events
    • Wilma is CLEX and due to parental leave will be with Clex for longer than it exists so Paola is managing availability for postdocs and people like her who will keep working for Clex after it ends and still need access to storage and SUs
    • https://www.21centuryweather.org.au 

2/5/2024 – weekly meeting 10 2024

Paola, Thomas, Chloe, Claire

  • CLEX datasets/handover
    • 1PB of user data on /g/data
    • Another PB on MDSS but that’s more reasonable.
    • NRI are willing to look after conda environments (hh5) with NCI, e.g. manage the continuous updates for NCI
      • CLEX hh5 policy is to only install things that are available on conda/pypi, and updated in last 5 years.
    • Haven’t had a chance to revisit the datasets yet since returning from leave.
    • Many datasets may disappear other than those handed to ACS.
    • MERRA2 is the most difficult – it’s big but may not have a large user base.
    • Should do a risk assessment when you open/close a centre – who is CLEX still responsible for (e.g. students/postdocs) that may not still have access to the data they need when clex projects shut down.
    • What happens to the LIEF grant?
      • Underpins hh5 and the ERA5 downloads
      • Some storage got redistributed but what happens now? None is reserved for datasets e.g. data remaining in ua8.
    • No routine clean up and no clear deletion policy anyway
    • Storage grant may have been used for working data which is very risky as it’s impossible to know if it’s actually still needed.
    • Paola trying to get the data under control or at least a plan by the end of May.
    • So much unmanaged data it’s like trying to find a specific thing in a hoarder house (!!!)
    • Very few CMS staff will migrate from CLEX to W21C so there’s a lot of risk as well in terms of people who know the history of the existing data.
      • W21C started in Feb but have not yet hired any CMS people (yet)
    • ACCESS-NRI may hire a data person.

11/4/2024 – weekly meeting 9 2024

Hannes, Paola, Claire. Apologies: Chloe, Thomas

  • CLEX ran out of quota at NCI because they didn’t realise the CoE had been extended and quota transfers were meant to be in place.
  • CF checking
    • Hannes has found Kelsey’s old CF checking scripts and a tutorial
  • Clex datasets
    • Currently no plan for ua8, e.g.
    • Note rr7 was never finally closed, but now only contains MERRA2
    • ua8 contains MERRA2 when rr7 ran out of space
    • “It’s a mess” “oh gees”
      • FROGS indices needs to be published
      • SODA is probably complete
      • Some datasets are downloaded subsets on researcher request and could be deleted
      • CMEMS (older version of AVISO)
      • AUS2200 needs to be published
      • C20C mostly to be deleted but some to be published
      • Large Ensemble data (from ua6) – sync to xv83?
      • Paola to sort out as much as she can and then we’ll review in a few weeks
    • hh5 will persist beyond June 30 and and has access to MDSS so some data can be archived there
    • When a project is decommissioned, access to its MDSS is lost – even if the data is still there, it’s inaccessible because you need project membership to see it.
  • xarray 
    • May change the way they handle geospatial coordinates – the way they store as floating points creates problems with raster
    • Mike Sumner submitting issues to Ryan Abernathy
      • When subsetting by ‘nearest’ can get columns of NaNs due to floating point representation
      • Need projection information inherent in the dataset to allow accurate subsetting
      • But it’s sort of against what xarray stands for – it’s not simplifying, it’s more complex to accurately support geospatial rasters
  • Paola away next week, and Hannes unable to attend as acting DS mgr.

28/3/2024 – weekly meeting 8 2024

Hannes, Dougie, Claire, Paola, Ben L, Paul B, Gen

  • Intake special!
    • Hannes presenting NCI’s Intake work
      • Building intake catalogues, particularly around climate data
      • Demo scripts to use Intake catalogues
      • Drivers support diverse file types, and the user doesn’t need to know about the formats
    • Dougie has written an ACCESS-NRI Intake catalogue (using a different driver again) which is a catalogue of catalogues, ie a catalogue of Intake sources (could be -ESM, -spark or any other driver, currently only Intake-ESM).
    • Intake v2 coming – Dougie watched a talk and it looks cool but it breaks things – need to pin in environments or updates will pick up the alpha version.
      • Can write transformations that add metadata, do operations on data, etc.
      • “Intake take 2”, there’s some info on Intake readthedocs
      • Paul – individual files not really visible in the catalogue anymore. Harder to read (though they were getting that way anyway). Each file represented by a sort of ID. Introspects files to add to catalogue and dynamically applies. Allows less boilerplate code and removes need for customisation. Get a version controlled file that’s like a meta language implementing functionality from generic frameworks (data sources, readers, transformations).
    • Paola – Previous limitation was choosing a single way to concatenate files, so v2 should allow people to join files differently.
      • Map files in a relational way sounds very good.
      • Move from user-focussed to managing centralised datasets.
    • Dougie – v2 utility for simple datasets is clear, but not sure how it works with more complex things like Intake-ESM which is very bespoke and handles data concatenation and stuff. Additional work will be required to port Intake-ESM to v2
    • Ben/Paul – also looking at kerchunk. Overlaps with Intake. Precomputes some work you’d need to do every time (e.g. concatenation of metadata), and has a plain xarray interface to datasets. Save view of where chunks are on disk. This alleviates time to build mfdataset calls.
    • Dougie – There’s some projects trying to make kerchunk better to use.
      • https://github.com/TomNicholas/VirtualiZarr 
      • https://github.com/NikosAlexandris/rekx
    • Ben – There’s STACC catalogues… can build a Kerchunk catalogue on that but it doesn’t handle inconsistent datasets like variables changing names, coming and going, resolution changes. How do you handle it in Intake?
    • Paola – do a search e.g. for CMIP6 data and try to load them, but if one of the matching datasets has an issue, the whole lot will fail to load.
      • Kerchunk would be very useful e.g. for 10min BoM data
      • IMOS trying to use kerchunk with their S3 storage where it’s very inefficient to retrieve all data to do a subset. But as a service provider they need to use stable libraries.
    •  Ben – issue with apparent non-deterministic behaviour (there’s formal issues around this)
      • Can handle netCDF3 and netCDF4 but can’t concatenate them together.
      • Paul – some of the issues are not kerchunks but actually dask/zarr. E.g. xarray can handle fixing individual files but zarr can’t handle non-uniform chunking along a dimension. Zarr will be changing this which will alleviate some of the challenges of poor data.
    •  Dougie – Consolidate chunks on load – ie define chunking and rechunk on load in order to concatenate dataset components together. Requires a good understanding of the underlying data and it’s issues – that’s a huge undertaking for e.g. 60 NCI catalogues
      • Need to load the catalogue JSON onto every dask worker which needs a lot of memory
      • Paul – implement parquet support for indexing (convert from JSON)
      • Template repeated parts of paths to reduce JSON
      • Compression layer over text.
      • Ben – Append new data as it comes in – effectively generating a new large JSON then converting to parquet so it’s still problematic on memory
    •  Martin Durant started both kerchunk and intake!
      • Martin is focussed on Intake v2 but there’s 43 kerchunk contributors and Ryan Signall (USGS) has funding to push kerchunk development along
      • xarray will attempt to merge datasets with different chunking, it in theory helps the user experience but can slow things down so much it maybe isn’t a good thing!
  • Persistency for hh5 environments and catalogues beyond CLEX
    • NRI may take them on, but need to negotiate also with NCI, e.g. merge with dk92?  Possibly collaboration between NCI and NRI. Claire Carouge discussing with Paola, and be co-responsible for climate python environment with NCI, remove NCI’s need to maintain a Python environment and NRI’s envs to be endorsed by NCI. 
  • CF/ACDD checking
    • Hannes to add the same checker (IOOS) and wrapper that Paola uses in hh5 into dk92. Need to check with Kelsey about summary python script that NCI used to use. Does that still exist or use Paola’s?

21/3/2024 – weekly meeting 7 2024 

Paola, Chloe, Claire

  • ACS reference data downloads
    • ia39_download user now working to download data automatically
    • However it would be wise to change ownership of everything to this user, rather than relying on the write group ACL
    • Best method is probably to make everything in the reference collection owned by the functional user, and securely share the secret to access this user with the writers group so any member of the group can access it to modify things as needed, instead of being solely dependent on Paola
    • Code sits alongside the data, but the ia39_downlaod user needs to be able to write to the download logs
    • Downloads (including auth, logging) are handled through GitHub actions.
    • Needed to add ia39_download user to hh5 group.
    • Andrew Wellington very responsive
  • ESMValTool workshop
    • Broadly useful for participants, maybe less so for the organisers, not much focus on specific ACCESS runs
    • Claire focussed on using CORDEX data
    • Alberto working on parsing Clef or Intake searches to produce ESMValTool-ready recipe input
    • Gen and Christine working on Jupyter Notebook integration (be great to be able to use ESMValTool alongside other tools).
  • CORDEX data
    • Mixed states of publication – CCAM and BARPA funded through ACS
      • Qld DES funded to publish a small amount of vars
      • NarCLIM?
      • WA NarCLIM not supported to publish yet
  • CLEX ending at end of this year
    • Some people moving to W21C 
    • Paola can’t guarantee storage beyond the end of this year though
    • hh5 underpins so many researchers!
    • Maybe NRI might take over maintenance, but could be told to use the NCI envs instead. NCI envs do not seem to be sufficiently agile, hh5 is very responsive to new requests, problems and changes. Also support for bespoke locally developed packages.

14/3/2024 – weekly meeting 6 2024

Paola, Hannes

  • Discussed CORDEX
    • Hannes had a bunch of questions – delegated work from Yiling
    • CMORising
    • CCAM and/or NarCLIM? Other data?

29/2/2024 – weekly meeting 5 2024

Paola, Thomas, Hannes, Claire

  • Decommissioning plan for CLEX NCI facilities if CMS data support funding isn’t continued to W21C 
    • 300TB data (ua8, rr7 ongoing data mgmt)
    • hh5 conda environments and storage
    • There’s a lot of dependencies on Paola’s NCI ident but if she stops working what happens?
    • Better managed data has moved to ia39
    • Maybe NRI will give some storage for publishing e.g. AUS2200
    • Paola puts a backup of everything published to ks32 to MDSS
  • Still need to check issues with ia39 functional user for automated downloads with Github Actions (Paola)
  • AGCD Big Data
    • Most incomplete but be good to draw a line under it (Overview — Working with Big/Challenging Data Collections (acdguide.github.io))
    • Paige keen to just make clear what’s incomplete and how people can contribute.
    • Point to existing much better resources for learning Dask etc now, and make the focus Australian tools
    • Note that ACCESS-NRI Discourse exists but can be harder to find the gems among the discussion sometimes.
    • Establish moderators so people can raise issues asking for resources to be included
  • ESMValTool 
    • ACCESS-NRI hosted a community discussion but mostly attended by NRI and NCI folk, I was the only other Australian (Plus LLNL and ECCC)
    • Paige Martin now at NRI working with Romain, brings a good broad overview of tools used elsewhere
    • Gab Abramowitz to work with NRI to enhance iLAMB
    • Thomas writing his own evaluation tools with xarray and intake, xmip, datatree
    • Note there are other tools too like PCMDI Metrics Package, icclim, xclim,
    • Inclusion of computing derived variables e.g. in ocean domain?
    • Thomas – aim to share metrics as widely as possible so it’s only written once. Claire – this is the goal of ESMValTool.
    • Need to write in one of our books – Big Data or Governance – how to write pluggable code for other tools, not specific to COSIMA or ESMValTool or whatever. 
  • NCI data services working on intake to improve internal QA/QC
    • intake-esm not really maintained, Anderson has another job now, Dougie needed some changes made and it was not very maintained, may not work with intake2. Who can look after this in the future?
    • Hannes is using intake-spark – less curated, scrape all metadata from all netCDF files, should be more robust to intake2 transition
    • Scrape all netCDF files for each publication, put into parquet files, with intake can open easily
    • Who should take care of intake-esm catalogues? “who loses the staring contest”? How to maintain confidence in cataloguing?

22/2/2024 – weekly meeting 4 2024

Paola, Claire, Chloe

  • CSIRO has uninstalled Zoom from our laptops, tedious
  • ACCESS-NRI jobs – 3 team lead positions advertised
    • Paola would be well suited but has concerns about applying
    • ACCESS testing – Martin is pivotal, doco/process isn’t clear
    • One of the jobs is TL of Ocean modelling but that team has always existed, just doesn’t have a team lead
    • Secondments?
  • ACDG Cross-Inst data sharing report
    • Paola to send to Andy P, get his okay then send to Angela for publishing

15/2/2024 – weekly meeting 3? 2024

Paola, Gen, Claire

  • Regridded data for both CMIP and CORDEX are available in BoM project lp01
  • ACDG Cross-Institutional Data Sharing – Ethics approvals all done now and we’re good to publish the report
    • Good lesson in needs for approvals and processes!!
    • Chloe has also gone through the same process for her NCESS consultation
  • ACS reference data
    • CMORPH (was it?) data needed updating in ia39, had accidentally hardcoded the 00:00 time so needed to fix that. 
    • There’s a lot of files so will need concatenation when Paola has time
    • Tried to use the functional user on Gadi through github actions but it didn’t work. Need to check if the key is working.
  • ACDG Metadata portal
    • Need to add more records but otherwise going well
  • ACDG Governance book
    • Close to final now, should arrange another meeting
    • Tidy up ‘Create’ section and ‘Publishing’
  • ACDG Big Data
    • Paige is back in Australia now, we should spin this back up?

25/1/2024 – weekly meeting 1 2024

Thomas, Claire, Gen, Chloe. Apologies: Paola

  • 2024 is already busy – and tomorrow is a holiday!
  • AGCD consolidation on hold
  • Data backups
    • No backup strategy still for published CMIP5/6 data but there is a backup of the pre-CMORised CMIP6 data on CSIRO /datastore
    • xv83, ia39, hq89 has a lot of storage but it’s stretched and there’s no backup for any of it, and not all is urgently needed on disk
    • Stream data to tape when publishing to make a backup copy and delete from work disk
    • Also asked to publish Qld CCAM data through ACS storage
    • Initial inquiries with IM&T “Non-Standard Requests Team” quoted $330k p.a. to store PBs of data which seemed odd so we went to Joseph Antony, who confirmed we can indeed use /datastore – give feedback via Gareth maybe???
    • Follow up with Steve and Gareth established we can stream data in parallel to datastore using Globus after resolving some issues (first tests were about 1/10 the speed of the parallel rsync approach), Steve McMahon has been doing some testing and we are good to proceed with backing up CCAM data. Quoted 4PB in 5mo. Our goal was 1PB/3mo so this is good.
    • Figure out what to do with CMIP data
  • Gen not permanent yet so can’t rock the boat at BoM 🙂 
  • We are now hosted on WordPress as public Confluence access had to be removed
  • Intake 
    • Gen has made good headway in the last couple of weeks 
    • in lp01, some catalogue external stuff, some are the lp01 regridded output data.
    • This is Gen’s ACS focus but there is overlap with NESP
    • Cataloguing BARRA2, BARPA and CCAM data (from xv83) – but CCAM will move to hq89 next week for a few forcing datasets, will be equivalent to py18 for BARPA.
    • BARPA starting with historical for all models, CCAM starting with a few models (ERA5, ACCESS) for all scenarios
    • Gen skilling up to take over Francois’ data processing role
    • Intake catalogues do not replace good file structure and metadata!! Still needed and it’s needed just to build the catalogue too!
    • Claire – Kerchunk is sometimes useful too