Meeting notes archive 2022

22 Dec 2022 climate data weekly meeting 27 2022

Chloe, Claire, Alicia

  • ACS
    • Still on hold for CSIRO at present
    • xv83  and ia39  have migrated to enable larger quotas, but both seem to still reporting they’re on gdata5?
    • Marcus reports a slow down of post-processing
  • CMIP7
    • Chloe is on the data request planning committee
    • Lots of meetings across the WCRP going on at the moment
  • Environment BU
    • how final is the current structure? Most staff are expecting further changes in 6 months time
    • Chloe may move into Climate Intelligence but does that leave Earth Systems less well represented in the D&D space?

15 Dec 2022 climate data weekly meeting 26 2022

Paola, Sam, Alicia, Chloe, Claire

  • xclim now included in hh5 env
    • CMIP6 pre-processor
    • Utilise the Intake-ESM catalogue – working on how to best do this
    • What to do with shorter time periods?
    • New Intake-ESM version is much better but it had an issue that destroyed our catalogues
      • Better handling of parallel file access
      • conda not yet seeing the right version though but hh5 env will be built specifically for the new patch
  • ACCESS-NRI
    • Confusion around use of ARCCSS slack vs CWS-help vs NRI “hive” 
    • Some people think slack is closing, confusion about where people should post questions
    • NRI staff requesting people’s notebooks working with CMIP6 – unclear about NRI remit beyond ACCESS?
    • Support infrastructure meeting to be held later today
    • Hive has the advantage over slack of being long term and public, however what is the specific scope?
    • There may be some people unwilling to post publicly and display lack of knowledge, too.
  • ia39 ACS reference datasets
    • work likely to resume with Sam in January
    • C20C – clean up and move to ia39 (80TB) and free up room for MERRA2 in ua8 instead of split between there and rr7.
      • C20C / 20th Century Reanalysis v2 only used by Michael Grose, v3 used by CLEX researchers
    • MERRA2 can also go to ia39
    • JRA55-do product being ended next year some time

01 Dec 2022 climate data weekly meeting 25 2022

Claire, Paola, Sam. Apologies: Chloe

  • Sam is in CMS team and is upskilling in data.
  • Sam has been working on GPU parallelising python (pytorch) downscaling code with Sanaa
  • Upcoming ECMWF MOOC on ML in climate MOOC Machine Learning in Weather & Climate (ecmwf.int)
    • CMS team all registered, so has Claire but haven’t heard back.
  • ACCESS workshop in February? 
  • Joint NCI-CLEX ML training in Melbourne next year
  • CMS converting CMS wiki to Jupyter book
  • ERA5 issue identified, some data needs to be redownloaded.
  • CMIP6 downloads going smoothly.
  • hh5 conda – now includes xMIP, wavespectra, pyinterp
  • CMIP6 processing – xMIP
    • Uses hh5 Intake catalogue but Paola had to make some mods to work at NCI
    • NRI keen to use xmip notebooks – preprocess data for comparison
    • Also ESMValTool contains ESMValCore which contains CMORisers for all major datasets.
  • Dataset Governance WG
    • Paola’s PR to include checklist is ready to go
    • Link to a google doc that can be copied and used interactively for the checklist

24 Nov 2022 climate data weekly meeting 24 2022

Claire, Paola. Apologies: Chloe

  • Climate working groups
    • Big Data restructure has been pushed, can start making edits again now
    • Admonition boxes need titles
    • Preference (e.g. Dougie) to not have concrete examples in the book but rather to point to external sources
    • Need some direction from Paige about where we need to go/what’s required from us.
    • In dataset guidelines explain best practice and reasons for additional steps outside of creating data. https://github.com/ACDguide/Governance/issues/5
  • CLEX blog being rewritten as Jupyter book
  • Data reuse
    • Important for researchers to be clear on the licences and right to use of data they hold
    • e.g. when purchasing data from BoM, ensure data is available for all use and reuse.
    • Process akin to CSIRO’s RDP needed for checking data restrictions (but legal stuff is hard!)
  • CLEX workshop
    • Paola’s poster had links to our books etc but the CMS and comms posters were separate to the science posters
    • Survey asking what people want training in, common one was working with big data – our book covers a lot of it!
    • CMS need to demonstrate their value, some clarity required around CMS vs NRI roles re. training etc

13 Oct 2022 climate data weekly meeting 23 2022

Chloe, Claire, Paola.

  • CSIRO restructure – O&A merging with L&W to become “Environment”
  • CMS team
    • Dale Roberts is new starter at UniMelb, Paola in Melbourne for induction week
    • Helpdesk
      • Request for StackOverflow style helpdesk
      • Try running a Discourse server (set up a VM to host it), test it out with the community at Clex workshop in November
      • Check with NRI that this is okay and doesn’t clash with their plans
      • Possible slack integration
      • Able to exist openly and outside helpdesk, easily transferrable
    • Clex future – new centre may happen, could be based at Monash, more model based.
  • Big Data WG 
    • Meeting today but Paola is on a flight, Dougie is on a flight, and Claire has a clash
    • Paige is available but maybe we need to reschedule to discuss Paola & Dougie’s jamboard for book restructure
  • Nectar VMs
    • Monash running secuity scans on their Nectar node, surprise emails where uplift issues are identified
    • Dale has experience running production servers, so may be able to assist moving Paola’s dev systems into production e.g. InvenioRDM
  • ACS reference data
    • Sam may join the group (Paola’s team), he seems to enjoy data management – doing things in structured ways really helps
    • Also CMIP6 downloads are going really well, Syazwan being very responsive and following up issues.
  • Single Access WG
    • New version of Invenio released, properly supports subject keywords
      • vocabulary based fields
      • Defined in configuration, no code changes needed
    • Installed currently on test instance but not OneClimate yet
    • ElasticSearch will no longer be free so they’ll move the back end to OpenSearch, installed with the new one
    • CalTech library using the same system, also have an interest in Handles instead of DOIs, looks like it should be possible and if not it’s a feature they’ll add if they haven’t already

08 Sep 2022 climate data weekly meeting 22 2022

Claire, Paola, Thomas

  • Codebreak help – combining many files 
    • Opening many files at once (e.g. GPCP), building on a blog post from Scott
    • Should be able to update now using parallel mf_dataset , if performance is better? (use parallel=True)
    • GPCP has daily precipitation per file, so it’s a nightmare to concatenate and chunk
    • Relevant more broadly because this dataset is in ia39  so resources need to be available more broadly than Clex
    • Does Intake catalogue help cut down the work in finding/indexing files?
    • Contribute to Big Data book
    • Dask bag  no longer required now we can use parallel. 
    • xarray and dask change so fast!
  • Clex CMS
    • ANU hiring on hold, no appropriate candidates at the moment
    • Need external support, e.g. from GFDL who can access NCI and help the users
    • Dale Roberts starting soon with Uni Melbourne but remote in Canberra
  • ACS Code & Data Group
    • Need to meet re permissions, writers groups etc
    • iMERG data downloaded
    • Reminder to ua8  users to switch to ia39 
  • Big data book
    • Hard to talk about restructure without seeing it
    • A lot of titles are no longer representative of their contents
    • Concept vs examples
      • Intake cataloge in particular needs explanation AND examples
    • xarray  have fixed some Zarr problems
      • Zarr backend now enabled for nczarr 
      • PyNIO now given up completely
      • Full support for zarr python
    • How to keep relevant with constant updates to libraries and methods – update strategy for blogs/advice/examples?
  • ARE
    • Paola hasn’t been able to reliably reproduce her delay issues
    • It’s quite unstable (kernel failures) but too unpredictable to submit a help ticket
    • More stable if you request a full node but that’s wasteful and unlikely more efficient
    • Training people how to identify their bottlenecks and get them out of the way (do you just need a whole lot of memory for one step? Do that in a separate job?)
    • Trade-off between effort to automate/optimise vs getting a result sooner, less efficiently
    • Compute time costs, but so does human time!

01 Sep 2022 climate data weekly meeting 21 2022

Claire, Chloe, Thomas, Alicia. Apologies: Paola

  • xarray-datatree
    • New tool out of Pangeo https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114 
    • Alternative to ESMValTool?
    • LLNL rewriting their tool based on xarray
    • Would make things so much easier for CMIP5 work Thomas did earlier
    • Will our hh5 Intake catalogues work to underpin this tool?
    • Will be installed in hh5 soon.
    • Julius’ CMIP6 preprocessing tool now renamed “xmip”, this package has grown out of it.
    • Being built on xarray this may ultimately be included natively in xarray down the track
    • There’s still lots of options to be added and a bunch of questions but it’ll be good to get our hands on it and have a play
    • NRI metrics – can these tools be used or are bespoke things or PCMDI metrics better?
    • Community have by and large transitioned away from Matlab
  • BARPA
    • Do we have access to BARRA-2/BARPA products yet?
    • Some test DRS’d data in ia39
      • /g/data/ia39/australian-climate-service/test-data/CORDEX-CMIP6/output
      • 1979-2007 currently there, recently expanded
    • Vanessa H to meet with Chun Hsu to discuss test data for ACS – follow up with Alicia.
    • Axiom being used by BoM as well as CSIRO
    • Some axiom improvements requested by Claire, Ben will implement them so they’re broadly available.
    • Axiom docs now published https://axiom.readthedocs.io/en/latest/
  • CMIP7
    • Call out for people wanting to contribute to CMIP7 experiment design
    • Chloe is co-leading the data request with Martin Juckes 
    • Is CSIRO contribution to CMIP7 given/likely?
    • CMIP is mentioned throughout the O&A Modelling Assessment document, but despite wide uptake it does not attract income so may again be an issue in the next generation.
    • Value to us joining these committees for personal careers and CSIRO impact.
    • Is cloud storage something we need to design for in CMIP7?
    • Is zarr a file format we should be looking at?
    • netCDF tools and standards vs performance
    • zarr backend for netCDF – development? timeline?
    • zarr zipstore important to minimise inode usage, performance hit still better than netCDFs
    • Can Julia read zarrs? there is a zarr.jl  so hopefully progressing, 9 contributors.
  • CSIRO data stewardship CoP
    • Checklists and guidelines working group has two sub-groups:
      • project mgmt and project closure (Chloe is a member)
      • software & data and software inception (Claire is a member)
  • xCDAT
    • https://xcdat.readthedocs.io/en/latest/
    • xCDAT: Xarray Climate Data Analysis Tools
    • NCAR pivoting older tools like NCL to new tools like python and xarray
    • Eventually be able to run any NCL/CDAT commands but have them backed by xarray
    • Right now people are pivoting to xarray but in 5 years time maybe it’ll be Julia?
    • Julia doesn’t have the vibrant community that python has right now though nor the depth of tools/libraries
    • ESMValTool a mix of Iris and Julia which is different

25 Aug 2022 climate data weekly meeting 20 2022

Thomas, Paola, Claire

  • Dask
    • Good discussion about handling opening many netCDF files at once – https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588/14
    • Strange behaviour noticed on ARE map-blocks workflow where the task graph and worker reports on the dask dashboard seem to not reflect what’s actually happening, they seem very delayed, reporting kernel idle while cell reported running ([*]), and workers sitting on 2% so seemed like nothing was happening, memory report might have been right, but then when the cell completed, the task graph and workers were still updating and showing high activity?? Seems like the dashboard information is getting very delayed. Have not seen this behaviour on OOD, is this just an ARE thing?
    • Is effort needed from dask engineers to work directly with NCI, Pawsey etc to resolve these problems? Upskill Australian HPC engineers to support a dask-enabled python data science solution
    • Paola’s problem may be a bit unique, but still issues with ARE may not be unique?
    • Cost recovery is an issue with ARE – important to realise that jobs are expensive so you need to make maximal use of the resources.
    • Is it viable to do a job comparison between OOD and ARE?
    • New Clex member working on ML workflow and learning ARE in the process, was asked to produce training doco instead of getting access to technical staff??
    • Can we form a community of practise on our side instead of going through NCI Helpdesk?
      • Leverage Big Data ACDGuide group
      • Approach ACCESS-NRI in 6 months when they’re more established
    • HPC based dask-enabled python workflows are going to be A Thing for the next 5-10 years so significant efforts to make them work well are worthwhile.
    • It would be good to know who works on the ARE so we can contact them directly via NCI help – someone under Allan?
    • xarray mapblocks are different to dask mapblocks – dask requires specifying meta about type (e.g. nparray float64), slightly different arguments.
      • Make a simplified example for the Jupyter book, and create a full blown example in a blog post or jupyter notebook.
  • Clex staffing
    • ANU position still open, need more candidates!
  • Climate data guidelines
    • Paola’s PR has raised a couple of independent issues around e.g. missing-value advice
    • Ping Alicia to check PR if she has time
  • ACCESS-NRI
    • Need for ACCESS training
    • Arnold has asked Holger for ACCESS-ESM (payu) training
    • Chloe prepared ACCESS training materials last year, publicly visible – https://confluence.csiro.au/display/ACCESS/ACCESS+Training
    • Need a meeting between people who have ACCESS doco and training materials
    • Need agreement between non-NRI people who may contribute doco to the NRI to work out who’s doing what – us, Scott, etc, and coordinating with Dougie, Kelsey, ClaireC etc.
    • Need for community committees
  • Start closing AMOS WGs in coming months and start transitioning these to inputs and references for other groups like the NRI doco, we can’t be on too many groups and committees at once!

11 Aug 2022 climate data weekly meeting 19 2022

Chloe, Claire, Paola. Apologies: Thomas

  • Data & Digital O&A
    • Chloe asked to be on DAP/RDP board – busy but good opportunity
  • CLEX staffing
    • Sam doing very well, quite independent
    • Ramsay learning how to do conda env management
  • ACDGuide Big Data book
    • Paola and Dougie working on how the Computations section could be better restructured
    • No clear logical way to break it up though
    • Paola working on restructure in an adjacent branch
    • Alignment with Scott’s notebook
    • Need someone to review and provide feedback
  • ARE
    • Paola having hassles with memory crashing (normal queue) – frequent kernel failures
    • Claire had some success in bigmembw queue
    • Damien realised he has to manually specify ‘analysis’ or ‘copyq’ in order to be able to interact with git.
    • Paola still struggling with the poorly structured netCDF she was working with 
      • Moved back to OOD but dask dashboard not working now
      • data was written with a geostrophic velocity package (`gsw`)
      • And now the dask dashboard is back again – intermittent problem?
    • Stability issues are a serious problem for training novice staff, really want reliable systems to minimise problems!
  • ia39 reference collection
    • Some progress, kind of slow
    • Trying to re-download for reproducibility rather than copy from ua8
    • iMERG is very slow to convert HDF to netCDF via openDAP and download
    • Have to limit how much we try to do at once – Jenkins jobs can only run for 24hrs
    • Two streams, one to replicate and one for when we’re just doing regular updates
    • Data concatenation isn’t optimised 
  • Single access work
    • Started the wiki describing the governance – how to add and edit entries, search etc.
    • Want to get about 50 catalogue entries in for demo purposes
    • Wiki edits/suggestions welcome ACDG data portal · ACDguide/invenioClim Wiki (github.com)
    • Need to add how to add new entries
    • Need to add info on communities
    • Will add copyright footer and maybe ACDG logo on right hand side
    • Wiki doesn’t give many choices, give page name and add markdown, it’s easy to work with.
    • Once documentation part is fixed will return to adding records
  • ACCESS-NRI meeting
    • Paola meeting with Kelsey, Aidan and possibly new person 
    • Figure out synergies between NRI and CLEX with documentation and training
    • How to contribute to common documentation source?
    • Frustrating that you need a MOSRS account just to see documentation about ACCESS
    • accessdev trak wiki could really use a clean up or migration – but Confluence would also be a barrier to community participation
    • wikis don’t provide great experiences
    • Need for formating languages that are quite natural to move from one to another – html, pdf etc to markdown or whatever, unlike wikis which all have their own slightly different sytnaxes
  • O&A Data Science Network
    • Need for Resources wiki
    • Actually there’s already a start on that in the DSN Notebook
    • Would Jupyterbook be a better solution?

04 Aug 2022 climate data weekly meeting 18 2022

Paola, Alicia, Claire. Apologies: Thomas

  • Working groups
    • Governance
      • Alicia has reviewed Paola’s PR, more still to do. Claire’s review outstanding
    • Single access
      • Paola working on the catalogue this week
      • Need to tidy up governance information (jupyterbook, git wiki?) to more clearly demonstrate search functionality etc
      • Templates changed in new community version
      • Subjects are handled differently to resource types – need hierarchical structure but it’s not quite right for our needs at the moment.
      • ElasticSearch based, would be much easier for an expert to fix!
  • CMS staff
    • New starter Sam has commenced, Paola pointed him at big data and governance books to get started
    • DR has accepted the UniMelb position, starting in October
  • Data dramas
    • Clex examples with problematic chunking
    • Badly structured file (poor compression, uncertain netCDF version) was 30GB but reduced to 4GB! Lots of problems with this file, many extra dimensions and incorrect coordinates and dimension ordering. 
    • This file mis-structuring would be a good map block example for the big data book
    • Awareness of dask utilisation and scaling – Roger Bodman GIL problem where he was charged for 48 cores but only getting 1 active at a time. Fixed this with Scott and climtas.nci but may still be using more cores than is sensible for this use case, check efficiency. Roger to work with Damien on fixing and standardising CRE code.
    • Risk of SU wastage with ARE
    • Payu bug causing very inefficient SU usage, requesting much longer jobs than are required.
  • ACCESS-NRI
    • Paola has meeting with Kelsey, Aidan and other NRI TL next week
    • Need to clarify roles of CLEX and NRI in terms of model support etc.
    • Kelsey – documentation and training. Need to sort out CLEX documentation re models
    • NRI not doing direct user support – so who does??
    • Be clear about who is responsible for various concerns
    • Stackoverflow community style support isn’t necessarily the answer, good documentation and helpdesk/slack is a more consistent model
    • Still waiting to find out who will join NRI from CSIRO and BoM but a few likely
    • Need for “open” approaches. COSIMA is an example but it’s not gold standard. Jupyter books are great, however it’s approach doco needs improvement
  • aus-ref-clim-data-nci
    • iMERG was in ua8, need to migrate replication to ia39
      • Underlying data is HDF5, get through openDAP to convert to .nc4 on the fly
      • Use PyDAP, pass list of stores in the same way you’d do list of files with open_mfdataset but doens’t work – because it’s HDF5?
      • Try concatenation but very slow. 
      • Files are fixed time, one timestep per file.
      • CDO was producing 6GB files instead of 0.5GB!
      • NCO working okay but must use ncrcat not ncecat. Also rechunk to multiple times. Very slow to run though.
      • Takes a while to convert, download and process, long running jobs.
      • Once we’ve got the concatenated version we should be able to remove the original files.
        • Any concerns if bad data is found in the record, can you inject new data into the middle of the file or would it be necessary to reconcatenate everything?
        • Concatenate per day so viable to recreate whole day files if errors are found
        • In ua8 did monthly concatenation but daily seems better for time to produce.
      • Need to sort out ia39 permissions/writers group implementation.
  • CPM (convection permitting) modelling growing
    • Is this used for long range projections or only forecasting? High compute costs but very relevant for TCs etc.

21 Jul 2022 climate data weekly meeting 17 2022

Paola, Alicia, Claire

  • Clex
    • Paola is now CMS team leader
    • Clex have been hiring into the CMS team to replace Claire and Aidan
  • ACD working groups
    • Some minor corrections required to Claire’s PR in Big Data book
      • ARE info added along with Gadi
      • Dougie points out restructure is required
      • Need some content on forecast cycle data
    • Paola has a PR in preparation for Governance book
      • Inclusion of data management practices for NCI data projects
    • Enabling cross institutional access
      • Use cases are in now
      • Mind map is a lot of work, synergies are less obvious than we’d expected
      • Things like expertise are noted commonalities though
  • CMIP6 replicas
    • `oi10` is full again.
    • One more GCM required for ACS downscaling
    • Syazwan indicates more quota will be added soon.
    • Are we able to track access times? 
      • works on `/scratch`, but MAS harvesting for Clef may still be affecting `atime` on CMIP data
      • Is use of Intake isntead of MAS a potential solution here or does building Intake catalogues also update atime? 

07 Jul 2022 climate data weekly meeting 16 2022

Paola, Alicia, Thomas, Chloe, Claire

  • NCI reference climate dataset release
    • We are ready to release an announcement about the reference datasets in `ia39`
    • https://github.com/aus-ref-clim-data-nci
    • Soft launch message prepared by Chloe, Damien and Paola
    • Will appear in next CLEX newsletter
    • Want to start migrating CLEX users from ua8 to ia39
    • Other datasets can be added to the managed collection
    • Discussion about posix writers group vs ACLs to manage access
      • posix is clear to less expert users
      • could break NCI accounting (will count toward group usage but not project quota??)
      • ACLs are safer and preferred but less transparent
    • Could add regridded ACCESS-S2 data to this collection (Thomas to follow up)
    • Need doco and guidance materials
    • Add an ‘Advanced” section for data custodians/stewards to the ACDGuide Governance book
  • ESGF replication
    • Additional data download requests go to Syazwan
    • Working through the ACS downscaling requests
    • DCPP data can be replicated
  • Creating a Jupyter book
    • Paola gave Claire some tips on how to set up new books

30 Jun 2022 climate data weekly meeting 15 2022

Paola, Alicia, Claire, Chloe

  • CMIP data
    • Syazwan now catching up with downloads, NIU slow but getting there
    • How to add new runs – ACCESS Archiver does first step, but confusion about how to proceed to register products
    • Paola has a ticket with Yiling, Chloe helping.
  • Intake catalogues/ia39 reference data
    • Paola is becoming a regex expert and should be paid more! 
    • Now have an intake for replicated data in `ia39` (as well as those in `hh5` for CMIP and CORDEX data)
    • Paola has started a Jupyter book of downloaded data – https://aus-ref-clim-data-nci.github.io/aus-ref-clim-data-nci/intro.html
    • Create catalogue YAML by ingesting DRS/scanning filesystem. May need post-processing.
    • Need to force some fields, e.g. the precip datasets don’t contain the variable “precipitation”!
    • One dataset doesn’t work because intake doesn’t seem to deal well with aggregating CSV files
    • Creating the catalogue isn’t easy but once done functionality is great
    • Demo notebook https://github.com/aus-ref-clim-data-nci/acs-replica-intake/blob/main/acs-replica-demo.ipynb
    • Need a paragraph for CLEX newsletter and also CSC update – CLEX due Wed 5th.
      • Do we want to agree a particular launch date? Available now?
    • Script that syncs README for each dataset to the contents of the book. This means we just need to keep the READMEs up to date and the book will update. New datasets need to be added to the book of course though.
    • Need a document (findable!) that makes all these books, notebooks etc easy to find!
    • Use public readme for organisation in github??
    • Link to each repo from central ACS – needs a REAME itself! https://github.com/AusClimateService/AusClimateService
  • Github
    • Use of github – wiki, tasks, issues etc. in this project?
    • A project is a collection of tasks/issues
    • e.g. Governance book has projects for each element of the book – kanban board, can add cards from current open issues
    • Way of bunching issues (different to labels I guess, more organised and clearer for collaborators)
    • Misconception among researchers that having code on github is “published”, but still need a DOI!
    • Other options – Bitbucket + JIRA e.g.; InvenioRDM do dev through discord which is great for making it open.
  • Clex restructure coming up
  • ACCESS-NRI taking off, Aidan H will be working on software release candidates
  • Mackallah et al CMIP6 data paper proofs are back, hopefully will be out soon.

26 May 2022 climate data weekly meeting 14 2022

Claire, Paola

  • CESM-LME data
    • Moved from ua6 to xv83
    • Paola had more recent data in ua8
    • The datasets seems to be not quite compatible though, need to be combined and cleaned up.
  • CMIP post-processing/archiver
    • Data movements getting stuck, wasting SUs on job
    • um2netcdf getting stuck
    • If you re-launch the archiver will it skip completed files?
    • ice files problematic?
    • CPU at 2% while memory usage high so it’s just idling
  • Archiving data for past researchers
    • Storage management getting more proactive.
  • Single access WG
    • InvenioRDM has new LTR version
    • Setting up in new larger Nectar (ARDC cloud) instance
    • Need to work through the test ingested items
    • Haven’t yet checked if they fixed some of our issues
    • Changes to communities which might help us
    • Skip next meetings but remind everyone to review some records
    • Still not happy with subjects and subject nesting.
    • Should we look at adding custom fields?
    • ARDC making cloud config a bit more flexible and give access to greater resources by default
    • Invenio added roles – anyone can contribute to public community, reviewers can edit and comment etc.
    • Community members could add discussion about data quality/fitness for purpose etc. e.g. a ‘data curation’ community.
    • Support for custom “sets”
  • post-processing for publishing
    • Chloe’s tool is available 
    • Should still work with people like Paola to prepare data for publication though
  • ACS data
    • Extremes indices downloaded
    • FROGS e.g. still needs some preparation
    • Intake catalogue created, adding to jenkins to keep updated
    • Communicate to the community that the data and catalogue are available
    • Could do a jupyter book for the data collection showing what data are available – way to document that’s user-friendly for scientists to read when we don’t have a web portal to document this stuff.

19 May 2022 climate data weekly meeting 13 2022

Claire, Alicia, Rebecca, Paola. Apologies: Chloe

  • Alicia returns to John’s team on June 1
    • Link current CSA work to ACS work
  • CORDEX 
    • Rebecca is publishing CORDEX data at NCI for Marcus
    • With Kelsey leaving, does anyone at NCI know the process? Yiling? Maybe Chloe can help?
    • Data is temporarily in `xv83` 
  • Data guidelines
    • We need to update the publishing section particularly for preparing for ESGF publication
    • CMORising (APP4 vs axiom/pproc)
    • Process with NCI to transfer to ESGF node and quarantine for checks
    • Chloe has extensive recent experience with ESGF publishing at NCI so may be able to assist on both sides
    • Has the CCAM model been registered in ES-DOC ready for publishing?
  • NCI status
    • Kelsey leaving so some services temporarily on hold while people are re-trained
    • Downloading seems to be continuing (Syazwan) but clef indices may not be updated – Kashif may be working on the database
    • Yiling handling data publishing – including ESGF?
    • Things are probably in flux a bit at NCI
    • Data decommissioning is on hold
  • Data processing from xv83
    • Alicia processing data for Josef (based on CMIP5). Where would that go?
      • Akin to the data in ua6_4 maybe?
    • Do we need a project for data underpinning web services?
    • This is for Tom R’s FCDI – does he have a project specifically for this?

12 May 2022 climate data weekly meeting 12 2022

Claire, Alicia, Paola, Thomas, Chloe, Tim

  • Data staffing status
    • Alicia returning to O&A shortly, working in ACS and continuing engagement in CSA
      • Talk to John about our needs (e.g. ACS data and code mgmt group)
    • ACCESS-NRI
      • Kelsey moving to data lead 
    • Data publishing at NCI going slowly
    • Data replications going slowly (though that is expected!)
  • O&A Modelling review
    • Brains trust composition should include a data presence 
      • Presence from data & digital (“Claire & Clothilde” – Chloe??)
      • Interaction with O&A D&D lead?
    • Need better code and data management
      • recommendation to better fund and support 
  • CMIP6 data prep & storage
    • APP4 tool getting used by Clex now! Yay!
    • Looking for storage ahead of data publishing for “CMIP6+”
    • Antarctic MIP is not part of ESGF but being used there too
    • Nerilie’s stuff can probably go in `p73`
    • Researchers to coordinate with Paola and Chloe – use appropriate structures, vocabularies, document experiments
    • Ensure Chloe and Paola are both watching all relevant CWS-HELP tickets
  • ESCI data post-processing
    • Jacob Weiss contracted for remainder of FY to process (at least some) ESCI data
    • Andrew Dowdy keen to get data processed
  • just say no” – hah!
  • Chunking
  • ua6 clean-up
    • much of what’s in “processed_for_TimE” is for `ua6_4` authoritative
    • What’s in people’s directories still needs cleaning up
      • duplicates
      • sometimes duplicates that don’t match – incomplete metadata
      • Tim to work with Craig on this
  • Data archiving
    • Rob Bell has useful scripts for moving scripts around between NCI and CSIRO archives and disk – Tim to share
      • Looking for a maintainer
    • Useful for helping with `/scratch` cleanup?
    • Similar to the `parallel rsync`
    • Chloe’s repo is at https://git.nci.org.au/cm2704/dm_tools
    • Is globus useful for us (yet)?
    • Rob’s criticism: `/scratch` purging policy at NCI appears less well thought-out than CSIRO’s
    • 100day purging is really tough for long model runs
    • Users end up regularly `touch`ing files.
    • Very limited doco for MDSS at NCI and no guidance about managing scratch purging with MDSS
    • If people start archiving to MDSS won’t that create its own problems with limited capacity!?
  • CSIRO PID service
    • Tim’s PID service that creates a namespace in which you can resolve links to different things – code, data, sensorts, etc
      • PID, ROR provide IDs for people, orgs but other things are missing say in instrumentation
    • Managed in a git repo – just pull!
    • Internal and external views so can be used outside of CSIRO
  • Extremes indices
    • Downloaded to `xv83`
    • Paola to build intake catalogue
    • `hh5` intake catalogue includes CMIP5, 6, COSIMA, CORDEX, can we add in here too? 
    • Discuss with Damien whether to include in hh5 catalogue above or make a separate one for it.
    • Could also use intake to remap variable names
  • Climate Mission
    • Using notebooks in cloud, code will be open but put data and stuff in marketplace to sell it
    • Go to website, select inputs and run code in the cloud, deliver products to paying customers
    • Data API (is actually backed by notebooks)
    • Save outputs on the back end so similar requests don’t need to be re-run
    • Tim will demo this soon
  • InvenioRDM data catalogue
    • Scott currently not active as he gets up and running at BoM
    • Can Tim support Paola so she’s not a single point of failure?
    • Getting pretty stable now and fairly rapid to populate.
    • Development is done openly on Discord, makes it really great to see whether issues are being worked on as a user
    • DOI tag shouldn’t be shown if it’s not a real one

05 May 2022 climate data weekly meeting 11 2022

Thomas, Chloe, Paola, Claire

  • O&A Digital Lead
    • EoIs (internal) open for an O&A Digital lead, but 0.2FTE for remainder of calendar year
    • It’s far too much for that sort of allocation
    • If any of us do end up putting in an EoI, the best we can offer is a white paper by end of year describing what the role would require
    • Additional complication of reducing our project loads to 0.8 next FY
  • ACCESS-NRI and CLEX
    • Claire C moving to NRI to take up leadership of land modelling (CABLE) team
    • Scott’s position in Clex has not yet been replaced (may be advertised now?)
    • ACCESS-NRI likely to take a lot of Clex CMS and related people
    • Canberra based is likely to severely limit talent availability (Melbourne and Hobart experts not willing to relocate)
      • CSIRO staff on secondment may not have to move? BoM too?
    • Claire’s position will be replaced but team leadership likely to go to an existing CMS team member
    • What is the future of Clex after current funding ends?
      • Clex current funding ends June 2024 but probably extended to Dec 2024
    • NRI have a data person lined up – zarr future etc?
    • Data mgmt will need to improve dramatically c.f. COSIMA, good time to be contributing expertise
  • Short term contracting to CSIRO
    • We have options to hire students on contract this FY to help with ACS overload
    • Catherine Gregory is not available
    • Nick Pitman? not sure if he’s available – checking
  • Enabling access WG
    • Started draft report
    • Need to follow up with Paige re Pangeo – Thomas to lead
    • Still no response from NCI
    • ARDC?
    • ACCESS-NRI? (software release team?)

28 Apr 2022 climate data weekly meeting 10 2022

Paola, Claire

  • Climate data working groups
    • Enabling access WG
      • All use cases now in except NCI – Claire to follow up
      • Ian has changed jobs but still involved
      • Need to clarify use of “cloud” nomenclature
      • Check with Paige if Pangeo want to contribute
      • Finalising most use cases as approved
      • Ned to start pulling the report together, 2 sections: one summarising the use cases, the other talking about implications for the climate community
    • Single access 
      • Some issues with Invenio DOI implementation
      • Something has broken some records?? Paige’s software entry maybe?
      • Still need to review the items we are assigned
      • Lots to do here
    • Big data
      • Paige has moved to the US but keen to continue leading the group
    • Data governance
      • Ticking along okay
  • CMIP replication
    • Claire requesting data for CORDEX downscaling
    • Some datasets only partially replicated, needed re-request
    • oi10 quota increased but still limited, still no on-going replication
    • similar hassles with ERA5
  • ACS code & data
    • replication into ia39 picking up
    • ETCCDI extremes indices are being replicated now, sme problems with filename conventions
    • Francois has regridded ETCCDI data in lp01 but ia39 will host the full original data.

14 Apr 2022 climate data weekly meeting 9 2022

Chloe, Thomas, Claire

  • CSIRO HPC migration
    • The Canberra Data Centre is being migrated next year <scream>
    • Any time constraints to be communicated to Tim Ho asap
  • Data pain points
    • scratch purging?
    • quota pathway for data replication
    • dataset maintenance – how to implement fixes and corrections? (e.g. ACCESS-S2, ERA5 inconsistent scale & offset factors)
    • research vs data focus – whose responsibility is it to get data right?

31 Mar 2022 climate data weekly meeting 8 2022

Paola, Claire, Chloe, Alicia, Thomas

  • ua6 decommissioning
    • Claire has backed up all LME-type data to xv83
    • Currently copying “processed” across – to be discussed with Tim Erwin
    • Need to chmod data in xv83 to xv83
  • Single access WG
    • InvenioRDM is updated and working
    • Improvements to automatic record ingest
    • WG meeting tomorrow to discuss further
  • ERA5
    • More storage available, NCI looking to download additional data, e.g. potential vorticity
  • Climate indices (Copernicus)
    • Paola is going to download them 
    • NCI tentatively agreed to manage but not in the short term
    • Modified ERA5 download code for indices
    • CDS server upgrade caused outage
    • Downloading to ua8
    • Could mount ia39 or xv83 if needed?
      • Largest is probably FROGS
  • ACS data: xv83/ia39
    • Need to establish policies/procedures, restructure existing
    • Ensure group ownership is correct throughout
    • Where will BARRA/BARPA go?
    • Indices to go in ia39?
    • Anything going under aus-ref-clim-data-nci should sit in one directory to indicate it’s not an Australian contribution
    • Need READMEs with replicated data.
  • Pangeo and object storage
    • Analysis Ready Cloud Optimised (ARCO) efforts
    • Links for quick discussion:
    • There is a working group between these folk and ESGF (good!!)
    • Again lack of connection with Australia but at least we can see these things
    • Thomas will circulate to Mark Gray (Pawsey) following our chat last week
    • Discussion about object storage (metadata indexing) and zarr storage approach (inodes for chunks, zip into single file)
    • How do we account for both approaches in our work when we may be working across posix and object-based systems?
      • zarrs work on HPC (e.g. DCFP ensemble)
      • Experiment, is all DRS metadata contained in file metadata? Can we port some ACCESS Data to new Pawsey system without loss of metadata info?
        • Metadata curation is vital!
      • Need for duplication zarr ↔ netCDF to support publication standards vs optimal performance?
      • nczarr future – hopefully we’ll be able to use zarr as the netCDF back end instead of HDF
      • Need lots of proof of concepts so you’re able to pivot when new standards emerge.
      • But then so many clients want CSV so we know there’s no one answer 
    • Should we cover some discussion of this in the Governance WG document?
  • CCAM atmospheric vs ocean-coupled CCAM

24 Mar 2022 climate data weekly meeting 7 2022

Paola, Claire, Chloe

  • Enabling access WG
    • Pawsey have offered Mark Gray to talk to us
    • Mark will jion our WG meeting tomorrow
    • That means we’ll have heard from everyone except NCI
    • BoM have provided 2 use cases (ACORN-SAT and AGCD)
    • Justin works part time and doesn’t work Fridays but he will fill in one for their cloud direction
    • Still to follow up with Sally Lowenstein?
  • Data governance WG
    • Katie accidentally broke the ToC file when she added a new page
    • Use of Main branch problematic – training on branches?
    • Metadata vs data viewpoints – what is a “good” record may be different according to who you are/what you’re looking for
      • A complete set of metadata may not describe data quality, uncertainty etc.
      • Do we have ANY files/records that we would consider “good”?!
  • Persistent ID training
    • Tim E running training on new PID

10 Mar 2022 climate data weekly meeting 6 2022

Paola, Alicia, Chloe, Claire

  • oi10 quota
    • ua6 and oi10 are on different filesystems
    • Claire queried if quota can be transferred if oi10 can be migrated to the next gdata filesystem. No response yet
    • Yiling should be in touch regarding clean up of ua6 deletion folder
    • No ongoing replication but new data requests still being actioned
    • Request 6monthly meetings to check the project is continuing to deliver what is needed for the researchers
    • Is MAS updating as expected? Are retracted data being deleted from filesystem and removed from MAS still? Probably but not sure?
      • Do we need to go back to looking at intake alternative?
  • NCI quota management
    • Unclear if MDSS has quotas now – maybe soft but not hard?
    • nci_account –v used to print further information including funding sources
    • What to do with “dark data”
    • Claire C meeting with NCI tomorrow, will raise some of these issues
    • Data retirement policy??
  • Climate data tutorial
    • Alicia is preparing a tutorial on using CSA data for farmers
    • When to use ensemble means, 90th percentiles, different models?
    • Discussion of CCiA and how to present ensembles, models, storylines
  • CRE data
    • Still using CMIP5 – reproducing CCiA analyses etc
    • Plans to move to CMIP6?
    • Mahesh keen to move CSA to CMIP6 but not until differences are understood scientifically

03 Mar 2022 climate data weekly meeting 5 2022

Alicia, Paola, Thomas, Claire, Chloe

  • xmhw package
    • Paola added to conda, did readthedocs and used github actions, good learning
    • Thomas very supportive and keen to contribute when he has time.
    • Some additions e.g. extremes
    • Likes blak to standardise python styling, and flake8 for codestyle
  • ua6 and oi10 quotas
    • ua6 has ~120TB ready to be deleted
    • oi10 has less than 50TB available
    • They may be on different filesystems but we should request deletion and quota transfer
    • Both appear to be on /g/data4
    • There is some CESM data spread across ua8 and ua6.
    • Recall ua6_4 is a separate data project
    • Maybe we can back stuff up to xv83 for ACS (CESM) and sorting (processed) 
      • Claire will start actioning this
    • Claire to contact Kelsey re removal of the deletion folder
    • Tim will probably need to review what’s in processed
  • InvenioRDM single access portal
    • Updated, and a new version next week
    • Paola tested that you can ingest and re-ingest records
    • Dump database and re-add works
    • Checked DAP API
      • Doesn’t return all metadata
      • DAP hosts a lot of tangential climate data so harder to filter than NCI
    • Two processes – modified script for CSIRO DAP API – title, description, URL then review manually to check what gets finally ingested then download the full records (via wget)
    • Communities is in the next Invenio release, how to use them? Manage editing? (e.g. ‘CSIRO’ community to review and edit DAP-derived entries, currently only one person can edit entries they create). Roles don’t work quite right yet without communities in.
  • ACS
    • Paola to work on next datasets
    • Circulated survey to Clex CIs
  • Women in Data Science talk next Monday

24 Feb 2022 climate data weekly meeting 4 2022

Alicia, Thomas, Paola, Claire

  • CLEX staff changes
    • Scott Wales has a job at the BoM starting soon
    • Going to a team environment
    • Finishes this week (tomorrow is last full day!)
    • Will still be involved in `accessdev` (but not jenkins)
  • Regridding tripolar data
    • Some BoM products are difficult to regrid
    • people tend to use bilinear interpolation but sometimes need conservative normed to deal with coastline issues
    • Need `gridspec` file or include in file metadata
    • Scott has notebook examples (referenced in the jupyter book)
    • Add a whole section on regridding to the jupyter book?
    • Need a route to ask BoM for these support files to be produced/circulated – can Scott help here?
  • oi10 and ua6 storage availability
    • oi10 has less than 50TB remaining
    • Automatic replication has been stopped
    • ua6 has at least 120TB of data flagged for deletion, can we ask Kelsey to delete it and move the quota across?
    • ua6 and oi10 may be on different filesystems though so they may not transfer?
    • Claire and Chloe reviewing routine replication requests
  • NCI Updates
    • `/scratch` purging is going to start after all, in April!
    • Adapter scheme to provide merit allocation to cloud and data
      • Step forward – way to apply for data and cloud, but short-lived, no clear view to ongoing data storage
  • ACS survey
    • Damien has circulated but only 9 responses so far
    • Paola still needs to circulate to Clex – base on Damien’s email?
      • Need to explain more in the Clex context wrt ua8
  • zarr and kerchunk

10 Feb 2022 climate data weekly meeting 3 2022

Alicia, Claire, Paola. 

  • NCI-Fujitsu workshop
    • Full week of workshops but only open on Tuesday
    • Claire C and Aidan H invited to other days
    • Different groups have different experiences – other researchers say NCI are good at collaboration
    • What does “collaboration” mean to different people?
    • Justin Freeman talked about needs of HPC-cloud bridging
      • Would make a great use case for the enabling access WG
      • Linda Eitelberg from BoM, works for Justin was only female presenter, senior data analyst
        • TC storm surge, tsunami warning system.
        • Database of scenarios, model as needed
        • Jupyter notebook in containers
      • Innovating – prototyping and trialling what services might look like in 5-10 years
      • Talked about Met Office work with MS
      • Maxor – private company working with cloud HPC to deliver forecasts faster, also Climate Vision
      • Microsoft Planetary Computer
      • Microservers, serverless computing
      • What they can and can’t do at NCI
      • Community driven motivation to bridge cloud and HPC
      • Motivation: only pay for what you use
      • Forecast – start simulation when you need without waiting for resource availability
    • Andy Hogg mostly talked about OOD
      • random performance variations
      • NFS data mounts slow
      • shared nodes
      • Cluster scaling clunky but worked okay.
      • Suggest OOD should have head nodes and native data mounts.
    • Afternoon: Companies promoting. We will solve governance etc.
  • CMEMS sea level data
    • New daily data product using a new algorithm, 1993-present in ua8
  • Big data WG
    • Paola has done a lot of work on tools page
    • Alicia reviewed
    • It’s a bit hard to review PR compared to being able to see the whole book (can’t see that until it’s merged unless you run your own book on a local branch)
    • Tim can help with R
    • Damien keen to help review/contribute
    • Paola has started writing about ML packages too
    • Alicia needs to add more to intake to deal with distributed client work.

03 Feb 2022  climate data weekly meeting 2 2022

Alicia, Paola, Claire. Apologies: Chloe

  • Data guidelines working group
    • Data QA/QC – how do we each do it?
      • Chloe’s automatic plots for ACCESS output for quick visual check of t=0, mean trend
      • Flagging when values are out of expected range (but watch for floating point overflows)
      • Animate and view outputs to check they “look” right
      • metrics toolboxes for model evaluation (e.g. PCMDI metrics package, ESMValTool)
    • Sharon Tickell has retirement policy needs too
  • Enabling access working group
    • Ian and Paola have talked to someone from Copernicus
      • CDS procure data and then goes through a QA process
      • Each dataset has nominated human reviewers
      • Long process after data submission 
        • Work with data as it is – add metadata to make sure there’s enough information to put into a workflow
        • Massage data into a common data model
        • Structure for hypercube for analysis tools
        • If regridding is required it seems to be done on the fly
        • Don’t force data to be on a common structure (unlike ESMValTool) but instead add sufficient information so that it can be read into the common data model
        • Methods to handle global and different regional models
      • Main issues: Funding, sufficient experts, many different formats and variety of data.
      • Informed by policy, political situation (who provides funding), stakeholder needs, as a side effect can provide data to anyone in the world!
      • Show example tenders for datasets
      • Use of python xarray to work with the ‘hypercube’
      • seasonal prediction different process to other data
      • speed for different use cases
  • Big data working group
    • Paola has done a lot of work on the tools page in her local branch, will push before next meeting for review after a bit more tidying.
    • Software licence considerations – often overlooked.
      • Do we check what input licences are before we define the licence on tools we create?
      • Python tool to query licences in dependency tree

27 Jan 2022 climate data weekly meeting 1 2022

Alicia, Claire, Paola. Apologies: Chloe

  • Data Guidelines WG
    • Katie sent Claire a lot of resources on data retraction/retirement for the guidelines jupyterbook
  • Alicia is working with Josef S on a Qld project similar to CSI through D61 on CCAM data
  • Working with challenging data WG
    • Paola has been working on the python tools page
      • including integrating Damien’s tools survey
      • formatting/layout experimentation
      • “glossary” approach to allow cross-linking
  • ACS reference data
    • quota transfer from xv83 to ia39 has finally happened
    • Can resume work on reanalysis/obs collections
    • Have done the first small collections so can start looking at next priority from bigger datasets now
  • PyAOS Carpentries training at ICSHMO 22
    • Damien and Claire teaching
    • helpers include Chloe, Paige and Holger, but could probably use a few more
    • Might run an internal CSIRO PyAOS workshop in the next 6 months too
  • Single access WG
    • Paola contacted InvenioRDM developers about lack of URL links
    • Scheduled for thier next sprint, they acknowledge the record view is rather lacking.
    • Next release is scheduled for February, so presumably then or by April the URLs should be fixed.
    • Paola has been working through Vocabularies issues
      • Spatial resolution, bounding box etc may be supported in API but not the form view
      • Rough degree→km conversions as options
      • Temporal resolutions individually added for sub-daily, but only down to 30min. Obs are 10min?
      • Track/trajectory data type
      • NCAR climate guide may be a good reference for what search options they’ve used in ‘advanced search’
    • Minor changes to front page
  • “Climate in the Cloud”
    • Cloud-based tool for climate education
    • Seems to not be particularly advanced yet
    • UNSW “simple climate model”
    • duck trajectories! 
    • Citizen science projects
  • Women in Data Science panel
    • Paola and Chloe participating
    • Part of larger global event
    • CMS doing a lot of work with ARDC at the moment