Understanding CM2 output
To begin this tutorial, it is expected that you have completed the previous module, Using CM2 suites in Rose and Cylc, from which this module picks up directly. It is also recommended that you watch the Getting Started with ACCESS Webinar prior to working through this material if you are not already familiar with ACCESS-CM2. For a technical guide on setting up with NCI and accessdev, please see Setting up for ACCESS-CM2.
In this tutorial we will resolve our suite (see Part IV) by learning where error logs are stored in the cylc-run directory, and debug our failed suite. We will then learn some tools for restarting failed suites, and controlling running suites. Finally, we will explore the cylc-run directory more thoroughly and examine our model output data.
A recording is available in which this tutorial material is demonstrated (5th February, 2021): https://www.youtube.com/watch?v=tw05r9-o_SI
Part V: Cylc error messages and basic suite debugging
In Part IV of Using CM2 suites in Rose and Cylc, our suite installed, ran a host of initial tasks (e.g., compiling the modelling components, and preparing input files), and attempted to initial its first one month cycle of the coupled model. At this point, the Cylc task ‘coupled’ (the heart of ACCESS-CM2) failed, and the suite was halted. To investigate the cause of this failure, we need to look at the logs from the suite run, which are contained in the cylc-run directory.
-
Open a new terminal, log into Gadi, and navigate to the cylc-run directory:
Note: Suite logs are generated on Gadi (where the tasks are run), however these are always copied back into a mirror of the cylc-run directory after each job; therefore accessdev can also be used for this step (unless a Gadi-accessdev communication error has occurred).$ ssh -X USER_ID@gadi.nci.org.au # or: $ ssh -X USER_ID@accessdev.nci.org.au $ cd ~/cylc-run/u-[suite_name] $ ls app/ log/ log.20210129T060522Z/ meta/ ozone.rc rose-suite.info share/ suite.rc suite.rc.processed work/ # the log subdirectory above is always a symlink to the latest set of logs, which are timestamped $ ls -l log lrwxrwxrwx 1 USER_ID p66 20 Jan 29 17:05 log -> log.20210129T060522Z $ ls log/job/18500101 coupled fcm_make2_drivers fcm_make2_um install_ancil install_warm make2_cice make2_mom # logs are separated into cycles (internal simulation dates which for us began on 18500101; Jan 1, 1850), and by 'job' (i.e. Cylc task) $ ls log/job/18500101/coupled 01 NN # a further separation into 'attempts' (consecutive failed/successful tasks). NN is always a symlink to the most recent attempt $ ls log/job/18500101/coupled/NN/ job job.err job.out job.status # each task will produce four log files $ less log/job/18500101/coupled/NN/job # this is the Cylc script that wrapped the coupled task. We can ignore this one $ less log/job/18500101/coupled/NN/job.status # this file contains information about the aforementioned script, including the PBS job_id $ less log/job/18500101/coupled/NN/job.err # the STDERR stream from the coupled task. Many error messages can be found here, particularly from inside the UM itself # at the very end of this log, we can find a relevant, but ultimately unhelpful message: 2021-01-31T23:18:25Z CRITICAL - failed/EXIT $ log/job/18500101/coupled/NN/job.out # the STDOUT stream from the coupled task, including PBS usage info. Also a common place for error messages to be printed, including the one which we seek at the end: ???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!! ? Error code: 1 ? Error from routine: io:file_open ? Error message: Failed to open file /g/data/access/TIDS/CMIP6_ANCIL/data/ancils/n96e/timeslice_1850/OzoneConc/v1/mmro3_monthly_CMIP6_1850_N96_edited_ancil_2anc ? Error from processor: 0 ? Error number: 11 ????????????????????????????????????????????????????????????????????????????????
-
Upon investigation of log files, it seems that our Ozone ancillary file (which we changed from the default historical timeseries to be a fixed profile from 1850) failed to load. Time to explore further:
Note: feel free to peruse the various ancillary files we have generated for our ACCESS-CM2 CMIP6 submission (/g/data/access/TIDS/CMIP6_ANCIL/data/ancils/n96e)$ ls /g/data/access/TIDS/CMIP6_ANCIL/data/ancils/n96e/timeslice_1850/OzoneConc/v1/mmro3_monthly_CMIP6_1850_N96_edited_ancil_2anc ls: cannot access '/g/data/access/TIDS/CMIP6_ANCIL/data/ancils/n96e/timeslice_1850/OzoneConc/v1/mmro3_monthly_CMIP6_1850_N96_edited_ancil_2anc': No such file or directory # this confirms that the file does not exist (permission issues can also cause this) $ ls /g/data/access/TIDS/CMIP6_ANCIL/data/ancils/n96e/timeslice_1850/OzoneConc/v1/ mmro3_monthly_CMIP6_1850_N96_edited-ancil_2anc # fortunately for us, the failure was simply due to a small typo in the file name we inserted into the Rose suite. Who would have known?!...
-
Let’s go back to our suite on accessdev and fix this up:
Note: see Using CM2 suites in Rose and Cylc Part III, step 2.
Rose: i) $ rose edit # open the Rose GUI ii) navigate to 'um -> namelist -> Reconfiguration and Ancillary Control -> Configure ancils and initialise dump fields -> 24d9c434' iii) set 'ancilfilename' to "$CMIP6_ANCILS/n96e/timeslice_1850/OzoneConc/v1/mmro3_monthly_CMIP6_1850_N96_edited-ancil_2anc" # in this experiment we will fix Ozone to 1850 levels # Save and close the Rose GUI OR $ nano app/um/rose-app.conf # under [namelist:items(24d9c434)], correct the following variable: ancilfilename='$CMIP6_ANCILS/n96e/timeslice_1850/OzoneConc/v1/mmro3_monthly_CMIP6_1850_N96_edited-ancil_2anc' $ <Cntrl> + x; y; <Enter> # to save and exit nano
Part VI: Controlling active suites
You may have noticed that the suite is still technically running (albeit halted currently). We will now learn some Rose/Cylc commands for controlling active suites, and restarting stopped suites.
-
If your Cylc GUI has been closed (or go ahead and close it now) and you are unsure whether your suite is actually still running, you can easily scan for active suites, and reopen the GUI if desired:
$ cylc scan u-cb956 USER_ID@accessdev.nci.org.au:43157 # my test suite is still active $ rose suite-gcontrol OR $ gcylc u-[suite_name] &
-
For suite updates like this, the Cylc does not need to be stopped. Updating an active suite is called a ‘reload’, where the suite is ‘re-installed’ and Cylc is updated with the changes:
Note: many actions can be performed within the Cylc GUI; including holding putting the suite (or individual tasks) on hold, and even viewing/editing job logs$ rose suite-run --reload ... [INFO] u-cb956: reload complete. # the failed 'coupled' job can be manually re-triggered by right-clicking 'coupled' in the Cylc GUI, and selecting 'Trigger (run now)' # we won't do this now, however...
-
While not necessary for the purpose of getting this suite running, let’s learn how to stop and restart suites:
$ rose suite-shutdown Really shutdown u-cb956? [y or n (default)] $ y <Enter> # using this method, Rose will attempt communicate with any actively running Cylc tasks, and shutdown the suite in a safe and consistent manner # this should work just fine for this example, but sometimes you may need to kill Cylc more directly OR $ cylc stop --kill u-[suite_name] # Cylc will attempt to terminate all tasks on Gadi, before stopping the suite; a good way to save on unnecessary SU usage OR $ cylc stop --now u-[suite_name] # Cylc will terminate the suite immediately, however any active tasks on Gadi will continue unimpeded and must be terminated manually via terminal logged into Gadi # if the suite is restarted while tasks are active on Gadi, Cylc will attempt to re-connect with them
-
With our suite now inactive, there are two main ways to get it up and running again:
$ rose suite-run --restart # this will re-install the suite (i.e. no need for a 'reload') and reopen Cylc in the same state as when it was stopped (you may need to manually trigger failed tasks) # an alternative restart method, but without the re-installation: $ rose suite-restart OR $ rose suite-run --new # this will overwrite any previous runs of the suite and begin completely afresh # WARNING: This will overwrite any and all existing model output and logs; recommended for suite testing only
- The suite should now run for two months, with a re-submission occurring after the first month (ozone redistribution occurs every six months).
You can track the progress of your suite either through the Cylc GUI on accessdev, or via the online tool ‘Cylc Review’ (https://accessdev.nci.org.au/cylc-review/).
Part VII: Model Output Data
While ACCESS-CM2 is running, files are model output data files move between the share/ and the work/ directory (symlinked in the cylc-run directory, but they are actually on Gadi’s /scratch disk). At the end of the cycle, restart files (model snap-shots) and history (model output data; the atmospheric component of which is controlled by the STASH) are moved to your archive, also on /scratch.
-
In a terminal logged into Gadi, lets have a look at our archive:
$ cd /scratch/[project]/[USER_ID]/archive/[suite_name] $ ls * history: atm/ cpl/ ice/ ocn/ restart: atm/ cpl/ ice/ ocn/ # here we can find restart and history files for each of the 4 main components of ACCESS-CM2: atm (UM), cpl (OASIS3-MCT), ocn (MOM) & ice (CICE)
-
It’s worth knowing what the restart files look like:
Note: Restart files are full dumps of all prognostic fields from each component of ACCESS-CM2, performed at the end of each simulation month.$ ls restart/atm cb957a.da18500201_00 cb957a.da18500301_00 cb957.xhist-18500131 cb957.xhist-18500228 # two atmospheric files per month (only *.da* contains model data) $ ls restart/cpl a2i.nc-18500131 a2i.nc-18500228 i2a.nc-18500131 i2a.nc-18500228 o2i.nc-18500131 o2i.nc-18500228 # three coupler files per month; one for each direction of information through the coupler (a2i = atmos to ice, i2a = ice to atmos, o2i = ocean to ice) $ ls restart/ice iced.1850-02-01-00000.nc iced.1850-03-01-00000.nc ice.restart_file-18500131 ice.restart_file-18500228 mice.nc-18500131 mice.nc-18500228 # three sea-ice files per month $ls restart/ocn restart-18500131.tar restart-18500228.tar # one tarred ocean file per month, each containing 16 separate files
-
UM restart files are not converted into netCDF by the suite (unlike the history files), but remain in UM-based ‘.pp file’ format (UKMO-proprietary), and you may need to inspect them. For this, we recommend the tool ‘xconv’, which can also convert UM files to netCDF format:
Note: The Python library ‘iris’, which is useful for analysing and visualising meteorological and oceanographic data sets, is also able to read and write UM .pp files. See Iris 3.8.0.dev126 documentation (scitools-iris.readthedocs.io)
Note: ‘iris’ is in fact the library used for the netcdf_conversion task in ACCESS-CM2 suites
$ /projects/access/apps/xconv/1.94/xconv restart/atm/cb957a.da18500201_00 & # to call the xconv binary file directly from the access project # the xconv GUI will open in an interactive window xconv: i) double click on any field (e.g. field 17: SURFACE TEMPERATURE AFTER TIMESTEP) # you can now see coordinate data in the right-hand window # switch between the dimensions (x,y,z,t) to see coordinate values, including the time stamp ii) select 'View Data' (above the coordinate window) # to inspect the cell-by-cell values of the field iii) select 'Plot Data' (next to 'View Data') # for an auto-generated plot of the field vi) select multiple fields from the main window (any will do) # these are fields that will be saved into a netCDF file in the next step v) set the 'Output file name' as 'xconv_test.nc' (top of the GUI) # or to any file name you like vi) ensure that the 'Output format' is set to 'Netcdf' (top of the GUI) # this should be automatically set vii) select 'Convert' (to the left, under the main field window) # your file will be generated within the directory from which you launched xconv # you can use this method to convert any UM .pp file to netCDF, not just restart files
-
Let’s now have a look at the history data:
$ ls history/atm cb957a.p71850feb cb957a.p71850jan cb957a.p81850feb cb957a.p81850jan cb957a.pd1850feb cb957a.pd1850jan cb957a.pm1850feb cb957a.pm1850jan cb957a.p71850feb.nc cb957a.p71850jan.nc cb957a.p81850feb.nc cb957a.p81850jan.nc cb957a.pd1850feb.nc cb957a.pd1850jan.nc cb957a.pm1850feb.nc cb957a.pm1850jan.nc # four UM atmospheric files are created by the coupler per month: 3-hourly (*.p7*), 6-hourly (*.p8*), daily (*.pd*) & monthly (*.pm*) # each has a corresponding netCDF file, created by the netcdf_conversion job $ ls history/cpl # empty, the coupler does not generate history files, only restarts $ ls history/ice iceh_d.1850-01.nc iceh_d.1850-02.nc iceh_m.1850-01.nc iceh_m.1850-02.nc # two sea-ice files per month: daily (iceh_d.*) & monthly (iceh_m.*) $ ls history/ocn ocean_daily.nc-18500131 ocean_month.nc-18500131 ocean_scalar.nc-18500131 ocean_daily.nc-18500228 ocean_month.nc-18500228 ocean_scalar.nc-18500228 # three ocean files per month: daily, monthly & scalar (time-invariant fields)
-
Let’s have a closer look at the atmospheric netCDF model output data files. First, we need to load the netCDF module into our login instance (enabling us to view netCDF files, and many other tasks), then we can use ‘ncdump’ to view file metadata:
$ module load netcdf # load just the basic netCDF commands OR $ module use /g/data/hh5/public/modules # this requires access to CLEX's hh5 project on NCI (https://my.nci.org.au/mancini/project/hh5/join) $ module load conda/analysis3 # load CLEX's 'analysis3' conda environment, which includes many useful modules (primarily Python-focused), including netCDF tools and iris $ module list Currently Loaded Modulefiles: 1) netcdf/4.7.3 OR Currently Loaded Modulefiles: 1) openmpi/4.0.2 2) conda/analysis3-20.10(analysis3) # loading too many modules into a single login instance can easily produce conflicts; { $ module purge } will clear all loaded modules; or just open a fresh terminal $ ncdump -c history/atm/cb957a.pm1850jan.nc | less # the -c flag tells ncdump to only read the file header and coordinate information, rather than all of the model data # I like to pipe ncdump to less, to keep my terminal history visible
-
With a netCDF-converted UM file open (in this case, a monthly data file from the first month of our test run), the first thing we can see are the dimensions:
Note: We will not cover the general structure or usage of netCDF files, however if you are new to netCDF there are many tutorials and descriptions online. A good introduction is: https://pro.arcgis.com/en/pro-app/latest/help/data/imagery/fundamentals-of-netcdf.htm
netcdf cb957a.pm1850jan { dimensions: time = UNLIMITED ; // (1 currently) model_theta_level_number = 85 ; # full model levels (edges of vertical grid cells) lat = 144 ; lon = 192 ; bnds = 2 ; model_rho_level_number = 85 ; # half model levels (center of vertical grid cells) pseudo_level = 5 ; # sea-ice depths pseudo_level_0 = 6 ; # atmospheric optical wavelength bands lon_u = 192 ; lat_v = 145 ; pseudo_level_1 = 17 ; # CABLE tiles (land-use) pseudo_level_2 = 13 ; # CABLE PFTs (plant functional types) depth = 6 ; # Soil depth pressure = 19 ; # atmospheric pressure levels
-
Followed by the metadata of the monthly STASH fields (variables):
Note: the field names are set according to the STASH field from which it originates, to ensure both clarity and provenance. In this example, we can see metadata for the STASH item m01s00i004 (http://reference.metoffice.gov.uk/um/stash/m01s00i004)variables: float fld_s00i004(time, model_theta_level_number, lat, lon) ; # corresponds to STASH item m01s00i004 fld_s00i004:_FillValue = 1.e+20f ; fld_s00i004:standard_name = "air_potential_temperature" ; fld_s00i004:long_name = "THETA AFTER TIMESTEP" ; fld_s00i004:units = "K" ; fld_s00i004:um_stash_source = "m01s00i004" ; fld_s00i004:missing_value = 1.e+20f ; fld_s00i004:cell_methods = "time: mean" ; fld_s00i004:grid_mapping = "latitude_longitude" ; fld_s00i004:coordinates = "sigma_theta surface_altitude theta_level_height" ; ... ... data: time = -43813.5 ; model_theta_level_number = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85 ; lat = -89.375, -88.125, -86.875, -85.625, -84.375, -83.125, -81.875, -80.625, -79.375, -78.125, -76.875, -75.625, -74.375, -73.125, -71.875, -70.625, -69.375, -68.125, -66.875, -65.625, -64.375, -63.125, -61.875, -60.625, -59.375, -58.125, -56.875, -55.625, -54.375, -53.125, -51.875, -50.625, -49.375, -48.125, -46.875, -45.625, -44.375, -43.125, -41.875, -40.625, -39.375, -38.125, -36.875, -35.625, -34.375, -33.125, -31.875, -30.625, -29.375, -28.125, -26.875, -25.625, -24.375, -23.125, -21.875, -20.625, -19.375, -18.125, -16.875, -15.625, -14.375, -13.125, -11.875, -10.625, -9.375, -8.125, -6.875, -5.625, -4.375, -3.125, -1.875, -0.625, 0.625, 1.875, 3.125, 4.375, 5.625, 6.875, 8.125, 9.375, 10.625, 11.875, 13.125, 14.375, 15.625, 16.875, 18.125, 19.375, 20.625, 21.875, 23.125, 24.375, 25.625, 26.875, 28.125, 29.375, 30.625, 31.875, 33.125, 34.375, 35.625, 36.875, 38.125, 39.375, 40.625, 41.875, 43.125, 44.375, 45.625, 46.875, 48.125, 49.375, 50.625, 51.875, 53.125, 54.375, 55.625, 56.875, 58.125, 59.375, 60.625, 61.875, 63.125, 64.375, 65.625, 66.875, 68.125, 69.375, 70.625, 71.875, 73.125, 74.375, 75.625, 76.875, 78.125, 79.375, 80.625, 81.875, 83.125, 84.375, 85.625, 86.875, 88.125, 89.375 ; ... ... # press 'q' to exit less
-
Similar for ocean and sea-ice output files:
Note: MOM and CICE use netCDF format natively, and field names are given by component-specific namelists which are not covered in this tutorial.$ ncdump -c history/ocn/ocean_month.nc-18500131 | less netcdf ocean_month { dimensions: xt_ocean = 360 ; yt_ocean = 300 ; time = UNLIMITED ; // (1 currently) ... ... variables: ... float sst(time, yt_ocean, xt_ocean) ; sst:long_name = "Potential temperature" ; sst:units = "K" ; sst:valid_range = -10.f, 500.f ; sst:missing_value = -1.e+20f ; sst:_FillValue = -1.e+20f ; sst:cell_methods = "time: mean" ; sst:time_avg_info = "average_T1,average_T2,average_DT" ; sst:coordinates = "geolon_t geolat_t" ; sst:standard_name = "sea_surface_temperature" ; ... ... $ncdump -c history/ice/iceh_m.1850-01.nc netcdf iceh_m.1850-01 { dimensions: d2 = 2 ; ni = 360 ; nj = 300 ; nc = 5 ; ... ... variables: float hi(time, nj, ni) ; hi:units = "m" ; hi:long_name = "grid cell mean ice thickness" ; hi:coordinates = "TLON TLAT time" ; hi:cell_measures = "area: tarea" ; hi:missing_value = 1.e+30f ; hi:_FillValue = 1.e+30f ; hi:cell_methods = "time: mean" ; hi:time_rep = "averaged" ; ... ...
- Tutorial complete.
Part VIII: Useful links
accessdev trac: https://accessdev.nci.org.au/trac
2016 ACCESS-CM2 Training: https://accessdev.nci.org.au/trac/wiki/access/AccessUserTrainingMar2016
Getting starting with ACCESS-CM2 document: http://nespclimate.com.au/wp-content/uploads/2020/10/Instruction-document-Getting_started_with_ACCESS.pdf
CSIRO-ACCESS Research page: https://research.csiro.au/access/
CLEX: https://climateextremes.org.au/
CLEX CMS blog: https://climate-cms.org/
CLEX CMS training videos: https://www.youtube.com/user/COECSSCMS
CLEX CMS Github: https://github.com/coecms/
CLEX Help Desk: cws_help@nci.org.au
NESP: http://nespclimate.com.au/
NCI Gadi User Guide: https://opus.nci.org.au/display/Help/Gadi+User+Guide
NCI Mancini: https://my.nci.org.au/mancini/login
UK Met Office trac: https://code.metoffice.gov.uk/trac/home
UM STASH: https://reference.metoffice.gov.uk/um/stash
Rose command reference: https://metomi.github.io/rose/doc/html/api/command-reference.html
Cylc documentation: https://cylc.github.io/
Cylc tutorial: https://cylc.github.io/cylc-doc/stable/html/tutorial/index.html
ESGF CMIP6 Data search: https://esgf-node.llnl.gov/search/cmip6/
Iris documentation & guide: scitools-iris.readthedocs.io