Understanding CM2 output

To begin this tutorial, it is expected that you have completed the previous module, Using CM2 suites in Rose and Cylc, from which this module picks up directly. It is also recommended that you watch the Getting Started with ACCESS Webinar prior to working through this material if you are not already familiar with ACCESS-CM2. For a technical guide on setting up with NCI and accessdev, please see Setting up for ACCESS-CM2.

In this tutorial we will resolve our suite (see Part IV) by learning where error logs are stored in the cylc-run directory, and debug our failed suite. We will then learn some tools for restarting failed suites, and controlling running suites. Finally, we will explore the cylc-run directory more thoroughly and examine our model output data.

A recording is available in which this tutorial material is demonstrated (5th February, 2021): https://www.youtube.com/watch?v=tw05r9-o_SI

Part V: Cylc error messages and basic suite debugging

In Part IV of Using CM2 suites in Rose and Cylc, our suite installed, ran a host of initial tasks (e.g., compiling the modelling components, and preparing input files), and attempted to initial its first one month cycle of the coupled model. At this point, the Cylc task ‘coupled’ (the heart of ACCESS-CM2) failed, and the suite was halted. To investigate the cause of this failure, we need to look at the logs from the suite run, which are contained in the cylc-run directory.

  1. Open a new terminal, log into Gadi, and navigate to the cylc-run directory:
    Note: Suite logs are generated on Gadi (where the tasks are run), however these are always copied back into a mirror of the cylc-run directory after each job; therefore accessdev can also be used for this step (unless a Gadi-accessdev communication error has occurred).

    $ ssh -X USER_ID@gadi.nci.org.au
    # or: $ ssh -X USER_ID@accessdev.nci.org.au
    $ cd ~/cylc-run/u-[suite_name]
    $ ls
    app/  log/  log.20210129T060522Z/  meta/  ozone.rc  rose-suite.info  share/  suite.rc  suite.rc.processed  work/
    # the log subdirectory above is always a symlink to the latest set of logs, which are timestamped
    $ ls -l log
    lrwxrwxrwx 1 USER_ID p66 20 Jan 29 17:05 log -> log.20210129T060522Z
    $ ls log/job/18500101
    coupled  fcm_make2_drivers  fcm_make2_um  install_ancil  install_warm  make2_cice  make2_mom
    # logs are separated into cycles (internal simulation dates which for us began on 18500101; Jan 1, 1850), and by 'job' (i.e. Cylc task)
    $ ls log/job/18500101/coupled
    01  NN
    # a further separation into 'attempts' (consecutive failed/successful tasks). NN is always a symlink to the most recent attempt
    $ ls log/job/18500101/coupled/NN/
    job  job.err  job.out  job.status
    # each task will produce four log files
    $ less log/job/18500101/coupled/NN/job
    # this is the Cylc script that wrapped the coupled task. We can ignore this one
    $ less log/job/18500101/coupled/NN/job.status
    # this file contains information about the aforementioned script, including the PBS job_id
    $ less log/job/18500101/coupled/NN/job.err
    # the STDERR stream from the coupled task. Many error messages can be found here, particularly from inside the UM itself
    # at the very end of this log, we can find a relevant, but ultimately unhelpful message:
    2021-01-31T23:18:25Z CRITICAL - failed/EXIT
    $ log/job/18500101/coupled/NN/job.out
    # the STDOUT stream from the coupled task, including PBS usage info. Also a common place for error messages to be printed, including the one which we seek at the end:
    ???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
    ?  Error code: 1
    ?  Error from routine: io:file_open
    ?  Error message: Failed to open file /g/data/access/TIDS/CMIP6_ANCIL/data/ancils/n96e/timeslice_1850/OzoneConc/v1/mmro3_monthly_CMIP6_1850_N96_edited_ancil_2anc
    ?  Error from processor: 0
    ?  Error number: 11
    ????????????????????????????????????????????????????????????????????????????????
  2. Upon investigation of log files, it seems that our Ozone ancillary file (which we changed from the default historical timeseries to be a fixed profile from 1850) failed to load. Time to explore further:
    Note: feel free to peruse the various ancillary files we have generated for our ACCESS-CM2 CMIP6 submission (/g/data/access/TIDS/CMIP6_ANCIL/data/ancils/n96e)

    $ ls /g/data/access/TIDS/CMIP6_ANCIL/data/ancils/n96e/timeslice_1850/OzoneConc/v1/mmro3_monthly_CMIP6_1850_N96_edited_ancil_2anc
    ls: cannot access '/g/data/access/TIDS/CMIP6_ANCIL/data/ancils/n96e/timeslice_1850/OzoneConc/v1/mmro3_monthly_CMIP6_1850_N96_edited_ancil_2anc': No such file or directory
    # this confirms that the file does not exist (permission issues can also cause this)
    $ ls /g/data/access/TIDS/CMIP6_ANCIL/data/ancils/n96e/timeslice_1850/OzoneConc/v1/
    mmro3_monthly_CMIP6_1850_N96_edited-ancil_2anc
    # fortunately for us, the failure was simply due to a small typo in the file name we inserted into the Rose suite. Who would have known?!...
  3. Let’s go back to our suite on accessdev and fix this up:
    Note: see Using CM2 suites in Rose and Cylc Part III, step 2.

    Rose: 
        i) $ rose edit
           # open the Rose GUI
       ii) navigate to 'um -> namelist -> Reconfiguration and Ancillary Control -> Configure ancils and initialise dump fields -> 24d9c434'    
      iii) set 'ancilfilename' to "$CMIP6_ANCILS/n96e/timeslice_1850/OzoneConc/v1/mmro3_monthly_CMIP6_1850_N96_edited-ancil_2anc"    
           # in this experiment we will fix Ozone to 1850 levels
           # Save and close the Rose GUI
    OR
    $ nano app/um/rose-app.conf
    # under [namelist:items(24d9c434)], correct the following variable:
    ancilfilename='$CMIP6_ANCILS/n96e/timeslice_1850/OzoneConc/v1/mmro3_monthly_CMIP6_1850_N96_edited-ancil_2anc'
    $ <Cntrl> + x; y; <Enter>
    # to save and exit nano

Part VI: Controlling active suites

You may have noticed that the suite is still technically running (albeit halted currently). We will now learn some Rose/Cylc commands for controlling active suites, and restarting stopped suites.

  1. If your Cylc GUI has been closed (or go ahead and close it now) and you are unsure whether your suite is actually still running, you can easily scan for active suites, and reopen the GUI if desired:

    $ cylc scan
    u-cb956 USER_ID@accessdev.nci.org.au:43157
    # my test suite is still active
    $ rose suite-gcontrol
    OR
    $ gcylc u-[suite_name] &
  2. For suite updates like this, the Cylc does not need to be stopped. Updating an active suite is called a ‘reload’, where the suite is ‘re-installed’ and Cylc is updated with the changes:
    Note: many actions can be performed within the Cylc GUI; including holding putting the suite (or individual tasks) on hold, and even viewing/editing job logs

    $ rose suite-run --reload
    ...
    [INFO] u-cb956: reload complete.
    # the failed 'coupled' job can be manually re-triggered by right-clicking 'coupled' in the Cylc GUI, and selecting 'Trigger (run now)'
    # we won't do this now, however...
  3. While not necessary for the purpose of getting this suite running, let’s learn how to stop and restart suites:

    $ rose suite-shutdown
    Really shutdown u-cb956? [y or n (default)]
    $ y <Enter>
    # using this method, Rose will attempt communicate with any actively running Cylc tasks, and shutdown the suite in a safe and consistent manner
    # this should work just fine for this example, but sometimes you may need to kill Cylc more directly
    OR
    $ cylc stop --kill u-[suite_name]
    # Cylc will attempt to terminate all tasks on Gadi, before stopping the suite; a good way to save on unnecessary SU usage
    OR
    $ cylc stop --now u-[suite_name]
    # Cylc will terminate the suite immediately, however any active tasks on Gadi will continue unimpeded and must be terminated manually via terminal logged into Gadi
    # if the suite is restarted while tasks are active on Gadi, Cylc will attempt to re-connect with them
  4. With our suite now inactive, there are two main ways to get it up and running again:

    $ rose suite-run --restart
    # this will re-install the suite (i.e. no need for a 'reload') and reopen Cylc in the same state as when it was stopped (you may need to manually trigger failed tasks)
    # an alternative restart method, but without the re-installation: $ rose suite-restart
    OR
    $ rose suite-run --new
    # this will overwrite any previous runs of the suite and begin completely afresh
    # WARNING: This will overwrite any and all existing model output and logs; recommended for suite testing only
  5. The suite should now run for two months, with a re-submission occurring after the first month (ozone redistribution occurs every six months).
    You can track the progress of your suite either through the Cylc GUI on accessdev, or via the online tool ‘Cylc Review’ (https://accessdev.nci.org.au/cylc-review/).

Part VII: Model Output Data

While ACCESS-CM2 is running, files are model output data files move between the share/ and the work/ directory (symlinked in the cylc-run directory, but they are actually on Gadi’s /scratch disk). At the end of the cycle, restart files (model snap-shots) and history (model output data; the atmospheric component of which is controlled by the STASH) are moved to your archive, also on /scratch.

  1. In a terminal logged into Gadi, lets have a look at our archive:

    $ cd /scratch/[project]/[USER_ID]/archive/[suite_name]
    $ ls *
    history:
    atm/  cpl/  ice/  ocn/
    restart:
    atm/  cpl/  ice/  ocn/
    # here we can find restart and history files for each of the 4 main components of ACCESS-CM2: atm (UM), cpl (OASIS3-MCT), ocn (MOM) & ice (CICE)
  2. It’s worth knowing what the restart files look like:
    Note: Restart files are full dumps of all prognostic fields from each component of ACCESS-CM2, performed at the end of each simulation month.

    $ ls restart/atm
    cb957a.da18500201_00  cb957a.da18500301_00  cb957.xhist-18500131  cb957.xhist-18500228
    # two atmospheric files per month (only *.da* contains model data)
    $ ls restart/cpl
    a2i.nc-18500131  a2i.nc-18500228  i2a.nc-18500131  i2a.nc-18500228  o2i.nc-18500131  o2i.nc-18500228
    # three coupler files per month; one for each direction of information through the coupler (a2i = atmos to ice, i2a = ice to atmos, o2i = ocean to ice)
    $ ls restart/ice
    iced.1850-02-01-00000.nc  iced.1850-03-01-00000.nc  ice.restart_file-18500131  ice.restart_file-18500228  mice.nc-18500131  mice.nc-18500228
    # three sea-ice files per month
    $ls restart/ocn
    restart-18500131.tar  restart-18500228.tar
    # one tarred ocean file per month, each containing 16 separate files
  3. UM restart files are not converted into netCDF by the suite (unlike the history files), but remain in UM-based ‘.pp file’ format (UKMO-proprietary), and you may need to inspect them. For this, we recommend the tool ‘xconv’, which can also convert UM files to netCDF format:
    Note: The Python library ‘iris’, which is useful for analysing and visualising meteorological and oceanographic data sets, is also able to read and write UM .pp files. See Iris 3.8.0.dev126 documentation (scitools-iris.readthedocs.io)
    Note: ‘iris’ is in fact the library used for the netcdf_conversion task in ACCESS-CM2 suites

    $ /projects/access/apps/xconv/1.94/xconv restart/atm/cb957a.da18500201_00 &
    # to call the xconv binary file directly from the access project
    # the xconv GUI will open in an interactive window
    xconv:
        i) double click on any field (e.g. field 17: SURFACE TEMPERATURE AFTER TIMESTEP)
           # you can now see coordinate data in the right-hand window
           # switch between the dimensions (x,y,z,t) to see coordinate values, including the time stamp
       ii) select 'View Data' (above the coordinate window)
           # to inspect the cell-by-cell values of the field
      iii) select 'Plot Data' (next to 'View Data')
           # for an auto-generated plot of the field
       vi) select multiple fields from the main window (any will do)
           # these are fields that will be saved into a netCDF file in the next step
        v) set the 'Output file name' as 'xconv_test.nc' (top of the GUI)
           # or to any file name you like
       vi) ensure that the 'Output format' is set to 'Netcdf' (top of the GUI)
           # this should be automatically set
      vii) select 'Convert' (to the left, under the main field window)
           # your file will be generated within the directory from which you launched xconv
           # you can use this method to convert any UM .pp file to netCDF, not just restart files
  4. Let’s now have a look at the history data:

    $ ls history/atm
    cb957a.p71850feb     cb957a.p71850jan     cb957a.p81850feb     cb957a.p81850jan     cb957a.pd1850feb     cb957a.pd1850jan     cb957a.pm1850feb     cb957a.pm1850jan
    cb957a.p71850feb.nc  cb957a.p71850jan.nc  cb957a.p81850feb.nc  cb957a.p81850jan.nc  cb957a.pd1850feb.nc  cb957a.pd1850jan.nc  cb957a.pm1850feb.nc  cb957a.pm1850jan.nc
    # four UM atmospheric files are created by the coupler per month: 3-hourly (*.p7*), 6-hourly (*.p8*), daily (*.pd*) & monthly (*.pm*)
    # each has a corresponding netCDF file, created by the netcdf_conversion job
    $ ls history/cpl
    # empty, the coupler does not generate history files, only restarts
    $ ls history/ice
    iceh_d.1850-01.nc  iceh_d.1850-02.nc  iceh_m.1850-01.nc  iceh_m.1850-02.nc
    # two sea-ice files per month: daily (iceh_d.*) & monthly (iceh_m.*) 
    $ ls history/ocn
    ocean_daily.nc-18500131  ocean_month.nc-18500131  ocean_scalar.nc-18500131
    ocean_daily.nc-18500228  ocean_month.nc-18500228  ocean_scalar.nc-18500228
    # three ocean files per month: daily, monthly & scalar (time-invariant fields)
  5. Let’s have a closer look at the atmospheric netCDF model output data files. First, we need to load the netCDF module into our login instance (enabling us to view netCDF files, and many other tasks), then we can use ‘ncdump’ to view file metadata:

    $ module load netcdf
    # load just the basic netCDF commands
    OR
    $ module use /g/data/hh5/public/modules
    # this requires access to CLEX's hh5 project on NCI (https://my.nci.org.au/mancini/project/hh5/join)
    $ module load conda/analysis3
    # load CLEX's 'analysis3' conda environment, which includes many useful modules (primarily Python-focused), including netCDF tools and iris
    $ module list 
    Currently Loaded Modulefiles:
     1) netcdf/4.7.3
    OR
    Currently Loaded Modulefiles:
     1) openmpi/4.0.2   2) conda/analysis3-20.10(analysis3)
    # loading too many modules into a single login instance can easily produce conflicts; { $ module purge } will clear all loaded modules; or just open a fresh terminal
    $ ncdump -c history/atm/cb957a.pm1850jan.nc | less
    # the -c flag tells ncdump to only read the file header and coordinate information, rather than all of the model data
    # I like to pipe ncdump to less, to keep my terminal history visible
  6. With a netCDF-converted UM file open (in this case, a monthly data file from the first month of our test run), the first thing we can see are the dimensions:
    Note: We will not cover the general structure or usage of netCDF files, however if you are new to netCDF there are many tutorials and descriptions online. A good introduction is: https://pro.arcgis.com/en/pro-app/latest/help/data/imagery/fundamentals-of-netcdf.htm

    netcdf cb957a.pm1850jan {
    dimensions:
            time = UNLIMITED ; // (1 currently)
            model_theta_level_number = 85 ;  # full model levels (edges of vertical grid cells)
            lat = 144 ;
            lon = 192 ;
            bnds = 2 ;
            model_rho_level_number = 85 ;  # half model levels (center of vertical grid cells)
            pseudo_level = 5 ;  # sea-ice depths
            pseudo_level_0 = 6 ;  # atmospheric optical wavelength bands
            lon_u = 192 ;
            lat_v = 145 ;
            pseudo_level_1 = 17 ;  # CABLE tiles (land-use)
            pseudo_level_2 = 13 ;  # CABLE PFTs (plant functional types)
            depth = 6 ;  # Soil depth
            pressure = 19 ;  # atmospheric pressure levels

     

  7. Followed by the metadata of the monthly STASH fields (variables):
    Note: the field names are set according to the STASH field from which it originates, to ensure both clarity and provenance. In this example, we can see metadata for the STASH item m01s00i004 (http://reference.metoffice.gov.uk/um/stash/m01s00i004)

    variables:
            float fld_s00i004(time, model_theta_level_number, lat, lon) ;  # corresponds to STASH item m01s00i004
                    fld_s00i004:_FillValue = 1.e+20f ;
                    fld_s00i004:standard_name = "air_potential_temperature" ;
                    fld_s00i004:long_name = "THETA AFTER TIMESTEP" ;
                    fld_s00i004:units = "K" ;
                    fld_s00i004:um_stash_source = "m01s00i004" ;
                    fld_s00i004:missing_value = 1.e+20f ;
                    fld_s00i004:cell_methods = "time: mean" ;
                    fld_s00i004:grid_mapping = "latitude_longitude" ;
                    fld_s00i004:coordinates = "sigma_theta surface_altitude theta_level_height" ;
    ...
    ...
    data:
     time = -43813.5 ;
     model_theta_level_number = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
        15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 
        33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 
        51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 
        69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85 ;
     lat = -89.375, -88.125, -86.875, -85.625, -84.375, -83.125, -81.875, 
        -80.625, -79.375, -78.125, -76.875, -75.625, -74.375, -73.125, -71.875, 
        -70.625, -69.375, -68.125, -66.875, -65.625, -64.375, -63.125, -61.875, 
        -60.625, -59.375, -58.125, -56.875, -55.625, -54.375, -53.125, -51.875, 
        -50.625, -49.375, -48.125, -46.875, -45.625, -44.375, -43.125, -41.875, 
        -40.625, -39.375, -38.125, -36.875, -35.625, -34.375, -33.125, -31.875, 
        -30.625, -29.375, -28.125, -26.875, -25.625, -24.375, -23.125, -21.875, 
        -20.625, -19.375, -18.125, -16.875, -15.625, -14.375, -13.125, -11.875, 
        -10.625, -9.375, -8.125, -6.875, -5.625, -4.375, -3.125, -1.875, -0.625, 
        0.625, 1.875, 3.125, 4.375, 5.625, 6.875, 8.125, 9.375, 10.625, 11.875, 
        13.125, 14.375, 15.625, 16.875, 18.125, 19.375, 20.625, 21.875, 23.125, 
        24.375, 25.625, 26.875, 28.125, 29.375, 30.625, 31.875, 33.125, 34.375, 
        35.625, 36.875, 38.125, 39.375, 40.625, 41.875, 43.125, 44.375, 45.625, 
        46.875, 48.125, 49.375, 50.625, 51.875, 53.125, 54.375, 55.625, 56.875, 
        58.125, 59.375, 60.625, 61.875, 63.125, 64.375, 65.625, 66.875, 68.125, 
        69.375, 70.625, 71.875, 73.125, 74.375, 75.625, 76.875, 78.125, 79.375, 
        80.625, 81.875, 83.125, 84.375, 85.625, 86.875, 88.125, 89.375 ;
    ...
    ...
    # press 'q' to exit less
  8. Similar for ocean and sea-ice output files:
    Note: MOM and CICE use netCDF format natively, and field names are given by component-specific namelists which are not covered in this tutorial.

    $ ncdump -c history/ocn/ocean_month.nc-18500131 | less
    netcdf ocean_month {
    dimensions:
            xt_ocean = 360 ;
            yt_ocean = 300 ;
            time = UNLIMITED ; // (1 currently)
    ...
    ...
    variables:
    ...
            float sst(time, yt_ocean, xt_ocean) ;
                    sst:long_name = "Potential temperature" ;
                    sst:units = "K" ;
                    sst:valid_range = -10.f, 500.f ;
                    sst:missing_value = -1.e+20f ;
                    sst:_FillValue = -1.e+20f ;
                    sst:cell_methods = "time: mean" ;
                    sst:time_avg_info = "average_T1,average_T2,average_DT" ;
                    sst:coordinates = "geolon_t geolat_t" ;
                    sst:standard_name = "sea_surface_temperature" ;
    ...
    ...
    $ncdump -c history/ice/iceh_m.1850-01.nc
    netcdf iceh_m.1850-01 {
    dimensions:
            d2 = 2 ;
            ni = 360 ;
            nj = 300 ;
            nc = 5 ;
    ...
    ...
    variables:
            float hi(time, nj, ni) ;
                    hi:units = "m" ;
                    hi:long_name = "grid cell mean ice thickness" ;
                    hi:coordinates = "TLON TLAT time" ;
                    hi:cell_measures = "area: tarea" ;
                    hi:missing_value = 1.e+30f ;
                    hi:_FillValue = 1.e+30f ;
                    hi:cell_methods = "time: mean" ;
                    hi:time_rep = "averaged" ;
    ...
    ...
  9. Tutorial complete.

accessdev trac: https://accessdev.nci.org.au/trac

2016 ACCESS-CM2 Training: https://accessdev.nci.org.au/trac/wiki/access/AccessUserTrainingMar2016

Getting starting with ACCESS-CM2 document: http://nespclimate.com.au/wp-content/uploads/2020/10/Instruction-document-Getting_started_with_ACCESS.pdf

CSIRO-ACCESS Research page: https://research.csiro.au/access/

CLEX: https://climateextremes.org.au/

CLEX CMS blog: https://climate-cms.org/

CLEX CMS training videos: https://www.youtube.com/user/COECSSCMS

CLEX CMS Github: https://github.com/coecms/

CLEX Help Desk: cws_help@nci.org.au

NESP: http://nespclimate.com.au/

NCI Gadi User Guide: https://opus.nci.org.au/display/Help/Gadi+User+Guide

NCI Mancini: https://my.nci.org.au/mancini/login

UK Met Office trac: https://code.metoffice.gov.uk/trac/home

UM STASH: https://reference.metoffice.gov.uk/um/stash

Rose command reference: https://metomi.github.io/rose/doc/html/api/command-reference.html

Cylc documentation: https://cylc.github.io/

Cylc tutorial: https://cylc.github.io/cylc-doc/stable/html/tutorial/index.html

ESGF CMIP6 Data search: https://esgf-node.llnl.gov/search/cmip6/

Iris documentation & guide: scitools-iris.readthedocs.io