Harvesting Metadata

Overview

The Data Access Portal exposes metadata for harvesting in a variety of ways. Metadata can be harvested flexibly via the API. A complete metadata record for each collection can also be harvested from the header of the collection landing page in accordance with the specifications of schema.org.

The DAP used to expose metadata for harvesting through an Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) endpoint, but this has now been retired.

Harvesting Using the API

Metadata can be harvested from the API in a variety of formats.

To do this:

  1. Search the Data Access Portal for the collections you wish to harvest. The https://data.csiro.dap/ws/v2/collections/{ext} endpoint is useful for doing this.
  2. Use the query results to find the Fedora PIDs.
  3. Once you have a set of identifiers of collections you want to harvest, use the https://data.csiro.dap/ws/v2/collections/{id}/metadata endpoint to retrieve a list of the metadata types that can be used to access that collection’s metadata.
  4. Select a metadata type from the list and use the https://data.csiro.dap/ws/v2/collections/{id}/metadata/{type} endpoint to harvest the metadata from the collection in your preferred format.
  5. The default encoding that will be returned is XML, however you can also request that results be returned as JSON by appending .json to your request or including the header "Accept: application/json".

There is an example Python script that can be used for harvesting metadata from the DAP: metadata_harvester

Metadata Formats

The default metadata types available for all DAP collections are detailed in the following table. The API defaults to serving XML as the encoding, even when the metadata schema usually requires JSON (eg. schema.org). You can explicitly request that results be returned in JSON by appending .json to your request or including the header "Accept: application/json".

FormatUsed byType
RIF-CSAll publicly accessible DAP collection records.
Party records for lead researchers.
Activity records.
rif
Dublin CoreAll publicly accessible DAP collection records.dc
Core Scientific Metadata ModelAll publicly accessible DAP collection records.csmd
Schema.orgAll publicly accessible DAP collection records.schemaOrg

Some publicly accessible DAP collections also use custom metadata schemas. The implementation of these schemas in the DAP is detailed in the Models section of our Swagger documentation.

FormatNotesModelType
ANZLICThis is an implementation of the geospatial metadata schema ISO 19115 in the context of Australia/New Zealand.AnzlicDtoanzlic
Darwin CoreA metadata schema based on taxa, for sharing biological information.DarwinCoreDtodwc
Marine Community ProfileThis is based on ISO 19115 and is primarily used by collections that have been imported by National Collections and Marine Infrastructure staff from Marlin records.MarineCommunityProfileDtomcp*

*Some collections that have been described using the Marine Community Profile have been erroneously tagged as anzlic.

Sets

Sets are lists of collections grouped for harvesting by a repository, funder or research facility. You can access a specific set at the https://data.csiro.au/dap/ws/v2/tags/{tag} endpoint.

You can get a complete list of tags to use the access the different sets at https://data.csiro.au/dap/ws/v2/tags. This list indicates the default metadata schema in which results are returned for these sets. If you would like to harvest metadata in an alternative format, then you can extract the identifiers from the results and use them to retrieve data in the format you prefer.

The examples below are not a complete list.

SetTagNotesAccess the set
Atlas of Living AustraliaALACollections that have been flagged by their depositors as being of potential interest to the Atlas of Living Australia.Access ALA set
TERN SoilsTERN_SoilsA set of collections funded by the Terrestrial Ecosystem Research Network.  Includes the Soil and Landscape Grid of Australia.Access TERN_Soils set
PACCSAPPACCSAPPacific-Australia Climate Change Science Adaptation Planning program.Access PACCSAP set
TERN ACEFTERN_ACEFAustralian Coastal Ecosystems Facility.Access TERN_ACEF set
Marine National FacilityMNFMarine National Facility voyage data collections.Access MNF set
Grains Research and Development CorporationGRDCA set of collections funded by the Grains Research and Development Corporation.Access GRDC set

Harvesting Using Headers

Complete Schema.org metadata is provided in JSON format in the HTML header of every collection landing page. It is therefore possible to harvest metdata from the Data Access Portal by crawling every collection landing page and scraping metadata from the headers. To do this, crawlers can use the sitemap to retrieve a list of URLs to all DAP collections. If you harvest metadata using this method, then you will need to process the javascript on each page, eg. by using a headless browser.

We request that if you using this method of harvesting, you do so by using a custom User Agent string that includes the word “bot” or “crawler” (case insensitive), so that we can filter this activity out of our usage logs.