Harvesting Metadata
Overview
The Data Access Portal exposes metadata for harvesting in a variety of ways. Metadata can be harvested flexibly via the API. A complete metadata record for each collection can also be harvested from the header of the collection landing page in accordance with the specifications of schema.org.
The DAP used to expose metadata for harvesting through an Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) endpoint, but this has now been retired.
Harvesting Using the API
Metadata can be harvested from the API in a variety of formats.
To do this:
- Search the Data Access Portal for the collections you wish to harvest. The
https://data.csiro.dap/ws/v2/collections/{ext}
endpoint is useful for doing this. - Use the query results to find the Fedora PIDs.
- Once you have a set of identifiers of collections you want to harvest, use the
https://data.csiro.dap/ws/v2/collections/{id}/metadata
endpoint to retrieve a list of the metadata types that can be used to access that collection’s metadata. - Select a metadata type from the list and use the https://data.csiro.dap/ws/v2/collections/{id}/metadata/{type} endpoint to harvest the metadata from the collection in your preferred format.
- The default encoding that will be returned is XML, however you can also request that results be returned as JSON by appending
.json
to your request or including the header"Accept: application/json"
.
There is an example Python script that can be used for harvesting metadata from the DAP: metadata_harvester
Metadata Formats
The default metadata types available for all DAP collections are detailed in the following table. The API defaults to serving XML as the encoding, even when the metadata schema usually requires JSON (eg. schema.org). You can explicitly request that results be returned in JSON by appending .json
to your request or including the header "Accept: application/json"
.
Format | Used by | Type |
RIF-CS | All publicly accessible DAP collection records. Party records for lead researchers. Activity records. | rif |
Dublin Core | All publicly accessible DAP collection records. | dc |
Core Scientific Metadata Model | All publicly accessible DAP collection records. | csmd |
Schema.org | All publicly accessible DAP collection records. | schemaOrg |
Some publicly accessible DAP collections also use custom metadata schemas. The implementation of these schemas in the DAP is detailed in the Models section of our Swagger documentation.
Format | Notes | Model | Type |
ANZLIC | This is an implementation of the geospatial metadata schema ISO 19115 in the context of Australia/New Zealand. | AnzlicDto | anzlic |
Darwin Core | A metadata schema based on taxa, for sharing biological information. | DarwinCoreDto | dwc |
Marine Community Profile | This is based on ISO 19115 and is primarily used by collections that have been imported by National Collections and Marine Infrastructure staff from Marlin records. | MarineCommunityProfileDto | mcp* |
*Some collections that have been described using the Marine Community Profile have been erroneously tagged as anzlic.
Sets
Sets are lists of collections grouped for harvesting by a repository, funder or research facility. You can access a specific set at the https://data.csiro.au/dap/ws/v2/tags/{tag}
endpoint.
You can get a complete list of tags to use the access the different sets at https://data.csiro.au/dap/ws/v2/tags
. This list indicates the default metadata schema in which results are returned for these sets. If you would like to harvest metadata in an alternative format, then you can extract the identifiers from the results and use them to retrieve data in the format you prefer.
The examples below are not a complete list.
Set | Tag | Notes | Access the set |
Atlas of Living Australia | ALA | Collections that have been flagged by their depositors as being of potential interest to the Atlas of Living Australia. | Access ALA set |
TERN Soils | TERN_Soils | A set of collections funded by the Terrestrial Ecosystem Research Network. Includes the Soil and Landscape Grid of Australia. | Access TERN_Soils set |
PACCSAP | PACCSAP | Pacific-Australia Climate Change Science Adaptation Planning program. | Access PACCSAP set |
TERN ACEF | TERN_ACEF | Australian Coastal Ecosystems Facility. | Access TERN_ACEF set |
Marine National Facility | MNF | Marine National Facility voyage data collections. | Access MNF set |
Grains Research and Development Corporation | GRDC | A set of collections funded by the Grains Research and Development Corporation. | Access GRDC set |
Harvesting Using Headers
Complete Schema.org metadata is provided in JSON format in the HTML header of every collection landing page. It is therefore possible to harvest metdata from the Data Access Portal by crawling every collection landing page and scraping metadata from the headers. To do this, crawlers can use the sitemap to retrieve a list of URLs to all DAP collections. If you harvest metadata using this method, then you will need to process the javascript on each page, eg. by using a headless browser.
We request that if you using this method of harvesting, you do so by using a custom User Agent string that includes the word “bot” or “crawler” (case insensitive), so that we can filter this activity out of our usage logs.