Preservation Plan
Purpose
CSIRO recognises the significant value in the data generated by its substantial investment in research. Sharing of research data is a central aim of research data management activities and it is essential for transparency, reproducibility, and innovation. The Data Access Portal provides the basis for the management of CSIRO’s digital research data assets in terms of storage, retention, and accessibility for reference, use, or reuse. For the purposes of this Preservation Plan, research data assets are those selected for long-term storage and management to enable validation of research findings and re-use of high-value or unique data. This Preservation Plan describes CSIRO’s approach to the processes and responsibilities of long-term retention and preservation of data assets held in the Data Access Portal for use by its user community.
The user community of the Data Acess Portal includes researchers, industry stakeholders, students, and policymakers from a broad range of the sciences and the wider community. To ensure long-term access to the collections in the Data Access Portal, CSIRO commits to maintaining the authenticity, reliability, and logical integrity in formats suitable for reuse.
Mission
The mission of the Data Access Portal is to provide access to CSIRO’s research data and ensure its long-term preservation and persistence.
The objective of the CSIRO Research Data Service is to ensure that the organisation captures, publishes and manages the right data to support innovation, collaboration, and scientific integrity. The Research Data Service will:
- ensure research data under the custodianship of CSIRO is securely stored, easily located and where appropriate, accessible to others for reuse;
- provide the ability to publish data to support reproducibility and research integrity;
- manage data in an increasingly collaborative research environment;
- respond to government and funding body requirements to share the data rising from publicly funded research.
The service is a collaboration between Information Management Technology teams as well as internal research partners to deliver a holistic solution for the management of research data in CSIRO. The Data Access Portal is the point of capture, and discovery portal, for CSIRO’s research data assets.
The CSIRO Data Access Portal optimises findability of the CSIRO’s research data collections. External systems that harvest the metadata includes data.gov.au, Research Data Australia, Google’s Dataset Search and other subject discipline portals.
Scope
The scope of this Preservation Plan is limited to the collections in the Data Access Portal. The Data Access Portal includes data assets generated by CSIRO. It may include third-party data that has been subject to a data deposit agreement and meets the criteria detailed in External Data Applications. CSIRO is committed to preserving the data that falls within our scope of responsibility.
Objectives
The data collections preservation service at CSIRO is designed to meet these objectives:
An integrated and interoperable data ecosystem with;
Established accountability for data at all levels;
A default assumption of openness, at the same time, ensuring that licensing, ethical and contractual obligations are honoured;
Supported by data management and data governance tools;
Acknowledgment of CSIRO culture and specific needs through the development of a data governance-oriented social architecture;
An approach where data is valued by the organisation and is managed in a way that enables us to realise value;
To preserve the data assets in the Data Access Portal permanently;
To survey file formats and assess their risk profile for long term preservation; and
To be a trusted digital repository.
Legal and regulatory framework
CSIRO is an Australian Government corporate entity, with a Board and Chief Executive. We’re constituted by and operate under the provisions of the Science and Industry Research Act 1949, which sets out our functions and powers, as well as those of our Minister, Board and Chief Executive. The governance, performance and accountability of our operations, including the use and management of public resources, are set out in the Public Governance, Performance and Accountability Act 2013 and related rules.
The legal and regulatory frameworks the Data Access Portal follows for preservation and access to datasets include:
- CSIRO’s Management of Research Data
- Archives Act 1983;
- Australian Code for Responsible Conduct of Research;
- Copyright Act 1968;
- Digital Continuity 2020; and
- Privacy Act 1988.
Roles and responsibilities
The Data Access Portal is an institutional repository that has been available since 2011. The service is owned by the Information Management Technology Department of CSIRO and is supported by qualified technical and information services staff. The Data Access Portal Development program ensures it operates on best practice principles in data asset preservation and technical capability. It is funded within the Information Management Technology budget. To manage and maintain the integrity of the Data which is stored in the Data Repository and to enable CSIRO to release such Data, the rights granted by the depositor includes the right for CSIRO to generate metadata, archive, convert, copy or migrate the Data onto any computer systems managed by CSIRO to ensure the future preservation and accessibility of the Data.
Depositors are CSIRO Officers who add research data to the Data Access Portal due to publisher, funder, or end of project requirements. CSIRO Officers acknowledge their responsibilities at the time of deposit by having worked through a Deposit Checklist or Software Licence Selection Process. Where a collection is to be made available externally, either publicly or restricted access to external partners, the depositor is required to submit the collection to an approver before it is published.
Approvers are CSIRO science leaders who approve the publishing of the collection based on the Deposit Checklist or Software Licence Selection Process.
Applications are accepted from external data producers but inclusion in the Data Access Portal is subject to criteria detailed in External Data Applications.
The Data Access Portal is developed and managed by CSIRO Information Management and Technology. It is governed by the Data Management Systems Board comprising of Authority and Chair of the Board (IMT), Senior Users (Science Business Units), Senior Supplier (IMT), Deputy Authority (IMT), and Specialist Advisors (IMT). In addition, the Scientific Domains have their own project Board. Pulsar Observations is governed by the CSIRO Australian Telescope National Facility (ATNF) Data Access Project Board. National Research Collections Australia has a temporary project board, NRCA Digital Infrastructure Upgrade Project Board, that will operate until it transfers to the Data Management Systems Board on completion of the project. The project boards play a key element in the Project Management framework and provide direction and advice to the Project Managers that will ensure the project achieves the expected outputs, outcomes, and impacts. The activities in scope are articulated in the approved Project Intent Statements (the Business Cases).
Development and support of the CSIRO Data Access Portal software and infrastructure is undertaken by a project manager and project team. The project team consists of a pool of thirty-one technical and research data services staff with a mix of full-time and part-time roles including 10 Software Developers, 5 Business Analysts, 6 Test Analysts, 2 Infrastructure Specialists, 2 Solution Architects, and 6 Research Data Services staff. The Research Data Service staff liaise, provide advice and training to CSIRO staff and the user community.
Implementing the preservation strategy
During the development of the Data Access Portal in the early 2010s, the Reference Model for an Open Archival Information System (hereafter referred to as OAIS Reference Model) was adopted to inform the processes for preserving originally submitted content.
The Data Access Portal’s processes are outlined in the RDS Functional Model shown in Figure 1 below.
Figure 1 CSIRO RDS Functional Model
Deposit checklist
To follow the guidance of the OAIS Reference Model, to accept appropriate information, the depositor and approver are required to complete a Deposit Checklist or Software Licence Selection Process. These documents are off-system tools to guide depositors and approvers through key contractual, legal, ethical, and quality questions related to publicly publishing collections. Depositors produce a draft collection that they submit to an approver. The approver has the appropriate delegation and authority rank and performs the final check before a collection is published. The benefits of the Deposit Checklist are the depositor and approver consider key issues prior to deposit that impacts on quality and preservation. Depositors and approvers have access to guidance in the form of written guides, training, and consultation with staff from a wide range of disciplines within CSIRO including commercial contracts, legal, ethics, data support, and information technology.
Deposit
The deposit function, or ingest function of the OAIS Reference Model, accepts metadata and files into the deposit module of the CSIRO Data Access Portal from the depositor. The repository system verifies the integrity and completeness of the information. The files are kept as originally supplied and are retained for preservation. The metadata is extracted and saved for search and retrieval and is added to the Data and Metadata Management database. Files are kept in on-premise object storage which is mirrored to a second geographic location, and additional copy of all collection files for all versions is saved onto tape. Links between the metadata database and the storage system for the files are maintained. All processes undertaken in the system are kept in logs. Depositors and approvers are notified when key processes such as publish are complete.
Archival storage
The Archival Storage functions include long-term storage, maintenance, and retrieval of the files. This function ensures the files remain the same as deposited and are accessible to the user community. To maintain the integrity of the files a checksum is created for each file and for all of the files in the collection. Management of the archival storage includes integrity checking, management of storage hierarchy, replacement of media, error checking, and providing files for the Access function.
Data storage is fit for purpose. At the collection level, the data is stored in on-premise object storage that is mirrored in a separate Australian state, and additional copy of all collection files for all versions is saved onto tape. Collections in each geographic location are generated from the same source copy. Checksums are created for each file in a collection and stored with each storage layer. The checksums are constantly monitored in each of the storage systems. Bad storage media is monitored.
Data and metadata management
Data and Metadata management functions are responsible for maintaining the integrity of the database and includes descriptive metadata, finding aids and system information. Descriptive metadata describes a collection and is used to support archive operations. This function receives queries from the search function of the user interface and generates a result for the user community.
Reports are generated on the use of the Data Access Portal from the Deposit function and include summaries of holdings. Google Analytics and Logging as a Service are used to collate download and page views. Updates to collections are received from the Deposit function and a new version is created.
Version control
To maintain the authenticity of a Data Access Portal collection any alteration is recorded accurately through the use of version control. Changes to metadata and/or files in the Data Access Portal create a new version. A new digital object identifier for a collection is automatically minted if the files in the new version are changed through deletion, replacement, or addition. The collection’s original version of metadata and files are retained. The digital object identifier of the original version will resolve to the landing page where the metadata and files are the same as originally deposited. A popup alert will advise if a new version is available. The current version will always be returned in search results from the Data Access Portal user interface.
Data collection withdrawal
If the depositor withdraws a collection the access conditions are changed from publicly accessible to restricted. Hereafter, only the system administrators and depositor will have access to the collection. Digital Object Identifiers for withdrawn collections will resolve to a tombstone page informing the user community that the collection has been withdrawn.
Access
The Access function of the Data Access Portal provides the user community with an interface to find, access, ask for assistance and provide feedback. The Access function includes the user interface where the user community can search, dereference a collection’s persistent identifier, download files, and request files that need to be made accessible from the archival storage. The Archival Storage function makes the files available to the requestor using the chosen access protocol. The user community can contact repository staff for assistance via email and telephone. The Access function implements the relevant security for a collection based on the access conditions that a depositor has applied to a collection.
Preservation planning
The functions of preservation planning are to monitor the needs of the depositors, approvers and user community. Additionally, this function monitors technology to ensure ongoing system suitability for the continued preservation and access to the Data Access Portal’s collections. In the Deposit Checklist, depositors are asked to consider a suitable format and size for long-term sharing and reuse. It is preferred that depositors use open and domain standard file formats. Further analysis is required to stocktake the file formats that exist for collections that are not Scientific Domains. At present, the Linux/Unix file utility is used to perform file format identification on ingest. The metadata from this tool is in the process of being reviewed to understand the file formats in current use. In addition, an investigation of other file format identification tools will be completed to understand their fit within the context of the CSIRO Data Access Portal. Once complete the repository will be able to further develop the preservation approach to file formats and migration. It is expected this analysis will be completed in 2024.
Administration
The Administration function includes the day-to-day management of the Data Access Portal for Deposit, Archival Storage, Data and Metadata Management, Access and Preservation Planning functions. Managing the interaction with depositors, approvers and the user community is an additional function of Administration. It is responsible for establishing standards, policies and monitoring system performance. Administration oversees the archiving and access systems of the Data Access Portal and ensures they are kept up-to-date.
Content coverage
CSIRO is committed to ensuring the long-term availability of the data it holds by ensuring technology is adapted to changes in storage and application technologies. The Data Access Portal includes a broad range of file formats in its collections that span the scientific disciplines within CSIRO. A wide range of file formats are accepted as preservation is based on the common practices of a discipline. Yet, file formats may become obsolete due to software or hardware dependencies. The preference is for data formats that are open and accessible by many software applications and are commonly used within the scientific discipline.
IT Architecture
The IT architecture for the preservation of the collections in the Data Access Portal is fit for purpose and developed and maintained by CSIRO. Plans for infrastructure development are considered within the Data Access Portal Development Project.
The application for the Data Access Portal is developed within CSIRO and there are at least three releases of the software each year. A record of the software configuration is maintained in a Bitbucket code repository. The release history of the software is publicly available to the Data Access Portal user community.
Security
Appropriate security measures ensure Data Access Portal collections are protected from unauthorised use, accidental modification or loss.
Security for the storage of collections in the Data Access Portal is restricted to the Administration team. Their responsibilities are outlined in the CSIRO Information Security Procedure. All administrators have obtained a security clearance that is administered by the Australian Commonwealth Government. Server rooms are located in multiple locations across Australia. Storage of Data Access Portal collections are mirrored in two data centres by default and an additional copy is kept on tape.
For the Data Access Portal application backups are performed on critical information in the system and documentation exists to permit the creation of replacement servers which would allow restoration of service within 2-3 days. Business continuity procedures have been documented.
Data is secured and users traceable for authenticated access for depositing and accessing restricted collections. The Administration team is able to track user interaction with the application using a combination of application logs, server logs, user access logs, Google Analytics and Logging as a Service are used for determining downloads. The CSIRO and the Australian Privacy Principles – Privacy Policy provide procedures for the management and protection of personal information that CSIRO collects and holds.
To test the security measures employed by the Data Access Portal, the CSIRO Cyber Security Services team undertakes regular security and penetration tests. The framework that guides CSIRO Cyber Security Services is provided by the Australian Government under the Protective Security Policy Framework (PSPF) and the Information Security Manual (ISM). The Server Ops and Database team undertake regular maintenance of the underlying infrastructure by applying security patches as required.
Sustainability
CSIRO is an Australian Government corporate entity. CSIRO’s parent began as the Advisory Council of Science and Industry in 1916. CSIRO is constituted by and operates under the provisions of the Australian Government Science and Industry Research Act 1949. CSIRO’s main funding is directly from the Australian Government and this funding is based on a triennial funding program. If funding was to cease, CSIRO would have approximately three years notice to plan for the succession of the Data Access Portal’s collections.
Acknowledgements
These plans were developed in consultation with the following documents:
Data Archiving and Networked Services (DANS), 2021, Preservation Plan, viewed 8 December 2017, <https://dans.knaw.nl/en/preservationplan/>.
Inter-university Consortium for Political and Social Research (ICPSR), 2009, Principles and Good Practice for Preserving Data, International Household Survey Network, IHSN Working Paper No 003, viewed 8 December 2017, <http://www.ihsn.org/sites/default/files/resources/IHSN-WP003.pdf>.
UK Data Archive, 2021, Preservation Policy, Version 12.00. <http://www.data-archive.ac.uk/media/514523/cd062-preservationpolicy.pdf>.