Lifecycle-Driven Data Requirement

Summary: Data requirements must be clearly defined throughout the data lifecycle, taking into account the ethical considerations and responsibilities of all stakeholders.

Type of pattern: Process pattern

Type of objective: Trustworthiness

Target users: Business analysts

Impacted stakeholders: Developers, data scientists, testers, operators

Lifecycle stages: Requirements

Relevant AI ethics principles: Human, societal and environmental wellbeing, human-centered values, fairness, privacy protection and security, reliability and safety, transparency and explainability, contestability, accountability

Mapping to AI regulations/standards: EU AI Act, ISO/IEC 42001:2023 Standard.

Context: The effectiveness of an AI model is heavily reliant on the quality of the data used to train or evaluate it. The data lifecycle is composed of several key phases, including data collection, cleaning, preparation, validation, analysis, and termination. However, the scope of data requirements often focuses on the data analysis phase, neglecting the other key phases in the data lifecycle. This focus can result in downstream ethical concerns, such as unreliable models, lack of accountability, and unfairness. To ensure trust in AI systems, it is vital to manage the data lifecycle carefully.

Problem: How can we make sure data is responsibly used and managed?

Solution: To ensure that data is used and managed in a responsible manner, it is essential to explicitly list and specify data requirements throughout the entire data lifecycle, including collection, cleaning, preparation, validation, analysis, and termination. These requirements should take into account all relevant ethical principles and all stakeholders involved in the process, including data providers, data engineers, data scientists, data consumers, and data auditors. A comprehensive data requirements specification document can be created to manage and document these data requirements [1], which can include detailed requirements on each phase of the data lifecycle (e.g., requirements about data sources and collection methods). In addition to data requirements, the specification should include basic information about the dataset, such as its vision, motivation, intended/nonintended uses, examples of data instances, and stakeholders consulted. The data requirements specification documents should have a clearly assigned owner, created date, and last updated date for traceability and accountability.

Benefits:

  • Improved data quality: When data requirements are specified and managed throughout the data lifecycle, the quality of the data can be improved, which in turn leads to more accurate and reliable AI models.
  • Compliance with regulations: Having a data requirements specification can help organizations comply with RAI regulations.

Drawbacks:

  • Different vocabularies: Stakeholders may have different vocabularies about data requirements.
  • Development inefficiency: To speed up the development and reduce the cost required, the development team may start collecting data before the data requirements specification is complete.

Related patterns:

Known uses:

  • Google has created a template for dataset requirements specification [2], which can be used to manage and track data requirements.
  • Requirements including data requirements should be regularly updated to reflect the changing needs of the users [3].
  • Data governance should be implemented and followed throughout the entire lifecycle of data [4].

References:

[1] Zhu, L., et al., AI and Ethics–Operationalising Responsible AI. arXiv preprint arXiv:2105.08867, 2021.

[2] Hutchinson, B., et al., Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure, in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021, Association for Computing Machinery: Virtual Event, Canada. p. 560–575.

[3] Vogelsang, A. and M. Borg. Requirements engineering for machine learning: Perspectives from data scientists. in 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW). 2019. IEEE.

[4] Lee, S.U., L. Zhu, and R. Jeffery. A data governance framework for platform ecosystem process management. in International Conference on Business Process Management. 2018. Springer.