Data Archives - Diversity and Inclusion in AI Guidelines

Data plays an essential role in AI systems since it is typically through very large historical datasets that AI algorithms learn and find patterns to deliver predictions and automate decisions. What, how, why, by whom, and for whom data is collected, labelled, modelled, stored, and applied has many diversity and inclusion implications. Positive and negative biases are present in the large datasets and algorithmic processes used in the development of AI models. Unwanted data biases often arise when algorithms are trained on one type of data and cannot extrapolate accurately beyond that data. In other types of AI systems (known as Symbolic AI), small datasets are used both as input and validation points in building and improving knowledge-based systems. It is critical to have a fair and inclusive representation of everyone who will be impacted by AI without any unwanted or negative bias that leads to discrimination and harm.

D09Improve feature-based labelling and formulate more precise notions about user identity using qualitative data from social media sources

Apply more inclusive and socially just data labelling methodologies such as Intersectional Labeling Methodology to address gender bias. Rather than relying on static, binary gender in a face classification infrastructure, application designers should embrace and demand improvements, to feature-based labelling. For instance, labels based on neutral performative markers (e.g., beard, makeup, dress) could replace gender classification in the facial analysis model, allowing third parties and individuals who come into contact with facial analysis applications to embrace their own interpretations of those features. Instead of focusing on improving methods of gender classification, application designers could use labelling alongside other qualitative data such as Instagram captions to formulate more precise notions about user identity.

D08Document social descriptors when scraping data from different sources and perform compatibility analysis

Developers should attend to and document the social descriptors (for example, age, gender, and geolocation) when scraping data from different sources including websites, databases, social media platforms, enterprise applications, or legacy systems. Context is important when the same data is later used for different purposes such as asking a new question about an existing data set. A compatibility analysis should be performed to ensure that potential sources of bias are identified, and mitigation plans made. This analysis would capture context shifts in new uses of data sets, identifying whether or how these could produce specific bias issues.

D07Assess dataset suitability factors

Dataset suitability factors should be assessed. This includes statistical methods for mitigating representation issues, the socio-technical context of deployment, and interaction of human factors with the AI system. The question of whether suitable datasets exist that fit the purpose of the various applications, domains, and tasks for the planned AI system should be asked.

D06Consider context issues and context drift during model selection and development

Context should be taken into consideration during model selection to avoid or limit biased results for sub-populations. Caution should be taken in systems designed to use aggregated data about groups to predict individual behaviour as biased outcomes can occur. “Unintentional weightings of certain factors can cause algorithmic results that exacerbate and reinforce societal inequities,” for example, predicting educational performance based on an individual’s racial or ethnic identity. Observed context drift in data should be documented via data transparency mechanisms capturing where and how the data is used and its appropriateness for that context. Harvard researchers have expanded the definition of data transparency, noting that some raw data sets are too sensitive to be released publicly, and incorporating guidance on development processes to reduce the risk of harmful and discriminatory impacts: • “In addition to releasing training and validation data sets whenever possible, agencies shall make publicly available summaries of relevant statistical properties of the data sets that can aid in interpreting the decisions made using the data, while applying state-of-the-art methods to preserve the privacy of individuals. • When appropriate, privacy-preserving synthetic data sets can be released in lieu of real data sets to expose certain features of the data if real data sets are sensitive and cannot be released to the public.” Teams should use transparency frameworks and independent standards; conduct and publish the results of independent audits; open non-sensitive data and source code to outside inspection.

D05Recognize relationships between access issues, infrastructure, capacity building, and data sovereignty

Access, including cloud and offline data hosting, should be attended to because government and industry generally build and manage these on their own terms. Access is directly connected to capacity building (teams and stakeholders) and data sovereignty issues.

D04Understand and adhere to data sovereignty praxis

The concept of, and practices supporting, data sovereignty is a critical element in the AI ecosystem. It covers considerations of the “use, management and ownership of AI to house, analyze and disseminate valuable or sensitive data”. Although definitions are context-dependent, operationally data sovereignty refers to stakeholders within an AI ecosystem, ad other relevant representatives from outside stakeholder cohorts to be included as partners throughout the AI-LC. Data sovereignty should be explored from and with the perspectives of those whose data is being used. These alternative and diverse perspectives can be captured and fed back into AI Literacy programs, exemplifying how people can affect and enrich AI both conceptually and materially. Various Indigenous technologists, researchers, artists, and activists have progressed the concept of, and protocols for, Indigenous data sovereignty in AI. This involves “Indigenous control over the protection and use of data that is collected from our communities, including statistics, cultural knowledge and even user data,” and moving beyond the representation of impacted users to “maximising the generative capacity of truly diverse groups.”

D03Establish clear procedures for ensuring data privacy and offering opt-out options

Data privacy should be at the forefront, particularly when data from marginalized populations are involved. End users should be offered choices about privacy and ethics in the collection, storage, and use of data. Opt-out methods for data collected for model training and model application should be offered where possible.

D02Involve stakeholders and ‘non-experts’ in the selection, collection, and analysis of demographically representative qualitative data

Representatives of impacted stakeholders should be identified and partnered with on data collection methods. This is particularly important when identifying new or non-traditional data-gathering resources and methods. To increase representativeness and responsible interpretation, when collecting and analyzing specific datasets include diverse viewpoints and not only those of experts. Technology or datasets deemed non-problematic by one group may be predicted to be disastrous by others. Training data sets should be demographically representative of the cohorts or communities on whom the AI system will impact.

Data Scientist

D01Establish a clear rationale for data collection

For data collection involving human subjects, why, how and by whom data is being collected should be established in the Pre-Design stage. Potential data challenges or data bias issues that have implications for diversity and inclusion should be identified by key stakeholders and data scientists. For example, in the health application domain, diverse data sources […]