Collaborative Curation & Collection Science

Australian Tree Seed Centre CSIRO Canberra

Leveraging the complementary capabilities of human and machine intelligence to collaboratively facilitate direct connections between digitisation, curation and data integration and management.

The Challenge

The broader vision for the collections of the future is to fully leverage the complementary capabilities of human and machine intelligence to collaboratively facilitate direct connections between digitisation, curation and data integration and management. Building a flexible interactive two-way interface between curators, researchers and machines is a long-term goal. A critical first step is to improve the ability to manage and curate the specimens. While there will always be a need for humans in the system there are many tasks that could be assisted by “digital curators”.

Our Response

This project aims to identify and evaluate the first practical steps towards a broader concept of what a ‘digital curator’ could do and provide a ‘proof-of-concept’ demonstration of the value of integrating machine and human intelligence to assist with collection management. The first step of this project involves identifying which types of activities where a digital curator could provide substantial assistance to human collection staff. This includes explicit identification of tasks where machine-human interaction and collaboration is a critical element. The design of the AI and the associated collaborative workflows will improve data quality and assist users to extract the most relevant and accurate information from the databases.

Impact

This collaborative digital curation capability is being built with the flexibility to ultimately be deployed across collections within CSIRO and globally. Natural history collections worldwide house an estimated 2-4 billion specimens, representing an irreplaceable record of life on Earth. However, less than 20% of these specimens are easily accessible to researchers and decision-makers. With mounting pressures including climate change, biodiversity loss, and biosecurity threats, making this knowledge more accessible has never been more crucial. Providing new methods to obtain accurate and timely data from these specimens to the global community will enable researchers and practitioners to address some of the environmental and societal issues we face today and in the future.

A prototype application for improving specimen metadata extraction from digital specimen images using human-AI collaborative workflows has been developed and is currently being trialled within the National Research Collections. It will be released as open-source software.

Publications

Alan Stenhouse, Nicole Fisher, Alexander Schmidt-Lebuhn, Brendan Lepschi, Juanita Rodriguez, Federica Turco, Andrew Reeson, Cécile Paris, Pete Thrall. 2025. A Vision of Human-AI Collaboration for Enhanced Biological Collection Curation and Research. BioScience. https://doi.org/10.1093/biosci/biaf021
- This paper explores how AI-based “digital curators” could revolutionize natural history collections by assisting human experts with routine tasks while preserving crucial human oversight. This collaborative human-AI approach aims to enhance curatorial capabilities and make vast biological collections more accessible to address critical challenges including biodiversity conservation and climate change adaptation.
Alan Stenhouse, Nicole Fisher, Alexander Schmidt-Lebuhn, Brendan Lepschi, Juanita Rodriguez, Federica Turco, Emma Toms, Andrew Reeson, Cécile Paris, Pete Thrall, 2023. Improving Biological Collections Data through Human-AI Collaboration. Biodiversity Information Science and Standards 7: e112488. https://doi.org/10.3897/biss.7.112488
- This paper introduces our prototype application for human-AI collaborative metadata extraction and curation.

News

Open allClose all

Paper accepted to BioScience

January 2025: BioScience journal from Oxford University Press has accepted our paper “A Vision of Human-AI Collaboration for Enhanced Biological Collection Curation and Research” for publication.

CINTEL Meet and Mingle 2023

October 2023: CINTEL 2023 Meet and Mingle event. Pete Thrall and Alan Stenhouse presented updates on the Collaborative Curation & Collection Science project during the CINTEL Annual face-to-face event (our CINTEL Meet and Mingle event), held in Adelaide.

Pete provided an overview of the project followed by Alan with an update on his progress so far. They also contributed to a panel discussion with other projects in the Collaborative Discovery research stream, uncovering common themes and challenges between the projects. See the presentations and panel discussion below.

Presenters are: Dr. Pete Thrall, Dr. Alan Stenhouse, Dr. Rad Suchecki, Dr. Maisie Li, Dr. Magda Guglielmo and Dr. Albert Ardevol.

Biodiversity Information Standards Conference presentation

October 2023: Alan presented his research on Improving Biological Collections Data through Human-AI Collaboration as part of the AI Contributions to biodiversity data & data standardisation: Opportunities and challenges symposium at the Biodiversity Information Standards 2023 Conference held during October in Hobart, Tasmania.

Hi there. Let’s get started. I’m Alan Stenhouse and I’ll tell you today a little bit about my work developing a prototype system for specimen metadata transcription in the biological collections, as well as give you some idea of how overall vision and aims within the collections domain.

I’m part of the Collaborative Intelligence Future Science Platform within CSIRO here in Australia, and our aim overall is to develop the science to help humans and AI work better together to exceed the performance of either alone. As you can imagine, this involves multiple disciplines. In addition to this project within the National Research Collections of Australia, we have projects in other domains as well as some foundational CINTEL projects looking at key aspects of human-AI collaboration such as trust, workflows, and skills.

Our vision overall is to transform collection science to tackle diverse environmental challenges by fully leveraging the complementary capabilities of human and machine intelligence to improve digitization, data integration and management, and delivery. So we’re developing methods and tools around these targets: existing specimens in digital format, especially the specimen metadata; collaborative data curation, that is finding and fixing errors and anomalies; and releasing the data for further taxonomic and other research purposes.

Within the collections there are about 15 million specimens, most of which are still to be digitized and made more widely available. In comparison, there’s about 12 million insect specimens in the collection, and we’ve just finished digitalizing the herbarium which is 1.2 million specimens. As you can imagine, these pose a few challenges, particularly around labels containing not so much typed but handwritten text and a whole variety of formats, as well as different data formats, different languages, and so on.

Here are some samples from the National Insect Collection—so you can see, oh upside down somehow, interesting, and in fact that doesn’t actually matter usually—and some from the herbarium. I included this one because I just read a book on Banks earlier this year and I was interested to see this one in the collection.

Our initial focus has been on developing an application for our collections team to enable improved metadata transcription using AI services and human-AI collaboration. I’ve been developing this prototype using existing AI models for optical character recognition, translation, and using large language models for metadata entity extraction or named entity recognition. Here we see two of the screens: the specimen image showing the bounding boxes around the text items which comes from the OCR process, and the metadata screen on the right showing the entities which have been automatically extracted using the LLM.

The first step is doing optical character recognition on the specimens. I’m using Google Vision which works well and can be accessed easily using APIs, but there are still errors, and I have to say humans also have problems deciphering some of the handwriting. But overall, Google Vision does a great job. We can also use it with another API to perform translation from many languages, which is an issue within at least our collection and I’m sure it is in others. This translation has dramatically improved in the last few years with newer AI models.

After OCR, we then need to extract the entities from each image so that these can ultimately be put into our collection database. At the moment, we’re using large language GPT models from OpenAI as they supply API for easy programmatic use and with good performance. This enables us to more easily extract a larger range of metadata from specimens including species names, collectors, determiners, dates, lat-longs, and other location information, as well as more complex data such as localities, habitat, and notes of various sorts.

Here are the results of a query using the GPT-4 model to extract the metadata from the specimen text. Note that there are some errors, and these come from both the LLM as well as the OCR process. The prompts and parameters that we supply to the LLM model which define our requirements are key, and this includes how we define the entities or metadata elements that we want to extract. Some initial testing can be done using the OpenAI playground at the URL on screen, and we can refine these further using our prototype. If anybody hasn’t used, or I presume probably a lot of people have tried ChatGPT, using the playground to play around with various types of queries is very useful and it’s free to access as far as I know.

There are trade-offs when using GPT-3.5 versus 4. 3.5 is faster, 4 is more accurate but unfortunately 4 costs about 20 times as much. Note that on this example, the result probably cost three cents or so.

Some models provide estimates of uncertainty about their predictions, which may be useful when checking the data. Here I show simple visualization within our prototype using some threshold levels of uncertainty to highlight uncertain items or uncertain entities. This is only to do with the uncertainty that the large language model has about its predictions, which has not necessarily any relation to how accurate the actual extraction is.

So how can we improve the results? By using collaborative and interactive refinement. For example, we can edit the OCR text while looking at the context of the specimen image—I don’t show that here but in the prototype you can do all this, but this is not really scalable, you know, when you’re talking about handling thousands, tens of thousands, or hundreds of thousands of images. We can retry entity extraction using a different model, for example GPT-4 rather than 3.5.

We can refine the prompts that we use to provide better context and guidance. You should provide examples and specify formatting requirements or other data modifications that you might desire. So here for example, I specify that all dates should be converted to year-month-day format. You should also specify the entities you want extracted and the output format, for example tab-separated or JSON. So you can see down the bottom there, those two items, or actually three now, are built up into one query along with the OCR text from the OCR process, which is passed to it and those have an effect on what you actually end up getting extracted.

We’re also starting to build in some cross-checks and entity lookups using online references, so I very much appreciate the work that a lot of you are doing to provide APIs to these services.

Future steps include developing other digital curators, if I can use that term, such as data curation, trait extraction, and data linking assistants to augment our human curators. We’ll also look at extending the knowledge base, as large language models are huge but limited and relatively fixed. Methods for extending their capabilities are rapidly emerging. We can fine-tune the models with our own data which may provide more accurate results but has time and cost implications. We can use retrieval augmented generation, which is one way where we can develop and use our own data sources to provide more relevant contextual information to the queries to the LLM to improve the response.

I’ll also be exploring how we can usefully integrate LLMs, ontologies, knowledge graphs, neural graphs, and other emerging methods in supporting and enhancing our digital curators. I’m especially interested in how we can integrate these methods into powerful and easy to use tools and workflows to assist our collections teams both here and around the world to enhance and improve their data management and curation. Thanks to our team and thank you for listening. I’m happy to take any questions and do get in touch anytime.

[Applause]

Thanks very much, Alan. Okay, so there is a question online from Ruka: “You just mentioned in your talk you translate stuff before feeding it to GPT. Have you tried doing it with translation? It understands other languages remarkably well.”

“Yeah, sure, and it does do it well. I mean, you can—I do use Google Translate because it does a bit of job of translation, but you can also ask as part of your query to the large language model to translate and extract so it’ll extract it in the language you want. I haven’t done it in other languages, well I could do it in some other languages but not so many. The problem for me is evaluating, but even so, I can see that it’s—I’ve tried some and it translates and extracts very well.”

“Are there any questions in the room?”

“So, Patrick Ruch, Swiss Institute of Bioinformatics in Geneva. You are using a generative model to do information extraction, but in principle I mean people would use like an analytic model like BERT and not a generative one, and the reason why they would do that because rarely but occasionally the generative model will hallucinate. You did not mention this issue in your analytics. Are there cases where there is hallucination?”

“There was some in the past that I noticed, and it was one of the parameters that you can set on when—and this is something you can try in the playground—is called, I’ve just forgotten it off the top of my head… temperature. And if you set that to zero it is more deterministic, so you lose the randomness and less likely to get hallucinations. But I suspect that you do—you also get errors of course, and where it particularly with the GPT-3.5 model it has a lot more problems, some at times doing anything and it will return basically nothing. So again this is why we’re not relying on this to do our thing, it’s a collaboration between human—this will come back, humans will then curate that data.

This is going to be used in the wild, or actually within our group, in about February as the collection team’s moving to a new building so they’re going to have some time spared to actually go through and do all this. But the speed up of—I mean the time taken to do all this and the cost is minimal overall even though it has a cost. So I think I calculated even using GPT-4 for 100,000 images was something like 1,700 to process them all and that, you know, it’s again it’s not perfect so there is always work to do, but getting the range of full text things like habitats and other information and localities into the at least on the starting right entities is, I hope, good progress.”

“Okay, thank you. But cost-wise using a BERT model or RoBERTa would cost just zero apart from you’ve got to train it, right?”

“Yeah, so here again I’ve used it because I’m not a machine learning person and I know—I did talk to them and this was an easy way for me to get started. So I agree totally, you know, you’re going to get much better as I said I think about fine tuning so basically similar process now. If anybody has any good ideas because we have all this data in the collections, I did think we could use that to do this tuning of models.”

“So we have one more question in the chat, and it’s from Mick: ‘Did you consider Chain of Thought prompting or alternatives?'”

“I have considered it but yeah, I guess all of that is a question of time and cost and so on as you’re charged for the amount of tokens or amount of text effectively that you put in and get back. And I wanted easy to handle responses, so either tabular or JSON or some other format that you want, so that was my intention. But I’m sure there’s other ways of doing it so feel free to experiment and tell us your results.”

“Great, thanks. I think we need to move on to the next speaker. Thanks very much, that was interesting.”

[Applause]

Alan presenting his talk on “Improving Biological Collections Data through Human-AI Collaboration” at the 2023 Biodiversity Information Standards conference in Hobart.

CINTEL Meet and Mingle 2022

October 2022: CINTEL Meet and Mingle event, Canberra. At the beginning of the CINTEL project, all early career researchers put together a “pitch” to present their projects. In the video below, Alan presents his initial ideas for AI-based collaborative digital collections curation, followed by some discussion. Pete also provides an overview to the project and was joined by Alan for further discussion. See the inspiring presentations and discussions below.

Overview of Alan’s presentation as a sketchnote:
A sketchnote of Alan's presentation on AI-based digital collections curation.

Next up we have joining us remotely, Alan Stenhouse, who completed his PhD in ecology with a research on improving the quality of biodiversity data collected using citizen science projects. And as part of his PhD research, he developed species observation recording apps. I believe there’s an echidna one that I’m familiar with, that you’re involved in, Alan. Uh, and he’s also had a wide ranging career in software development, uh, internationally with a later focus on mobile applications. So Alan, thank you so much for joining us from afar. Where are you today?

I’m in Adelaide. Um, yeah, sorry not to be there in person. Um, just, uh, a bit of precautionary principle going on with, um, some sort of cold symptoms. So I’m very sorry not to be meeting everybody.

No, not a problem. Thanks to, you know, the technology that we’ve managed to develop with events over the last few years, you can still be with us, so it’s lovely to have you here. Uh, shall we take a look at your video, Alan?

Yeah. Have a look.

To boldly go where no one has gone before. Sound familiar? Captain Kirk recently went to space for the first time, the actor William Shatner, at the age of 90 ventured where few have ever been, and recounts how surprised he was.

“It was among the strongest feelings of grief I have ever encountered. The contrast between the vicious coldness of space and the warm nurturing of earth below filled me with overwhelming sadness.

Every day we are confronted with the knowledge of further destruction of earth at our hands, the extinction of animal species of flora and fauna, things that took 5 billion years to evolve, and suddenly we will never see them again. Because of the interference of mankind, it filled me with dread. My trip to space was supposed to be a celebration, and instead it felt like a funeral.”

So what has this got to do with us? Well, we are the caretakers of earth for future generations. In our biological collections, we have plant and animal specimens that can provide incredibly valuable information about our world. However, to make better use of these, we first need to get the specimens into digital format.

This can be expensive and takes a lot of time and effort. There are over 15 million specimens in the national research collections within C S I R O and between two and 4 billion specimens in collections worldwide. In addition, scientists with the expertise to name, describe, and classify living things are lacking.

These experts in species identification and classification are vital in addressing key issues in society, including climate change bio security, environmental and human health. So how can we help them?

As part of the CINTEL Future Science platform, I’m looking at the concept of AI based digital curators and how human curators might work with them. This Collaboration between human and digital curators as vital for enhancing the value of our biological collections.

I aim to develop the concept of the digital curator or a set of intelligent digital curatorial assistants and methods of collaborating with them to enhance and combine our complementary skills.

I am in the beginning stages of my research, and I’m initially aiming at improving the speed and accuracy of digitization processes using AI methods such as computer vision and natural language processing.

As we know that AI won’t always be correct, developing suitable methods of communication and interaction are key to providing a useful and usable system in the future.

As these digital biological collections are enhanced through connections to genetic, environmental, and other data sets, I aim to enable human and AI to collaboratively explore and examine broader uses of this data to address global challenges in areas ranging from human and environmental health to engineering at large, small and nanoscale. Using biomimicry through to adapting and responding to climate change,

Finding new ways to support and enhance these collaborative processes can ensure we unlock the amazing value stored in our global biological collections. Finally, to slightly modify Captain Kirk’s mission statement, Collaboration with AI, a new frontier. These are the voyages of the CINTEL Group.

I’d like to thank, um, Pete Thrall and our team from the National Research Collections and Data 61, who I’m working with on this project. And if anyone here would like to collaborate, please get in touch. Thanks.

Fantastic, Alan, well done. Now are you focused on any of the national collections in particular?

Um, yeah, at the moment there’s a lot of work going on in both the herbarium so the plant and insect collections. Um, so a lot of digitalization effort is there. So that’s where we’re sort of focusing, or I’m focusing my attention currently, but that should be, um, the outcome should be applicable across collections and hopefully into other domains as well.

Fantastic. And you mentioned a set of digital curatorial assistants. Can you go into more about what you mean by that?

Um, yeah, I, I guess, um, I meant I’m aiming to sort of develop a set of tools utilizing AI in particular contexts. So, um, and which humans can use depending on their requirements. So, um, for example, currently, you know, human intelligence is used for most parts of the digitalization process. Um, and that’s including sometimes sort of complicated and repetitive, uh, parts such as metadata transcription. And, and what I mean by metadata, uh, when you’ve got a specimen, a plant or insect or other specimen, they often have a label attached to them. Um, and this label could be, you know, in our collections anyway, normally in English, but it could be in other languages. It’s often typewritten, but it could also be handwritten and has a range of, um, specific data that we wanna capture all of it, but we also want to capture specific bits out of each of those, um, labels. So given the nature and scale of the task, um, it would be nice to use some interactive AI based tools to help with that. Um, and that’s, that’s probably the easy, an easy example. I mean, I say it’s easy, but it’s maybe not quite so easy as we might think. Yeah. Um, and, and there are many more, uh, examples of these curatorial assistants.

Sounds like they’d save a lot of time.

Yeah, certainly hope so. Yeah.

So we’ve got a question from the audience. Alan. To what extent is AI used for classification of collections currently, and how do you see it developing into a more collaborative version?

Um, I don’t think it’s used for the classification of species, um, because that’s, I guess, uh, if they’re already in the collections, um, they, they’ve, I, as far as I know, at least, they’re, they’re generally already classified. But, um, sorry, there’s a plane going over here at the Oh,

That’s okay. We can’t hear the plane.

Okay, good. Um, and, um, but I think that is a thing that is obviously species classification from is, is much more widely, um, available. And generally those, those species classifiers are mostly very narrow. Uh, so the context is very important. Um, but I think that’s one thing that, that one of it might maybe, or almost definitely will be one of the curatorial assistants in the future, um, whereby we may be able to use those to go through, for example, a whole collection of images and maybe see where, um, things possibly have been misclassified because there is a, you know, a huge amount of stuff. We can’t go through these, um, by hand. Um, and when also when species names, uh, sometimes change, which is another issue, um, we may want to either go through the metadata or when, when new species are identified, these may already be in collections. Um, when we think about worldwide collections, the scale is very large. So, so that is a, um, I think a future quite a, um, quite a, um, aptt, I guess, assistant that we will have in the future.

Absolutely. Got another audience question here. It says, Alan, I perceive your work having incredible power to present AI for social good. Do you have plans to present your work to a wider social audience? Your work can change public perceptions.

Um, I, I have to say that’s one of my large motivations, uh, for doing this work. Um, and for, for changing sort of my career from software to, well, ecology was my PhD, but applied software into that area. Um, I haven’t thought about it in terms of this because it’s very early days, um, for me here. Um, but I do think, uh, that, you know, our biological collections are marvelous. I mean, nature and our environment all around us is, um, just incredible and we don’t value it enough. It is our life support system. And this is the point I guess I was trying to make from the opening of my video, uh, how going into space, um, is, is a marvelous achievement and so on for mankind, but we don’t even know anything about our own world. Yeah. And we’re in the process of destroying it. Um, and these people who go into space and look back, um, and go, oh, look at that beautiful thing down there with, we have, um, and this, this has a name called the overview effect, apparently. Excuse Me. When you get out and, and see, um, what it is we have, we, we suddenly appreciate it more.

Alan, do you have any guiding principles for defining the form of collaboration between humans and ai?

Um, that’s a very good question. Um, and I think it’s a key part of the research or our, our general research to, to answer that. Um, I guess I have this, um, personal concept, um, for I guess interaction in general, um, which I call the, the three Cs, which, um, are, control, consistency and context. Um, and I think those will prove useful here. And, and they probably have some tie ups. I, I, I think with trust and so on, which we’ve talked about previously. But, um, um, from my perspective, control implies the human or the user is, is in control and thus is responsible, which I think is a big, um, key issue for AI use. Um, consistency sort of implies some, not necessarily standard, but consistent system behaviors depending on the situation. And there, when I say the situation, that’s the third c, that’s the, the context, um, or at least a part of the context. So, um, there’s lots of details I guess underlying those three sort of overall, um, Cs and, um, I guess if I add another one, um, or another two, communication and collaboration, which are, you know, obviously key parts for our whole um, F S P.

Absolutely. Well, thank you so much, Alan. Appreciate your time.

Thank you.

Alan’s postdoc pitch presentation from the CINTEL Meet and Mingle event in October, 2022.

Next up is the National Collections and Marine infrastructure.

Let’s watch the video.

Hi everyone. We’re from the National Research Collections Australia.

Today we’re going to talk about a Cintel FSP project that names to help us more effectively manage and unlock the vast amounts of biological data contained in these collections. But before we do that, we want to give you a little bit of background about what actually the collections contain.

So our biological collections consist of seven major, uh, collections, mostly in Canberra. We have a couple in Hobart and another collection, the tropical Herbarium up in Cairns. Altogether, these collections covering plants, uh, mammals, birds, insects, fish and algae total about 15 plus million specimens.

We also have a new five year strategy. This strategy aims to take the national research collections forward into the 21st century and position us at the forefront of collection based biodiversity research. So there are four main pillars to this strategy, which are about securing the collections, enhancing and value adding to the collections, mobilizing the data in these collections and delivering to the data in these collections to address a whole range of, uh, national and international challenges. Specimen curation, digitization, data mobilization, and new analytic approaches play key roles across all of these areas.

And our collections are very diverse. So we have a very large number, number of specimens, but we also have a lot of different kinds of data. And each of specimen that we have is linked to, uh, some metadata that we need to database properly, and we also need to curate.

And on top of that, we are creating new data that we are attaching to each of our specimens. So we are generating images, we are generating, um, genome sequences. We are generating C scans. And as we generate more data, it becomes more difficult to manage all of them.

So for example, in the herbarium, uh, we are undertaking a mass digitization of the collection, and we are going to generate more than 200 terabytes of data. And this is only the beginning.

So with this big data, we have what we call the digital transformation. And we have a challenge, right, because our specimens hold information on when the specimen was collected. Um, they also hold information on what the specimen looks like, so how much it measures, what are the colors. It includes information on the genomic data.

So what are we gonna do with all this new information? And the research community is relying on museums to solve a lot of kinds of different questions.

And we need to decide what to do first. We need to prioritize, right? Because we can’t do everything. So how do we ensure that the data, the quality of this data, and this information is discoverable and useful for the community, um, and effectively link it with global biodiversity expertise and initiatives.

Our data, if mobilized, will be useful to, uh, tackling a diversity of challenges, both at the national and the global scale. However, because we have so many specimens, it is very, very difficult to manage them effectively by humans alone. And because we’ve got so many different taxonomic groups, there just aren’t enough experts to cover all of them.

So our vision is that we can use collaborative intelligence to combine human expertise and decision making with, uh, intelligent, scripted and ai, uh, colleagues. So to say, to help us tackle this challenge.

I want to show you just very quickly the kind of data that we have, uh, and what they are used for. Just to give you a bit of an idea, now I’m coming myself from the herbarium. So my example is drawn from there. This is herbarium specimen. However, the same applies, you know, with only slight differences across all our collections, both in C s R and outside. So this specimen here, as you can see, has got a bunch of labels on them, and it has got the plant itself.

And this specimen contains, or data has been derived from it data across a lot of fields, everything from its identification, the species name where and when it was collected and by who. And then its morphology as you can use. See this for example, a little labor that says this particular plant is actually the reference for a drawing in the floor of Australia, just as an example.

And these different kinds of data and different combinations are used for science in every area from biogeography across evolutionary biology, ethno, botto, and uh, even genomics or community and landscape ecology.

However, before we can make our data, uh, maximally useful for tackling these challenges and for research, we have to ensure that it is as correct, as secure, as consistent as possible. And then we’ve got great chain of challenges. As mentioned, we’ve got millions upon millions of specimens, both in C S I R O and then in other organizations.

And we just have a limited human capacity to, uh, find inconsistencies and errors. The kinds of, uh, errors there, there’s many different ones, but I’m just again, showing from my world two, uh, problems we’ve got here on the right. There are data entry errors, even in things like geospatial coordinates. In this case, you see that the specimen of a daisy species that occurs in the blue mountains has been geo geocoded as being in the ocean. And then, uh, under that you see an example where duplicates of the very same plant specimen that have been distributed to do different collections, have got different, slightly different data associated with them and have even been identified as two different species.

And it’s just an extreme challenge for us humans to keep millions and millions of data points accurate and consistent across collections and, uh, coherent. And this is where dedicated machine intelligence and other automated approaches would really make a difference to us.

So we’ve outlined, uh, during this talk some of the general curation problems that we have. We have limited staff resources available to manage ever-growing collections, the associated metadata. And in addition to that, we’re in the process of accelerating the digitization of collections, which means an a huge amount of data of that type, but also greater and greater demands for collection data to be used, uh, by researchers and end users.

The sheer volume of data and the need for data quality to be managed is beyond our capacity to handle, um, as humans alone.

So what can CINTEL do for us? CINTEL is going to help us realize the concept of building a digital curator, uh, as a way to grow our capacity, uh, and help us mobilize and value add to the specimen data in these collections, Syntel offers a platform to test these kinds of ideas, uh, and and with the aim of generating a usable tool.

The outcomes of this will be to enable more streamlined collection management and error checking, increase our ability to effectively use the specialized skills and knowledge of human curators, improve the usability, reliability, and robustness of our data, and generate new research directions and approaches down the track. Further into the future, it might be possible to do other things to really start to, to accelerate and to add to what we could do with the collection science that we do around these collections. So machine driven linking of data layers, for example, linking together genomes, genomes and environmental data to realize the extended specimen concept, interactive machine, human interface to, to engage the global taxonomic network.

Uh, increasing our ability to compile and deliver specimen data, images and other associated information, uh, to researchers. Perhaps even AI assisted species identification is a possibility in the future. And ultimately seamless data to delivery to researchers, end users and aggregators, such as the violence of living Australia.

So one way that, that we kind of like to think about this is that what we’re about here is creating a digital colleague can imagine, if you will, a, um, a team member who’s, who’s just like any other team member, uh, except that they’re digital rather than being a human being. And so some of the kinds of underpinning questions that this project will address include what capabilities would we want this team member to have? What would their job description look like? How would they work with other team members? What would facilitate trust between digital and human colleagues? And how would these team members, both human and digital learn and evolve together?

We’re very excited to have the opportunity to try to tackle some of these questions as part of the CINTEL F S P, and we really are looking forward, uh, to seeing where this project takes us. Thank you.

Now to talk more about this project, please welcome Pete Thrall, who is the group leader for digitization with CSIRO’s National Collection and Marine Infrastructure Business Unit. And welcome back joining us remotely. Alan Stenhouse, if we have him there. There we are. We’re all made it. The technology is working. It’s fantastic. Hello. Welcome.

Hi, thanks.

Thank you so much for that presentation. I think it’s really obvious to see how syntel could be beneficial in the work that you’re doing, but what challenges specifically do you think a digital curator will bring to the collections workflows, and how would you overcome them?

Um, Well there’s probably an, I guess there’s sort of one basic challenge is, is just trying to decide what to do first. Um, and, and we’re sort of still working through that. I think, I think trust is a sort of key one. Um, the data that that is represented by the specimen collection, specimen collections, um, it’s, it’s, it’s irreplaceable data. Um, we need to make sure it’s right. Uh, but building that sort of trust between the human curatorial and collection staff, uh, and what, what this digital curator would do is I think a really key first step. And, and I think that question of, well, what, what kinds of jobs would we want to give that curator, um, in the first instance? And how would we, you know, what would the interface need to look like? What would the sort of, um, you know, how would it need to look in a way that would maximize the ability to grow that sort of trust that we start to delegate more and more kinds of tasks, um, um, to that, to that digital capability. I think that’s probably a really key underpinning one.

Yeah, absolutely. So Alan, how do you see the interaction between human and machine curator working in the collection?

Um, yeah, well, I guess that’s one of the key questions. I think, as I probably said in my talk earlier, for a lot of our, um, uh, domain areas of, um, exploration on this and, um, that is a, the interaction and communication between human and digital curator in this instance, um, I think is part of the, part of the bits that, that I’m hoping to address or explore and which will lead to other, other further research questions. But, um, obviously, uh, making some digital curators that are doing some automatic, um, helping, so some sort of, uh, whether it’s data quality checking, um, or curation, um, whether it’s some other task. So for the, with helping with the metadata, um, it does something and then communicates back to us the results. Um, as I also indicated, control and consistency and context, um, we provide the human expertise to assist and we control the end result effectively. So, so, um, when it comes back to us, us with results and we can, we can verify that, oh, yep, that’s, that’s looking pretty good, that trust level starts building up. Um, but we should also ensure that it comes up and asks questions or comes up with responses and shows its level of confidence, um, when it’s giving us a, a result so that we learn, oh, we need to pay attention when it’s coming up with, with these, um, interactions. So, so this, this two-way communication, um, because oh, is is very important, but also to improve the model with the ai, uh, curator that we have as over time.

Absolutely. That Makes sense. Makes perfect sense. Now, I imagine a future where you’ve got this perfectly operational digital curator that is absolutely perfect. I I did say it’s an imaginary future. What, ideally, what, which, which one task would you just love to be able to hand over and never have to deal with as a person again?

Um, well, I, I’ve got a, I’ve got a thought about that. Although the, uh, the collection folks out in the audience may have, may have other thoughts about it. Um, I’m not a, I’m not a collection scientist myself. I think, I think the sort of, as we mentioned in the talk, um, you know, there’s this sort of vast amount of data, um, in these collections, various specific data on the specimen labels associated genetic data, associated environmental data, et cetera. Um, and it’s impossible, um mm-hmm for these data sets to not have errors in them. I mean, it’s just, it’s just, you know, we do, we do our best, but it’s beyond human capacity to, to really manage that effectively. I guess for me, the ideal would be to have that digital cur curator sort of able to roam around at will in the databases and sort of, you know, find these an anomalies, fix the things that it can let us know which things it can’t fix and needs help with. Um, maybe even help with, you know, preliminary sort of identification of specimens. Um, and, and really sort of be a, you know, that valued team member that’s helping to really just manage the databases, I think, um, in a way that ensures that we are delivering the highest possible quality data that we can to tackle some of the kinds of environmental issues that we, that we’re constantly being asked to do.

Love it. Well, I’d love to chat with you more, but unfortunately we are out time no worries. But please join me in thanking Pete and Alan everyone. Thank you.

Pete’s presentation providing an overview of the CINTEL Collaborative Collections Curation and Management project.

The Team

Pete Thrall, Alan Stenhouse, Brendan Lepschi, Federica Turco, Juanita Rodriguez Arrieta, Alexander Schmidt-Lebuhn, Nicole Fisher, Emma Toms