Machine Assisted Genome Annotation

Working to develop a collaborative human-AI workflow leveraging the latest developments in ML application to the problem of Genome Annotation. The workflow will allow the domain experts to efficiently curate automated GA in the context of an interactive and learning system.

The Challenge

Over the past two decades there has been an explosion in the generation of sequenced genomes, with now many thousands of species, and multiple individuals within species, with completed (assembled) genome sequences.  Such assemblies are a necessary but not sufficient input enabling scientific discovery in fields including health and medicine, agriculture, and biosecurity. In addition to completeness and contiguity, genome sequences require accurate structural and functional annotation. This functional annotation describes how the genome sequence is used to create biological phenotypes. Annotations may vary between species and between individuals as is evidenced by biological diversity. Meticulous, mostly manual genome annotation (GA) carried out for model species by countless experts over many years is the gold standard which does not scale when attempting GA of multiple large and complex genomes. The alternative, automated GA, relies largely on transferring the annotations from one species to another, often also automatically annotated, species. Biology however is complex and messy, making this an error prone process that should be followed by manual curation. In practice, automated annotations are either not curated at all or curated partially, with most of the information extrapolated from other species’ annotations. There is an urgent need for a scalable combination of the two approaches, where an interplay between the automation and a domain expert will allow for accurate and timely annotation of genomes. 

Our Response

Researchers from CSIRO’s Collaborative Intelligence (CINTEL) Future Science Platform and Crop bioinformatics and data science team are working to develop a collaborative human-AI workflow leveraging the latest developments in ML application to the problem of Genome Annotation. The workflow will allow the domain experts to efficiently curate automated GA in the context of an interactive and learning system.

The quality of GAs generated by this system is expected to be close to that of manual annotations but with greatly reduced need for human expert time, with additional future improvements coming from better labelled data generated through the collaborative process. The human-machine interactions specifics will be detailed in the early stages of this project. This project also records the details of all such interactions and depending on the type of ML/AI used to be fed back into the system. The collected information will be available for downstream review and will form the basis for learning for the expert, thus resulting in collaborative learning. Compared to the traditional curation process, this project envisages efficiency gains e.g., from multiple gene family members’ annotations being verified by the expert user in a single step. Crucially, in addition to the refined input data set labelling which can improve the baseline ML performance, the collaborative approach will facilitate learning from both parties leading to improvements in AI and in human expert efficiency. 

Impact

The applied use of genomics data is fundamental to the future goal of increasing crop yields by accelerating the process of selective breeding. Collaborative intelligence can increase the capacity to link functional information and knowledge to gene and regulatory sequences and so contribute to the process of crop design and selective breeding.  Successfully developed collaborative discovery workflow will be also applicable to anyone relying on annotated reference genomes, including researchers in many parts of CSIRO business units (Agriculture and Food, Health and Biosecurity, National Collections and Marine Infrstructure) as well as externally, including breeding companies and the biotechnology sector. More reliable and timely annotation of genomes will accelerate the discovery, selection and editing of gene targets (e.g., for disease resistance) and enable insights into the underlying biological mechanisms.

Team

Rad Suchecki, Alex Whan, Maisie Li, Meredith McNeil

External Collaborators

Dr David Starns

University of Liverpool

Students

Madeline Fechner

University of Adelaide

  • Vacation Studentship: “Prompting Large Language Models for Gene Functional Annotation”