Project 7

August 8th, 2023

NLP and Information Retrieval in Science.

Project location:

Marsfield or Eveleigh (NSW)

Desirable skills:

Background in Computer Science, with training in Natural Language Processing and machine learning

Supervisory project team:

Maciek Rybinski, Xiang Dai, Sarvnaz Karimi, Necva Bolucu and Stephen Wan

Contact person:

Principal Research Scientist, Data61

Project description:

This project will target core NLP research developments for improved performance of tailored GPT solutions for scientific discovery.


Due to the difficulty of modelling long text, NLP models and widely used benchmarks are usually sentence-level or paragraph-level solutions. Many popular datasets are samples from abstracts of scientific literature. This is problematic because many important findings of interest may appear in different parts of the article. In this project, we aim to construct models capable of processing full-text scholarly articles, surpassing the processing capacity of most current NLP models. This could include research in sub-domains of information extraction, LLMs (Large Language Models), generative NLP and text summarisation.

Activities and expected outcomes include:

1. We aim to benchmark the capability of existing large language models on publicly available entity linking datasets and the expected outcome will be a journal or conference papers.
2. We aim to develop new entity linking methods on top of existing large language models and focus on improving its capability of few and zero-shot learning. The expected outcome will be a reusable tool and the corresponding publication.
3. We aim to enhance large language models by injecting factual and relational knowledge so that they can be better at entity linking.
4. We aim to benchmark the capability of existing large language models on publicly available document-level information extraction datasets and the expected outcome will be a journal or conference paper.
5. We aim to develop new retrieval-based methods for long document processing. The expected outcome will be a reusable tool and a corresponding publication.
6. We aim to build domain-specific benchmark dataset for document-level information task. The expected outcome will be cross-discipline collaboration, reusable resources and the corresponding publication.
7. Examination of cross-capabilities, i.e., zero/few-shot shot inference on long documents, and a corresponding publication.

[1] Elizabeth T Hobbs et al., “ECO-CollecTF: a corpus of annotated evidence-based assertions in biomedical manuscripts”, Frontiers in Research Metrics and Analytics, vol. 6, p. 674205, 2021.
[2] Jiao Li et al., “BioCreative V CDR task corpus: a resource for chemical disease relation extraction”, Database : the journal of biological databases and curation, vol. 2016, 2016.
[3] Yi Luan et al., “Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction”, in EMNLP, 2018.
[4] Kimberly Van Auken et al., “BC4GO: a full-text corpus for the BioCreative IV GO task”, Database, vol. 2014, 2014.