Skip to main content



We use our expertise in human factors, psychometrics, machine learning, information retrieval, temporal statistical analysis, and codifying natural language to draw insights from a variety of data sources used to identify disease outbreaks (e.g. from social media), adverse side effects, suicidal ideation, or the best treatment for a specific patient.

Current projects:

Human Factors in eHealth:

In these cross Business Unit projects we do human centred design research for new health service technologies and health assessment tools. The main goal of this work is to create eHealth technologies for better health outcomes. For example we are developing a mobile speech assessment and decision support tool that can be used by child carers to test for potential speech disorders in young children. We use gamification to encourage children to speak a set of specific words and develop novel signal processing and machine learning algorithms to provide an assessment of speech development problems. In prior work we have developed brain-computer interfaces (using EEG and eye-gaze tracking) for physical rehabilitation.

Syndromic Surveillance from Social Media:

The World Health Organisation states that more than 60% of early signals for epidemic outbreaks come from unofficial information sources such as social media. The goal of this project is to harness social media posts for early detection of epidemics that affect the community. We combine time series monitoring algorithms with advances in computational linguistics. We have worked on detection of syndromes such as influenza, using a statistical monitoring algorithm based on time-between-events. We are currently investigating acute disease events (i.e., sudden disease outbreaks such as a thunderstorm asthma outbreak in Melbourne in November 2016) and pandemics (i.e., widespread disease outbreaks such as the Ebola outbreak in Africa in 2014). We also secured the first place in an international shared task competition on detection of vaccination usage from tweets, in 2018.


Medical Decision Support:

As part of Precision Health Future Science Platform, we investigate how search, information retrieval and natural language technology can be applied to decision support within medical domain. Our research covers two aspects:

(1) Precision medicine: Given a patient (demographics, disease, gene, or other relevant personal information), we find information in the form of scientific evidence or suitable clinical trials that benefits patients and doctors to improve health or keep a patient healthy.

(2) Clinical Decision Support (CDS) systems aim to assist clinicians in their daily decision-making related to diagnosis, tests, and treatments of patients by providing relevant evidence from the scientific literature. This promise however is yet to be fulfilled, with search for relevant literature for a given patient condition still being an active research topic. As part of this research, we have developed a publicly available system called Apples to Apples.

  • Apples to Apples (A2A) Platform:
    We developed a platform called Apples to Apples (A2A) to facilitate experimentation and hypothesis testing for information retrieval and BioNLP researchers working on clinical decision support. It provides a large range of query and document processing techniques that are explored in the biomedical search domain. This platform is made public for researchers globally through the following link:
    This currently uses datasets provided through an international shared task by the NIST (National Institute of Standards and Technology) called TREC CDS track was designed from 2014 to 2016 to address this research gap.


Informatics for radiology:

This work covers a few projects in collaboration with CSIRO, University and Health Sector parties. Our activities related to radiology are divided into two parts:

(1) processing radiology reports to identify cases of abnormality as well as assigning diagnostic codes to the reports; and

(2) generating radiology reports from x-ray image.

An example of generation of the radiology reports is shown below (credit to Sonit Singh’s system):

Chest X-Ray
Radiologist report
Our machine generated report
No acute cardiopulmonary abnormality. No pleural effusions. No pneumothorax. No focal areas of consolidation. Heart size within normal limits. Osseous structures intact. No evidence of active disease. The heart size and pulmonary vascularity appear within normal limits. The lungs are free of focal airspace disease. No pleural effusion or pneumothorax is seen.

Note: The sample x-ray is from the Indiana University Chest X-ray collection. (Demner-Fushman, D., Kohli, M. D., Rosenman, M. B., Shooshan, S. E., Rodriguez, L., Antani, S., Thoma, G. R., & McDonald, C. J. (2016). Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2), 304-310. doi: 10.1093/jamia/ocv080).


Completed projects:

Diagnostic coding using ICD10 Codes:

In collaboration with CSIRO Health and Biosecurity and the NSW Ministry of Health the project sought to improve classification of death certifications for major diseases such as diabetes, influenza, pneumonia and HIV. Machine learning and keyword matching techniques were trialed and validated with high accuracy achieved.

Informatics for Pharmacovigilance:

  • Adverse Drug Event Detection for Pharmacovigilance:

  • Social media is becoming an increasingly important source of information to complement traditional pharmacovigilance methods. In order to identify signals of potential adverse drug reactions, it is necessary to first identify medical concepts in the social media text. The project tested a range of existing methods for identification in a controlled setting on a known corpus. The project also developed CADEminer which mines consumer reviews for adverse effects.

  • Duplicate Detection in Spontaneous Reporting of Adverse Events:

  • Data duplication is a significant problem in databases of adverse drug reactions as reports often come from a variety of sources. The databases are large making duplication detection expensive. The project developed a parallelized method for de-duplication using Spark.

  • CSIRO Adverse Drug Event Corpus (CADEC):

  • CADEC is a rich annotated corpus of medical forum posts on patient-reported Adverse Drug Events (ADEs). The corpus is sourced from posts on social media, and contains text that is largely written in colloquial language and often deviates from formal English grammar and punctuation rules. Annotations contain mentions of concepts such as drugs, adverse effects, symptoms, and diseases linked to their corresponding concepts in controlled vocabularies, i.e., SNOMED Clinical Terms and MedDRA. The quality of the annotations is ensured by annotation guidelines, multi-stage annotations, measuring inter-annotator agreement, and final review of the annotations by a clinical terminologist. This corpus is useful for studies in the area of information extraction, or more generally text mining, from social media to detect possible adverse drug reactions from direct patient reports. The corpus is publicly available at

Social Media Analytics for Mental Health Research:

In this body of work, we ask how how social media data can:

(1) supplement and support research in mental health; and
(2) facilitate the development of new mental health services.

We studied whether social media data (Twitter) can be used to collect a real-time metric of emotion data and that this correlates with existing mental health metrics. We also examined whether we could detect suicidal ideation on Twitter to support research in mental health (for example, to study stigma around suicidal ideation), and to investigate socio-technical issues in health services around suicidal ideation detection. Quick link:

Relevant expertise:

  • Human factors
  • Psychometrics
  • Machine learning
  • Information retrieval
  • Temporal statistical analysis
  • Codifying natural language