Project 1

August 12th, 2023

Multimodal Large Language Model for Adaptive Knowledge Transfer – From Vision-Language Modelling to Multimodal Learning in the Wild

Project Location:

This project can be completed from the following sites: Clayton (VIC), Eveleigh (NSW), Dutton Park (QLD)

Desirable skills:

  • Machine learning and/or comp. vision know-how
  • Proficiency in Python and PyTorch
  • Some knowledge of foundation models and deep learning

Supervisory project team:

Lina Yao,  Melanie  Ayre, Sarvnaz Karimi , Stephen Wan, Necva Bolucu, Dadong Wang, He Zhao, Andrew Hellicar, and Piotr Koniusz

Contact people:

Senior Principal Research Scientist and Science Lead for Translational Machine Learning

Principal Research Scientist at Data61

Project Description:

Relying on abundant text, image, and paired data, large-scale Vision-Language models (such as CLIP, Stable Diffusion, and GPT4) have shown success in building large-scale foundation models for real-world applications. Existing multimodal learning mainly focus on a restricted set of modalities, such as vision, language, and audio, with easily accessible high-quality training data. However, in the areas of many real-world applications, especially in interdisciplinary/scientific research fields, the limited data cannot support learning a large-scale multimodal model from scratch. For example, it is hard to scale up the paired data for analysing the relationship between climate change (and extreme weather) and human activities; collecting large experimental datasets for studying material characteristics is also challenging. 


To accelerate the generalization of multimodal learning to understudied domains, we consider whether the knowledge in pre-trained large-scale multimodal foundation models, e.g., CLIP and GPT4, (and the successful experience of developing these models) can be adapted to more diverse and under-represented modalities in the wild (such as various tabular and sensor data, cross-media data).  


This project aims to study adaptive multimodal learning for adapting and transferring the learned knowledge in a large-scale foundation model to new tasks with limited and non-ideal data. Six sub-challenges and sub-tasks will be studied:

1. We study the adaptation of large-scale pre-trained models to downstream tasks with the same and similar modalities. Considering the similar modality characteristics, the task can be handled by focusing on aligning the representation of the new modality to the pre-trained models. For example, based on a pre-trained vision-language model, the adaptation methods can help to understand the satellite cloud images for weather analysis and prediction.

2. Adapting a pre-trained model to the under-represented new modalities with high-level heterogeneity is commonly required in real-world applications but is more challenging. We will investigate how to identify the transferable knowledge on modalities’ interactions and then adapt the pre-trained model according to the modality interactions. For example, we can explore the vision-language and audio-language interactions in the pre-trained large model for adapting the interactions to new modalities of time-series and tabular data measuring the healthy condition of patients.

3. We will investigate how to continually learn multimodal models from different modalities incrementally observed in the stream. Instead of only adaptation, we hope the model can continually grow and accumulate multimodal knowledge toward a more generalizable model. We will study how to continually develop a sharable representation space for the incrementally accumulated modalities, where different modalities can interact with each other in a universal model.

4. We will investigate generalised prompt learning for the models with diverse modalities. Text-based promoting has been successful for large language models (LLMs) and Vision-Language models (e.g., CLIP). Firstly, with the language models as the centre, we will study how to adapt the text prompt and LLM-based query, understanding, and reasoning for tasks with diverse modalities (e.g., managing and generating preliminary analysis report for images, audios, and signals from astronomical observation), as in-context learning. Secondly, we will investigate generalizing the prompt from text to more modalities. For example, in the prediction of material characteristics (e.g., the density of states), apart from the material property and energy level, the interaction type between material and energy needs to be specified, which can be modelled as a general type of prompt. 

5. We will navigate to explore the inter-modality semantic correlation; First, cross-media data comprises various formats, such as texts, images, and audio, which introduces various data representations. Second, these cross-media data come from heterogeneous data sources on different media platforms, like YouTube and Netflix. To this end, how to capture their semantic correspondences, align these diverse representations, and map them into a shared feature space still needs further research for cross-media semantic interactions.

6. Cross-media retrieval is a promising and practical task in big data applications. It takes text queries to retrieve matching visual data, which requires bridging the semantic gaps between different modalities. Large language models (LLMs) have shown strong capabilities in text generation, document summarization, and question answering, which has considerable potential in cross-media domains for assisting models in better understanding linguistic content by supplying richer textual information. Additionally, cross-media retrieval has vast application prospects in video search, zero-shot learning, and e-commerce product search.


This project will include four students. Student-#1 will mainly contribute to sub-task (1) and (2) and focus on developing fundamental adaptive multimodal learning methods with different levels of challenges. Student-#2 will mainly contribute to sub-task (3) focus on developing continual/incremental learning approaches for diverse modalities in the stream and sub-task (4) focus on the generalised prompt learning techniques.  Student #3 will mainly contribute to sub-task (5) the fundamental methods of exploring and discovering inter-modality semantic correlations. Student #4 will mainly contribute to sub-task (6) validate and evaluation in the context of the cross-modality retrieval and relevant tasks. The research of the four students will support each other. All students will contribute to the literature review, experimenting, and publication works. 


The project’s outcome, focusing on the unification of adaptive multimodal LLMs and its application in real-world scenarios, aligns with the principles of digital science by embracing the convergence of computational technologies and analytical methods. By integrating vision-language modelling with multimodal learning, the project leverages cutting-edge digital tools to synthesize and interpret complex data across various sensory domains. This enables a more dynamic, responsive learning environment that can be applied in diverse contexts ranging from industry to academic research. Such alignment with digital science not only reinforces the technological aspect of the research but also enhances the transfer and accessibility of knowledge in the digital age, bridging gaps between traditional scientific exploration and contemporary digital innovation.


The expected project outcome are also including submissions. Submissions are expected to be produced and submitted to top-tier conferences (CORE A*) and journals (SJR Q1), such as NeurIPS, SIGIR, WebConf (WWW), ICLR, AAAI, KDD, IJCAI, WSDM, CIKM, IEEE PAMI, TNNLS;