Project 5

August 8th, 2023

Interpretability of Vision and Language models for effective human-robot teaming

Project location:

This project can be completed from the following sites: Clayton (VIC), Black Mountain (ACT), Pullenvale (QLD)

Desirable skills:

 – Strong linear algebra and computer vision foundations
 – Basic understanding of machine learning algorithms, deep learning models, and visualization
 – Familiarity with 2D object detection, semantic and/or instance-level segmentation
 – Strong expertise in at least one of Python or C/C++ in a Linux environment is required.
 – Development experience using Tensorflow or PyTorch is a must.
 – Distributed and/or robotics computing (Slurm, ROS) is preferred.  

Supervisory project team:

Zeeshan Hayder, David Ahmedt, Stano Funiak, Hashini Senaratne and Dana Kulic

This project will have external collaboration from ANU, Monash University and other Australian universities.

Contact person:

Research Scientist, Data61

Project description:

Harnessing the collaborative potential of vision and language models is a new research area with potential applications in visual storytelling, content understanding and moderation, and remote robotic operations. Traditional multi-modal learning methods are limited by the fixed-fusion strategies, lack of cross modal reasoning, interpretability, and dependency on domain specific data. Furthermore, information asymmetry (the inability to communicate raw measurements and the full user intent) prevents the use of these models in decentralised settings. This project aims to leverage the synergy between vision and language (V-L) models to address novel tasks, such as human-robot teaming for remote robotic operations and cross-modal image retrieval, making more informed and accurate decisions in challenging and dynamic environments. We study representation learning of V-L models, utilizing multi-modal prompt learning and model fine-tuning to effectively capture the nuances and relationships between vision and language, and build infrastructure to enable rapid evaluation using a combination of simulation, hybrid and physical human robot collaboration experiments. We study the use of V-L models at inference-time in the presence of information asymmetry, with a focus on the high-level control of a team of robots and the use of learned V-L representations to communicate the intent of the operator to the robot, and a summary of the environment to the operator.

The scholarships will support four PhD students with a multidisciplinary supervision team drawn from three CPS research groups.

Projects 1 & 2: The research in these projects will be focused on developing novel vision and language (V-L) learning methods to detect, classify and recognize actions from large-scale videos, and to develop novel methods for visual sensing and navigation. Aspirational aims include efficient image analytics for prompt learning and to benchmark its performance in the real-world environment. The students will investigate 1) CLIP model’s ability to comprehend both visual and textual inputs without the need for task-specific fine-tuning, 2) VisualBERT model to understand visual content by incorporating image information alongside textual context, 3) DALL-E model to understand and creating visual content from text, and 4) SAM foundation model to learn better semantic contexts. Large-scale analysis of clip, visual Bert, Dall-e and SAM will enable us to effectively interpret the models and learn prompts in a more natural manner. 5)  Dialog-based control using the vision language models.

Project 3: This project will develop and evaluate explainability (xAI) strategies for human-robot supervision.  A key research question is how to best represent the robot’s knowledge/capabilities: by showing trajectories, by visualising reward or policy, by visualising the value function (cost-to-go) or features of the reward, level of certainty or inverse capability.  In this project, we will develop alternative approaches for representing the learnt robot knowledge, and evaluate how these representations influence human understanding of the robot capability, trust, task division and subsequent teaching/feedback and task performance.
 
Project 4: In this project, we will develop and evaluate communication learning strategies within the context of field robot supervision.  First, metrics for assessing communication quality in a way that is non-disruptive to the user will be developed.  The proposed approach and metrics will then be evaluated in a series of user studies, that will evaluate the appropriate reward signal for learning, and whether models learned for the individual or for a population are more effective.

By developing tools for different disciplines, this project cultivates collaboration across diverse groups within CSIRO as well as two Australian universities. Specific outcomes include:
 • Development of customised Vision-Language model tailored for human-robot visual navigation
 • Development of a multimodal fusion technique that process both visual and textual inputs
 • Development of student-teacher based compact models using the latest foundation models
 • Development of a protocol for seamless control of the robot, based on shared understanding between the human and robotic agents
 • Development of a collaborative inference strategy for remote robotic operations (such as in the Arctic or on the Moon), where the robot reasons about its environment and only a high-level summary of the environment is communicated between the robot and the operator.