Data Platforms, Data Link and Model Management

For an increasing number of usage scenarios, data analytics and machine learning demonstrate effectiveness in handling real-world tasks with little human involvement. The demand for constructing such kinds of systems rises rapidly but the actual construction of such a system requires lots of effort. It is not only because such systems are often data and compute-intensive, which is a challenge for decision making in a timely manner, but also because there are always unexpected situations in real-world applications and evaluating data analytics outcomes or machine learning models learned from known data under new situations as well as under various policy frameworks is non-trivial. When multiple stakeholders are involved in the usage scenarios, there are complex trade-offs based on different stakeholder interests and risk appetites on data sharing, for example who can see what data under what conditions through what types of data connections.

We investigate processes of machine learning models construction through reviewing existing practices and building concrete data analytics applications. We develop software systems to make these processes efficient and manageable, and validate our solutions by applying them to concrete applications in our target areas to measure effectiveness.

Projects and Research Activities

  • Data integration: we develop Spark-based fast and accurate data duplicate detection system, causality driven data integration system for drug side effect discovery, and exploit data similarity for storage compressibility / deduplication to accommodate rapidly growing “similar” data efficiently.
  • “Big data” handling: We develop large-scale distributed systems to handle radio telescope data processing and apply deep learning techniques to automate data processing to remove bottlenecks using Tensorflow in large GPU clusters. We collaborate with the Five-hundred-meter Aperture Spherical radio Telescope (FAST) team in Chinese Academy of Science on developing data sharing, archiving, provenance data collection, multiple time series data analytics and Pulsar search result ranking methods. We also develop efficient methods for managing shared cloud resources for dealing with both data- and compute-intensive computing.
  • Hierarchically Distributed Data Matrix (HDM): A distributed execution engine that provides a functional meta-data abstraction for Big Data processing. It provides built-in planner and optimizer to simplify the functional computation graph (10 to 60% faster for different types of operations compared with Spark 1.6.2). It supports multi-cluster architecture and dependency and execution history management.
  • Knowledge Graph as dataset linkage: Information of interest to governments, businesses and consumers for decision making and crisis handling and economic activity support is often distributed in a long tail manner in shared data. Existing data sharing portals like often rely on basic elastic search on metadata provided by the data owner to identify useful information. This greatly limits the potential of linking shared data from different owners for complex analytics. We propose to use a large language model and knowledge graph techniques to enrich metadata to establish deeper links among shared data.
  • Model and Data Co-versioning and Dependency Tracking: Distributed ledger technologies (including blockchain) can provide a single shared view of the dependency data for many parties. We propose a blockchain-based system to track dependency between dataset, model at development time and the query/prediction at operation time. Such system uses blockchain to provide an immutable provenance service cross the development and operation stage of the ML system and a graph-based query service at operation stage.
  • AI Ethics Operationalization: Although AI is solving real-world challenges and transforming industries, there are serious concerns about its ability to behave and make decisions in a responsible way. However, the existing principles are typically high-level level and do not provide concrete guidance to technologists on how to implement AI responsibly. Therefore, we propose to develop and operationalize software engineering guidelines for developers and technologies to develop AI systems in a responsible way. Such engineering guidelines and responsible mechanisms can be applied to the data platform.
  • Privacy by Design: Privacy patterns are proposed in both industry and academia as reusable design solutions to tackle general privacy issues. The existing privacy patterns, however, lack perspectives of software design thinking. There is not yet a thorough investigation into the potential trade-offs amongst the quality attributes when applying privacy patterns onto the systems. We apply the privacy patterns to the data platform design, and analyse the privacy properties and their impact on the other quality attributes, like performance.

CSIRO Mission Alignment

The project aligns with the Missions that involve supply chain of an industry or multiple stakeholders who need to share and process their data collaboratively, such as Ending Plastic Waste, Net Zero, and Critical Energy Metal.

Relevant Publications

  • A Case Based Deep Neural Network Interpretability Framework and Its User Study
    R Nadeem, H Wu, H Paik, C Wang
    International Conference on Web Information Systems Engineering, 147-161
  • Towards Effective Data Augmentations via Unbiased GAN Utilization
    S Verma, C Wang, L Zhu, W Liu
    Pacific Rim International Conference on Artificial Intelligence, 555-567
  • A compliance checking framework for DNN models
    S Verma, C Wang, L Zhu, W Liu
    Proceedings of the 28th International Joint Conference on Artificial Intelligence
  • H. Wu, C. Wang, J. Yin, K. Lu, L. Zhu, Sharing Deep Neural Network Models with Interpretation,Proc. 27th International World Wide Web Conference (WWW), 2018.
  • H. Wu, C. Wang, Y. Fu, S. Sakr, K. Lu, L. Zhu, A Differentiated Caching Mechanism to Enable Primary Storage Deduplication in Clouds, IEEE Transactions on Parallel and Distributed Systems (TPDS), 2018.
  • H. Wu, C. Wang, Y. Fu, S. Sakr, L. Zhu, K. Lu, HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud, 33rd International Conference on Massive Storage Systems and Technology (MSST)2017.
  • X. Zhou, L. Chen, Y. Zhang, D. Qin, L. Cao, G. Huang, C. Wang, Enhancing online video recommendation using social user interactions,The VLDB Journal 26 (5), 637-656, 2017.
  • D. Wu, S. Sakr, L. Zhu, H. Wu, Towards Big Data Analytics across Multiple Clusters. Proc. 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2017.
  • D. Wu, L. Zhu, Q. Lu, S. Sakr, HDM: A Composable Framework for Big Data Processing, IEEE Transactions on Big Data , 2017.
  • C. Wang, S. Karimi, Parallel Duplicate Detection in Adverse Drug Reaction Databases with Spark, Proc. 19th International Conference on Extending Database Technology (EDBT), 2016.
  • S. Karimi, C. Wang, A. Metke-Jimenez, R. Gaire, C. Paris, Text and data mining techniques in adverse drug reaction detection, ACM Computing Surveys (CSUR) 47(4), 2015.
  • K. Ye, Z. Wu, C. Wang, B. B. Zhou, W. Si, X. Jiang, A. Y. Zomaya, “Profiling-Based Workload Consolidation and Migration in Virtualized Data Centers,” IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 26, no. 3, pp. 878-890, March 2015.
  • C. Wang, S. Karimi, Causality driven data integration for adverse drug reaction discovery, Health Informatics Society Australia (HISA) Big Data Conference, 2014.
  • X. Liu, C. Wang, B. B. Zhou, J. Chen, T. Yang and A. Y. Zomaya, Priority-Based Consolidation of Parallel Workloads in the Cloud, IEEE Transactions on Parallel and Distributed Systems(TPDS), vol. 24, no. 9, pp. 1874-1883, Sept. 2013.
  • J. Chen, C. Wang, B. B. Zhou, L. Sun, Y. C. Lee, and A. Y. Zomaya. 2011. Tradeoffs between Profit and Customer Satisfaction for Service Provisioning in the Cloud. Proc. 20th international symposium on High performance distributed computing (HPDC ’11). ACM, New York, NY, USA, 229-238.
  • C. Wang, Y. Zhou, A Collaborative Monitoring Mechanism for Making a Multitenant Platform Accountable. USENIX HotCloud 2010.


Research Team/Group Involved

Architecture and Analytics Platforms Team

Information Security and Privacy Research Group