CRP with Macquarie University

Autonomic Security for Cloud Infrastructure

Data61 SCS, Macquarie University
Partners:	Prof Jian Yang

Cloud computing is burgeoning. At present, more than 50% of worldwide businesses have already deployed all or part of their IT solutions on cloud. Cloud computing offers several new opportunities for IT practitioners. For example, the “on-demand” nature provided by cloud allows businesses to deploy their systems in a more efficient way, and the multi-tenancy feature of cloud enables businesses to manage various systems using the same account. “Infrastructure as a Service (IaaS)”, which is the most fundamental service model in cloud computing, is responsible for providing infrastructure of cloud systems and facilitates provisioning of virtualized cloud resources for IT organizations. “IaaS” is faced with security issues because cloud infrastructure can be intruded by hackers and malicious users. The new features of cloud such as the “on-demand” nature and multi-tenancy make it even easier for intruders to attack cloud infrastructure. It was reported that 43% of cloud consumers encountered attacks on the infrastructure of their cloud systems.

While it is a good security practice for businesses to exactly follow proper security guidelines when deploying IT solutions, security threat scenarios can vary a lot and some attacks may be unpreventable. Exiting mechanisms for preventing cloud security threats, such as restricting user privileges in cloud ACL (Access Control List), are unable to cater for circumstances where user account credentials are revealed to hackers. As such, security threat prevention is not enough for maintaining system availability and it is imperative to propose an autonomic mechanism for handing failures resulting from security threats in cloud infrastructure. Existing strategies for dealing with security failures in cloud infrastructure, such as cloud Disaster Recovery (DR) mechanisms, have several drawbacks: 1) it is inconvenient to maintain data consistency between the master datacentre and the slave datacentre(s), especially when the data in the master datacentre get changed very frequently; 2) even if the slave datacentre(s) take over service from the master datacentre upon failure, the same security failure might happen again because they are replicas of the master datacentre and have the same security vulnerabilities; 3) system operators need to make more effort in maintaining both the master datacentre and the slave datacentre(s) and this also takes operators more time; 4) due to the “pay as you go” cost model offered by cloud, the incurred monetary cost of the “master-slave“ cloud architecture is relatively high compared to a “single-site” architecture; 5) expert knowledge (e.g. which availability zone in cloud is more secure, or what the best replication frequency is, etc.) is usually required in order to implement these strategies, which may not always be available.

Therefore, in order to bridge the gaps within existing security failure handling mechanisms, this research proposes a more sophisticated security failure detection and recovery framework for cloud infrastructure. The main research challenge lies in autonomy and automation of this security solution for cloud infrastructure. Specifically, there are four research questions: 1) how to make risk assessment for cloud infrastructure to determine its security risk value; 2) how to detect and diagnose security attacks and intrusions on cloud infrastructure in an automated way; 3) how to recover from security failures in an automatic and genetic manner if security breaches are detected; 4) how to prevent security breaches if they have not occurred yet. Our proposed framework works in the following way: first, we perform risk assessment for the infrastructure of a cloud system based on its current system state to determine its overall risk value. The risk value is calculated by analysing the vulnerability of each resource component in the cloud system infrastructure and investigating how the infrastructure can be influenced by these vulnerabilities. If the risk value is greater than a predefined threshold, we perform security attack detection and diagnosis. Security attacks are detected and diagnosed by analysing runtime logs generated by cloud platforms as well as cloud management tools. Once security attacks are detected, the framework triggers the recovery component to perform recovery based on system state transition, and then we perform risk assessment again for the recovered cloud system infrastructure to make sure its latest risk value is below the threshold. If there is no attack detected, then the prevention component takes effect and performs security threat prevention activities. One challenge with our proposed framework is related to determination of the optimal recovery strategy, and another challenge is about achieving the framework’s capability of handling security issues for infrastructure with cloud operations (e.g. upgrade) being performed.

We will evaluate our framework on AWS EC2 platform, which is the most widely used public cloud platform in the world. The framework is to be evaluated using three types of cloud systems: 1) stateless cloud instances attached with an auto scaling group and a load balancer; 2) typical three-tier light-weighted cloud web applications; 3) data-intensive business applications with large-scale relational database system (RDS). The evaluation metrics we are going to investigate are recovery time, recovery monetary cost, recovery impact on cloud systems and stakeholder satisfaction. In addition, we will make comparison between our proposed security solution and other existing cloud security strategies.