Statistical and econometric models for the analysis of large longitudinal data
Increasingly longitudinal data is being collected at such rates that it is becoming difficult to analyse the data using existing statistical methods, and software and hardware solutions. For example, electricity smart meter technology is increasingly deployed in residential and commercial buildings. In such a system, energy usage is collected and electronically transmitted back to the distributor every 30 minutes. This means that for a state such as Victoria, Australia, over 35 billion observations are collected every year from the approximately 2.1 million households.
The introduction of smart meters affords the opportunity to better model and understand residential and business energy usage patterns between months, between days and within days, something that is not possible using only quarterly energy usage information. Similarly, to improve compliance and government services to customers, governments collect income, financial and asset information from their customers over time. Analysing over ten years’ worth of such data for over ten million customers means that the number of observation will exceed 100 million observations.
In both examples above it is important to account for any temporal and spatial correlations, as any uncertainties associated with model parameters, predictions and forecasts could be severally under-estimated. This could hinder good decision making as patterns may be identified that do not really exist, resulting in incorrect inferences. For example, in the energy case, if the uncertainties associated with peak energy usage in different regions are under-estimated, the system may not meet reliability targets for these peak periods. For the government services data, having good parameter uncertainty estimates assists with identifying variables that may drive certain behaviours that can result in better financial outcomes for customers.
The scientific challenge is to account for these correlations using appropriately complex statistical models when the data sets are very large. This stream of research is focussed on developing methodology to undertake statistical analysis of large, correlated, both temporally and spatially, data sets to help improve decision making.