Big Data Challenges for the Science of Small Things
Comprehensive sampling of large, detailed and heterogeneous structural configurations spaces is a daunting task; even for small data sets containing hundreds of possible structures. When confronted with millions of possible structures technical challenges accompany the scientific ones, but so do new insights that cannot be extracted if we pre-select one structure and assume it is “representative”.
In this project we explore and describe this complicated configuration spaces using high-throughput computational simulations, and examine the global correlation of structure and properties using a range of simple and sophisticated statistical methods. Depending on the specific problem, data sets can range from hundreds to hundreds of thousands of unique configurations, the importance of which can be assigned using probability distribution functions.
This data/computation intensive workflow demands a hybrid HPC/Big Data platform that is robust, resilient and flexible. To meet this challenge we are developing an in-house high-throughput simulation engine, for realizing an interactive and intelligent pipeline of defective nanostructure generation, characterization, large-scale computation, big data processing, analysis and mining with machine learning algorithms. Combines HPC infrastructure with the Apache HBase™/Spark™ ecosystem, we can support a range of popular simulation packages, simultaneously reducing unnecessary repetition and focusing more simulations in regions of interest.
For more information, contact the Project Leader, Dr Baichuan Sun.