Combining nuanced statistical methods with a robust parallel computational platform has enabled a modeling scheme that better predicts environmental conditions while being efficient enough to cover millions of monitoring locations.
The new modeling approach developed by KAUST tackles a longstanding obstacle to improved weather and climate prediction: how to implement non-Gaussian statistics for very large geospatial datasets.
“In spatial statistics, the main objective is to use data observed at monitoring stations to predict the conditions at unobserved locations,” explains Sagnik Mondal, a Ph.D. student from Marc Genton’s statistics research group. “These types of predictions are necessary for many kinds of weather and climate applications. Nowadays, however, the number of observation locations can reach millions, which is beyond the capability of traditional computational approaches, and the traditional Gaussian models fail to statistically capture extreme values.”
A Gaussian model is a straightforward statistical description of a dataset based on an average “mean” value and symmetric distributions to higher and lower values — the iconic “bell curve.” However, many environmental variables and their derivates — like rainfall intensity, wind speed, days without rain or days above a certain temperature — are not symmetric in their distribution. Rather, they have peak probabilities hovering close to zero but can, on rare occasions, reach significantly high extremes. This long “tail” to extreme values with very low probability cannot be captured by Gaussian models but is becoming increasingly important under climate change.
“In this work, we applied the Tukey g-and-h model, which is a non-Gaussian spatial model with two additional parameters to accommodate asymmetric distributions and better capture extreme values,” says Mondal.
While the Tukey model is clearly beneficial for weather data, it is not efficient enough to apply in practice for large geospatial data sets as a traditional sequential computation. However, it can be significantly improved by parallelizing the computations.
Read the full text.