Function for identifying most suitable geographically weighted regression model
A proposed improvement to the geographically weighted regression function in R for effective comparison of model fit between local and global models.
Rationale
The Geographically Weighted Regression (GWR) is an exploratory analytical tool that generates local regressions, which makes the spatial analysis more sensitive to conditions that vary locally over the area of interest. This is in contrast to a linear regression, which uses identical coefficients across the entire study area, since it is based on the unrealistic hypothesis of spatial uniformity of the effect of explanatory variables on the dependent variable (Brunsdon et al. 1996). The GWR is preferred over the linear regression in the presence of spatial heterogeneity, and allows the values and significance of a model to vary spatially in a continuous manner (Bellefon and Floch, 2018). This is especially because local GWR regression can “reduce spatial autocorrelation in residuals compared to using a global estimator” (Bates and Pryce, 2014: 9).
However, while GWR models are appropriate for when data is not described well by a global model, there are spatial regions where the explanatory power of a global model may be higher such as data without certain geographic effect. This, gwr.all(), function helps to provide methodological justification for using a more complex model of GWR, rather than linear regression.
The function, gwr.all(), strives to provide useful comparative measures to determine the most appropriate model. gwr.all() allows the user to select the most appropriate model, between a global regression and a local regression model, through a two-pronged approach: comparing R-squared values and comparing Akaike Information Criterion (AIC) values. The AIC is a relative measure of model parsimony and provides a relative estimate of the quality of statistical models for the same data set. It considers the Kullback-Liebler information distance between statistical distributions (Fig 1), which can be construed as “information loss” when comparing the two distributions.
Description of function
The function calculates local GWR, local R-squared values, global R-squared value, as well as the AIC of GWR, of the linear regression, and the difference in AIC (ΔAIC) values. To use this function, an object of a Formal Class Large Spatial Polygons Data Frame that has incorporated both explanatory and dependent variables must be created beforehand.
The function prints two complementary maps: one with the local geographically weighted regressions and another showing the local R-squared values. The latter map shows the goodness-of-fit of each local GWR. It hence provides the explanatory power of the local GWR, since it explains the extent to which the variance of the dependent variable is explained by the variance of the independent variable. The function also prints the global R-squared value for comparison against the local R-squared values.
Alongside the maps, the AIC values of the linear regression model and the GWR model are printed. Models with smaller AIC values are preferable, since they minimise Kullback-Liebler information loss. The larger the ∆AIC, the stronger the evidence for one model over the other; ∆AIC>3 shows strong support for the model with the lower AIC value.
setwd("/Users/zenn/Documents/School/POLS0010/W1/Camden")Health <- read.csv("tables/KS301EW_oa11.csv")
Health <- Health[, c(1,24,25)]
Health$GoodHealth <- Health$KS301EW0023 + Health$KS301EW0024
Health <- Health[, c(1,4)]
names(Health) <- c("OA", "Good_Health")Qualification <- read.csv("tables/KS501EW_oa11.csv")Qualification <- Qualification[, c(1,20)]
names(Qualification) <- c("OA", "Qualification")Census.Data <- merge(Health, Qualification, by="OA")write.csv(Census.Data, "practical_data_assignment.csv", row.names=F)
Output.Areas<- readOGR(".", "Camden_oa11")
OA.Census <- merge(Output.Areas, Census.Data,
by.x="OA11CD",
by.y="OA")# Sets coordinate system to the British Nat Grid
proj4string(OA.Census) <- CRS("+init=epsg:27700")
gwr.all <- function(x, y, shapefile){
bw <- gwr.sel(y~x, data=shapefile, adapt=T)
model <- gwr(y~x, data=shapefile, adapt=bw, hatmatrix=T, se.fit=T)
results <- as.data.frame(model$SDF)
globalR2 <- (1 - (model$results$rss/model$gTSS))shapefile$x_coeff <- results$x
shapefile$localR2 <- results$localR2model.linear <- lm(y~x, data=shapefile)
AIC.linear <- as.numeric(AIC(model.linear))results1 <- as.data.frame(model$results)
AIC.gwr <- as.numeric(results1$AICc)AIC.delta = abs(AIC.gwr - AIC.linear)map1 <- tm_shape(shapefile) +
tm_fill("x_coeff", style="quantile") +
tm_layout(frame=F, legend.text.size=1, legend.title.size=1.5,
main.title=" ", main.title.size = 1.5)map2 <- tm_shape(shapefile) +
tm_fill("localR2", style="quantile", palette="Blues", title="Local_R2") +
tm_layout(frame=F, legend.text.size=1, legend.title.size=1.5,
main.title=paste("Global R2 =", round(globalR2, 3)), main.title
.size = 1.5)# print AIC values
plot(c(0,1), c(0,1), ann = F, bty = 'n', type = 'n', xaxt = 'n', yaxt = 'n'
) +
text(x = 0.5, y = 0.6, paste("AIC for linear regression =", round(AIC.lin
ear, 3)),
cex = 1, col = "black") +
text(x = 0.5, y = 0.5, paste("AIC for GWR =", round(AIC.gwr, 3)),
cex = 1, col = "black") +
text(x = 0.5, y = 0.4, paste("∆AIC =", round(AIC.delta, 3)),
cex = 1, col = "black")map3 <- recordPlot()
png(filename="maps.png", height=480, width=960, units="px")
grid.newpage() # creates a clear grid# assigns cell size of grid, here 1x2:
pushViewport(viewport(layout=grid.layout(1,2)))# prints a map object into a defined cell
print(map1, vp=viewport(layout.pos.col = 1, layout.pos.row = 1))
print(map2, vp=viewport(layout.pos.col = 2, layout.pos.row = 1))dev.off()png(filename="aic.png")
print(map3, vp=viewport(layout.pos.col = 2, layout.pos.row = 1))
dev.off()
}
gwr.all(x = OA.Census$Qualification, y = OA.Census$Good_Health, shapefile = O
A.Census)
This function is demonstrated through an analysis of the ‘Qualification’ and ‘Good Health’ variables in Camden. The data used is from CDRC 2011 Census Geodata pack for Camden.
The first map with the legend label “x_coeff” shows the local parameters from the GWR, while the second map shows the model performance of the GWR for each area. The central-west and central regions of Camden have the highest GWR coefficients at around 0.374–0.511. The local R-squared values of these regions are moderate, experiencing two bands of 0.504–0.593 and 0.593–0.667. This shows that the high GWR coefficients in these areas have a moderately high explanatory power for the regression. The analysis of ∆AIC values shows that GWR is a better model fit than the global model.
Potential applications
While the above function demonstrated the spatially varying relationship between qualification level and reported health, the analysis can be extended to other relationships as well. GWR is gaining popularity in social sciences and economics, investigating correlations between house prices and other covariates. In these aspects, gwr.all() will be useful in justifying the strength of one model.
GWR is one of the leading geo-statistical methods for population health studies, due to the locality of health outcomes (Young and Gotaway, 2010); it enables a deeper understanding of spatial and etiological processes. By providing local parameters, GWR allows researchers to deal with information beyond global models in order to implement targeted, place-specific health policies. A useful example would be the study by Mohammadinia et al (2017) that looked at modelling human leptospirosis based on environmental factors. gwr.all() will be effective in checking for spatial nonstationarity, to not only provide local parameters but also determine whether GWR or a global model should be more parsimonious; especially since some diseases, such as hereditary diseases or other diseases without geographical risk factors, may not experience significant geographical variations.
Multiple linear regression models have also been used for climatological data, but certain climate phenomena and trends are varying through space, so a comparison of goodness-of-fit of models would prove useful.
Limitations
First, accompanying text is required to explain the GWR coefficient map, due to the legend title being fixed as “x_coeff”. The function also assumes that the spatial dataset has already been merged with the shapefile prior. This function does not allow the user to choose bandwidth size; it is always specified to be adaptive to the bandwidth determined in object bw.
The function is also limited to two variables, such that multivariable regression analysis cannot be carried out. As such, the GWR is susceptible to the endogeneity through the omitted variable bias which may overestimate the local coefficients. Given that GWR carries out multiple local regressions, this function would not be efficient for large datasets. For example, for n=150000, two weeks of computing would be required to produce outputs. The GWR model also has limited external validity, and cannot be used in spatial extrapolation. Finally, AIC compares only relative quality between models, but does not reveal anything about absolute quality.
For further development of the function, additional parameters could be included to create a multivariate regression model. Additionally, spatial autocorrelation of the regression residuals can also be incorporated into the function through a Moran’s I test.
References
Burnham, K.P., Anderson, R.D. and Huyvaert, K.P. (2011) ‘AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons’ in Behavioral Ecology and Sociobiology, 65, 1, 23–35.
Bates, E. and Pryce, Gwilym (2014). Geographically Weighted Regression as an Automatic Valuation Method An Application to the Cross Price Elasticity for Housing. Glasgow: University of Glasgow.
Brunsdon C., Fotheringham, A.S. and Charlton, M.E. (1996) Geographically Weighted Regression: A Method for Exploring Spatial Nonstationarity. Ohio: The Ohio State University.
De Bellefon M., Floch J. (2018) ‘Geographically Weighted Regression’ in Handbook of Spatial Analysis (pp.231–254). Europe: Eurostat.
Mohammadinia A.I., Alimohammadi A. and Saeidian B. (2017) ‘Efficiency of Geographically Weighted Regression in Modeling Human Leptospirosis Based on Environmental Factors in Gilan Province, Iran’ in Geosciences (Switzerland), 7, 4, 136, 1–12.
Young L.J. and Gotway C.A. (2010) ‘Using geostatistical methods in the analysis of public health data: the final frontier?’ in Geostatistics for Environmental Applications, 16, 89–98.