lec4.Rmd

### Spatial modelling and spatial interpolation
\(
\newcommand{\Cor}{{\rm Corr}}
\)


* Simple ways of interpolation
* Simple statistical models for interpolation
* Geostatistical interpolation
* Deterministic models
* Combined approaches

### Taking a step back
Why do we need models?
* to understand _relations_ or _processes_
* to assess (predict, forcast, map) something we do 
or did not measure _and cannot see_
* to assess the consequence of decisions (scenarios)
where we cannot measure

### A sample data set
```{r}
library(sp)
data(meuse.riv)
data(meuse)
coordinates(meuse) = c("x", "y")
meuse.sr = SpatialPolygons(list(Polygons(list(Polygon(meuse.riv)),"meuse.riv")))
spplot(meuse, "zinc", col.regions=bpy.colors(),  main = "zinc, ppm",
       cuts = c(100,200,400,700,1200,2000), key.space = "right",
       sp.layout= list("sp.polygons", meuse.sr, fill = "lightblue")
)
```

### Thiessen "polygons", 1-NN
```{r}

library(gstat)
data(meuse)
coordinates(meuse) = ~x+y
data(meuse.grid)
gridded(meuse.grid) = ~x+y

meuse.th = idw(zinc~1, meuse, meuse.grid, nmax = 1)

spplot(meuse.th[1], 
	col.regions = bpy.colors(30), cuts = 29,
	sp.layout = list("sp.points", meuse, col = 3, cex=.5),
	main = "Zinc, 1-nearest neighbour")
```

### Zinc concentration vs. distance to river
```{r}
library(sp)
data(meuse)
plot(log(zinc)~sqrt(dist),meuse)
abline(lm(log(zinc)~sqrt(dist),meuse))
```

### Zinc conc. vs. distance to river: map of linear trend
```{r}
meuse.grid$sqrtdist = sqrt(meuse.grid$dist)
spplot(meuse.grid["sqrtdist"], col.regions = bpy.colors())
```

### Inverse distance weighted interpolation
Uses a **weighted** average:
$$\hat{Z}(s_0) = \sum_{i=1}^n \lambda_i Z(s_i)$$
with $s_0 = \{x_0, y_0\}$,
or $s_0 = \{x_0, y_0, \mbox{depth}_0\}$
weights inverse proportional to power $p$ of distance:
$$\lambda_i = \frac{|s_i-s_0|^{-p}}{\sum_{i=1}^n |s_i-s_0|^{-p}}$$
* power $p$: tuning parameter
* if for some $i$, $|s_i-s_0| = 0$, then $\lambda_i = 1$ and other
weights become zero
* $\Rightarrow$ **exact** interpolator

### Effect of power $p$
```{r}
library(sp)
set.seed(13531)
pts = data.frame(x = 1:10, y = rep(0,10), z = rnorm(10)*3 + 6)
coordinates(pts) = ~x+y
xpts = 0:1100 / 100
grd = data.frame(x = xpts, y = rep(0, length(xpts)))
gridded(grd) = ~x+y
library(gstat)
plot(z ~ x, as(pts, "data.frame"), xlab ='location', ylab = 'attribute value')
lines(xpts, idw(z~1, pts, grd, idp = 2)$var1.pred)
lines(xpts, idw(z~1, pts, grd, idp = 5)$var1.pred, col = 'darkgreen')
lines(xpts, idw(z~1, pts, grd, idp = .5)$var1.pred, col = 'red')
legend("bottomright", c("2", ".5", "5"), lty = 1, 
	col=c("black", "red", "darkgreen"), title = "invese distance power")
```

### Simple statistical models for interpolation

```{r}
library(sp)
data(meuse)
plot(zinc~dist, meuse)
plot(log(zinc)~sqrt(dist), meuse)
```

### Time series versus spatial data
Differences:
* spatial data live in 2 (or 3) dimensions
* there's no past and future
* there's no simple conditional independence (AR)
Correspondences
* nearby observations are more alike (auto-correlation)
* we can form moving averages
* coordinate reference systems are a bit like time zones and DST

### What information do we have?
* We have measurements $Z(x)$, with $x$ two-dimensional
(location on the map)
* we have $x$ and $y$
* we may have land use data 
* we may have soil type or geological data 
* we may have remote sensing imagery 
* we may have all kinds of relevant information, related
to processes that cause (or result from) $Z(x)$ 
* we have google maps 

_We don't want to ignore anything important_

### Regression or correlation?
(Correlation between two different variables, no auto-correlation:) 
```{r}
n = 500
x = rnorm(n)
y = rnorm(n)
R = matrix(c(1,0,0,1),2,2)
R
plot(cbind(x,y) %*% chol(R))
title("zero correlation")
R[2,1] = R[1,2] = 0.5
plot(cbind(x,y) %*% chol(R))
title("correlation 0.5")
R[2,1] = R[1,2] = 0.9
plot(cbind(x,y) %*% chol(R))
title("correlation 0.9")
R[2,1] = R[1,2] = 0.98
plot(cbind(x,y) %*% chol(R))
title("correlation 0.98")
```

### Correlation vs. regression:
Correspondences:
* both are in the "classic statistics" book, and may involve hypothesis testing
* both deal with _two_ continuous variables
* both look at (first order) _linear_ relations
* when correlation is significant, the regression _slope_ is significant
Differences:
* Regression distinguishes $y$ from $x$: $y$ depends on $x$, not reverse;
* the line $y=ax+b$ is _not_ equal to the line $x = cy + d$
* Correlation is symmetric: $\Cor(x,y)=\Cor(y,x)$
* Correlation coefficient is unitless and within $[-1,1]$, 
regression coeficients have data units
* Regression is concerned with _prediction_ of $y$ from $x$.

### The power of regression models for spatial prediction
... is hard to overestimate. Regression and correlation are the fork
and knife of statistics.
* linear models have endless application: polynomials, interactions, nested
effects, ANOVA/ANCOVA models, hypothesis testing, lack of fit testing, ...
* predictors can be transformed non-linearly
* linear models can be generalized: logistic regression, Poisson regression, ...,
to cope with discrete data (0/1 data, counts, log-normal)
* many derived techniques solve one particular issue in regression, e.g.:
 * ridge regression solves collinearity (extreme correlation among predictors)
 * stepwise regression automatically selects "best" models among many candidates
 * classification and regression trees

# Why is regression difficult in spatial problems?
Regression models assume independent observations. Spatial data are always
to some degree spatially correlated. 

This does not mean we should discard regression, but rather think about
* to which extent is an outcome dependent on independence?
* to which extent is regression _robust_ agains a violated assumption 
of independent observations?
* to which extent _is_ the assumption violated? (how strong is the
correlation)

### What is spatial correlation?
Waldo Tobler's first law in geography:

Everything is related to everything else, but near things are more
related than distant things." [Tobler, 1970, p.236]

TOBLER, W. R. (1970). "A computer model simulation of urban growth in
the Detroit region". Economic Geography, 46(2): 234-240.

_But how then is "being related" expressed?_