## V. ALGORITHMS FOR SPATIAL DATA

Depending on the type of conception of space and the measurement level, different algorithms are applied. The widely used data mining algorithms for spatial data are:

1) Spatial Auto-regressive model

2) Markov Random Field model

3) Geographically weighted regression

4) Fractal models

5) Map-Reduce algorithm for polygon retrieval

6) The General Spatial Interaction Model

Guo et al [18] propose a Map-Reduce based parallel polygon retrieval algorithm, as it is a fundamental operation that is often computed under real time constraints. This algorithm first hierarchically indexes the spatial terrain data using a quad-tree index, with the help of which a significant amount of data gets filtered out in the pre-processing stage based on the queried object. The algorithms 1 to 6 mentioned above are widely used when the spatial data is of area data as mentioned in sectionI-B and algorithms 7 and 8 are for spatial interaction data.

**TABLE I**

** COMPARISON OF HADOOP AND SPARK BASED SYSTEMS**

### A. *Algorithms for Area data*

**1) Spatial Autoregressive model:** The assumption of independent observations in the context of area data is unlikely to be appropriate, because of the possibility of spatial dependence. Spatial dependence is where the observed values in one areal unit depend on the values of neighbouring observations at near-by areal units.

This spatial dependence can be introduced into a model:

1. When there is spatial correlation in dependent variable, which is called as Spatial lag model,

2. When there is dependence in the error terms, it is called as Spatial error dependence [20]

3. When spatial dependence is in the regressor variables, it is called as cross-regressive model also called as Spatial Lagged X(SLX) models [20].

The spatial lag models are used when there is spatial correlation in the dependent variable. These spatial correlations are motivated by theoretical arguments that emphasize the importance of neighbourhood effects. Spatial lag models are extensions of regression models of type

1. They allow observations of dependent variable in i to depend on observations in neighbouring j not equal to i.

where, yi is an observed dependent variable, $X _i. _q $ is observed explanatory variables with q = 1,..,Q, q is the regression co-efficient and $\in _i$ is the error term.

The basic spatial lag model i.e first order spatial autoregressive(SAR) model is of the form 2

where, the error term $\epsilon_i$ is independently identically distributed. Wij is the (i; j)th element of spatial weight matrix W. The scalar $\rho$ in equation:2 is a parameter that is to be estimated which determines the strength of the spatial autoregressive relation between yi and $\sum_{j=1}^nW_i._jy_j.\sum_{j=1}^nW_i._jy_j$$ is a linear combination of spatially related observations.

**2) Markov Random Field models:**: The Markov Random Field also called as Conditional Autoregressive (CAR)) models are popular approach for analyzing spatial and network data. These models are widely used to represent the local dependency between the random variables. Consider Y = (y1, y2,.., yn) and the set p($Y _i$ | $Y _j$ , j not equal to i) then we know that p(y1, y2,…,yn) determines p($Y _i$ | $Y _j$ , j not equal to i) (full conditional distributions). But if p($Y _i$ | $Y _j$ , j not equal to i) determines p(y1, y2,..,yn), the joint distribution is called as Markov Random Field. This notion of using local specification to determine a joint (or global) distribution in the form $p(Y _i | Y _j)$ , j not equal to i) = p($Y _i$ | $Y _j$ $\in$ $\partial i$) is referred to as Markov Random Field.

But when does the set p($Y _i$ | $Y _j$ $\in$ $\partial i$) uniquely determine p(y1, y2,.., $Y _n$)? To answer this, we need some important aspects in this regard like that of a Clique and a Potential.

A Clique is a set of cells (equivalently indices) such that each element is a neighbour of every other element.

A Potential function (or simply potential) of order is a function of k arguments that is exchangeable in these arguments. The arguments of the potential would be the values taken by variables associated with the cells for a clique of size k.

The Hammersely-Clifford theorem says that, if we have a Markov Random Field (i.e., p($Y _i | Y _j \in \partial i$) uniquely determine p(y1, y2,…, $Y _n$)), then the later is a Gibbs distribution. The Geman and Geman (1984) result says that, if we have a joint Gibbs distribution, then we have a Markov Random Field. As a result, they argued that to sample a Markov random field, one could sample from its associated Gibbs distribution. Kaise, Lahiri and Nordman (2012) defined a Conclique as a set of locations such that no location in the set is a neighbour of any other location in the set. Any two members of a conclique may share common neighbours, but they cannot be neighbours themselves. Additionally every set of a singleton.

Location can be treated as both a clique and a conclique. In the parlance of graphs, the analog of a conclique is a so-called ”independent set”, defined by a set of vertices in which no two vertices share an edge. This particular graph terminology conflicts with the probabilistic notion of independence, while a ”conclique” truly represents a conditionally independent set of locations in a Markov random field model. The full conditional p($Y _j$|$Y _j$ ; j not equal to i) is the conditional cdf of $Y _i$ under the model. Kaise, Lahiri and Nordman

The key property of this generalized spatial residuals is: Let $C _j$ : j = 1,,,,,q be a collection that partition the integer grid. Under the conditional model, spatial residuals within a conclique are iid Uniform(0; 1)-distributed:

Using the conditional independence of random variables at locations within a conclique [23] have proposed a conclique based Gibbs Sampling algorithm for sampling from a Markov random field. They have shown that it is provably fast and the sampler is also provably geometrically ergodic (i.e., it mixes at a fast rate), which is unusual for spatial data. Kaise, Lahiri and Nordman [22] provide a methodology for performing GOF tests using concliques. Concliques based Gibbs Sampling allows for fast approximation of the reference distribution for the GOF test statistics.