All-Nearest Neighbor Queries

Given two object sets $ A$ and $ B$ , an all-nearest neighbor (all-NN) query is to find, for each object in $ A$ , a nearest neighbor from $ B$ . The formal definition is given below.

Definition: Given two object sets $ A$ and $ B$ , an all-NN query($ A,B$ ) is to find for each object $ a_i\in A$ an object $ b_j\in B$ such that for all other objects $ b_k\in B$ , $ dist(b_j,a_i)\leq dist(b_k,a_i)$ .

The result of an all-NN($ A,B$ ) query consists of $ \mid A\mid$ object pairs $ <a_i,b_i>$ , where $ \mid A\mid$ is the cardinality of set $ A$ and $ b_i$ is the nearest neighbor of $ a_i$ in set $ B$ . Note that all-NN queries are not commutative. More specifically, all-NN($ A,B$ )$ \neq$ all-NN($ B,A$ ).

These queries are common in several applications. For example, a query may be issued to find the nearest hospital for each housing society of a city. This may be an important query for urban planning. Such queries may also be issued for resource allocation problems. Several clustering and outlier detection algorithms also use all-NN queries for efficiency. Similar to aggregate nearest neighbor queries, all-NN queries can also be used to detect abnormalities in very large circuits [NO97].

Few spatial join methods have been proposed that can be used to answer all-NN queries. Brinkhoff et al. [BKS93] proposed a join method where both data sets are assumed to be indexed by R-tree. The algorithm returns the objects from the sets $ A$ and $ B$ that intersect each other. The algorithm traverses both trees and prunes any entry pair $ <E_A,E_B>$ , if the nodes $ E_A$ and $ E_B$ do not overlap. The intuition is that these nodes cannot contain any object $ a_i$ that intersects some object $ b_j$ . Clearly, the problem of spatial join is different from finding all-NN because even if two nodes do not overlap they can still contain two points $ a_i$ and $ b_j$ so that $ b_j$ is the NN of $ a_i$ . However, this spatial join technique can be slightly changed to answer closest-pair problem which has been used to answer all-NN queries [HS98,CMTV00]. A closest-pair (CP) problem is to find a pair $ <a_i,b_j>$ from two sets $ A$ and $ B$ so that there is no other pair $ <a_x,b_y>$ such that $ dist(a_x,b_y)<dist(a_i,b_j)$ . The above spatial join method can be used to answer CP queries. However, the difference is that sometimes the nodes that do not overlap will have to be visited. The algorithm traverses the two R-tree indexes and recursively visits the entry pairs $ <E_A,E_B>$ that has minimum distance among all other entries. Bohm et al. [BK04] propose a $ k$ -nearest neighbor join method which searches $ k$ NNs from set $ B$ for each object in $ A$ . If $ k=1$ , the problem is same as All-NN query. GORDER [Hon04] is another method to process $ k$ nearest neighbor joins.

A straight forward approach to answer an all-NN query is to perfrom one NN query on data set $ B$ for each object in $ A$ . In [BKSS90], the authors propose several improvements to reduce CPU and I/O costs. Zhang et al. [ZPMT04] proposed two approaches for the case when data set $ B$ is indexed. Multiple nearest neighbor (MNN) method is similar to an index-nested-loop join operation. For each object in $ A$ , MNN applies a NN query on the R-tree of data set $ B$ . The order in which the objects from $ A$ are processed is important because if two objects that are near to each other are processed one after the other, a large percentage of the pages of the R-tree will be available in LRU memory for the processing of the second object. To achieve a better order, the data set $ A$ is sorted by some space filling curve [Bia69] which reduces the I/O cost for MNN. However, CPU cost of MNN is still very high. To reduce the CPU cost, they propose batched nearest neighbor (BNN) method. BNN splits the points in $ A$ into $ n$ disjoint groups. Then, the R-tree of $ B$ is traversed once only for each group which significantly reduces the number of distance computations hence a lower CPU cost. They also presented a hash-based method, but their experiment proved that BNN outperforms hash-based methods in most of the experiments.

The latest work on all-NN queries is proposed by Chen and Patel [CP07]. They improve the pruning bounds presented by Corral et al. [CMTV04]. Let $ M$ and $ N$ be two minimum bounding rectangles (MBR) for data points in $ A$ and $ B$ , respectively. Corral et al. define $ MINMINDIST(M,N)$ as the minimum distance between the two MBRs. The intuition is that this is the minimum possible distance between any two points present in the MBRs (one point in $ M$ and one in $ N$ ). They define $ MAXMAXDIST(M,N)$ as the maximum distance between two MBRs $ M$ and $ N$ . These distance metrics are illustrated in Fig. [*]. The pruning condition is that an MBR $ N'$ and its child entries can be pruned if $ MAXMAXDIST(M,N)<MINMINDIST(M,N')$ because for every point $ a_i\in M$ , there always exists a point in $ N$ that lies closer to $ a_i$ than all points in $ N'$ .

Figure: Different Pruning Metrics for All-NN Queries
\includegraphics[width=6.0in]{background/fig/all-nn.eps}

Chen and Patel note that the pruning metric $ MAXMAXDIST$ is unnecessarily conservative and can be improved. They define another pruning metric $ MINMAXMINDIST$ which is termed $ NXNDIST$ in short and is shown in Fig. [*]. Intuitively, $ NXNDIST(M,N)$ is the maximum possible distance of any object $ a_i\in M$ to its nearest neighbor $ b_i\in N$ . An MBR $ N'$ and its child entries can be pruned if $ NXNDIST(M,N)<MINMINDIST(M,N')$ because by the definitions of MBR and $ NXNDIST$ , for every point $ a_i\in M$ , there always exists a point in $ N$ that lies closer to $ a_i$ then all points in $ N'$ . We omit the mathematical definition and calculation of $ NXNDIST$ due to its complexity and refer interested readers to [CP07].

Muhammad Aamir Cheema 2007-10-11