Given two object sets
and
, an all-nearest neighbor (all-NN) query is to find, for
each object in
, a nearest neighbor from
. The formal definition is given
below.
Definition: Given two object sets
and
, an all-NN query(
) is to find for
each object
an object
such that for all other objects
,
.
The result of an all-NN(
) query consists of
object pairs
, where
is the cardinality of set
and
is the nearest neighbor of
in set
. Note that
all-NN queries are not commutative. More specifically, all-NN(
)
all-NN(
).
These queries are common in several applications. For example, a query may be issued to find the nearest hospital for each housing society of a city. This may be an important query for urban planning. Such queries may also be issued for resource allocation problems. Several clustering and outlier detection algorithms also use all-NN queries for efficiency. Similar to aggregate nearest neighbor queries, all-NN queries can also be used to detect abnormalities in very large circuits [NO97].
Few spatial join methods have been proposed that can be used to answer all-NN queries. Brinkhoff
et al. [BKS93] proposed a join method where both data sets
are assumed to be indexed by R-tree. The algorithm returns the objects from the sets
and
that intersect each other. The algorithm traverses both trees and prunes
any entry pair
, if the nodes
and
do not overlap. The intuition
is that these nodes cannot contain any object
that intersects some object
.
Clearly, the problem of spatial join is different from finding all-NN because even
if two nodes do not overlap they can still contain two points
and
so that
is the NN of
. However, this spatial join technique can be slightly changed
to answer closest-pair problem which has been used to answer all-NN
queries [HS98,CMTV00]. A closest-pair (CP) problem is to find
a pair
from two sets
and
so that there is no other pair
such that
. The above spatial join method can be used to
answer CP queries. However, the difference is that sometimes the nodes that do not
overlap will have to be visited. The algorithm traverses the two R-tree indexes
and recursively visits the entry pairs
that has minimum distance among
all other entries. Bohm et al. [BK04] propose a
-nearest neighbor
join method which searches
NNs from set
for each object in
. If
,
the problem is same as All-NN query. GORDER [Hon04] is another
method to process
nearest neighbor joins.
A straight forward approach to answer an all-NN query is to perfrom one NN query on data set
for each object in
. In [BKSS90], the authors propose several improvements
to reduce CPU and I/O costs. Zhang et al. [ZPMT04] proposed two approaches
for the case when data set
is indexed. Multiple nearest neighbor (MNN) method
is similar to an index-nested-loop join operation. For each object in
, MNN applies a
NN query on the R-tree of data set
. The order in which the objects from
are processed
is important because if two objects that are near to each other are processed one
after the other, a large percentage of the pages of the R-tree will be available in LRU memory
for the processing of the second object.
To achieve a better order, the data set
is sorted by some
space filling curve [Bia69] which reduces the I/O cost for MNN. However, CPU
cost of MNN is still very high. To reduce the CPU cost, they propose batched nearest
neighbor (BNN) method. BNN splits the points in
into
disjoint groups. Then, the
R-tree of
is traversed once only for each group which significantly reduces the
number of distance computations hence a lower CPU cost. They also presented a hash-based method, but
their experiment proved that BNN outperforms hash-based methods in most of the experiments.
The latest work on all-NN queries is proposed by Chen and Patel [CP07].
They improve the pruning bounds presented by Corral et al. [CMTV04].
Let
and
be two minimum bounding rectangles (MBR) for data points in
and
,
respectively. Corral et al. define
as the minimum distance
between the two MBRs. The intuition is that this is the minimum possible distance between
any two points present in the MBRs (one point in
and one in
).
They define
as the maximum distance between two MBRs
and
. These
distance metrics are illustrated in Fig.
.
The pruning condition is that an MBR
and its child entries can be pruned
if
because for every point
, there always exists
a point in
that lies closer to
than all points in
.
Chen and Patel note that the pruning metric
is unnecessarily conservative and
can be improved. They define another pruning metric
which is termed
in short and is shown in Fig.
.
Intuitively,
is the maximum possible distance of any object
to its nearest neighbor
. An MBR
and its child entries can be pruned if
because by the definitions of MBR and
, for every point
, there always exists a point in
that lies closer to
then all points
in
. We omit the mathematical definition and calculation of
due to its
complexity and refer interested readers to [CP07].
Muhammad Aamir Cheema 2007-10-11