Most of the work on
NN queries assumes that the data is disk-resident and can be
scanned multiple times. However, such work is not suitable for processing
data streams that typically require a one-pass algorithms as the data is
not stored on disk and it is too large to fit in main memory. So, queries on
data streams do not have access to the entire data set and query answers are
typically approximate.
Bern et al. [Ber93] presented a quadtree based algorithm
to answer approximate NN queries. The approximate answer guarantees that the
distance from the query point to the returned point is at most
times
the distance from the query point to any other point in the data structure where
is the
dimensionality. However,
their approach is limited to find only a single NN and they did not discuss the
extension of their approach to answer
nearest neighbors. Moreover, the relative
error depends on the dimensionality
.
Problem of
-approximate
NN queries was studied in [AM93].
Let
be the distance of
actual nearest neighbor of
from
, an
approximate
NN query is to find a set of objects
NN that contains
objects so that
where
is the distance of
object in
NN from
. The relative error
is a user specified constant. The algorithm proposed in [AM93] required
exponential time in
and linear space. Follow-up
studies [AMN$^+$98,KOR98,LCGMW02] improved its time/space
requirements.
Koudas et al. [KOTZ04] were the first to propose a solution to
approximate
NN queries with absolute error bounds. Given a dataset
, a query point
and a user specified value
, they
find a set
NN that contains
points in
such that for any
there exists
a point
(the actual
set) such that
). Their
technique is called DISC (aDaptive Indexing on Stream by space-filling Curve).
They partition the space into a grid so that the maximum distance between any two points in
any cell is at most
. To avoid storing all the points arriving in system, they keep only
objects in each cell
and discard the rest. They prove that the exact
NN
search on these stored objects corresponds to a valid
kNN answer over the original data set.
DISC indexes the data points with a B-tree that uses a space-filling curve mechanism to facilitate
fast updates and query processing. The index can be adjusted to either 1) optimize memory
utilization to answer
kNN queries under certain accuracy requirements or 2) achieve the best
possible accuracy under a given memory constraint. DISC can process both snapshot and continuous
kNN queries.