Approximate NN Queries

Most of the work on NN queries assumes that the data is disk-resident and can be scanned multiple times. However, such work is not suitable for processing data streams that typically require a one-pass algorithms as the data is not stored on disk and it is too large to fit in main memory. So, queries on data streams do not have access to the entire data set and query answers are typically approximate.

Bern et al. [Ber93] presented a quadtree based algorithm to answer approximate NN queries. The approximate answer guarantees that the distance from the query point to the returned point is at most $12.d^{1/2}$ times the distance from the query point to any other point in the data structure where is the dimensionality. However, their approach is limited to find only a single NN and they did not discuss the extension of their approach to answer nearest neighbors. Moreover, the relative error depends on the dimensionality .

Problem of $(1+\epsilon)$ -approximate NN queries was studied in [AM93]. Let be the distance of $k^{th}$ actual nearest neighbor of from , an approximate $(1+\epsilon)k$ NN query is to find a set of objects $(1+\epsilon)k$ NN that contains objects so that $q_{(1+\epsilon)}.dist_k \leq q.dist_k\times(1+\epsilon)$ where $q_{(1+\epsilon)}.dist_k$ is the distance of $k^{th}$ object in $(1+\epsilon)k$ NN from . The relative error $\epsilon$ is a user specified constant. The algorithm proposed in [AM93] required exponential time in and linear space. Follow-up studies [AMN$^+$98,KOR98,LCGMW02] improved its time/space requirements.

Koudas et al. [KOTZ04] were the first to propose a solution to approximate NN queries with absolute error bounds. Given a dataset , a query point and a user specified value , they find a set NN that contains points in such that for any $p'\in ekNN$ there exists a point $p\in kNN$ (the actual set) such that $dist(p',q)\leq (dist(p,q)+e$ ). Their technique is called DISC (aDaptive Indexing on Stream by space-filling Curve). They partition the space into a grid so that the maximum distance between any two points in any cell is at most . To avoid storing all the points arriving in system, they keep only $K\geq k$ objects in each cell and discard the rest. They prove that the exact NN search on these stored objects corresponds to a valid kNN answer over the original data set. DISC indexes the data points with a B-tree that uses a space-filling curve mechanism to facilitate fast updates and query processing. The index can be adjusted to either 1) optimize memory utilization to answer kNN queries under certain accuracy requirements or 2) achieve the best possible accuracy under a given memory constraint. DISC can process both snapshot and continuous kNN queries.

Muhammad Aamir Cheema 2007-10-11