Reproducibility for RkNN Experiments

Reproducibility for RkNN Experiments

This page contains source code, data sets, pseudocodes, and scripts for our paper "Reverse k Nearest Neighbors Query Processing: Experiments and Analysis" published in PVLDB 2015.

Copyright and Disclaimer

Copyright (c) 2014, Shiyu Yang, Muhammad Aamir Cheema
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
* The names of the authors may not be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Download

Click here to download the package. Note that this package contains 32-bit binaries. If you compile the code to 64-bit binaries, you will need to use 64-bit R-tree generator. In that case, download 64-bit data (and R-tree) generator and replace these files with the files in ./bin folder in the original package.

This package contains the following:

Scripts to generate synthetic data sets(./createRaw.sh) and R-trees (./createRTree.sh) used in the paper. Scripts to run experiments shown in the paper (./runExperiments.sh and ./runAnalysis.sh). Scripts to draw figures shown in the paper (./drawFig_exp.sh and ./drawFig_analysis.sh).
./bin folder contains a program (./bin/datagen_multid) to generate synthetic data sets, a program (./bin/rtree_gen_2d) to generate RTrees using the raw data files stored in ./Raw_files folder and the RKNN algorithms (./bin/rknn) .
./code folder contains the source code of Six-regions, TPL, TPL++, FINCH, InfZone and SLICE algorithms.
./draw folder contains the scripts to draw the figures shown in the paper.
./fig folder initially contains the experimental figures shown in the paper. The figures generated by the scripts will overwrite the figures in this folder.
./Raw_files is initially empty. It will contain the raw data sets (synthetic data sets generated by ./bin/datagen_multid) and the real data sets.
./results folder is initially empty. The output of ./bin/rknn will be stored in this folder.
./RTrees folder is initially empty. The RTrees generated by ./bin/rtree_gen_2d will be stored in this folder.

Click here and download the real data sets. You will need to move these files to ./Raw_files folder found in the extracted package.

Instructions

To reproduce the experimental results in the paper, run runAll.sh or follow the steps below.

run buildRkNN.sh to compile the source code and create the binary files. The compiled binary files will be stored in ./bin folder. If your system is 64-bits and you are using 32-bit files for RTree generator etc., please make sure that you modify buildRkNN.sh to compile 32 bit binaries.
run createRaw.sh. It will create the synthetic data sets by following the parameters specified in the paper. The raw data files will be stored in ./Raw_files.
run createRTree.sh. It will create R-trees for all the data sets used in the paper. The R-trees will be stored in ./RTrees. Note: you will need to make sure that the real data sets have been copied to ./Raw_files before running this step.
run runExperiments.sh. It will run all the experiments shown in the Experiments section of the paper. The results (stats) of each algorithm for each experiment are stored in ./results folder.
run runAnalysis.sh to run the experiments shown in Introduction section and the Algorithms section. This will store the results in ./results folder.
run drawFig_exp.sh and drawFig_analysis.sh. These scripts will use the results files in ./results folder and will create the figures accordingly. Due to different CPU time on different systems, the appearance of the figures may not be optimal. However, it can be fixed by modifying the scripts appropriately.

Pseudocode

As mentioned in the paper, we implemented some simple yet effective optimizations in each algorithm. Click here to download the pseudocodes that we followed in our implementation. We have tried our best to implement each algorithm in an efficient and fair manner. If you have suggestions to improve certain algorithm, please contact us.

FAQs

This list of FAQs will be maintained (until mentioned otherwise). If you have any question, please contact us.

Question. We downloaded the code and tried to run the executables but got the following error messgae:
/lib/ld-linux.so.2: bad ELF interpreter: No such file or directory
Answer. It appears that you are trying to run 32-bit binaries on a 64 bit system that is missing some of the 32 bits libraries. Please either use 64-bit binaries (link provided in Download section above) or install the 32 bit libraries. For more details, click here
Question. Why the number of I/Os for six-regions and FINCH are different each time I run the same experiment.
Answer. Since the buffer uses random eviction strategy, the number of I/Os may be different each time an experiment is run especially when the number of buffers used is small (although the difference is not huge). Since TPL, TPL++, InfZone and SLICE do not use the buffer, the number of I/Os for these algorithms do not change.
Question. I modified the script ./createRaw.sh to generate two small synthetic data sets containing 1000 and 1001 points, respectively (see below).
for size in 1000 1001 do ./datagen_multid $size unif 2 done
I noticed that the first 1000 points of the two synthetic data sets are always the same. Is there something wrong with the data generatore (./bin/datagen_multid)?
Answer. The data generator works fine on a few machines we tested, therefore, we are not able to reproduce the error for extensive testing. However, another user had reported the same problem. Note that the psuedo-random generator takes time(NULL) as a seed, i.e. srand( time (NULL) ). The apparent reason of the reported problem is that the same seed is being passed for each call to ./datagen_multid (may be the data sets being generated are too small???). This problem can be fixed by adding sleep 1 in ./createRaw.sh before every call to ./datagen_multid. This ensures that it gets a different seed every time it is called. If you are curious how the data generator works, you can download the source code of the data generator.

Reporting Bugs

We have extensively tested the source code by comparing the results of all algorithms with a Brute Force algorithm for more than a Million queries on various data setts using different parameters. However, we cannot guarantee that the source code is bug-free. If you spot a bug, please contact me (email given at the home page) or Shiyu Yang (yangs AT CSE dot UNSW dot EDU dot AU).