Code
The code for our leapfrog implementation for Apache Jena is available here.
Dataset
The dataset used was a reduced version of the Wikidata truthy dump from November 15, 2018. The original dump and its reduced version are available at zenodo.
Repeating the experiments
Prerequisites
- any x64 linux distribution with glib support
- java 8
- python (both 2 or 3 works)
- bzip2
- On a debian-based distro:
sudo apt install bzip2
- On a debian-based distro:
- pip
-
Some of the following steps can take hours to complete, so we recommend using tmux to execute them.
Getting the repo and the dataset
- Clone this repository.
git clone git@github.com:cirojas/leapfrog-benchmark.gitif you use ssh keys
or
git clone https://github.com/cirojas/leapfrog-benchmark.gitif you don’t.
- Download the dataset used and move it to the
benchmarkfolder - Extract it
bzip2 -d wikidata-wcg-filtered.nt.bz2 - Or you can construct the dataset from the truthy wikidata dump
Create the database for Jena and leapfrog
- Download the files apache-jena-3.9.0.tar.gz from Apache Jena downloads page or here and move it into
jenafolder - Change directory into
jenafolder - Extract it
tar -xf apache-jena-3.9.0.tar.gz - Create the database for jena
apache-jena-3.9.0/bin/tdbloader2 --loc=db/jena ../wikidata-wcg-filtered.nt - Edit the file
apache-jena-3.9.0/bin/tdbloader2indexwith any text editor. After the line 389generate_index "$K3 $K1 $K2" "$DATA_TRIPLES" OSPadd the following lines:
generate_index "$K1 $K3 $K2" "$DATA_TRIPLES" SOP generate_index "$K2 $K1 $K3" "$DATA_TRIPLES" PSO generate_index "$K3 $K2 $K1" "$DATA_TRIPLES" OPSthen save and exit.
- Create the database for the leapfrog implementation
apache-jena-3.9.0/bin/tdbloader2 --loc=db/leapfrog ../wikidata-wcg-filtered.nt
Create the database for Blazegraph
- Download Blazegraph jar from its sourceforge page or from here and move it into
blazegraphfolder - Change directory into
blazegraphfolder java -Xmx20g -cp blazegraph.jar com.bigdata.rdf.store.DataLoader load.properties ../wikidata-wcg-filtered.nt
Create the database for Virtuoso Opensource
- Download the file from Virtuoso Open Source Edition v7.2.5.1 from its github releases page or from here and move it into
virtuosofolder - Change directory into
virtuosofolder - Extract it
tar -xf virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz - Init the server
virtuoso-opensource/bin/virtuoso-t -c virtuoso.ini - The server can take some time to start, wait a minute and start the interactive sql:
virtuoso-opensource/bin/isql localhost:1111and enter the following commands:ld_dir('..', '*.nt', 'http://wikidata.org');rdf_loader_run();exit();
- Shut down the server
virtuoso-opensource/bin/isql localhost:1111 -K
Run the benchmark
- Change directory into
benchmarkfolder bash run-benchmark.sh queries/bgps-
bash run-benchmark.sh queries/optionalsNow the results are available in the folders
queries/bgps/outputandqueries/optionals/outputFor each query pattern you will find a folder containing four files, one for each database. Each line of a file contains three values separated by a semicolon:
queryNumber;numberOfResutls;executionTimeInNanoseconds
Building the dataset
- Download the Wikidata truthy dump
wikidata-wcg.nt.bz2from here. - Extract it
bzip2 -d wikidata-wcg.nt.bz2. - Move it to
wikidata-filterfolder and change directory to that folder. - Execute
python remove_labels_and_descriptions.pyto remove labels and descriptions from wikidata, along with strings having other language than english. - Execute
python remove_properties.pyto remove all properies listed inremoved_properties.txtin our case we removed all properties that appeared more than 1.000.000 times or less than 1.000 times.
Getting random queries for the benchmark
For each query pattern we created a java program that will find 50 random sets of properties with at least 1 result.
The jars are in the find-queries folder.
To find a query, you need to execute java -jar find_XYZ.jar [jena-database-location] properties_wikidata.txt, where properties_wikidata.txt is a file with the properties that can be chosen.
Results
You can find our results in our repository