Code
The code for our leapfrog implementation for Apache Jena is available here.
Dataset
The dataset used was a reduced version of the Wikidata truthy dump from November 15, 2018. The original dump and its reduced version are available at zenodo.
Repeating the experiments
Prerequisites
- any x64 linux distribution with glib support
- java 8
- python (both 2 or 3 works)
- bzip2
- On a debian-based distro:
sudo apt install bzip2
- On a debian-based distro:
- pip
-
Some of the following steps can take hours to complete, so we recommend using tmux to execute them.
Getting the repo and the dataset
- Clone this repository.
git clone git@github.com:cirojas/leapfrog-benchmark.git
if you use ssh keys
or
git clone https://github.com/cirojas/leapfrog-benchmark.git
if you don’t.
- Download the dataset used and move it to the
benchmark
folder - Extract it
bzip2 -d wikidata-wcg-filtered.nt.bz2
- Or you can construct the dataset from the truthy wikidata dump
Create the database for Jena and leapfrog
- Download the files apache-jena-3.9.0.tar.gz from Apache Jena downloads page or here and move it into
jena
folder - Change directory into
jena
folder - Extract it
tar -xf apache-jena-3.9.0.tar.gz
- Create the database for jena
apache-jena-3.9.0/bin/tdbloader2 --loc=db/jena ../wikidata-wcg-filtered.nt
- Edit the file
apache-jena-3.9.0/bin/tdbloader2index
with any text editor. After the line 389generate_index "$K3 $K1 $K2" "$DATA_TRIPLES" OSP
add the following lines:
generate_index "$K1 $K3 $K2" "$DATA_TRIPLES" SOP generate_index "$K2 $K1 $K3" "$DATA_TRIPLES" PSO generate_index "$K3 $K2 $K1" "$DATA_TRIPLES" OPS
then save and exit.
- Create the database for the leapfrog implementation
apache-jena-3.9.0/bin/tdbloader2 --loc=db/leapfrog ../wikidata-wcg-filtered.nt
Create the database for Blazegraph
- Download Blazegraph jar from its sourceforge page or from here and move it into
blazegraph
folder - Change directory into
blazegraph
folder java -Xmx20g -cp blazegraph.jar com.bigdata.rdf.store.DataLoader load.properties ../wikidata-wcg-filtered.nt
Create the database for Virtuoso Opensource
- Download the file from Virtuoso Open Source Edition v7.2.5.1 from its github releases page or from here and move it into
virtuoso
folder - Change directory into
virtuoso
folder - Extract it
tar -xf virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
- Init the server
virtuoso-opensource/bin/virtuoso-t -c virtuoso.ini
- The server can take some time to start, wait a minute and start the interactive sql:
virtuoso-opensource/bin/isql localhost:1111
and enter the following commands:ld_dir('..', '*.nt', 'http://wikidata.org');
rdf_loader_run();
exit();
- Shut down the server
virtuoso-opensource/bin/isql localhost:1111 -K
Run the benchmark
- Change directory into
benchmark
folder bash run-benchmark.sh queries/bgps
-
bash run-benchmark.sh queries/optionals
Now the results are available in the folders
queries/bgps/output
andqueries/optionals/output
For each query pattern you will find a folder containing four files, one for each database. Each line of a file contains three values separated by a semicolon:
queryNumber;numberOfResutls;executionTimeInNanoseconds
Building the dataset
- Download the Wikidata truthy dump
wikidata-wcg.nt.bz2
from here. - Extract it
bzip2 -d wikidata-wcg.nt.bz2
. - Move it to
wikidata-filter
folder and change directory to that folder. - Execute
python remove_labels_and_descriptions.py
to remove labels and descriptions from wikidata, along with strings having other language than english. - Execute
python remove_properties.py
to remove all properies listed inremoved_properties.txt
in our case we removed all properties that appeared more than 1.000.000 times or less than 1.000 times.
Getting random queries for the benchmark
For each query pattern we created a java program that will find 50 random sets of properties with at least 1 result.
The jars are in the find-queries
folder.
To find a query, you need to execute java -jar find_XYZ.jar [jena-database-location] properties_wikidata.txt
, where properties_wikidata.txt
is a file with the properties that can be chosen.
Results
You can find our results in our repository