Tamer Ozsu gave a 1 hour seminar on RDF Data Management today at Politecnico di Milano.
University of Waterloo
What I found intriguing is the database and modeling perspective he has on RDF.
We all know that RDF is a machine readable format for describing resources on the web based on triples. A triple is a simple concept composed of three terms: .
Every resource is identified by a URI. The terms can either be URIs or literals.
Terms can have types (rdf:type), which is basically corresponding to a class in modeling sense.
Triples define a graph of vertexes, including URI vertexes and literal vertexes.
RDF can be queried through SPARQL.
In database sense, one could think to a naive solution by saying we have a global schema with only one table with the three columns (Subject, Predicate, Object), upon which one could perform SQL queries.
However, this solution is critical because it’s introducing a huge number of joins.
The real relational solutions that one can apply are:
- Property table: each class of objects go to a different table (like in normalized relations)
- Vertically partitioned table: build a two column table, containing subject and object, ordered by subjects
- Exhaustive indexing: consider the whole table and build all the possible indexes upon it
However, typical solutions do not go towards relational schemas. They directly work on the RDF graphs, by applying subgraph matching, by applying isomorphisms. However, off-the-shelf algorithms do not scale well to the size of RDF graphs.
Typically, RDF graphs are not generic graphs. They often implement “star-shape patterns”, where one core object is connected to several properties.
Ozsu proposes an encoding technique for optimizing querying.
It defines an adjacency table where for every vertex, you select all its adjacent nodes, and you encode them by applying n-gram technique with n=3, encoding the result in binary format and then putting them all in OR. If you encode the question in the same way, this lets you find a set of nodes that possibly match the question. You need to run a two-step process though: the first step runs an inclusion query, which extracts all the nodes that include a given query node. To make it efficient, you run an S-tree solution: you define a tree and apply the inclusion query. You get a set of potential nodes, and then you perform a join between them based on the predicate label that connect the query nodes. To reduce the space of the join, one can apply pruning and thus apply VS-tree techniques.
The result is complete but not sound, in the sense that you can still get a lot of false positives. In a second step, you verify that the extracted nodes actually match.
Aggregation obviously adds up complexity.
|RDF graph or neuron connections?|