|V.S. Subrahmanian (Univ. of Maryland)|
As part of the Data Management in the Social Semantic Web workshop (DMSSW workshop) at the ASONAM 2012 conference in Istanbul, Turkey, V.S. Subrahmanian (University of Maryland) gave an interesting talk on efficient subgraph matching on (social) networks.
Queries are defined as graphs themselves, with some of the places defined as constants and some defined as variables.
The complexity of queries over graphs is high, due to the large number of joins to be performed even in case of fairly simple queries.
The size of the query is typically relatively small with respect to the entire dataset (network). The proposed approaches are useful for scale of at least tens of millions of nodes in the network.
How to work on disk
The mechanism implemented is called DOGMA index and applies an algorithm called K-merge.
The algorithm builds a hierarchical index where I put at most K nodes of the graph in each index item. For obtaining that, I merge together connected nodes.You can do that randomly or more intelligently by trying to minimizing connections between nodes in different index items.
|Example of DOGMA Index, where nodes of the original network (at the bottom) are merged in higher level representations in the level above (in this example, K = 4, since we have 4 nodes in each index position).|
I don’t want to build the index by partitioning the whole graph, because it’s painful for large graphs.
I start from a G0 graph, and I merge nodes until I get G1, G2, Gn graphs, each of them is more or less half the size of the previous, until Gn has K nodes or less. Then, I build the dogma index over Gn.
For the query, I can use a basic approach: identify the variable nodes that are immediately close to a constant node, and then find the possible satisfying values for those variables, starting from the constants. I can apply conditions considering distance constraints between constants and variables, as well as between candidate variable names. To allow this, I also save in every node of the index the distance of closest node in the other nodes.
How to work on the cloud
This approach has been also implemented in the cloud, through the so called COSI Architecture, assuming a cloud of k+1 computing nodes. The implementation of the edge cuts that generates the index must be very quick and produce fairly good cuts (but not necessarily optimal).
The image below lists some of the references to S.V. works on the topic.
|Some references to V.S. Subrahmanian works on subgraph matching.|