ExploreWeb workshop on exploration of (Semantic) Web data at ICWE 2011

Together with Piero Fraternali and Daniel Schwabe I organized a workshop at ICWE 2011 on on Search, Exploration and Navigation of Web Data Sources, named ExploreWeb 2011.
It was a new challenge for us, because it was at its first edition, but I can say it has been a quite successful event.
We got 12 submissions and we accepted 7 of them, and furthermore we invited Soren Auer as a keynote speaker to start the day. The attendance was also very good, as we got 25+ people in the room for the whole day.
Here is a quick summary of the day.

Soren Auer: Exploration and other stages of the Linked Data Life Cycle

Soren Auer at exploreWeb 2011.

Soren Auer from University of Leipzig gave a very nice keynote talk at the beginning of the exploreWeb workshop on the entire lifecycle of Linked Data.
 The talk was centered on the requirements imposed by the continuous growth of the Linked Open Data cloud (LOD) and on the life cycle associated to the LD contents. Such life cycle comprises the following phases:

  1.  Extraction of LOD: extracting linked data is a challenge per se. Indeed, for instance in the case of DBpedia extraction from Wikipedia, the issues to be considered include: keeping aligned the semantic version with the user generated one; and at the same time coping with the messy and incoherent “schemas” offered by Wikipedia infoboxes. For covering this aspect, a “Mapping Wiki” based on a higher level ontology has been created for defining the mapping between labels in Wikipedia.
  2. Storage and Querying of LOD: the critical issue here is that it’s still 5 to 50 time slower than RDBMS. On the other side, it obviously grants increased flexibility (especially at the schema manipulation level). A new benchmark recently performed by Soren’s group provides new performance results for Virtuoso, Sesame, Jena, and BigOWLIM. The benchmark was performed on 25 frequent DBpedia queries and shows that Virtuoso consistently grants speed two times higher than the competitors, while Jena confirms as the most poorly performing platform.
  3. Authoring of LOD: different approaches can be adopted, including: Semantic Wikis (e.g., OntoWiki), in which users do not edit text but semantic descriptions built with forms. We can identify two main classes of semantic wikis: semantic text wikis and semantic data wiki. A new approach is now adopted by the new RDFa Content Editor, which uses OpenCalais and other APIs for helping annotating the text within a WYSIWYG environment.
  4. Linking LOD: approaches to linking can be automatic, semi-automatic (e.g., see the tools SILK and LIMES), or manual (e.g., see Sindice in UIs and Semantic Pingback). 
  5. Evolution of LOD: the evolution of linked data is a critical problem, not yet fully addressed. The EvoPat project is a first attempt to formalize the problem and the solution, by defining a set of evolution patterns and anti-patterns. Some features are already integrated into Ontowiki. 
  6. Exploration of LOD: Challenging because of: size, heterogeneity, distributedness.  Spacial and faceted exploration of LinkedGeoData #ld #semweb.  #freebase is the best search assistant for #ld . Also: Parallax and neofonie faceted browser . domain-specific exploration tools (relationship finder on RDF), visual query builders, … 
  7. Visualization of LOD: on this, Soren highlighted that with the continuously growing size of LOD, the (semantic) data visualization will become more and more important. He presented some preliminary approaches but a lot of work still needs to be done in this field.

 In the discussion and Q&A that followed the keynote, the hot topics have beenthe performance benchmark and the authoring of LOD related to the end users v expert/ technical user

Alessandro Bozzon: A Conceptual Framework for Linked Data Exploration

Alessandro Bozzon

Alessandro discussed some motivation to the problem of exploration and integration of linked data sources and then described the Search Computing approach to linked data exploration, which applies the general purpose SeCo framework to the specific needs of the LOD.
More on this can be found on the Search Computing web site, including also a demo and a video.

Daniel Schwabe: Support for reusable explorations of Linked Data in the Semantic Web
Daniel Schwabe started his talk with some strong motivation statements.
One of the main benefits of linked data should be that data bring their own self-description.
However, if you work on it you may end up doing really dirt works on the data, to make them linked.

Daniel Schwabe

When you go to exploration interfaces, expectations of the end users might be very different with respect to what the exploration tools for tech-savvy user. That gap needs to be filled, and Rexplorator moves in that direction. Explorator was presented in the Linked Data workshop(LDOW) in Madrid in 2009. Now its extension Rexplorator has been demonstrated at ISWC 2010 and now presented extensively at the ExploreWeb workshop.
With it, you can do composition of functions, parametrization of operators, storage and reuse of “use cases”, with a query by example approach. The UI lets you think that you are dealing with resources and sets of resources, but actually the system is dealing only with triples and SPARQL queries.
A pretty interesting approach, which has something in common with the Search Computing one, and also features great UI and expressive power. It also covers faceted search.
Rexplorator is a MVC based application implemented with Ruby using ActiveRDF DSL.

In-Young Ko

Han-Gyu Ko and In-Young Ko. Generation of Semantic Clouds based on Linked Data for Efficient Multimedia Semantic Annotations
The presentation started from the definition of the requirements of semantic cloud generation: the idea is to produce tag clouds and help people annotating multimedia contents (e.g., for IP-TV contents).
The requirements include being able to:

  • identify the optimal number of tag clouds
  • balance the size of the different clouds shown to the users
  • check the coherency between the clouds and avoid ambiguity of each cloud.

The proposed lifecycle includes three phases:

  1. locating the spotting points: with a context-aware searching of linked data, starting from more important and densely connected nodes. More general nodes are more likely to be selected
  2. selecting the relations to traverse: the aim here is to reduce the RDF graph to the set of relevant relations only
  3. identify term similarity and clustering of tags.

If compared with simpler approaches for constructing clouds (e.g., based on rdf:type and SKOS parsing), this approach leads to better and more meaningful clouds of keywords.
The implemented system overlays the generated clouds upon the IPTV screen and let people select the tags.

Mamoun Abu Helou. Segmentation of Geo-Referenced Queries

Mamoun Abu Helou

This work aimed at manipulating natural language, multi-objective queries so as to split them into several simple single-aim queries.
The focus of the work was limited to geographical queries. It exploited Geowordnet, Yago, GeoNames, and Google GeoCoder API for identifying the important geographical concepts in the query. Both instances (e.g., Louvre) and classes (e.g., museum) can be identified.
A benchmark over 250 queries show promising results for the approach.

Peter Dolog.SimSpectrum: A Similarity Based Spectral Clustering Approach to Generate a Tag Cloud
Peter’s work addressed the specific problem of clustering within tag clouds.There are some problems in clouds: recent tags are overlooked because they have lower frequency; frequent ones are often useless; … .
The presentation delved into the discussion on the selection of the best algorithms for clustering of tags.
The aim was to reduce the number of tags, pick the most relevant ones, and put at nearby locations in the cloud the semantically close terms. The evaluation of the approach has been calculated in terms of coverage, overlap and relevance between the queries and the generated clouds, in the medical field.

Matthias Keller. A Unified Approach for Modeling Navigation over Hierarchical, Linear and Networked Structures

Matthias Keller

This is a visionary presentation on the needs and possible directions for a navigational model for data structures.
Data structures are very diverse (trees, graphs, …), and extracting hyperlink/access structure from the content structure is very difficult (basically there is no automatic transformation between the two). CMS enable something of this, but with limited expressive power and difficult configurability.
The idea is then to model:

  • the content organization supporting different graph-based content structures
  • the description of the access structures
  • the relation between the content and the access structure

The authors propose a graphical notation for covering these requirements and define some navigation patterns using this language.

Rober Morales-Chaparro. Data-driven and User-driven Multidimensional Data Visualization
This work aims at extracting automatically a set of optimal visualization of complex data, covering the entire lifecycle:

  1. the data model 
  2. the data mining
  3. the information model
  4. the visualization proposal engine
  5. the visualization model
  6. the code generation
  7. and the final generated application for the end user

To conclude, here is a simple tag cloud generated for the content discussed during the workshop:

To keep updated on my activities you can subscribe to the RSS feed of my blog or follow my twitter account (@MarcoBrambi).

Search Computing demonstration at WWW 2010, Hyderabad, India

Together with Alessandro Bozzon, I’ve presented a demonstration of the search computing exploratory search paradigm at WWW 2010.

The demonstrated scenario is in the real estate and job search field. Suppose that a user is willing to find a new job with a specific expertise and in a certain city. Based on his findings, he also wants to search for housing opportunities in the closeby neighbourhoods. Hence, he wants to check for additional information on the quality of life in the area, on availability of services (public transportation, schools for his children, and so on). The final decision will be based on a complex function of all these aspects. The figure below shows the graph of actually existing and registered searchable concepts within this scenario. All these concepts are searched through third-party services.

 
Here is a short video with a summary of the demonstration:

Here you can see Alessandro at work, while demonstrating the approach to some visitor (big prize if you guess who he is:) :

Btw, if you are looking for some more exciting pictures I took in Hyderabad, India you can have a look at this Flickr set of pictures from Hyderabad (while at WWW 2011).


To keep updated on my activities you can subscribe to the RSS feed of my blog or follow my twitter account (@MarcoBrambi).

Our paper at WWW 2010

The paper by Bozzon, A., Brambilla, M., Ceri, S., Fraternali, P. on Liquid query: multi-domain exploratory search on the web has been published in the proceedings of the 19th international conference on World Wide Web (WWW conference 2010, Raleigh, NC, USA). ACM, New York, NY, USA, pp. 161-170.

Further details available here:

http://dbgroup.como.polimi.it/brambilla/node/131

New Book on Search Computing

The new Springer LNCS book (volume 5950)

Search Computing – Challenges and Directions

edited by Stefano Ceri and Marco Brambilla is in print.

A preview of the book (with interactive table of contents) is now available. Publication of paper version of the book is scheduled for March 2010.

Invited Challenge Talk on Search Computing at ESEC/ ACM SIGSOFT FSE 2009

I have been invited to present a Challenge Talk at ESEC/FSE 2009, in Amsterdam on Search Computing. I presented the following talk on 28 August 2009 at Vrije University:

Title: Engineering Search Computing Applications: Vision and Challenges
Abstract: Search computing is a novel discipline whose goal is to answer complex, multi-domain queries. Such queries typically require combining in their results domain knowledge extracted from multiple Web resources; therefore, conventional crawling and indexing techniques, which look at individual Web pages, are not adequate for them. This talk will sketch the main characteristics of search computing and highlight how various classical computer science disciplines – including software engineering, Web engineering, service-oriented architectures, data management, and human-computing interaction – are challenged by the search computing approach.