Instrumenting Continuous Knowledge Extraction, Sharing, and Benchmarking

This is a contribution in response to the Call for Linked Research for the workshop at ESWC 2017 entitled Enabling Decentralised Scholarly Communication.

Authors: Marco Brambilla, Emanuele Della Valle, Andrea Mauri, Riccardo Tommasini.

Affiliation: Politecnico di Milano, DEIB, Data Science Lab. Milano, Italy.

You can also read and download the FULL ARTICLE IN PDF.
“Nanos gigantum humeris insidentes”
(Bernard of Chartres, 1115 ca.)

Introduction

Science aims at  creating new knowledge upon the existing one, from the observation of physical phenomena, their modeling and empirical validation. This combines the well known motto “standing on the shoulders of giants” (attributed to Bernard of Chartres and subsequently rephrased by Isaac Newton) with the need of trying and validating new experiments.
However, knowledge in the world continuously evolves, at a pace that cannot be traced even by large crowdsourced bodies of knowledge such as Wikipedia. A large share of generated data are not currently analysed and consolidated into exploitable information and knowledge (Ackoff 1989). In particular, the process of ontological knowledge discovery tends to focus on the most popular items, those which are mostly quoted or referenced, and is less effective in discovering less popular items, belonging to the so-called long tail , i.e. the portion of the entity’s distribution having fewer occurrences .(Brambilla 2016)
This becomes a challenge for practitioners, enterprises and scholars / researchers, which need to be up to date to innovation and emerging facts. The scientific community also need to make sure there is a structured and formal way to represent, store and access such knowledge, for instance as ontologies or linked data sources.
Our idea is to propose a vision towards a set of (possibly integrated) publicly available tools that can help scholars keeping the pace with the evolving knowledge. This implies the capability of integrating informal sources, such as social networks, blogs, and user-generated content in general. One can conjecture that somewhere, within the massive content shared by people online, any low-frequency, emerging concept or fact has left some traces. The challenge is to detect such traces, assess their relevance and trustworthiness, and transform them into formalized knowledge (Stieglitz 2014).An appropriate set of tools that can improve effectiveness of knowledge extraction, storage, analysis, publishing and experimental benchmarking could be extremely beneficial for the entire research community across fields and interests.

Our Vision towards Continuous Knowledge Extraction and Publishing

We foresee a paradigm where knowledge seeds can be planted, and subsequently grow, finally leading to the generation and collection of new knowledge, as depicted in the exemplary process shown below: knowledge seeding (through types, context variables, and example instances), growing (for instance by exploring social media), and harvesting for extracting concepts (instances and types).

ske-paradigm
We advocate for a set of tools that, when implemented and integrated, enable  the following perspective reality:

  • possibility of selecting any kind of source of raw data, independently of their format, type or  semantics (spanning quantitative data, textual content, multimedia content), covering both data streams or pull-based data sources;
  • possibility of applying different data cleaning and data analysis pipelines to the different sources, in order to increase data quality and abstraction / aggregation;
  • possibility of integrating the selected sources;
  • possibility of running homogeneous knowledge extraction processes of the integrated sources;
  • possibility of publishing the results of the analysis and semantic enrichment as new and further (richer) data sources and streams, in a coherent, standard and semantic way.

This enables generation of new sources which in turn can be used in subsequent knowledge extraction processes of the same kind. The results of this process must be available at any stage to be shared for building an open, integrated and continuously evolving knowledge for research, innovation, and dissemination purposes.

A Preliminary Feasibility Perspective

Whilst beneficial and powerful, the vision we propose is far from being achieved nowadays.  However, we are convinced that the vision is not out of reach in the mid term. To give a hint of this, we report here our experience with the research, design and implementation of a few tools that point in the proposed direction:

  1. Social Knowledge Extractor (SKE) is a publicly available tool for discovering emerging knowledge by extracting it from social content. Once instrumented by experts through very simple initialization, the tool is capable of finding emerging entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors, built by using terms occurring in their social content, and then ranks the candidates by using their distance from the centroid of seeds, returning the top candidates as result. The tool can run continuously or with periodic iterations, using the results as new seeds. Our research on this has been published in (Brambilla et al., 2017), a simplified implementation is currently available online for demo purposes
    at http://datascience.deib.polimi.it/social-knowledge/,
    and the code is available as open-source under an Apache 2.0 license on GitHub at https://github.com/DataSciencePolimi/social-knowledge-extractor.
  2. TripleWave is a tool for disseminating and exchanging RDF streams on the Web. At the purpose of  processing information streams in real-time and at Web scale, TripleWave integrates nicely with RDF Stream Processing (RSP) and Stream Reasoning (SR) as solutions to combine semantic technologies with stream and event processing techniques. In particular, it integrates with an existing ecosystem of solutions to query, reason and perform real-time processing over heterogeneous and distributed data streams. TripleWave can be fed with existing Web streams (e.g. Twitter and Wikipedia streams) or time-annotated RDF datasets (e.g. the Linked Sensor Data dataset) and it can be invoked through both pull- and push-based mechanisms, thus enabling RSP engines to automatically register and receive data from TripleWave. The tool has been described in (Mauri et al., 2016) and the code is available as open-source on GitHub at https://github.com/streamreasoning/TripleWave/.
  3. RSPlab enables efficient design and execution of reproducible experiments,  as well as sharing of the results. It integrates two existing RSP benchmarks (LSBench and CityBench) and two RSP engines (C-SPARQL engine and CQELS). It provides a programmatic environment to: deploy in the cloud RDF Streams and RSP engines; interact with them using TripleWave and RSP Services; continuously monitor their performances and collect statistics. RSPlab is released as open-source under an Apache 2.0 license, is currently under submission at ISWC – Resources Track and is available on GitHub
    at https://github.com/streamreasoning/rsplab.

Conclusions and Outlook

We believe that knowledge intaking by scholars is going to become more and more time consuming and expensive, due to the amount of knowledge that is being built and shared everyday. We envision a comprehensive approach based on integrated tools that allow data collection, cleaning, integration, analysis and semantic representation that can be run continuously  for keeping the formalized knowledge bases aligned with the evolution of knowledge, with limited cost and high recall on the facts and concepts that emerge or decay. These tools do not need to be implemented by the same vendor or provider; we instead advocate for opensource publishing of all the implementations, as well as for the definition of an agreed-upon integration platform that allows them all to colloquiate appropriately.

References:

  • Russell L. Ackoff. From data to wisdom. Journal of applied systems analysis 16, 3–9 (1989).

  • Marco Brambilla, Stefano Ceri, Florian Daniel, Emanuele Della Valle. On the quest for changing knowledge. In Proceedings of the Workshop on Data-Driven Innovation on the Web – DDI 16. ACM Press, 2016. Link

  • Stefan Stieglitz, Linh Dang-Xuan, Axel Bruns, Christoph Neuberger. Social Media Analytics. Business & Information Systems Engineering 6, 89–96 Springer Nature, 2014. Link

  • Marco Brambilla, Stefano Ceri, Emanuele Della Valle, Riccardo Volonterio, Felix Xavier Acero Salazar. Extracting Emerging Knowledge from Social Media. In Proceedings of the 26th International Conference on World Wide Web – WWW 17. ACM Press, 2017. Link

  • Andrea Mauri, Jean-Paul Calbimonte, Daniele Dell’Aglio, Marco Balduini, Marco Brambilla, Emanuele Della Valle, Karl Aberer. TripleWave: Spreading RDF Streams on the Web. 140–149 In Lecture Notes in Computer Science. Springer International Publishing, 2016. Link

Extracting Emerging Knowledge from Social Media

Today I presented our full paper titled “Extracting Emerging Knowledge from Social Media” at the WWW 2017 conference.

The work is based on a rather obvious assumption, i.e., that knowledge in the world continuously evolves, and ontologies are largely incomplete for what concerns low-frequency data, belonging to the so-called long tail.

Socially produced content is an excellent source for discovering emerging knowledge: it is huge, and immediately reflects the relevant changes which hide emerging entities.

In the paper we propose a method and a tool for discovering emerging entities by extracting them from social media.

Once instrumented by experts through very simple initialization, the method is capable of finding emerging entities; we propose a mixed syntactic + semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors, built by using terms occurring in their social content, and then ranks the candidates by using their distance from the centroid of seeds, returning the top candidates as result.

The method can be continuously or periodically iterated, using the results as new seeds.

The PDF of the full paper presented at WWW 2017 is available online (open access with Creative Common license).

You can also check out the slides of my presentation on Slideshare.

A demo version of the tool is available online for free use, thanks also to our partners Dandelion and Microsoft Azure.

You can TRY THE TOOL NOW if you want.