Instrumenting Continuous Knowledge Extraction, Sharing, and Benchmarking

This is a contribution in response to the Call for Linked Research for the workshop at ESWC 2017 entitled Enabling Decentralised Scholarly Communication.

Authors: Marco Brambilla, Emanuele Della Valle, Andrea Mauri, Riccardo Tommasini.

Affiliation: Politecnico di Milano, DEIB, Data Science Lab. Milano, Italy.

You can also read and download the FULL ARTICLE IN PDF.
“Nanos gigantum humeris insidentes”
(Bernard of Chartres, 1115 ca.)


Science aims at  creating new knowledge upon the existing one, from the observation of physical phenomena, their modeling and empirical validation. This combines the well known motto “standing on the shoulders of giants” (attributed to Bernard of Chartres and subsequently rephrased by Isaac Newton) with the need of trying and validating new experiments.
However, knowledge in the world continuously evolves, at a pace that cannot be traced even by large crowdsourced bodies of knowledge such as Wikipedia. A large share of generated data are not currently analysed and consolidated into exploitable information and knowledge (Ackoff 1989). In particular, the process of ontological knowledge discovery tends to focus on the most popular items, those which are mostly quoted or referenced, and is less effective in discovering less popular items, belonging to the so-called long tail , i.e. the portion of the entity’s distribution having fewer occurrences .(Brambilla 2016)
This becomes a challenge for practitioners, enterprises and scholars / researchers, which need to be up to date to innovation and emerging facts. The scientific community also need to make sure there is a structured and formal way to represent, store and access such knowledge, for instance as ontologies or linked data sources.
Our idea is to propose a vision towards a set of (possibly integrated) publicly available tools that can help scholars keeping the pace with the evolving knowledge. This implies the capability of integrating informal sources, such as social networks, blogs, and user-generated content in general. One can conjecture that somewhere, within the massive content shared by people online, any low-frequency, emerging concept or fact has left some traces. The challenge is to detect such traces, assess their relevance and trustworthiness, and transform them into formalized knowledge (Stieglitz 2014).An appropriate set of tools that can improve effectiveness of knowledge extraction, storage, analysis, publishing and experimental benchmarking could be extremely beneficial for the entire research community across fields and interests.

Our Vision towards Continuous Knowledge Extraction and Publishing

We foresee a paradigm where knowledge seeds can be planted, and subsequently grow, finally leading to the generation and collection of new knowledge, as depicted in the exemplary process shown below: knowledge seeding (through types, context variables, and example instances), growing (for instance by exploring social media), and harvesting for extracting concepts (instances and types).

We advocate for a set of tools that, when implemented and integrated, enable  the following perspective reality:

  • possibility of selecting any kind of source of raw data, independently of their format, type or  semantics (spanning quantitative data, textual content, multimedia content), covering both data streams or pull-based data sources;
  • possibility of applying different data cleaning and data analysis pipelines to the different sources, in order to increase data quality and abstraction / aggregation;
  • possibility of integrating the selected sources;
  • possibility of running homogeneous knowledge extraction processes of the integrated sources;
  • possibility of publishing the results of the analysis and semantic enrichment as new and further (richer) data sources and streams, in a coherent, standard and semantic way.

This enables generation of new sources which in turn can be used in subsequent knowledge extraction processes of the same kind. The results of this process must be available at any stage to be shared for building an open, integrated and continuously evolving knowledge for research, innovation, and dissemination purposes.

A Preliminary Feasibility Perspective

Whilst beneficial and powerful, the vision we propose is far from being achieved nowadays.  However, we are convinced that the vision is not out of reach in the mid term. To give a hint of this, we report here our experience with the research, design and implementation of a few tools that point in the proposed direction:

  1. Social Knowledge Extractor (SKE) is a publicly available tool for discovering emerging knowledge by extracting it from social content. Once instrumented by experts through very simple initialization, the tool is capable of finding emerging entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors, built by using terms occurring in their social content, and then ranks the candidates by using their distance from the centroid of seeds, returning the top candidates as result. The tool can run continuously or with periodic iterations, using the results as new seeds. Our research on this has been published in (Brambilla et al., 2017), a simplified implementation is currently available online for demo purposes
    and the code is available as open-source under an Apache 2.0 license on GitHub at
  2. TripleWave is a tool for disseminating and exchanging RDF streams on the Web. At the purpose of  processing information streams in real-time and at Web scale, TripleWave integrates nicely with RDF Stream Processing (RSP) and Stream Reasoning (SR) as solutions to combine semantic technologies with stream and event processing techniques. In particular, it integrates with an existing ecosystem of solutions to query, reason and perform real-time processing over heterogeneous and distributed data streams. TripleWave can be fed with existing Web streams (e.g. Twitter and Wikipedia streams) or time-annotated RDF datasets (e.g. the Linked Sensor Data dataset) and it can be invoked through both pull- and push-based mechanisms, thus enabling RSP engines to automatically register and receive data from TripleWave. The tool has been described in (Mauri et al., 2016) and the code is available as open-source on GitHub at
  3. RSPlab enables efficient design and execution of reproducible experiments,  as well as sharing of the results. It integrates two existing RSP benchmarks (LSBench and CityBench) and two RSP engines (C-SPARQL engine and CQELS). It provides a programmatic environment to: deploy in the cloud RDF Streams and RSP engines; interact with them using TripleWave and RSP Services; continuously monitor their performances and collect statistics. RSPlab is released as open-source under an Apache 2.0 license, is currently under submission at ISWC – Resources Track and is available on GitHub

Conclusions and Outlook

We believe that knowledge intaking by scholars is going to become more and more time consuming and expensive, due to the amount of knowledge that is being built and shared everyday. We envision a comprehensive approach based on integrated tools that allow data collection, cleaning, integration, analysis and semantic representation that can be run continuously  for keeping the formalized knowledge bases aligned with the evolution of knowledge, with limited cost and high recall on the facts and concepts that emerge or decay. These tools do not need to be implemented by the same vendor or provider; we instead advocate for opensource publishing of all the implementations, as well as for the definition of an agreed-upon integration platform that allows them all to colloquiate appropriately.


  • Russell L. Ackoff. From data to wisdom. Journal of applied systems analysis 16, 3–9 (1989).

  • Marco Brambilla, Stefano Ceri, Florian Daniel, Emanuele Della Valle. On the quest for changing knowledge. In Proceedings of the Workshop on Data-Driven Innovation on the Web – DDI 16. ACM Press, 2016. Link

  • Stefan Stieglitz, Linh Dang-Xuan, Axel Bruns, Christoph Neuberger. Social Media Analytics. Business & Information Systems Engineering 6, 89–96 Springer Nature, 2014. Link

  • Marco Brambilla, Stefano Ceri, Emanuele Della Valle, Riccardo Volonterio, Felix Xavier Acero Salazar. Extracting Emerging Knowledge from Social Media. In Proceedings of the 26th International Conference on World Wide Web – WWW 17. ACM Press, 2017. Link

  • Andrea Mauri, Jean-Paul Calbimonte, Daniele Dell’Aglio, Marco Balduini, Marco Brambilla, Emanuele Della Valle, Karl Aberer. TripleWave: Spreading RDF Streams on the Web. 140–149 In Lecture Notes in Computer Science. Springer International Publishing, 2016. Link

Model-driven Development of User Interfaces for IoT via Domain-specific Components & Patterns

This is the summary of a joint contribution with Eric Umuhoza to ICEIS 2017 on Model-driven Development of User Interfaces for IoT via Domain-specific Components & Patterns.
Internet of Things technologies and applications are evolving and continuously gaining traction in all fields and environments, including homes, cities, services, industry and commercial enterprises. However, still many problems need to be addressed.
For instance, the IoT vision is mainly focused on the technological and infrastructure aspect, and on the management and analysis of the huge amount of generated data, while so far the development of front-end and user interfaces for IoT has not played a relevant role in research.
On the contrary, we believe that user interfaces in the IoT ecosystem they can play a key role in the acceptance of solutions by final adopters.
In this paper we present a model-driven approach to the design of IoT interfaces, by defining a specific visual design language and design patterns for IoT applications, and we show them at work. The language we propose is defined as an extension of the OMG standard language called IFML.

The slides of this talk are available online on Slideshare as usual:

Extracting Emerging Knowledge from Social Media

Today I presented our full paper titled “Extracting Emerging Knowledge from Social Media” at the WWW 2017 conference.

The work is based on a rather obvious assumption, i.e., that knowledge in the world continuously evolves, and ontologies are largely incomplete for what concerns low-frequency data, belonging to the so-called long tail.

Socially produced content is an excellent source for discovering emerging knowledge: it is huge, and immediately reflects the relevant changes which hide emerging entities.

In the paper we propose a method and a tool for discovering emerging entities by extracting them from social media.

Once instrumented by experts through very simple initialization, the method is capable of finding emerging entities; we propose a mixed syntactic + semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors, built by using terms occurring in their social content, and then ranks the candidates by using their distance from the centroid of seeds, returning the top candidates as result.

The method can be continuously or periodically iterated, using the results as new seeds.

The PDF of the full paper presented at WWW 2017 is available online (open access with Creative Common license).

You can also check out the slides of my presentation on Slideshare.

A demo version of the tool is available online for free use, thanks also to our partners Dandelion and Microsoft Azure.

You can TRY THE TOOL NOW if you want.

Spark-based Big Data Analysis of Semantic IFML Models and Web Logs for Enhanced User Behavior Analytics

I’d like to report on our demonstration paper at WWW 2017, focusing on Spark-based Big Data Analysis of  Semantic IFML Models and Web Logs  for Enhanced User Behavior Analytics.

The motivation of the work is that  no approaches exist for merging web log analysis and statistics with information about the Web application structure, content and semantics. Indeed, basic Web analytics tools are widespread and provide statistics about Web site navigation at the syntactic level only: they analyze the user interaction at page level in terms of page views, entry and landing page, page views per visit, and so on. Unfortunately, those tools do not provide precise statistics neither about the content and semantics of the visited pages, nor about the actual reactions of the users to the actual content (instances) he is shown.

With our work we demonstrate the advantages of combining Web application models with runtime navigation logs, at the purpose of deepening the understanding of users behaviour.

We propose a model-driven approach that combines user interaction modeling (based on the IFML standard), full code generation of the designed application, user tracking at runtime through logging of runtime component execution and user activities, integration with page content details, generation of integrated schema-less data streams, and application of large-scale analytics and visualization tools for big data, by applying data visualization techniques that build direct representation of statistics on the IFML visual models of the Web application.

The paper describing the approach is available in the WWW 2017 proceedings.

The video of the demo is available on YouTube:

Social Media Behaviour during Live Events: the Milano Fashion Week #MFW case

Social media are getting more and more  important in the context of live events, such as fairs, exhibits, festivals, concerts, and so on,  as they play an essential role in communicating them to  fans, interest groups, and the general population. These kinds of events are geo-localized within a city or territory and are scheduled within a public calendar.

Together with the people in the Fashion in Process group of Politecnico di Milano, we studied the impact on social media of a specific scenario, the Milano Fashion Week (MFW), which is an important event in Milano for the whole fashion business.

We presented this work at the Location and the Web workshop co-located with the WWW 2017 Conference in Perth, Australia.

We focus our attention on the spreading of social content  in space, measuring the spreading of the event propagation in space. We build different clusters of fashion brands, we characterize several features of propagation in space and we correlate them to the popularity of the brand and temporal propagation.

We show that the clusters along space, time and popularity dimensions are loosely correlated, and therefore trying to  understand the dynamics of the events only based on popularity  aspects would not be appropriate.

The paper PDF is available as open access PDF online on the WWW 2017 Conference web site. You can download it here.

The PowerPoint presentation is available on SlideShare.

Data Science for Good City Life

On March 10, 2017 we hosted a seminar by Daniele Quercia in the Como Campus of Politecnico di Milano, on the topic:

Good City Life

Daniele Quercia

Daniele Quercia leads the Social Dynamics group at Bell Labs in Cambridge
. He has been named one of Fortune magazine’s 2014 Data All-Stars, and spoke about “happy maps” at TED.  His research has been focusing in the area of urban informatics and received best paper awards from Ubicomp 2014 and from ICWSM 2015, and an honourable mention from ICWSM 2013. He was Research Scientist at Yahoo Labs, a Horizon senior researcher at the University of Cambridge, and Postdoctoral Associate at the department of Urban Studies and Planning at MIT. He received his PhD from UC London. His thesis was sponsored by Microsoft Research and was nominated for BCS Best British PhD dissertation in Computer Science.

His presentation will contrast the corporate smart-city rhetoric about efficiency, predictability, and security with a different perspective on the cities, which I think is very inspiring and visionary.

“You’ll get to work on time; no queue when you go shopping, and you are safe because of CCTV cameras around you”. Well, all these things make a city acceptable, but they don’t make a city great.

This slideshow requires JavaScript.

Daniele is launching – a global group of like-minded people who are passionate about building technologies whose focus is not necessarily to create a smart city but to give a good life to city dwellers. The future of the city is, first and foremost, about people, and those people are increasingly networked. We will see how a creative use of network-generated data can tackle hitherto unanswered research questions. Can we rethink existing mapping tools [happy-maps]? Is it possible to capture smellscapes of entire cities and celebrate good odors [smelly-maps]? And soundscapes [chatty-maps]?

The complete video of the seminar has been streamed live on youtube and is now available online at and embedded here:

The seminar was open to the public and hosted at the Polo Regionale di Como headquarters of Politecnico di Milano, located in Via Anzani 42, III floor, Como.

You can also download the Good City Life flyer.

When a Smart City gets Personal

When people talk about smart cities, the tendency is to think about them in a technology-oriented or sociology-oriented manner.

However, smart cities are the places where we leave and work everyday now.

Here is a very broad perspective (in Italian) about the experience of big data analysis and smart city instrumentation for the town of Como, in Italy: an experience on how phone calls, mobility data, social media, people counters can contribute to take and evaluate decisions.


You can read it on my Medium channel.

View story at

The role of Big Data in Banks

I was listening at R. Martin Chavez, Goldman Sachs deputy CFO just last month in Harvard at the ComputeFest 2017 event, more precisely, the SYMPOSIUM ON THE FUTURE OF COMPUTATION IN SCIENCE AND ENGINEERING on “Data, Dollars, and Algorithms: The Computational Economy” held in Harvard on Thursday, January 19, 2017.

His claim was that

Banks are essentially API providers.

The entire structure and infrastructure of Goldman Sachs is being restructured for that. His case is that you should not compare a bank with a shop or store, you should compare it with Google. Just imagine that every time you want to search on Google you need to get in touch (i.e., make a phone call or submit a request) to some Google employee, who at some points comes back to you with the result. Non sense, right?  Well, but this is what actually happens with banks. It was happening with consumer-oriented banks before online banking, and it’s still largely happening for business banks.

But this is going to change. Amount of data and speed and volume of financial transaction doesn’t allow that any more.

Banks are actually among the richest (not [just] in terms of money, but in data ownership). But they are also craving for further “less official” big data sources.

Juri Marcucci: Importance of Big Data for Central (National) Banks.

Today at the ISTAT National Big Data Committee meeting in Rome, Juri Marcucci from Bank of Italy discussed their research activity in integration of Google Trends information in their financial predictive analytics.

Google Trends provide insights of user interests in general, as the probability that a random user is going to search for a particular keyword (normalized and scaled, also with geographical detail down to city level).

Bank of Italy is using Google Trends data for complementing their prediction of unemployment rates in short and mid term. It’s definitely a big challenge, but preliminary results are promising in terms of confidence on the obtained models. More details are available in this paper.

Paolo Giudici from University of Pavia showed how one can correlate the risk of bank defaults with their exposition on Twitter:

Paolo Giudici: bank risk contagion based (also) on Twitter data.

Obviously, all this must take into account the bias of the sources and the quality of the data collected. This was pointed out also by Paolo Giudici from University of Pavia. Assessment of “trustability” of online sources is crucial. In their research, they defined the T-index on Twitter accounts in a very similar way academics define the h-index for relevance of publications, as reported in the photographed slide below.

Paolo Giudici: T-index describing the quality of Twitter authors in finance.

It’s very interesting to see how creative the use of (non-traditional, web based) big data is becoming, in very diverse fields, including very traditional ones like macroeconomy and finance.

And once again, I think the biggest challenges and opportunities come from the fusion of multiple data sources together: mobile phones, financial tracks, web searches, online news, social networks, and official statistics.

This is also the path that ISTAT (the official institute for Italian statistics) is pursuing. For instance, in the calculation of official national inflation rates, web scraping techniques (for ecommerce prices) upon more than 40.000 product prices are integrated in the process too.



The Dawn of a new Digital Renaissance in Cultural Heritage

Fluxedo joined forces with the Observatory of Digital Innovation in Arts & Culture Heritage (Osservatorio per l’innovazione digitale nei beni e attività culturali) by the School of Management (MIP) of Politecnico di Milano, for covering the social media analytics of Italian and international museums.

The results of the work have been presented during an event on January 19th, 2017 hosted by Piccolo Teatro di Milano, which was very successful.

The live dashboard of the SocialOmeters analysis on the museums is available here:


A summary of the event through social media content of the event as generated via Storify is available here.


The official hashtag of the event #OBAC17 has become Twitter trend  in Italy, with 579 tweets, 187 users, around 600 likes and retweets, and a potential audience of 2.2 million users.

The event had a huge visibility on the national media, as reported in this press review:

1.      La rivoluzione dei musei online. Il primato di Triennale e Pinacoteca – read Il Corriere della Sera Milano

2.      L’innovazione prolifera (ma fatica) – read Il Sole 24 Ore Nòva

3.      I musei italiani e la digitalizzazione: il punto del Politecnico di Milano – read Advertiser

4.      Osservatorio Politecnico, musei social ma con pochi servizi digitali – read Arte Magazine

5.      Arte & Innovazione. Musei italiani sempre più social (52%) e virtuali (20%) – read
Corriere del Web

6.      Boom di visitatori nei musei ma è flop dei servizi digitali – read Il Sole 24 Ore Blog

7.      Capitolini, comunali e Maxxi di Roma tra i musei più popolari sui social network – read La Repubblica Roma

8.      La pagina Facebook della Reggia di Venaria è la più apprezzata d’Italia con oltre 166 mila “like” – read La Stampa Torino

9.      Il 52% dei musei italiani è social ma i servizi digitali per la fruizione delle opere sono limitati. Un’analisi dell’Osservatorio Innovazione Digitale nei Beni e Attività Culturali – read Lombard Street
10.  Musei sempre più social, ecco i più cliccati  – read TTG Italia

11.  Tanta cultura, poco digitale: solo il 52% dei musei italiani è sui social e il 43% non ha ancora un sito – read Vodafone News

12.  Tra Twitter e Instagram, 52% musei italiani punta sui social media – read ADNKronos

13.  Musei strizzano occhi a social, ma strada è lunga – read ANSA ViaggiArt

14.  Oltre la metà dei musei italiani è online e sui social, ma i servizi digitali evoluti e quelli on site sono ancora scarsi – read Brand News

15.  Il 52% dei musei italiani è social ma i servizi digitali per la fruizione delle opere sono limitati – read DailyNet

16.  Musei italiani sempre più social, ma i servizi digitali sono limitati – read Diario Innovazione

17.  Musei e social network – read Inside Art

18.  Musei Vaticani e Maxxi tra i più social d’Italia – read Il Messaggero

19.  I musei si fanno spazio sui social – read Italia Oggi

20.  Musei italiani social, ma non troppo – read La Repubblica

21.  Capitolini e Maxxi da record sui social – read La Repubblica Roma

22.  Musei lucani poco social e poco visitabili in web – read La Siritide

23.  Venaria regina dei social – read La Stampa Torino

24.  In calo nel 2016 il numero degli ingressi nei luoghi di cultura in Basilicata – read Oltre

25.  Musei sempre più social, ma poco interattivi – read QN ILGIORNO – il Resto del Carlino – LA NAZIONE

26.  Dal marketing alle guide per disabili. Cultura, boom dell’industria digitale – read QN ILGIORNO – il Resto del Carlino – LA NAZIONE

27.  Musei romani sempre più social – read Radio Colonna

28.  Social network: il Maxxi tra i musei più popolari – read Roma2Oggi

29.  Musei italiani sempre più social, ma la strada è ancora lunga – read Travel No Stop

30.  Musei italiani sempre più social e virtuali – read Uomini & Donne della Comunicazione

31.  Tra Twitter e Instagram, un museo su due in Italia scommette sui …  – read Italia per Me

32.  Beni culturali, musei lucani poco social – read La Nuova del Sud

33.  Fb, Instagram e Twitter: i musei italiani puntano sui social ma non basta – read  La Repubblica

34.  Tra Twitter e Instagram, un museo su due in Italia scommette sui social media – read La Stampa

35.  Il 52% dei musei italiani è social ma i servizi digitali per la fruizione delle opere sono limitati – read Sesto Potere

36.  Il 52% dei musei italiani è social, ma la fruizione delle opere digital è limitata – read Il Sole 24 Ore

The Harvard-Politecnico Joint Program on Data Science in full bloom

After months of preparation, here we are.

This week we kicked off the second edition of the DataShack program on Data Science that brings together interdisciplinary teams of data science, software engineering & computer science, and design students from Harvard (Institute of Applied Computational Science) and Politecnico di Milano (faculties of Engineering and Design).

The students will address big data extraction, analysis, and visualization problems provided by two real-world stakeholders in Italy: the Como city municipality and Moleskine.

logo-moleskineThe Moleskine Data-Shack project will explore the popularity and success of different Moleskine products co-branded with other famous brands (also known as special editions) and launched in specific periods in time. The main field of analysis is the impact that different products have on social media channels. Social media analysis then will be correlated with product distribution and sales performance data, along multiple dimensions (temporal, geographical, etc.) and product features.

logo-comoThe project consists of collecting and analyzing data about the city and the way people live and move within it, by integrating multiple and diverse data sources. The problems to be addressed may include providing estimates of human density and movements within the city, predicting the impact of hypothetical future events, determining the best allocation of sensors in the streets, and defining optimal user experience and interaction for exploring the city data.

The kickoff meeting of the DataShack 2017 projects, in Harvard. Faculties Pavlos Protopapas, Stefano Ceri, Paola Bertola, Paolo Ciuccarelli and myself (Marco Brambilla) are involved in the program.

The teams have been formed, and the problems assigned. I really look forward to advising the groups in the next months and seeing the results that will come out. The students have shown already commitment and engagement. I’m confident that they will be excellent and innovative this year!

For further activities on data science within our group you can refer to the DataScience Lab site, Socialometers, and Urbanscope.