I collected here the list of my write-ups of the first three keynote speeches of the conference:
Over one billion cars interact with each other on the road every day. Each driver has his own driving style, which could impact safety, fuel economy and road congestion. Knowledge about the driving style of the driver could be used to encourage “better” driving behaviour through immediate feedback while driving, or by scaling auto insurance rates based on the aggressiveness of the driving style.
In this work we report on our study of driving behaviour profiling based on unsupervised data mining methods. The main goal is to detect the different driving behaviours, and thus to cluster drivers with similar behaviour. This paves the way to new business models related to the driving sector, such as Pay-How-You-Drive insurance policies and car rentals. Here is the presentation I gave on this topic:
Driver behavioral characteristics are studied by collecting information from GPS sensors on the cars and by applying three different analysis approaches (DP-means, Hidden Markov Models, and Behavioural Topic Extraction) to the contextual scene detection problems on car trips, in order to detect different behaviour along each trip. Subsequently, drivers are clustered in similar profiles based on that and the results are compared with a human-defined ground-truth on drivers classification.
The proposed framework is tested on a real dataset containing sampled car signals. While the different approaches show relevant differences in trip segment classification, the coherence of the final driver clustering results is surprisingly high.
This work has been published at the 4th IEEE Big Data Conference, held in Boston in December 2017. If you are interested in further contributions at the conference, here you can find my summaries of the keynote speeches on human-in-the-loop machine learning and on increasing human perception through text mining.
In the context of Domain-Specific Modeling Language (DSML) development, the involvement of end-users is crucial to assure that the resulting language satisfies their needs.
In our paper presented at SLE 2017 in Vancouver, Canada, on October 24th within the SPLASH Conference context, we discuss how crowdsourcing tasks can exploited to assist in domain-specific language definition processes. This is in line with the vision towards cognification of model-driven engineering.
The slides are available on slideshare:
Indeed, crowdsourcing has emerged as a novel paradigm where humans are employed to perform computational and information collection tasks. In language design, by relying on the crowd, it is possible to show an early version of the language to a wider spectrum of users, thus increasing the validation scope and eventually promoting its acceptance and adoption.
We propose a systematic (and automatic) method for creating crowdsourcing campaigns aimed at refining the graphical notation of DSMLs. The method defines a set of steps to identify, create and order the questions for the crowd. As a result, developers are provided with a set of notation choices that best fit end-users’ needs. We also report on an experiment validating the approach.
Improving the quality of the language notation may improve dramatically acceptance and adoption, as well as the way people use your notation and the associated tools.
Essentially, our idea is to spawn to the crowd a bunch of questions regarding the concrete syntax of visual modeling languages, and collect opinions. Based on different strategies, we generate an optimal notation and then we check how good it is.
In the paper we also validate the approach and experiment it in a practical use case, namely studying some variations over the BPMN modeling language.
The full paper can be found here: https://dl.acm.org/citation.cfm?doid=3136014.3136033. The paper is titled: “Better Call the Crowd: Using Crowdsourcing to Shape the Notation of Domain-Specific Languages” and was co-authored by Marco Brambilla, Jordi Cabot, Javier Luis Cánovas Izquierdo, and Andrea Mauri.
You can also access the Web version on Jordi Cabot blog.
The artifacts described in this paper are also referenced on findresearch.org, namely referring to the following materials:
- Overview on crowdsearcher site: http://crowdsearcher.deib.polimi.it/casestudies/
- Code of the UI (including the configuration of the tasks and of the modeling alternatives analyized): https://github.com/janez87/tef
- Code of the crowdsourcing platform: https://github.com/janez87/crowdsearcher/tree/modeling-new
- Results summary: http://crowdsearcher.deib.polimi.it/casestudies/crowd-experiment-results-sle2017.xlsx
For centuries, science (in German “Wissenschaft”) has aimed to create (“schaften”) new knowledge (“Wissen”) from the observation of physical phenomena, their modelling, and empirical validation.
Recently, a new source of knowledge has emerged: not (only) the physical world any more, but the virtual world, namely the Web with its ever-growing stream of data materialized in the form of social network chattering, content produced on demand by crowds of people, messages exchanged among interlinked devices in the Internet of Things. The knowledge we may find there can be dispersed, informal, contradicting, unsubstantiated and ephemeral today, while already tomorrow it may be commonly accepted.
The challenge is once again to capture and create consolidated knowledge that is new, has not been formalized yet in existing knowledge bases, and is buried inside a big, moving target (the live stream of online data).
The myth is that existing tools (spanning fields like semantic web, machine learning, statistics, NLP, and so on) suffice to the objective. While this may still be far from true, some existing approaches are actually addressing the problem and provide preliminary insights into the possibilities that successful attempts may lead to.
I gave a few keynote speeches on this matter (at ICEIS, KDWEB,…), and I also use this argument as a motivating class in academic courses for letting students understand how crucial is to focus on the problems related to big data modeling and analysis. The talk, reported in the slides below, explores through real industrial use cases, the mixed realistic-utopian domain of data analysis and knowledge extraction and reports on some tools and cases where digital and physical world have brought together for better understanding our society.
The presentation is available on SlideShare and are reported here below:
In our most recent study, we analysed the user behaviour and profile, as well as the textual and visual content posted on social media for art and culture events.
The corresponding paper has been presented at CD-MAKE 2017 in Reggio Calabria on August 31st, 2017.
Nowadays people share everything on online social networks, from daily life stories to the latest local and global news and events. In our paper, we address the specific problem of user behavioural profiling in the context of cultural and artistic events.
We propose a specific analysis pipeline that aims at examining the profile of online users, based on the textual content they published online. The pipeline covers the following aspects: data extraction and enrichment, topic modeling based on LDA, dimensionality reduction, user clustering, prediction of interest, content analysis including profiling of images and subjects.
We show our approach at work for the monitoring of participation to a large-scale artistic installation that collected more than 1.5 million visitors in just two weeks (namely The Floating Piers, by Christo and Jeanne-Claude). In the paper we report our findings and discuss the pros and cons of the work.
The full paper is published by Springer in the LNCS series in volume 10410, pages 219-236.
The slides used for the presentation are available on SlideShare:
Together with the Urbanscope team, we gave a TedX talk on the topics and results of the project here at Politecnico di Milano. The talk was actually given by our junior researchers, as we wanted it to be a choral performance as opposed to the typical one-man show.
The message is that cities are not mere physical and organizational devices only: they are informational landscapes where places are shaped more by the streams of data and less by the traditional physical evidences. We devise tools and analysis for understanding these streams and the phenomena they represent, in order to understand better our cities.
Two layers coexist: a thick and dynamic layer of digital traces – the informational membrane – grows everyday on top of the material layer of the territory, the buildings and the infrastructures. The observation, the analysis and the representation of these two layers combined provides valuable insights on how the city is used and lived.
Urbanscope is a research laboratory where collection, organization, analysis, and visualization of cross domain geo-referenced data are experimented.
The research team is based at Politecnico di Milano and encompasses researchers with competencies in Computing Engineering, Communication and Information Design, Management Engineering, and Mathematics.
The aim of Urbanscope is to systematically produce compelling views on urban systems to foster understanding and decision making. Views are like new lenses of a macroscope: they are designed to support the recognition of specific patterns thus enabling new perspectives.
If you enjoyed the show, you can explore our beta application at:
and discover the other data science activities we are conducting at the Data Science Lab of Politecnico, DEIB.
You can also read and download the full article in PDF
Science aims at creating new knowledge upon the existing one, from the observation of physical phenomena, their modeling and empirical validation. This combines the well known motto “standing on the shoulders of giants” (attributed to Bernard of Chartres and subsequently rephrased by Isaac Newton) with the need of trying and validating new experiments.
However, knowledge in the world continuously evolves, at a pace that cannot be traced even by large crowdsourced bodies of knowledge such as Wikipedia. A large share of generated data are not currently analysed and consolidated into exploitable information and knowledge (Ackoff 1989). In particular, the process of ontological knowledge discovery tends to focus on the most popular items, those which are mostly quoted or referenced, and is less effective in discovering less popular items, belonging to the so-called long tail , i.e. the portion of the entity’s distribution having fewer occurrences (Brambilla 2016).
This becomes a challenge for practitioners, enterprises and scholars / researchers, which need to be up to date to innovation and emerging facts. The scientific community also need to make sure there is a structured and formal way to represent, store and access such knowledge, for instance as ontologies or linked data sources.
Our idea is to propose a vision towards a set of (possibly integrated) publicly available tools that can help scholars keeping the pace with the evolving knowledge. This implies the capability of integrating informal sources, such as social networks, blogs, and user-generated content in general. One can conjecture that somewhere, within the massive content shared by people online, any low-frequency, emerging concept or fact has left some traces. The challenge is to detect such traces, assess their relevance and trustworthiness, and transform them into formalized knowledge (Stieglitz 2014).An appropriate set of tools that can improve effectiveness of knowledge extraction, storage, analysis, publishing and experimental benchmarking could be extremely beneficial for the entire research community across fields and interests.
Our Vision towards Continuous Knowledge Extraction and Publishing
We foresee a paradigm where knowledge seeds can be planted, and subsequently grow, finally leading to the generation and collection of new knowledge, as depicted in the exemplary process shown below: knowledge seeding (through types, context variables, and example instances), growing (for instance by exploring social media), and harvesting for extracting concepts (instances and types).
We advocate for a set of tools that, when implemented and integrated, enable the following perspective reality:
- possibility of selecting any kind of source of raw data, independently of their format, type or semantics (spanning quantitative data, textual content, multimedia content), covering both data streams or pull-based data sources;
- possibility of applying different data cleaning and data analysis pipelines to the different sources, in order to increase data quality and abstraction / aggregation;
- possibility of integrating the selected sources;
- possibility of running homogeneous knowledge extraction processes of the integrated sources;
- possibility of publishing the results of the analysis and semantic enrichment as new and further (richer) data sources and streams, in a coherent, standard and semantic way.
This enables generation of new sources which in turn can be used in subsequent knowledge extraction processes of the same kind. The results of this process must be available at any stage to be shared for building an open, integrated and continuously evolving knowledge for research, innovation, and dissemination purposes.
A Preliminary Feasibility Perspective
Whilst beneficial and powerful, the vision we propose is far from being achieved nowadays. However, we are convinced that the vision is not out of reach in the mid term. To give a hint of this, we report here our experience with the research, design and implementation of a few tools that point in the proposed direction:
- Social Knowledge Extractor (SKE) is a publicly available tool for discovering emerging knowledge by extracting it from social content. Once instrumented by experts through very simple initialization, the tool is capable of finding emerging entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors, built by using terms occurring in their social content, and then ranks the candidates by using their distance from the centroid of seeds, returning the top candidates as result. The tool can run continuously or with periodic iterations, using the results as new seeds. Our research on this has been published in (Brambilla et al., 2017), a simplified implementation is currently available online for demo purposes
and the code is available as open-source under an Apache 2.0 license on GitHub at https://github.com/DataSciencePolimi/social-knowledge-extractor.
- TripleWave is a tool for disseminating and exchanging RDF streams on the Web. At the purpose of processing information streams in real-time and at Web scale, TripleWave integrates nicely with RDF Stream Processing (RSP) and Stream Reasoning (SR) as solutions to combine semantic technologies with stream and event processing techniques. In particular, it integrates with an existing ecosystem of solutions to query, reason and perform real-time processing over heterogeneous and distributed data streams. TripleWave can be fed with existing Web streams (e.g. Twitter and Wikipedia streams) or time-annotated RDF datasets (e.g. the Linked Sensor Data dataset) and it can be invoked through both pull- and push-based mechanisms, thus enabling RSP engines to automatically register and receive data from TripleWave. The tool has been described in (Mauri et al., 2016) and the code is available as open-source on GitHub at https://github.com/streamreasoning/TripleWave/.
- RSPlab enables efficient design and execution of reproducible experiments, as well as sharing of the results. It integrates two existing RSP benchmarks (LSBench and CityBench) and two RSP engines (C-SPARQL engine and CQELS). It provides a programmatic environment to: deploy in the cloud RDF Streams and RSP engines; interact with them using TripleWave and RSP Services; continuously monitor their performances and collect statistics. RSPlab is released as open-source under an Apache 2.0 license, is currently under submission at ISWC – Resources Track and is available on GitHub
We believe that knowledge intaking by scholars is going to become more and more time consuming and expensive, due to the amount of knowledge that is being built and shared everyday. We envision a comprehensive approach based on integrated tools that allow data collection, cleaning, integration, analysis and semantic representation that can be run continuously for keeping the formalized knowledge bases aligned with the evolution of knowledge, with limited cost and high recall on the facts and concepts that emerge or decay. These tools do not need to be implemented by the same vendor or provider; we instead advocate for opensource publishing of all the implementations, as well as for the definition of an agreed-upon integration platform that allows them all to colloquiate appropriately.
Outlook on Research Resource Sharing
Russell L. Ackoff. From data to wisdom. Journal of applied systems analysis 16, 3–9 (1989).
Marco Brambilla, Stefano Ceri, Florian Daniel, Emanuele Della Valle. On the quest for changing knowledge. In Proceedings of the Workshop on Data-Driven Innovation on the Web – DDI 16. ACM Press, 2016. Link
Stefan Stieglitz, Linh Dang-Xuan, Axel Bruns, Christoph Neuberger. Social Media Analytics. Business & Information Systems Engineering 6, 89–96 Springer Nature, 2014. Link
Marco Brambilla, Stefano Ceri, Emanuele Della Valle, Riccardo Volonterio, Felix Xavier Acero Salazar. Extracting Emerging Knowledge from Social Media. In Proceedings of the 26th International Conference on World Wide Web – WWW 17. ACM Press, 2017. Link
Andrea Mauri, Jean-Paul Calbimonte, Daniele Dell’Aglio, Marco Balduini, Marco Brambilla, Emanuele Della Valle, Karl Aberer. TripleWave: Spreading RDF Streams on the Web. 140–149 In Lecture Notes in Computer Science. Springer International Publishing, 2016. Link
(*) Note: the current version includes content in response to an online open review.
This is the summary of a joint contribution with Eric Umuhoza to ICEIS 2017 on Model-driven Development of User Interfaces for IoT via Domain-specific Components & Patterns.
Internet of Things technologies and applications are evolving and continuously gaining traction in all fields and environments, including homes, cities, services, industry and commercial enterprises. However, still many problems need to be addressed.
For instance, the IoT vision is mainly focused on the technological and infrastructure aspect, and on the management and analysis of the huge amount of generated data, while so far the development of front-end and user interfaces for IoT has not played a relevant role in research.
On the contrary, we believe that user interfaces in the IoT ecosystem they can play a key role in the acceptance of solutions by final adopters.
In this paper we present a model-driven approach to the design of IoT interfaces, by defining a specific visual design language and design patterns for IoT applications, and we show them at work. The language we propose is defined as an extension of the OMG standard language called IFML.
The slides of this talk are available online on Slideshare as usual:
Today I presented our full paper titled “Extracting Emerging Knowledge from Social Media” at the WWW 2017 conference.
The work is based on a rather obvious assumption, i.e., that knowledge in the world continuously evolves, and ontologies are largely incomplete for what concerns low-frequency data, belonging to the so-called long tail.
Socially produced content is an excellent source for discovering emerging knowledge: it is huge, and immediately reflects the relevant changes which hide emerging entities.
In the paper we propose a method and a tool for discovering emerging entities by extracting them from social media.
Once instrumented by experts through very simple initialization, the method is capable of finding emerging entities; we propose a mixed syntactic + semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors, built by using terms occurring in their social content, and then ranks the candidates by using their distance from the centroid of seeds, returning the top candidates as result.
The method can be continuously or periodically iterated, using the results as new seeds.
You can also check out the slides of my presentation on Slideshare.
Social media are getting more and more important in the context of live events, such as fairs, exhibits, festivals, concerts, and so on, as they play an essential role in communicating them to fans, interest groups, and the general population. These kinds of events are geo-localized within a city or territory and are scheduled within a public calendar.
Together with the people in the Fashion in Process group of Politecnico di Milano, we studied the impact on social media of a specific scenario, the Milano Fashion Week (MFW), which is an important event in Milano for the whole fashion business.
We focus our attention on the spreading of social content in space, measuring the spreading of the event propagation in space. We build different clusters of fashion brands, we characterize several features of propagation in space and we correlate them to the popularity of the brand and temporal propagation.
We show that the clusters along space, time and popularity dimensions are loosely correlated, and therefore trying to understand the dynamics of the events only based on popularity aspects would not be appropriate.
A subsequent paper on the temporal analysis of the same event “Temporal Analysis of Social Media Response to Live Events: The Milano Fashion Week”, focusing on Granger Causality and other measures, has been published at ICWE 2017 and is available in the proceedings by Springer.
The PowerPoint presentation is available on SlideShare.