Exploring the bi-verse: a trip across the digital and physical ecospheres

I’ve been invited to give a keynote talk at the WISE 2022 Conference. Thinking about it, I decided to focus on my idea of a bi-verse. To me, the bi-verse is the duality between the physical and digital worlds.

On one side, the Web and social media are the environments where people post their content, opinions, activities, and resources. Therefore, a considerable amount of user-generated content is produced every day for a wide variety of purposes.

On the other side, people live their everyday life immersed in the physical world, where society, economy, politics, and personal relations continuously evolve. These two opposite and complementary environments are today fully integrated: they reflect each other and they interact with each other in a stronger and stronger way.

Exploring and studying content and data coming from both environments offers a great opportunity to understand the ever-evolving modern society, in terms of topics of interest, events, relations, and behavior.

This slidedeck summarizes my contribution:

In my speech, I discuss business cases and socio-political scenarios, to show how we can extract insights and understand reality by combining and analyzing data from the digital and physical world, so as to reach a better overall picture of reality itself. Along this path, we need to keep into account that reality is complex and varies in time, space, and many other dimensions, including societal and economic variables. The speech highlights the main challenges that need to be addressed and outlines some data science strategies that can be applied to tackle these specific challenges.

The Final TRIGGER Conference

We will join and contribute to the final TRIGGER conference is scheduled for May 31st, 2022 in Brussels.

The theme is: “Rethinking the EU’s role in global governance”. In this context, the TRIGGER project is going to present the main research outcomes of the H2020 research program that started in 2018, setting the stage for the collaboration among 14 international partners. 

We will present our main contributions, namely PERSEUS and COCTEAU.

A quick intro to PERSEUS is available in this video:

Further details about the event are available here:

Generation of Realistic Navigation Paths for Web Site Testing using RNNs and GANs

Weblogs represent the navigation activity generated by a specific amount of users on a given website. This type of data is fundamental because it contains information on the behaviour of users and how they interface with the company’s product itself (website or application). If a company could have a realistic weblog before the release of its product, it would have a significant advantage because it can use the techniques explained above to see the less navigated web pages or those to put in the foreground.

A large audience of users and typically a long time frame are needed to produce sensible and useful log data, making it an expensive task. 

To address this limit, we propose a method that focuses on the generation of REALISTIC NAVIGATIONAL PATHS, i.e., web logs .

Our approach is extremely relevant because it can at the same time tackle the problem of lack of publicly available data about web navigation logs, and also be adopted in industry for AUTOMATIC GENERATION OF REALISTIC TEST SETTINGS of Web sites yet to be deployed.

The generation has been implemented using deep learning methods for generating more realistic navigation activities, namely

  • Recurrent Neural Network, which are very well suited to temporally evolving data
  • Generative Adversarial Network: neural networks aimed at generating new data, such as images or text, very similar to the original ones and sometimes indistinguishable from them, that have become increasingly popular in recent years.

We run experiments using open data sets of weblogs as training, and we run tests for assessing the performance of the methods. Results in generating new weblog data are quite good, as reported in this summary table, with respect to the two evaluation metrics adopted (BLEU and Human evaluation).


Comparison of performance of baseline statistical approach, RNN and GAN for generating realistic web logs. Evaluation is done using human assessments and BLEU metrics


Our study is described in detail in the paper published at ICWE 2020 – International Conference on Web Engineering with DOI: 10.1007/978-3-030-50578-3. It’s available online on the Springer Web site. and can be cited as:

Pavanetto S., Brambilla M. (2020) Generation of Realistic Navigation Paths for Web Site Testing Using Recurrent Neural Networks and Generative Adversarial Neural Networks. In: Bielikova M., Mikkonen T., Pautasso C. (eds) Web Engineering. ICWE 2020. Lecture Notes in Computer Science, vol 12128. Springer, Cham

The slides are online too:

Together with a short presentation video:


Coronavirus stories and data

Coronavirus COVID-19 is an extreme challenge for our society, economy, and individual life. However, governments should have learnt from each other. The impact has been spreading slowly across countries. There has been plenty of time to take action. But apparently people and government can’t grasp the risk until it’s onto them. And the way European and American governments are acting is to slow and incremental.

I live in Italy, we rank second in the world for healthcare quality. The mindset of “this won’t happen here” was the attitude at the beginning of this challenge, and look at  what happened. I’m reporting here two links to articles that mention a data-driven vision, but also the human, psychological an behavioural aspects involved. They are two simple stories that report the Italian perspective on the virus.

Coronavirus Stories From Italy

And why now it’s the time for YOU to worry, fellow Europeans and Americans

#Coronavirus: Updates from the Italian Front

A preview of what will happen in a week in the rest of the world. Things have dramatically changed in our society

IEEE Big Data Conference 2017: take home messages from the keynote speakers

I collected here the list of my write-ups of the first three keynote speeches of the conference:

Driving Style and Behavior Analysis based on Trip Segmentation over GPS Information through Unsupervised Learning

Over one billion cars interact with each other on the road every day. Each driver has his own driving style, which could impact safety, fuel economy and road congestion. Knowledge about the driving style of the driver could be used to encourage “better” driving behaviour through immediate feedback while driving, or by scaling auto insurance rates based on the aggressiveness of the driving style.
In this work we report on our study of driving behaviour profiling based on unsupervised data mining methods. The main goal is to detect the different driving behaviours, and thus to cluster drivers with similar behaviour. This paves the way to new business models related to the driving sector, such as Pay-How-You-Drive insurance policies and car rentals. Here is the presentation I gave on this topic:

Driver behavioral characteristics are studied by collecting information from GPS sensors on the cars and by applying three different analysis approaches (DP-means, Hidden Markov Models, and Behavioural Topic Extraction) to the contextual scene detection problems on car trips, in order to detect different behaviour along each trip. Subsequently, drivers are clustered in similar profiles based on that and the results are compared with a human-defined ground-truth on drivers classification.

The proposed framework is tested on a real dataset containing sampled car signals. While the different approaches show relevant differences in trip segment classification, the coherence of the final driver clustering results is surprisingly high.


This work has been published at the 4th IEEE Big Data Conference, held in Boston in December 2017. The full paper can be cited as:

M. Brambilla, P. Mascetti and A. Mauri, “Comparison of different driving style analysis approaches based on trip segmentation over GPS information,” 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, 2017, pp. 3784-3791.
doi: 10.1109/BigData.2017.8258379

You can download the full paper PDF from the IEEE Explore Library, at this url:


If you are interested in further contributions at the conference, here you can find my summaries of the keynote speeches on human-in-the-loop machine learning and on increasing human perception through text mining.

A Curated List of WWW 2017 Papers for Data Science and Web Science

This year the WWW conference 2017 is definitely focusing a lot of emphasis on Web Science and Data Science.

I’m recording here a list of papers I found interesting at the conference, related to this topic. Disclaimer: the list may be incomplete, as I did not go through all the papers. So in case you want full coverage of the conference, you can just browse the full WWW proceedings, which are entirely available online as open-access creative commons content.

Anyway, here is my list:

Prices and Subsidies in the Sharing Economy (Page 53)
Zhixuan Fang (Tsinghua University)
Longbo Huang (Tsinghua University)
Adam Wierman (California Institute of Technology)

Understanding and Discovering Deliberate Self-harm Content in Social Media (Page 93)
Yilin Wang (Arizona State University)
Jiliang Tang (Michigan State University)
Jundong Li (Arizona State University)
Baoxin Li (Arizona State University)
Yali Wan (University of Washington)
Clayton Mellina (Yahoo Research)
Neil O’Hare (Yahoo Research)
Yi Chang (Huawei Research America)

Cataloguing Treatments Discussed and Used in Online Autism Communities (Page 123)
Shaodian Zhang (Columbia University)
Tian Kang (Columbia University)
Lin Qiu (Shanghai Jiao Tong University)
Weinan Zhang (Shanghai Jiao Tong University)
Yong Yu (Shanghai Jiao Tong University)
Noémie Elhadad (Columbia University)

Neural Collaborative Filtering (Page 173)
Xiangnan He (National University of Singapore)
Lizi Liao (National University of Singapore)
Hanwang Zhang (Columbia University)
Liqiang Nie (Shandong University)
Xia Hu (Texas A&M University)
Tat-Seng Chua (National University of Singapore)

Exact Computation of Influence Spread by Binary Decision Diagrams (Page 947)
Takanori Maehara (Shizuoka University & RIKEN Center for Advanced Intelligence Project)
Hirofumi Suzuki (Hokkaido University)
Masakazu Ishihata (Hokkaido University)

Secure Centrality Computation Over Multiple Networks (Page 957)
Gilad Asharov (Cornell-Tech)
Francesco Bonchi (ISI Foundation)
David García Soriano (Eurecat & Pompeu Fabra University)
Tamir Tassa (The Open University)

Interplay between Social Influence and Network Centrality: A Comparative Study on Shapley Centrality and Single-Node-Influence Centrality (Page 967)
Wei Chen (Microsoft Research)
Shang-Hua Teng (University of Southern California)

Portfolio Optimization for Influence Spread (Page 977)
Naoto Ohsaka (The University of Tokyo)
Yuichi Yoshida (National Institute of Informatics & Preferred Infrastructure, Inc.)

Extracting and Ranking Travel Tips from User-Generated Reviews (Page 987)
Ido Guy (Ben-Gurion University of the Negev & eBay Research)
Avihai Mejer (Yahoo Research)
Alexander Nus (Yahoo Research)
Fiana Raiber (Technion – Israel Institute of Technology)

Information Extraction in Illicit Web Domains (Page 997)
Mayank Kejriwal (University of Southern California)
Pedro Szekely (University of Southern California)

Learning to Extract Events from Knowledge Base Revisions (Page 1007)
Alexander Konovalov (Ohio State University)
Benjamin Strauss (Ohio State University)
Alan Ritter (Ohio State University)
Brendan O’Connor (University of Massachusetts, Amherst)

CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases (Page 1015)
Xiang Ren (University of Illinois at Urbana-Champaign)
Zeqiu Wu (University of Illinois at Urbana-Champaign)
Wenqi He (University of Illinois at Urbana-Champaign)
Meng Qu (University of Illinois at Urbana-Champaign)
Clare R. Voss (Army Research Laboratory)
Heng Ji (Rensselaer Polytechnic Institute)
Tarek F. Abdelzaher (University of Illinois at Urbana-Champaign)
Jiawei Han (University of Illinois at Urbana-Champaign)

Using Crowdsourcing for Domain-Specific Languages Specification

In the context of Domain-Specific Modeling Language (DSML) development, the involvement of end-users is crucial to assure that the resulting language satisfies their needs.

In our paper presented at SLE 2017 in Vancouver, Canada, on October 24th within the SPLASH Conference context, we discuss how crowdsourcing tasks can exploited to assist in domain-specific language definition processes. This is in line with the vision towards cognification of model-driven engineering.

The slides are available on slideshare:


Indeed, crowdsourcing has emerged as a novel paradigm where humans are employed to perform computational and information collection tasks. In language design, by relying on the crowd, it is possible to show an early version of the language to a wider spectrum of users, thus increasing the validation scope and eventually promoting its acceptance and adoption.

Ready to accept improper use of your tools?

We propose a systematic (and automatic) method for creating crowdsourcing campaigns aimed at refining the graphical notation of DSMLs. The method defines a set of steps to identify, create and order the questions for the crowd. As a result, developers are provided with a set of notation choices that best fit end-users’ needs. We also report on an experiment validating the approach.

Improving the quality of the language notation may improve dramatically acceptance and adoption, as well as the way people use your notation and the associated tools.

Essentially, our idea is to spawn to the crowd a bunch of questions regarding the concrete syntax of visual modeling languages, and collect opinions. Based on different strategies, we generate an optimal notation and then we check how good it is.

In the paper we also validate the approach and experiment it in a practical use case, namely studying some variations over the BPMN modeling language.

The full paper can be found here: https://dl.acm.org/citation.cfm?doid=3136014.3136033. The paper is titled: “Better Call the Crowd: Using Crowdsourcing to Shape the Notation of Domain-Specific Languages” and was co-authored by Marco Brambilla, Jordi Cabot, Javier Luis Cánovas Izquierdo, and Andrea Mauri.

You can also access the Web version on Jordi Cabot blog.

The artifacts described in this paper are also referenced on findresearch.org, namely referring to the following materials:

Myths and Challenges in Knowledge Extraction and Big Data Analysis

For centuries, science (in German “Wissenschaft”) has aimed to create (“schaften”) new knowledge (“Wissen”) from the observation of physical phenomena, their modelling, and empirical validation.

Recently, a new source of knowledge has emerged: not (only) the physical world any more, but the virtual world, namely the Web with its ever-growing stream of data materialized in the form of social network chattering, content produced on demand by crowds of people, messages exchanged among interlinked devices in the Internet of Things. The knowledge we may find there can be dispersed, informal, contradicting, unsubstantiated and ephemeral today, while already tomorrow it may be commonly accepted.

Picture2The challenge is once again to capture and create consolidated knowledge that is new, has not been formalized yet in existing knowledge bases, and is buried inside a big, moving target (the live stream of online data).

The myth is that existing tools (spanning fields like semantic web, machine learning, statistics, NLP, and so on) suffice to the objective. While this may still be far from true, some existing approaches are actually addressing the problem and provide preliminary insights into the possibilities that successful attempts may lead to.

I gave a few keynote speeches on this matter (at ICEIS, KDWEB,…), and I also use this argument as a motivating class in academic courses for letting students understand how crucial is to focus on the problems related to big data modeling and analysis. The talk, reported in the slides below, explores through real industrial use cases, the mixed realistic-utopian domain of data analysis and knowledge extraction and reports on some tools and cases where digital and physical world have brought together for better understanding our society.

The presentation is available on SlideShare and are reported here below:

Urban Data Science Bootcamp

We organize a crash-course on how the science of urban data can be applied to solve metropolitan issues.


The course is a 2 days face-to-face event with teaching sessions, workshops, case study discussions and hands-on activities for non-IT professionals in the field of city management. It is issued in two editions along the year:

  • in Milan, Italy, on  November 8th-9th, 2017
  • in Amsterdam, The Netherlands, on November 30th-December 1st, 2017.

You can download the flyer and program of the Urban datascience bootcamp 2017.

Ideal participants include: Civil servants, Professionals, Students, Urban planners, and managers of city utilities and services. No previous experience in data science or computer science is required. Attendees should have experience in areas such as economic affairs, urban development, management support, strategy & innovation, health & care, public order & safety.

Data is the catalyst needed to make the smart city vision a reality in a transparent and evidence-based (i.e. data-driven) manner. The skills required for data-driven urban analysis and design activities are diverse, and range from data collection (field work, crowdsensing, physical sensor processing, etc.); data processing by employing established big data technology frameworks; data exploration to find patterns and outliers in spatio-temporal data streams; and data visualization conveying the right information in the right manner.

The CrowdInsights professional school “Urban Data Science Bootcamp” provides a no-frills, hands-on introduction to the science of urban data; from data creation, to data analysis, data visualization and sense-making, the bootcamp will introduce more than 10 real-world application uses cases that exemplifies how urban data can be applied to solve metropolitan issues. Attendees will explore the challenges and opportunities that come from the adoption of novel types of urban data source, including social media, mobile phone data, IoT networks, etc.