Crash Course in Data Science at PoliMi

For the second time, we are proposing a “night-time” multidisciplinary interactive mini-course that introduces data science concepts, methods and use cases to bachelor students (and master students of different faculties such as management, design, architecture, and so on) of Politecnico di Milano.

The full program of the mini-course is:

Day Topic Instructor Classroom (DEIB) Materials
4/2/2019 Intro to big data and data science.(Re)descovering SQL. Ceri / Brambilla Seminari Intro

SQL

5/2/2019 Big Data and NoSQL. Brambilla Conferenze NoSQL overview

NoSQL Databases

Graph Databases

8/2/2019 Data Analysis: dimensionality, clustering Brambilla Seminari Dimensionality Reduction & Clustering
12/2/2019 Data analysis: classification & hands-on on machine learning, AI, neural networks, deep learning Brambilla/ Ramponi/ Di Giovanni Conferenze Classification, neural networks, CNN, RNN, DNN, Deep learning
14/2/2019 Hands-on data analysis Ramponi/ Di Giovanni Seminari Python-datascienceNN-Keras
20/2/2019 Scenarios: Genomics, Bots and Fake News Ceri /Daniel Seminari Bots and fake news
21/2/2019 Statistics in practice Vantini Seminari
27/2/2019 Data visualization Ciuccarelli Seminari Datascience Challenges

Possible Theses and Projects

The course is in Italian, with teaching materials in English.

Classes are always from 5:30pm to 7:00pm.

You can read more at:

http://datascience.deib.polimi.it/course/crash-course-data-science-peopledeib/ 

Or you can get in touch if you want more details: marco.brambilla@polimi.it.

data-science

Understanding Polarized Political Events through Social Media Analysis

Predicting the outcome of elections is a topic that has been extensively studied in political polls, which have generally provided reliable predictions by means of statistical models. In recent years, online social media platforms have become a potential alternative to traditional polls, since they provide large amounts of post and user data, also referring to socio-political aspects.

In this context, we designed a research that aimed at defining a user modeling pipeline to analyze dis cussions and opinions shared on social media regarding polarized political events (such as a public poll or referendum).

The pipeline follows a four-step methodology.

 

  • First, social media posts and users metadata are crawled.
  • Second, a filtering mechanism is applied to filter out spammers and bot users.
  • Third, demographics information is extracted out of the valid users, namely gender, age, ethnicity and location information.
  • Fourth, the political polarity of the users with respect to the analyzed event is predicted.

In the scope of this work, our proposed pipeline is applied to two referendum scenarios:

  • independence of Catalonia in Spain
  • autonomy of Lombardy in Italy

We used these real-world examples to assess the performance of the approach with respect to the capability of collecting correct insights on the demographics of social media users and of predicting the poll results based on the opinions shared by the users.

Cursor_and_KDWEB_2018_paper_1_pdf

Experiments show that the method was effective in predicting the political trends for the Catalonia case, but not for the Lombardy case. Among the various motivations for this, we noticed that in general Twitter was more representative of the users opposing the referendum than the ones in favor.

The work has been presented at the KDWEB workshop at the ICWE 2018 conference.

A preprint of the paper can be downloaded from ArXiv and cited as reported here:

Roberto Napoli, Ali Mert Ertugrul, Alessandro Bozzon, Marco Brambilla. A User Modeling Pipeline for Studying Polarized Political Events in Social Media. KDWeb Workshop 2018, co-located with ICWE 2018, Caceres, Spain, June 2018. arXiv:1807.09459

Data Cleaning for Knowledge Extraction and Understanding on Social Media

 

Social media platforms let users share their opinions through textual or multimedia content. In many settings, this becomes a valuable source of knowledge that can be exploited for specific business objectives. Brands and companies often ask to monitor social media as sources for understanding the stance, opinion, and sentiment of their customers, audience and potential audience. This is crucial for them because it let them understand the trends and future commercial and marketing opportunities.

However, all this relies on a solid and reliable data collection phase, that grants that all the analyses, extractions and predictions are applied on clean, solid and focused data. Indeed, the typical topic-based collection of social media content performed through keyword-based search typically entails very noisy results.

We recently implemented a simple study aiming at cleaning the data collected from social content, within specific domains or related to given topics of interest.  We propose a basic method for data cleaning and removal of off-topic content based on supervised machine learning techniques, i.e. classification, over data collected from social media platforms based on keywords regarding a specific topic. We define a general method for this and then we validate it through an experiment of data extraction from Twitter, with respect to a set of famous cultural institutions in Italy, including theaters, museums, and other venues.

For this case, we collaborated with domain experts to label the dataset, and then we evaluated and compared the performance of classifiers that are trained with different feature extraction strategies.

The work has been presented at the KDWEB workshop at the ICWE 2018 conference.

A preprint of the paper can be downloaded and cited as reported here:

Emre Calisir, Marco Brambilla. The Problem of Data Cleaning for Knowledge Extraction from Social Media. KDWeb Workshop 2018, co-located with ICWE 2018, Caceres, Spain, June 2018.

The slides used in the workshop are available online here:

 

Iterative knowledge extraction from social networks

Yesterday, we presented a new work at The Web Conference in Lyon along the research line on knowledge extraction from human generated content started with our paper “Extracting Emerging Knowledge from Social Media” presented at the WWW 2017 Conference (see also this past post).

Our motivation starts from the fact that knowledge in the world continuously evolves, and thus ontologies and knowledge bases are largely incomplete, especially regarding data belonging to the so-called long tail. Therefore, we proposed a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates.

Based on this foundational idea, we explored the possibility of running our method iteratively, using the results as new seeds. In this paper we address the following research questions:

  1. How does the reconstructed domain knowledge evolve if the candidates of one extraction are recursively used as seeds?
  2. How does the reconstructed domain knowledge spread geographically?
  3. Can the method be used to inspect the past, present, and future of knowledge?
  4. Can the method be used to find emerging knowledge?

This is the presentation given at the conference:

This work was presented at The Web Conference 2018, in the Modeling Social Media (MSM) workshop.

The paper is in the official proceedings of the conference through the ACM Digital Library.

You can also find here a PDF preprint version of “Iterative Knowledge Extraction from Social Networks” by Brambilla et al.

 

IEEE Big Data Conference 2017: take home messages from the keynote speakers

I collected here the list of my write-ups of the first three keynote speeches of the conference:

Driving Style and Behavior Analysis based on Trip Segmentation over GPS Information through Unsupervised Learning

Over one billion cars interact with each other on the road every day. Each driver has his own driving style, which could impact safety, fuel economy and road congestion. Knowledge about the driving style of the driver could be used to encourage “better” driving behaviour through immediate feedback while driving, or by scaling auto insurance rates based on the aggressiveness of the driving style.
In this work we report on our study of driving behaviour profiling based on unsupervised data mining methods. The main goal is to detect the different driving behaviours, and thus to cluster drivers with similar behaviour. This paves the way to new business models related to the driving sector, such as Pay-How-You-Drive insurance policies and car rentals. Here is the presentation I gave on this topic:

Driver behavioral characteristics are studied by collecting information from GPS sensors on the cars and by applying three different analysis approaches (DP-means, Hidden Markov Models, and Behavioural Topic Extraction) to the contextual scene detection problems on car trips, in order to detect different behaviour along each trip. Subsequently, drivers are clustered in similar profiles based on that and the results are compared with a human-defined ground-truth on drivers classification.

The proposed framework is tested on a real dataset containing sampled car signals. While the different approaches show relevant differences in trip segment classification, the coherence of the final driver clustering results is surprisingly high.

 


This work has been published at the 4th IEEE Big Data Conference, held in Boston in December 2017. The full paper can be cited as:

M. Brambilla, P. Mascetti and A. Mauri, “Comparison of different driving style analysis approaches based on trip segmentation over GPS information,” 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, 2017, pp. 3784-3791.
doi: 10.1109/BigData.2017.8258379

You can download the full paper PDF from the IEEE Explore Library, at this url:

https://ieeexplore.ieee.org/document/8258379/

If you are interested in further contributions at the conference, here you can find my summaries of the keynote speeches on human-in-the-loop machine learning and on increasing human perception through text mining.

How Fashionable is Digital Data-Driven Fashion?

Within the context of our data science research track, we have been involved a lot in fashion industry problems recently.

We already showcased some studies in fashion, for instance related to the analysis of the Milano Fashion Week events and their social media impact.

Starting this year, we are also involved in a research and innovation project called FaST – Fashion Sensing Technology. FaST is a project meant to design, experiment with, and implement an ICT tool that could monitor and analyze the activity of Italian emerging Fashion brands on social media. FaST aims at providing SMEs in the Fashion industry with the ability to better understand and measure the behaviours and opinions of consumers on social media, through the study of the interactions between brands and their communities, as well as support a brand’s strategic business decisions.

Given the importance of Fashion as an economic and cultural resource for Lombardy Region and Italy as a whole, the project aims at leveraging on the opportunities given by the creation of an hybrid value chain fashion-digital, in order to design a tool that would allow the codification of new organizational models. Furthermore, the project wants to promote process innovation within the fashion industry but with a customer-centric approach, as well as the design of services that could update and innovate both creative processes and the retail channel which, as of today, represents the core to the sustainability and competitiveness of brands and companies on domestic and international markets.

Within the project, we study social presence and digital / communication strategies of brands, and we will look for space for optimization. We are already crunching a lot of data and running large scale analyses on the topic. We will share our exciting results as soon as available!

 

Acknowledgements

FaST – Fashion Sensing Technology is a project supported by Regione Lombardia through the European Regional Development Fund (grant: “Smart Fashion & Design”). The project is being developed by Politecnico di Milano – Design dept. and Electronics, Information and Bioengineering dept. – in collaboration with Wemanage Group, Studio 4SIGMA, and CGNAL.

logo_w_fondo_transparent 2

Using Crowdsourcing for Domain-Specific Languages Specification

In the context of Domain-Specific Modeling Language (DSML) development, the involvement of end-users is crucial to assure that the resulting language satisfies their needs.

In our paper presented at SLE 2017 in Vancouver, Canada, on October 24th within the SPLASH Conference context, we discuss how crowdsourcing tasks can exploited to assist in domain-specific language definition processes. This is in line with the vision towards cognification of model-driven engineering.

The slides are available on slideshare:

 

Indeed, crowdsourcing has emerged as a novel paradigm where humans are employed to perform computational and information collection tasks. In language design, by relying on the crowd, it is possible to show an early version of the language to a wider spectrum of users, thus increasing the validation scope and eventually promoting its acceptance and adoption.

SLE2017-v2
Ready to accept improper use of your tools?

We propose a systematic (and automatic) method for creating crowdsourcing campaigns aimed at refining the graphical notation of DSMLs. The method defines a set of steps to identify, create and order the questions for the crowd. As a result, developers are provided with a set of notation choices that best fit end-users’ needs. We also report on an experiment validating the approach.

Improving the quality of the language notation may improve dramatically acceptance and adoption, as well as the way people use your notation and the associated tools.

Essentially, our idea is to spawn to the crowd a bunch of questions regarding the concrete syntax of visual modeling languages, and collect opinions. Based on different strategies, we generate an optimal notation and then we check how good it is.

In the paper we also validate the approach and experiment it in a practical use case, namely studying some variations over the BPMN modeling language.

The full paper can be found here: https://dl.acm.org/citation.cfm?doid=3136014.3136033. The paper is titled: “Better Call the Crowd: Using Crowdsourcing to Shape the Notation of Domain-Specific Languages” and was co-authored by Marco Brambilla, Jordi Cabot, Javier Luis Cánovas Izquierdo, and Andrea Mauri.

You can also access the Web version on Jordi Cabot blog.

The artifacts described in this paper are also referenced on findresearch.org, namely referring to the following materials:

Myths and Challenges in Knowledge Extraction and Big Data Analysis

For centuries, science (in German “Wissenschaft”) has aimed to create (“schaften”) new knowledge (“Wissen”) from the observation of physical phenomena, their modelling, and empirical validation.

Recently, a new source of knowledge has emerged: not (only) the physical world any more, but the virtual world, namely the Web with its ever-growing stream of data materialized in the form of social network chattering, content produced on demand by crowds of people, messages exchanged among interlinked devices in the Internet of Things. The knowledge we may find there can be dispersed, informal, contradicting, unsubstantiated and ephemeral today, while already tomorrow it may be commonly accepted.

Picture2The challenge is once again to capture and create consolidated knowledge that is new, has not been formalized yet in existing knowledge bases, and is buried inside a big, moving target (the live stream of online data).

The myth is that existing tools (spanning fields like semantic web, machine learning, statistics, NLP, and so on) suffice to the objective. While this may still be far from true, some existing approaches are actually addressing the problem and provide preliminary insights into the possibilities that successful attempts may lead to.

I gave a few keynote speeches on this matter (at ICEIS, KDWEB,…), and I also use this argument as a motivating class in academic courses for letting students understand how crucial is to focus on the problems related to big data modeling and analysis. The talk, reported in the slides below, explores through real industrial use cases, the mixed realistic-utopian domain of data analysis and knowledge extraction and reports on some tools and cases where digital and physical world have brought together for better understanding our society.

The presentation is available on SlideShare and are reported here below:

Urban Data Science Bootcamp

We organize a crash-course on how the science of urban data can be applied to solve metropolitan issues.

crowdinsights_bootcamp_2017_en

The course is a 2 days face-to-face event with teaching sessions, workshops, case study discussions and hands-on activities for non-IT professionals in the field of city management. It is issued in two editions along the year:

  • in Milan, Italy, on  November 8th-9th, 2017
  • in Amsterdam, The Netherlands, on November 30th-December 1st, 2017.

You can download the flyer and program of the Urban datascience bootcamp 2017.

Ideal participants include: Civil servants, Professionals, Students, Urban planners, and managers of city utilities and services. No previous experience in data science or computer science is required. Attendees should have experience in areas such as economic affairs, urban development, management support, strategy & innovation, health & care, public order & safety.

Data is the catalyst needed to make the smart city vision a reality in a transparent and evidence-based (i.e. data-driven) manner. The skills required for data-driven urban analysis and design activities are diverse, and range from data collection (field work, crowdsensing, physical sensor processing, etc.); data processing by employing established big data technology frameworks; data exploration to find patterns and outliers in spatio-temporal data streams; and data visualization conveying the right information in the right manner.

The CrowdInsights professional school “Urban Data Science Bootcamp” provides a no-frills, hands-on introduction to the science of urban data; from data creation, to data analysis, data visualization and sense-making, the bootcamp will introduce more than 10 real-world application uses cases that exemplifies how urban data can be applied to solve metropolitan issues. Attendees will explore the challenges and opportunities that come from the adoption of novel types of urban data source, including social media, mobile phone data, IoT networks, etc.