Coronavirus stories and data

Coronavirus COVID-19 is an extreme challenge for our society, economy, and individual life. However, governments should have learnt from each other. The impact has been spreading slowly across countries. There has been plenty of time to take action. But apparently people and government can’t grasp the risk until it’s onto them. And the way European and American governments are acting is to slow and incremental.

I live in Italy, we rank second in the world for healthcare quality. The mindset of “this won’t happen here” was the attitude at the beginning of this challenge, and look at  what happened. I’m reporting here two links to articles that mention a data-driven vision, but also the human, psychological an behavioural aspects involved. They are two simple stories that report the Italian perspective on the virus.

Coronavirus Stories From Italy

And why now it’s the time for YOU to worry, fellow Europeans and Americans

#Coronavirus: Updates from the Italian Front

A preview of what will happen in a week in the rest of the world. Things have dramatically changed in our society

Data Science for Business Innovation. Live courses for executives and managers in Italy and The Netherlands

Starting October 2019, we open a new opportunity for companies:

a 2-day hands-on course on Data-driven innovation for executive and managers.

The course is specially developed for executives, managers, and decision-makers that need to handle the foundations of data analysis for taking informed decisions on data-driven business, innovation path and strategies within the enterprise. It consists of keynotes, success stories, and quick  introductory lectures spanning big data, machine learning, data valorization and communication. The course covers terminology and concepts, tools and methods, use cases and success stories of data science applications.

The course explains  what value Data Science can create, what Data Science can solve, what the difference is between descriptive, predictive and prescriptive analytics, and what the roles of machine learning and artificial intelligence are.

The teaching style will be very practical, with use cases, hands on sessions, workgroup activities, and networking sessions for applying what you learn directly on real projects.

The live events will be:

If you are interested, you can visit the pages for the Italian [ITA] and English [ENG] editions respectively, and/o download the detailed brochures:

You can always get in touch to ask for more details.

Similar initiatives that we held in the past included the Urban Data Science Bootcamp, delivered in Milano and Amsterdam in 2017 (see a Medium story on the event here, to understand the style and activities, although you should consider that those reported there are about the specific sector of smartcity).

The event is also integrated with an online mini MOOC available on Coursera.

The course is offered by Politecnico di Milano in collaboration with Cefriel and EIT Digital.

 

Are open source projects governed by rich clubs?

The network of collaborations in an open source project can reveal relevant emergent properties that influence its prospects of success.

In our recent joint work with the Open University of Catalunya / ICREA, we analyze open source projects to determine whether they exhibit a rich-club behavior, that is a phenomenon where contributors with a high number of collaborations (i.e., strongly connected within the collaboration network) are likely to cooperate with other well-connected individuals.

ownCloud-open-source-accessibilityThe presence or absence of a rich-club has an impact on the sustainability and robustness of the project. In fact, if a member of the rich club leaves the project, it is easier for other members of the rich club to take over. Less collaborations would require more effort from more users.

The work has been presented at OpenSym 2019, the 15th International Symposium on Open Collaboration, in Skövde (Sweden), on August 20-22, 2019.

The full paper is available on the conference Web Site (or locally here), and the slides presenting our results are available on Slideshare:

For this analysis, we build and study a dataset with the 100 most popular projects in GitHub, exploiting connectivity patterns in the graph structure of collaborations that arise from commits, issues and pull requests. Results show that rich-club behavior is present in all the projects, but only few of them have an evident club structure.

For instance, this network of contributors for the Materialize project seems to go against the open source paradigma. The project is “owned” by very  few users:

richclubEstablished in 2014 by a team of 4 developers, at the time of the analysis it featured 3,853 commits and 252 contributors. Nevertheless, the project only has two top contributors (with more than 1,000 commits), which belong to the original team, and no other frequent contributors.

For all the projects, we compute coefficients both for single source graphs and the overall interaction graph, showing that rich-club behavior varies across different layers of software development. We provide possible explanations of our results, as well as implications for further analysis.

Data Science for Business Innovation. A new MOOC on Coursera

Breaking news!

We just published our new MOOC “Data Science for Business Innovation” on Coursera!

Our course is available for free on Coursera and is jointly offered by Politecnico di Milano and EIT Digital, as a compendium of the must-have expertise in data science for non-technical people, including executives, middle-managers to foster data-driven innovation.

The course is an introductory, non-technical overview of the concepts of data science.

You can enrol in the first edition of the course starting today.

The course is completely free and you can enjoy content at any time, with professional English speakers and animated, engaging materials.

Here is a short intro to the course:

The course consists of introductory lectures spanning big data, machine learning, data valorization and communication.
All the remaining details can be found on Coursera:

eit

Topics cover the essential concepts and intuitions on data needs, data analysis, machine learning methods, respective pros and cons, and practical applicability issues. The course covers terminology and concepts, tools and methods, use cases and success stories of data science applications.

The course explains what is Data Science and why it is so hyped. It discusses the value that Data Science can create, the main classes of problems that Data Science can solve, the difference is between descriptive, predictive and prescriptive analytics, and the roles of machine learning and artificial intelligence.

From a more technical perspective, the course covers supervised, unsupervised and semi-supervised methods, and explains what can be obtained with classification, clustering, and regression techniques. It discusses the role of NoSQL data models and technologies, and the role and impact of scalable cloud-based computation platforms.

All topics are covered with example-based lectures, discussing use cases, success stories and realistic examples.

If you are interested in these topics, feel free to look at it on Coursera.

We look forward to seeing you there!

Content-based Classification of Political Inclinations of Twitter Users

Social networks are huge continuous sources of information that can be used to analyze people’s behavior and thoughts.

Our goal is to extract such information and predict political inclinations of users.

In particular, we investigate the importance of syntactic features of texts written by users when they post on social media. Our hypothesis is that people belonging to the same political party write in similar ways, thus they can be classified properly on the basis of the words that they use.

We analyze tweets because Twitter is commonly used in Italy for discussing about politics; moreover, it provides an official API that can be easily exploited for data extraction. Many classifiers were applied to different kinds of features and NLP vectorization methods in order to obtain the best method capable of confirming our hypothesis.

To evaluate their accuracy, a set of current Italian deputies with consistent activity in Twitter has been selected as ground truth, and we have then predicted their political party. Using the results of our analysis, we also got interesting insights into current Italian politics. Here are the clusters of users:

ieee-big-data-2018-twitter-elections-clusters

Results in understanding political alignment are quite good, as reported in the confusion matrix here: ieee-big-data-2018-twitter-elections-parties

Our study is described in detail in the paper published in the IEEE Big Data 2018 conference and linked at:

DOI: 10.1109/BigData.2018.8622040

The article can be downloaded here, if you don’t have access to IEEE library.

You can also look at the slides on SlideShare:

You can cite the paper as follows:

M. Di Giovanni, M. Brambilla, S. Ceri, F. Daniel and G. Ramponi, “Content-based Classification of Political Inclinations of Twitter Users,” 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 2018, pp. 4321-4327.
doi: 10.1109/BigData.2018.8622040

online news and social media

News Sharing Behaviour on Twitter. A Dataset and a Pipeline

Online social media are changing the news industry and revolutionizing the traditional role of journalists and newspapers. In this scenario, investigating the behaviour of users in relationship to news sharing is relevant, as it provides means for understanding the impact of online news, their propagation within social communities, their impact on the formation of opinions, and also for effectively detecting individual stance relative to specific news or topics, as well as for understanding the role of journalism today.

Our contribution is two-fold.

First, we build a robust pipeline for collecting datasets describing news sharing; the pipeline takes as input a list of news sources and generates a large collection of articles, of the accounts that provide them on the social media either directly or by retweeting, and of the social activities performed by these accounts.

The dataset is published on Harvard Dataverse:

https://doi.org/10.7910/DVN/5XRZLH

Second, we also provide a large-scale dataset that can be used to study the social behavior of Twitter users and their involvement in the dissemination of news items. Finally we show an application of our data collection in the context of political stance classification and we suggest other potential usages of the presented resources.

The code is published on GitHub:

https://github.com/DataSciencePolimi/NewsAnalyzer

The details of our approach is published in a paper at ICWSM 2019 accessible online.

You can cite the paper as:

Giovanni Brena, Marco Brambilla, Stefano Ceri, Marco Di Giovanni, Francesco Pierri, Giorgia Ramponi. News Sharing User Behaviour on Twitter: A Comprehensive Data Collection of News Articles and Social Interactions. AAAI ICWSM 2019, pp. 592-597.

Slides are on Slideshare:

You can also download a summary poster.

partenza_poster__1__pdf-2

 

Brand Community Analysis using Graph Representation Learning on Social Networks – with a Fashion Case

In a world more and more connected, new and complex interaction patterns can be extracted in the communication between people.

This is extremely valuable for brands that can better understand  the interests of users and the trends on social media to better target  their products. In this paper, we aim to analyze the communities that arise around commercial brands on social networks to understand the meaning of similarity, collaboration, and interaction among users.

We exploit the network that builds around the brands by encoding it into a graph model. We build a social network graph, considering user nodes and friendship relations; then we compare it with a heterogeneous graph model, where also posts and hashtags
are considered as nodes and connected to the different node types; we finally build also a reduced network, generated by inducing direct user-to-user connections through the intermediate nodes (posts and hashtags). These different variants are encoded using graph representation learning, which generates a numerical vector for each node. Machine learning techniques are applied to these vectors to extract valuable insights for each user and for the communities they belong to.

We report on our experiments performed on an emerging fashion brand on Instagram, and we show that our approach is able to discriminate potential customers for the brand, and to highlight meaningful sub-communities composed by users that share the same kind of content on social networks.

The use case is taken from a joint research project with the Fashion in Process group in the Design Department of Politecnico di Milano, within the framework of FAST (Fashion Sensing Technology).

This study has been published by Springer as part of ACM SAC 2019, Cyprus.

Here is the slideset presenting the idea:

The paper can be referenced as:

Marco Brambilla, Mattia Gasparini: Brand Community Analysis On Social Networks Using Graph Representation Learning. ACM Symposium on Applied Computing (SAC) 2019, pp. 2060-2069.

The link to the officially published paper in the ACM Library will be available shortly.

A sneak peek at the European Union Ethics Guidelines for AI

A few days ago, politico.eu published a preview of the document that the European Union will issue as guidance for ethical issues related to artificial intelligence and machine learning.

The document was written by the High-level Expert Group on Artificial Intelligence, appointed by the European Commission.

This advanced version of the document is available online now for a sneak peek preview.

The official version will be released shortly.

Besides the actual and technical content, this step is important as a principle too, because rarely a governmental institution feels the need to take such positions on scientific/technical evolution. This pronouncement makes it clear how strategic and crucial AI and ML is deemed today, also from a political perspective.

If you want to read more about Europe’s take on AI, you can also read this article on Medium.

Improving Topic Modeling with Knowledge Graph Embeddings

Topic modeling techniques have been applied in many scenarios in recent years, spanning textual content, as well as many different data sources. The existing researches in this field continuously try to improve the accuracy and coherence of the results. Some recent works propose new methods that capture the semantic relations between words into the topic modeling process, by employing vector embeddings over knowledge bases.

In our recent paper presented at the AAAI-MAKE Spring Symposium 2019, held at Stanford University, we studied how knowledge graph embeddings affect topic modeling performance on textual content. In particular, the objective of the work is to determine which aspects of knowledge graph embedding have a significant and positive impact on the accuracy of the extracted topics.

We improve the state of the art by integrating some avanced graph embedding approaches (specifically designed for knowledge graphs) within the topic extraction process.
We also studied how the knowledge base could be expanded by using dataset-specific relations between the words.
We implemented the method and we validated it with a set of experiments with 2 variations of the knowledge base, 7 embedding methods, and 2 methods for incorporation of the embeddings into the topic modeling framework, also considering different parameterizations of topic number and embedding dimensionality.
Besides the specific technical results, the work has also aims at showing the potentials of integrating statistical methods with knowledge-centric methods. The full extent of the impact of these techniques shall be explored further in the future.
The details of the work are reported in the paper, which is available online here, and in the slides, also available online (on SlideShare and here below).

Possible Theses in Data Science

Here is a presentation that summarizes some of the relevant topics currently available for theses within the Data Science Lab under my supervision.

Feel free to get in touch in case you are interested.