Large-Scale Analysis of On-line Conversation about Vaccines before COVID-19

Frequent words and co-occurrences used by pro-vaccination and anti-vaccination communities.

In this study, we map the Twitter discourse around vaccinations in English along four years, in order to:

  • discover the volumes and trends of the conversation;
  • compare the discussion on Twitter with newspapers’ content; and
  • classify people as pro- or anti- vaccination and explore how their behavior is different.

Datasets. We collected four years of Twitter data (January 2016 – January 2020) about vaccination, before the advent of the Covid-19 pandemic, using three keywords: ’vaccine’, ’vaccination’, and ’immunization’, obtaining around 6.5 MLN tweets. The collection has been analyzed across multiple dimensions and aspects. General

Analysis. The analysis shows that the number of tweets related to the topic in- creased through the years, peaking in 2019. Among others, we identified the 2019 measles outbreak as one of the main reasons for the growth, given the correlation of the tweets volume with CDC (Centers for Disease Control and Prevention) data on measles cases in the United States in 2019 and with the high number of newspaper articles on the topic, which both significantly increased in 2019. Other demographic, space-time, and content analysis have been performed too.

Subjects. Besides the general data analysis, we considered a number of specific topics often addressed within the vaccine conversation, such as the flu vaccine, hpv, polio, and others. We identified the temporal trends and performed specific analysis related to these subjects, also in connection with the respective media coverage.

News Sources. We analyzed the news sources most cited in the tweets, which include Youtube, NaturalNews (which is generally considered as a biased and fake news website) and Facebook. Overall, among the most cited sources, 32% can be labeled as reliable and 25% as conspiracy/fake news sources. Furthermore 32% of the references point to social networks (including Youtube). This analysis shows how social media and non-reliable sources of information frequently drive vaccine-related conversation on Twitter.

User Stance. We applied stance analysis on the authors of the tweets, to determine the user’s orientation toward a given (pre-chosen) target of interest. Our initial content analysis revealed that a large amount of the content is of satirical or derisive nature, causing a number of classification techniques to perform poorly on the dataset. Given that other studies considered the presence of stance-indicative hashtags as an effective way to discover polarized tweets and users, a rule-based classification was applied, based on a selection of 100+ hashtags that allowed to automatically classify a tweet as pro-vaccination or vaccination-skeptic, obtain- ing a total of 250,000+ classified tweets over the 4 years.

Share of pro- and anti- vaccine discourse in time. Pro-vaccine tweet volumes appear to be larger than anti-vaccine tweets and to increase over time.

The words used by the two groups of users to discuss of vaccine-related topics are profoundly different, as are the sources of information they refer to. Anti-vaccine users cited mostly fake news websites and very few reliable sources, which are instead largely cited by pro-vaccine users. Social media (primarily Youtube) represent a large portion of linked content in both cases.

Additionally, we performed demographics (age, gender, ethnicity) and spatial analysis over the two categories of users with the aim of understanding the features of the two communities. Our analysis also shows to which extent the different states are polarized pro or against vaccination in the U.S. on Twitter.

Stance of US states towards vaccination.

A video presenting our research is available on YouTube:

This work has been presented at the IC2S2 conference.

The cover image  by NIAID is licensed under CC BY 2.0.

Are open source projects governed by rich clubs?

The network of collaborations in an open source project can reveal relevant emergent properties that influence its prospects of success.

In our recent joint work with the Open University of Catalunya / ICREA, we analyze open source projects to determine whether they exhibit a rich-club behavior, that is a phenomenon where contributors with a high number of collaborations (i.e., strongly connected within the collaboration network) are likely to cooperate with other well-connected individuals.

ownCloud-open-source-accessibilityThe presence or absence of a rich-club has an impact on the sustainability and robustness of the project. In fact, if a member of the rich club leaves the project, it is easier for other members of the rich club to take over. Less collaborations would require more effort from more users.

The work has been presented at OpenSym 2019, the 15th International Symposium on Open Collaboration, in Skövde (Sweden), on August 20-22, 2019.

The full paper is available on the conference Web Site (or locally here), and the slides presenting our results are available on Slideshare:

For this analysis, we build and study a dataset with the 100 most popular projects in GitHub, exploiting connectivity patterns in the graph structure of collaborations that arise from commits, issues and pull requests. Results show that rich-club behavior is present in all the projects, but only few of them have an evident club structure.

For instance, this network of contributors for the Materialize project seems to go against the open source paradigma. The project is “owned” by very  few users:

richclubEstablished in 2014 by a team of 4 developers, at the time of the analysis it featured 3,853 commits and 252 contributors. Nevertheless, the project only has two top contributors (with more than 1,000 commits), which belong to the original team, and no other frequent contributors.

For all the projects, we compute coefficients both for single source graphs and the overall interaction graph, showing that rich-club behavior varies across different layers of software development. We provide possible explanations of our results, as well as implications for further analysis.

Brand Community Analysis using Graph Representation Learning on Social Networks – with a Fashion Case

In a world more and more connected, new and complex interaction patterns can be extracted in the communication between people.

This is extremely valuable for brands that can better understand  the interests of users and the trends on social media to better target  their products. In this paper, we aim to analyze the communities that arise around commercial brands on social networks to understand the meaning of similarity, collaboration, and interaction among users.

We exploit the network that builds around the brands by encoding it into a graph model. We build a social network graph, considering user nodes and friendship relations; then we compare it with a heterogeneous graph model, where also posts and hashtags
are considered as nodes and connected to the different node types; we finally build also a reduced network, generated by inducing direct user-to-user connections through the intermediate nodes (posts and hashtags). These different variants are encoded using graph representation learning, which generates a numerical vector for each node. Machine learning techniques are applied to these vectors to extract valuable insights for each user and for the communities they belong to.

We report on our experiments performed on an emerging fashion brand on Instagram, and we show that our approach is able to discriminate potential customers for the brand, and to highlight meaningful sub-communities composed by users that share the same kind of content on social networks.

The use case is taken from a joint research project with the Fashion in Process group in the Design Department of Politecnico di Milano, within the framework of FAST (Fashion Sensing Technology).

This study has been published by Springer as part of ACM SAC 2019, Cyprus.

Here is the slideset presenting the idea:

The paper can be referenced as:

Marco Brambilla, Mattia Gasparini: Brand Community Analysis On Social Networks Using Graph Representation Learning. ACM Symposium on Applied Computing (SAC) 2019, pp. 2060-2069.

The link to the officially published paper in the ACM Library will be available shortly.

Possible Theses in Data Science

Here is a presentation that summarizes some of the relevant topics currently available for theses within the Data Science Lab under my supervision.

Feel free to get in touch in case you are interested.

Predictive Analysis on U.S. Midterm Elections on Twitter with RNN

We implemented an analysis (meaning both a method and a system) that aim to gauge local support for the two major US political parties in the 68 most competitive House of Representative districts during the 2018 U.S. mid-term elections.

The analysis attempts to mirror the “Generic Ballot” poll, i.e., a survey of voters of a particular district which aims to measure local popularity of national parties by querying participants on the likelihood they would vote for a “generic” Democrat or Republican candidate. We collect the tweets containing national parties and politicians in the 68 most competitive districts. By most competitive we mean that they are rated as: toss up50%-50%, or lean by the Cook Political Report.

This means we are addressing an extremely challenging analysis and prediction problem, while disregarding the simpler cases (everyone is good at predicting the obvious!).

Our solution employs the Twitter Search API to query for tweets mentioning a national leader or party, posted form a limited geographic region (i.e., each specific congressional district). For example, the following query extracts tweets on Republicans:

TRUMP OR REPS OR Republicans OR Republican OR MCCCONNELL OR ‘MIKE PENCE’ OR ‘PAUL RYAN’ OR #Republicans OR #REPS OR @realDonaldTrumpOR @SpeakerRyan OR @senatemajldr OR @VP OR GOP OR @POTUS

To limit the search to each congressional district, we use the geocode field in the search query of the API, which queries a circular area based on the coordinates of the center and the radius. Because of the irregular shape of the congressional districts, multiple queries are needed for each of them, therefore we built a custom set of bubbles that approximate the district shape.

For the analysis of the tweets, we adopted a Recurrent Neural Network, namely a RNN-LSTM binary classifier trained on tweets.

To build training and testing data we collected tweets of users with clear political affiliation, including candidates, political activists, and also lesser know users, well versed in the political vernacular.
The accounts selected yielded around 280,000 tweets in 6 months before election day, labeled based on the author’s political affiliation.

Notice that the method is a general political-purpose language-independent analysis framework, that can be applied to any national or local context.

Further details and the results can be found on this Medium post.

This work has been published as a short scientific paper presented at IEEE Big Data Conference in Seattle, WA on December 2018 and on a previous Medium post by Antonio Lopardo.

You can also download a poster format reporting the work:

poster-midterm

In case you want to cite the work, you can do it in this way:

A. Lopardo and M. Brambilla, “Analyzing and Predicting the US Midterm Elections on Twitter with Recurrent Neural Networks,” 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 2018, pp. 5389–5391.
doi: 10.1109/BigData.2018.8622441.
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8622441&isnumber=8621858

The online running prototype, the full description of the project, its results, and source code are available at http://www.twitterpoliticalsentiment.com/USA/.

Understanding Polarized Political Events through Social Media Analysis

Predicting the outcome of elections is a topic that has been extensively studied in political polls, which have generally provided reliable predictions by means of statistical models. In recent years, online social media platforms have become a potential alternative to traditional polls, since they provide large amounts of post and user data, also referring to socio-political aspects.

In this context, we designed a research that aimed at defining a user modeling pipeline to analyze dis cussions and opinions shared on social media regarding polarized political events (such as a public poll or referendum).

The pipeline follows a four-step methodology.

 

  • First, social media posts and users metadata are crawled.
  • Second, a filtering mechanism is applied to filter out spammers and bot users.
  • Third, demographics information is extracted out of the valid users, namely gender, age, ethnicity and location information.
  • Fourth, the political polarity of the users with respect to the analyzed event is predicted.

In the scope of this work, our proposed pipeline is applied to two referendum scenarios:

  • independence of Catalonia in Spain
  • autonomy of Lombardy in Italy

We used these real-world examples to assess the performance of the approach with respect to the capability of collecting correct insights on the demographics of social media users and of predicting the poll results based on the opinions shared by the users.

Cursor_and_KDWEB_2018_paper_1_pdf

Experiments show that the method was effective in predicting the political trends for the Catalonia case, but not for the Lombardy case. Among the various motivations for this, we noticed that in general Twitter was more representative of the users opposing the referendum than the ones in favor.

The work has been presented at the KDWEB workshop at the ICWE 2018 conference.

A preprint of the paper can be downloaded from ArXiv and cited as reported here:

Roberto Napoli, Ali Mert Ertugrul, Alessandro Bozzon, Marco Brambilla. A User Modeling Pipeline for Studying Polarized Political Events in Social Media. KDWeb Workshop 2018, co-located with ICWE 2018, Caceres, Spain, June 2018. arXiv:1807.09459

Data Cleaning for Knowledge Extraction and Understanding on Social Media

 

Social media platforms let users share their opinions through textual or multimedia content. In many settings, this becomes a valuable source of knowledge that can be exploited for specific business objectives. Brands and companies often ask to monitor social media as sources for understanding the stance, opinion, and sentiment of their customers, audience and potential audience. This is crucial for them because it let them understand the trends and future commercial and marketing opportunities.

However, all this relies on a solid and reliable data collection phase, that grants that all the analyses, extractions and predictions are applied on clean, solid and focused data. Indeed, the typical topic-based collection of social media content performed through keyword-based search typically entails very noisy results.

We recently implemented a simple study aiming at cleaning the data collected from social content, within specific domains or related to given topics of interest.  We propose a basic method for data cleaning and removal of off-topic content based on supervised machine learning techniques, i.e. classification, over data collected from social media platforms based on keywords regarding a specific topic. We define a general method for this and then we validate it through an experiment of data extraction from Twitter, with respect to a set of famous cultural institutions in Italy, including theaters, museums, and other venues.

For this case, we collaborated with domain experts to label the dataset, and then we evaluated and compared the performance of classifiers that are trained with different feature extraction strategies.

The work has been presented at the KDWEB workshop at the ICWE 2018 conference.

A preprint of the paper can be downloaded and cited as reported here:

Emre Calisir, Marco Brambilla. The Problem of Data Cleaning for Knowledge Extraction from Social Media. KDWeb Workshop 2018, co-located with ICWE 2018, Caceres, Spain, June 2018.

The slides used in the workshop are available online here:

 

Iterative knowledge extraction from social networks

Yesterday, we presented a new work at The Web Conference in Lyon along the research line on knowledge extraction from human generated content started with our paper “Extracting Emerging Knowledge from Social Media” presented at the WWW 2017 Conference (see also this past post).

Our motivation starts from the fact that knowledge in the world continuously evolves, and thus ontologies and knowledge bases are largely incomplete, especially regarding data belonging to the so-called long tail. Therefore, we proposed a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates.

Based on this foundational idea, we explored the possibility of running our method iteratively, using the results as new seeds. In this paper we address the following research questions:

  1. How does the reconstructed domain knowledge evolve if the candidates of one extraction are recursively used as seeds?
  2. How does the reconstructed domain knowledge spread geographically?
  3. Can the method be used to inspect the past, present, and future of knowledge?
  4. Can the method be used to find emerging knowledge?

This is the presentation given at the conference:

This work was presented at The Web Conference 2018, in the Modeling Social Media (MSM) workshop.

The paper is in the official proceedings of the conference through the ACM Digital Library.

You can also find here a PDF preprint version of “Iterative Knowledge Extraction from Social Networks” by Brambilla et al.

 

How Fashionable is Digital Data-Driven Fashion?

Within the context of our data science research track, we have been involved a lot in fashion industry problems recently.

We already showcased some studies in fashion, for instance related to the analysis of the Milano Fashion Week events and their social media impact.

Starting this year, we are also involved in a research and innovation project called FaST – Fashion Sensing Technology. FaST is a project meant to design, experiment with, and implement an ICT tool that could monitor and analyze the activity of Italian emerging Fashion brands on social media. FaST aims at providing SMEs in the Fashion industry with the ability to better understand and measure the behaviours and opinions of consumers on social media, through the study of the interactions between brands and their communities, as well as support a brand’s strategic business decisions.

Given the importance of Fashion as an economic and cultural resource for Lombardy Region and Italy as a whole, the project aims at leveraging on the opportunities given by the creation of an hybrid value chain fashion-digital, in order to design a tool that would allow the codification of new organizational models. Furthermore, the project wants to promote process innovation within the fashion industry but with a customer-centric approach, as well as the design of services that could update and innovate both creative processes and the retail channel which, as of today, represents the core to the sustainability and competitiveness of brands and companies on domestic and international markets.

Within the project, we study social presence and digital / communication strategies of brands, and we will look for space for optimization. We are already crunching a lot of data and running large scale analyses on the topic. We will share our exciting results as soon as available!

 

Acknowledgements

FaST – Fashion Sensing Technology is a project supported by Regione Lombardia through the European Regional Development Fund (grant: “Smart Fashion & Design”). The project is being developed by Politecnico di Milano – Design dept. and Electronics, Information and Bioengineering dept. – in collaboration with Wemanage Group, Studio 4SIGMA, and CGNAL.

logo_w_fondo_transparent 2

Myths and Challenges in Knowledge Extraction and Big Data Analysis

For centuries, science (in German “Wissenschaft”) has aimed to create (“schaften”) new knowledge (“Wissen”) from the observation of physical phenomena, their modelling, and empirical validation.

Recently, a new source of knowledge has emerged: not (only) the physical world any more, but the virtual world, namely the Web with its ever-growing stream of data materialized in the form of social network chattering, content produced on demand by crowds of people, messages exchanged among interlinked devices in the Internet of Things. The knowledge we may find there can be dispersed, informal, contradicting, unsubstantiated and ephemeral today, while already tomorrow it may be commonly accepted.

Picture2The challenge is once again to capture and create consolidated knowledge that is new, has not been formalized yet in existing knowledge bases, and is buried inside a big, moving target (the live stream of online data).

The myth is that existing tools (spanning fields like semantic web, machine learning, statistics, NLP, and so on) suffice to the objective. While this may still be far from true, some existing approaches are actually addressing the problem and provide preliminary insights into the possibilities that successful attempts may lead to.

I gave a few keynote speeches on this matter (at ICEIS, KDWEB,…), and I also use this argument as a motivating class in academic courses for letting students understand how crucial is to focus on the problems related to big data modeling and analysis. The talk, reported in the slides below, explores through real industrial use cases, the mixed realistic-utopian domain of data analysis and knowledge extraction and reports on some tools and cases where digital and physical world have brought together for better understanding our society.

The presentation is available on SlideShare and are reported here below: