The VaccinEU dataset of COVID-19 Vaccine Conversations on Twitter in French, German, and Italian

Despite the increasing limitations for unvaccinated people, in many European countries, there is still a non-negligible fraction of individuals who refuse to get vaccinated against SARS-CoV-2, undermining governmental efforts to eradicate the virus.

Within the PERISCOPE project, we studied the role of online social media in influencing individuals’ opinions about getting vaccinated by designing a large-scale collection of Twitter messages in three different languages — French, German, and Italian — and providing public access to the data collected. This work was implemented in collaboration with Observatory on Social Media, Indiana University, Bloomington, USA.

Focusing on the European context, we devised an open dataset called VaccinEU, that aims to help researchers to better understand the impact of online (mis)information about vaccines and design more accurate communication strategies to maximize vaccination coverage.

The dataset is openly accessible in a Dataverse repository and a GitHub repository.

Furthermore, a description has been published in a paper at ICWSM 2022 (open access), which can be cited as:

Di Giovanni, M., Pierri, F., Torres-Lugo, C., & Brambilla, M. (2022). VaccinEU: COVID-19 Vaccine Conversations on Twitter in French, German and Italian. Proceedings of the International AAAI Conference on Web and Social Media16(1), 1236-1244. https://ojs.aaai.org/index.php/ICWSM/article/view/19374

Large-Scale Analysis of On-line Conversation about Vaccines before COVID-19

Frequent words and co-occurrences used by pro-vaccination and anti-vaccination communities.

In this study, we map the Twitter discourse around vaccinations in English along four years, in order to:

  • discover the volumes and trends of the conversation;
  • compare the discussion on Twitter with newspapers’ content; and
  • classify people as pro- or anti- vaccination and explore how their behavior is different.

Datasets. We collected four years of Twitter data (January 2016 – January 2020) about vaccination, before the advent of the Covid-19 pandemic, using three keywords: ’vaccine’, ’vaccination’, and ’immunization’, obtaining around 6.5 MLN tweets. The collection has been analyzed across multiple dimensions and aspects. General

Analysis. The analysis shows that the number of tweets related to the topic in- creased through the years, peaking in 2019. Among others, we identified the 2019 measles outbreak as one of the main reasons for the growth, given the correlation of the tweets volume with CDC (Centers for Disease Control and Prevention) data on measles cases in the United States in 2019 and with the high number of newspaper articles on the topic, which both significantly increased in 2019. Other demographic, space-time, and content analysis have been performed too.

Subjects. Besides the general data analysis, we considered a number of specific topics often addressed within the vaccine conversation, such as the flu vaccine, hpv, polio, and others. We identified the temporal trends and performed specific analysis related to these subjects, also in connection with the respective media coverage.

News Sources. We analyzed the news sources most cited in the tweets, which include Youtube, NaturalNews (which is generally considered as a biased and fake news website) and Facebook. Overall, among the most cited sources, 32% can be labeled as reliable and 25% as conspiracy/fake news sources. Furthermore 32% of the references point to social networks (including Youtube). This analysis shows how social media and non-reliable sources of information frequently drive vaccine-related conversation on Twitter.

User Stance. We applied stance analysis on the authors of the tweets, to determine the user’s orientation toward a given (pre-chosen) target of interest. Our initial content analysis revealed that a large amount of the content is of satirical or derisive nature, causing a number of classification techniques to perform poorly on the dataset. Given that other studies considered the presence of stance-indicative hashtags as an effective way to discover polarized tweets and users, a rule-based classification was applied, based on a selection of 100+ hashtags that allowed to automatically classify a tweet as pro-vaccination or vaccination-skeptic, obtain- ing a total of 250,000+ classified tweets over the 4 years.

Share of pro- and anti- vaccine discourse in time. Pro-vaccine tweet volumes appear to be larger than anti-vaccine tweets and to increase over time.

The words used by the two groups of users to discuss of vaccine-related topics are profoundly different, as are the sources of information they refer to. Anti-vaccine users cited mostly fake news websites and very few reliable sources, which are instead largely cited by pro-vaccine users. Social media (primarily Youtube) represent a large portion of linked content in both cases.

Additionally, we performed demographics (age, gender, ethnicity) and spatial analysis over the two categories of users with the aim of understanding the features of the two communities. Our analysis also shows to which extent the different states are polarized pro or against vaccination in the U.S. on Twitter.

Stance of US states towards vaccination.

A video presenting our research is available on YouTube:

This work has been presented at the IC2S2 conference.

The cover image  by NIAID is licensed under CC BY 2.0.

Content-based Classification of Political Inclinations of Twitter Users

Social networks are huge continuous sources of information that can be used to analyze people’s behavior and thoughts.

Our goal is to extract such information and predict political inclinations of users.

In particular, we investigate the importance of syntactic features of texts written by users when they post on social media. Our hypothesis is that people belonging to the same political party write in similar ways, thus they can be classified properly on the basis of the words that they use.

We analyze tweets because Twitter is commonly used in Italy for discussing about politics; moreover, it provides an official API that can be easily exploited for data extraction. Many classifiers were applied to different kinds of features and NLP vectorization methods in order to obtain the best method capable of confirming our hypothesis.

To evaluate their accuracy, a set of current Italian deputies with consistent activity in Twitter has been selected as ground truth, and we have then predicted their political party. Using the results of our analysis, we also got interesting insights into current Italian politics. Here are the clusters of users:

ieee-big-data-2018-twitter-elections-clusters

Results in understanding political alignment are quite good, as reported in the confusion matrix here: ieee-big-data-2018-twitter-elections-parties

Our study is described in detail in the paper published in the IEEE Big Data 2018 conference and linked at:

DOI: 10.1109/BigData.2018.8622040

The article can be downloaded here, if you don’t have access to IEEE library.

You can also look at the slides on SlideShare:

You can cite the paper as follows:

M. Di Giovanni, M. Brambilla, S. Ceri, F. Daniel and G. Ramponi, “Content-based Classification of Political Inclinations of Twitter Users,” 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 2018, pp. 4321-4327.
doi: 10.1109/BigData.2018.8622040

online news and social media

News Sharing Behaviour on Twitter. A Dataset and a Pipeline

Online social media are changing the news industry and revolutionizing the traditional role of journalists and newspapers. In this scenario, investigating the behaviour of users in relationship to news sharing is relevant, as it provides means for understanding the impact of online news, their propagation within social communities, their impact on the formation of opinions, and also for effectively detecting individual stance relative to specific news or topics, as well as for understanding the role of journalism today.

Our contribution is two-fold.

First, we build a robust pipeline for collecting datasets describing news sharing; the pipeline takes as input a list of news sources and generates a large collection of articles, of the accounts that provide them on the social media either directly or by retweeting, and of the social activities performed by these accounts.

The dataset is published on Harvard Dataverse:

https://doi.org/10.7910/DVN/5XRZLH

Second, we also provide a large-scale dataset that can be used to study the social behavior of Twitter users and their involvement in the dissemination of news items. Finally we show an application of our data collection in the context of political stance classification and we suggest other potential usages of the presented resources.

The code is published on GitHub:

https://github.com/DataSciencePolimi/NewsAnalyzer

The details of our approach is published in a paper at ICWSM 2019 accessible online.

You can cite the paper as:

Giovanni Brena, Marco Brambilla, Stefano Ceri, Marco Di Giovanni, Francesco Pierri, Giorgia Ramponi. News Sharing User Behaviour on Twitter: A Comprehensive Data Collection of News Articles and Social Interactions. AAAI ICWSM 2019, pp. 592-597.

Slides are on Slideshare:

You can also download a summary poster.

partenza_poster__1__pdf-2

 

Understanding Polarized Political Events through Social Media Analysis

Predicting the outcome of elections is a topic that has been extensively studied in political polls, which have generally provided reliable predictions by means of statistical models. In recent years, online social media platforms have become a potential alternative to traditional polls, since they provide large amounts of post and user data, also referring to socio-political aspects.

In this context, we designed a research that aimed at defining a user modeling pipeline to analyze dis cussions and opinions shared on social media regarding polarized political events (such as a public poll or referendum).

The pipeline follows a four-step methodology.

 

  • First, social media posts and users metadata are crawled.
  • Second, a filtering mechanism is applied to filter out spammers and bot users.
  • Third, demographics information is extracted out of the valid users, namely gender, age, ethnicity and location information.
  • Fourth, the political polarity of the users with respect to the analyzed event is predicted.

In the scope of this work, our proposed pipeline is applied to two referendum scenarios:

  • independence of Catalonia in Spain
  • autonomy of Lombardy in Italy

We used these real-world examples to assess the performance of the approach with respect to the capability of collecting correct insights on the demographics of social media users and of predicting the poll results based on the opinions shared by the users.

Cursor_and_KDWEB_2018_paper_1_pdf

Experiments show that the method was effective in predicting the political trends for the Catalonia case, but not for the Lombardy case. Among the various motivations for this, we noticed that in general Twitter was more representative of the users opposing the referendum than the ones in favor.

The work has been presented at the KDWEB workshop at the ICWE 2018 conference.

A preprint of the paper can be downloaded from ArXiv and cited as reported here:

Roberto Napoli, Ali Mert Ertugrul, Alessandro Bozzon, Marco Brambilla. A User Modeling Pipeline for Studying Polarized Political Events in Social Media. KDWeb Workshop 2018, co-located with ICWE 2018, Caceres, Spain, June 2018. arXiv:1807.09459

Iterative knowledge extraction from social networks

Yesterday, we presented a new work at The Web Conference in Lyon along the research line on knowledge extraction from human generated content started with our paper “Extracting Emerging Knowledge from Social Media” presented at the WWW 2017 Conference (see also this past post).

Our motivation starts from the fact that knowledge in the world continuously evolves, and thus ontologies and knowledge bases are largely incomplete, especially regarding data belonging to the so-called long tail. Therefore, we proposed a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates.

Based on this foundational idea, we explored the possibility of running our method iteratively, using the results as new seeds. In this paper we address the following research questions:

  1. How does the reconstructed domain knowledge evolve if the candidates of one extraction are recursively used as seeds?
  2. How does the reconstructed domain knowledge spread geographically?
  3. Can the method be used to inspect the past, present, and future of knowledge?
  4. Can the method be used to find emerging knowledge?

This is the presentation given at the conference:

This work was presented at The Web Conference 2018, in the Modeling Social Media (MSM) workshop.

The paper is in the official proceedings of the conference through the ACM Digital Library.

You can also find here a PDF preprint version of “Iterative Knowledge Extraction from Social Networks” by Brambilla et al.

 

How Fashionable is Digital Data-Driven Fashion?

Within the context of our data science research track, we have been involved a lot in fashion industry problems recently.

We already showcased some studies in fashion, for instance related to the analysis of the Milano Fashion Week events and their social media impact.

Starting this year, we are also involved in a research and innovation project called FaST – Fashion Sensing Technology. FaST is a project meant to design, experiment with, and implement an ICT tool that could monitor and analyze the activity of Italian emerging Fashion brands on social media. FaST aims at providing SMEs in the Fashion industry with the ability to better understand and measure the behaviours and opinions of consumers on social media, through the study of the interactions between brands and their communities, as well as support a brand’s strategic business decisions.

Given the importance of Fashion as an economic and cultural resource for Lombardy Region and Italy as a whole, the project aims at leveraging on the opportunities given by the creation of an hybrid value chain fashion-digital, in order to design a tool that would allow the codification of new organizational models. Furthermore, the project wants to promote process innovation within the fashion industry but with a customer-centric approach, as well as the design of services that could update and innovate both creative processes and the retail channel which, as of today, represents the core to the sustainability and competitiveness of brands and companies on domestic and international markets.

Within the project, we study social presence and digital / communication strategies of brands, and we will look for space for optimization. We are already crunching a lot of data and running large scale analyses on the topic. We will share our exciting results as soon as available!

 

Acknowledgements

FaST – Fashion Sensing Technology is a project supported by Regione Lombardia through the European Regional Development Fund (grant: “Smart Fashion & Design”). The project is being developed by Politecnico di Milano – Design dept. and Electronics, Information and Bioengineering dept. – in collaboration with Wemanage Group, Studio 4SIGMA, and CGNAL.

logo_w_fondo_transparent 2

Extracting Emerging Knowledge from Social Media

Today I presented our full paper titled “Extracting Emerging Knowledge from Social Media” at the WWW 2017 conference.

The work is based on a rather obvious assumption, i.e., that knowledge in the world continuously evolves, and ontologies are largely incomplete for what concerns low-frequency data, belonging to the so-called long tail.

Socially produced content is an excellent source for discovering emerging knowledge: it is huge, and immediately reflects the relevant changes which hide emerging entities.

In the paper we propose a method and a tool for discovering emerging entities by extracting them from social media.

Once instrumented by experts through very simple initialization, the method is capable of finding emerging entities; we propose a mixed syntactic + semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors, built by using terms occurring in their social content, and then ranks the candidates by using their distance from the centroid of seeds, returning the top candidates as result.

The method can be continuously or periodically iterated, using the results as new seeds.

The PDF of the full paper presented at WWW 2017 is available online (open access with Creative Common license).

You can also check out the slides of my presentation on Slideshare.

A version of the tool is available online for free use, thanks also to our partners Dandelion API and Microsoft Azure. The most recent version of the tool is available on GitHub here.

Social Media Behaviour during Live Events: the Milano Fashion Week #MFW case

Social media are getting more and more  important in the context of live events, such as fairs, exhibits, festivals, concerts, and so on,  as they play an essential role in communicating them to  fans, interest groups, and the general population. These kinds of events are geo-localized within a city or territory and are scheduled within a public calendar.

Together with the people in the Fashion in Process group of Politecnico di Milano, we studied the impact on social media of a specific scenario, the Milano Fashion Week (MFW), which is an important event in Milano for the whole fashion business.

We presented this work at the Location and the Web workshop co-located with the WWW 2017 Conference in Perth, Australia.

We focus our attention on the spreading of social content  in space, measuring the spreading of the event propagation in space. We build different clusters of fashion brands, we characterize several features of propagation in space and we correlate them to the popularity of the brand and temporal propagation.

We show that the clusters along space, time and popularity dimensions are loosely correlated, and therefore trying to  understand the dynamics of the events only based on popularity  aspects would not be appropriate.

The paper PDF is available as open access PDF online on the WWW 2017 Conference web site. You can download it here.

A subsequent paper on the temporal analysis of the same event “Temporal Analysis of Social Media Response to Live Events: The Milano Fashion Week”, focusing on Granger Causality and other measures, has been published at ICWE 2017 and is available in the proceedings by Springer.

The PowerPoint presentation is available on SlideShare.

Modeling and Analyzing Engagement in Social Network Challenges

Within a completely new line of research, we are exploring the power of modeling for human behaviour analysis, especially within social networks and/or in occasion of large scale live events. Participation to challenges within social networks is a very effective instrument for promoting a brand or event and therefore it is regarded as an excellent marketing tool.
Our first reasearch has been published in November 2016 at WISE Conference, covering the analysis of user engagement within social network challenges.
In this paper, we take the challenge organizer’s perspective, and we study how to raise the
engagement of players in challenges where the players are stimulated to
create and evaluate content, thereby indirectly raising the awareness about the brand or event itself. Slides are available on slideshare:

We illustrate a comprehensive model of the actions and strategies that can be exploited for progressively boosting the social engagement during the challenge evolution. The model studies the organizer-driven management of interactions among players, and evaluates
the effectiveness of each action in light of several other factors (time, repetition, third party actions, interplay between different social networks, and so on).
We evaluate the model through a set of experiment upon a real case, the YourExpo2015 challenge. Overall, our experiments lasted 9 weeks and engaged around 800,000  users on two different social platforms; our quantitative analysis assesses the validity of the model.

The paper is published by Springer here.

cross-platform_pdf

 

To keep updated on my activities you can subscribe to the RSS feed of my blog or follow my twitter account (@MarcoBrambi).