Despite the increasing limitations for unvaccinated people, in many European countries, there is still a non-negligible fraction of individuals who refuse to get vaccinated against SARS-CoV-2, undermining governmental efforts to eradicate the virus.
Within the PERISCOPE project, we studied the role of online social media in influencing individuals’ opinions about getting vaccinated by designing a large-scale collection of Twitter messages in three different languages — French, German, and Italian — and providing public access to the data collected. This work was implemented in collaboration with Observatory on Social Media, Indiana University, Bloomington, USA.
Focusing on the European context, we devised an open dataset called VaccinEU, that aims to help researchers to better understand the impact of online (mis)information about vaccines and design more accurate communication strategies to maximize vaccination coverage.
Furthermore, a description has been published in a paper at ICWSM 2022 (open access), which can be cited as:
Di Giovanni, M., Pierri, F., Torres-Lugo, C., & Brambilla, M. (2022). VaccinEU: COVID-19 Vaccine Conversations on Twitter in French, German and Italian. Proceedings of the International AAAI Conference on Web and Social Media, 16(1), 1236-1244. https://ojs.aaai.org/index.php/ICWSM/article/view/19374
In this study, we map the Twitter discourse around vaccinations in English along four years, in order to:
discover the volumes and trends of the conversation;
compare the discussion on Twitter with newspapers’ content; and
classify people as pro- or anti- vaccination and explore how their behavior is different.
Datasets. We collected four years of Twitter data (January 2016 – January 2020) about vaccination, before the advent of the Covid-19 pandemic, using three keywords: ’vaccine’, ’vaccination’, and ’immunization’, obtaining around 6.5 MLN tweets. The collection has been analyzed across multiple dimensions and aspects. General
Analysis. The analysis shows that the number of tweets related to the topic in- creased through the years, peaking in 2019. Among others, we identified the 2019 measles outbreak as one of the main reasons for the growth, given the correlation of the tweets volume with CDC (Centers for Disease Control and Prevention) data on measles cases in the United States in 2019 and with the high number of newspaper articles on the topic, which both significantly increased in 2019. Other demographic, space-time, and content analysis have been performed too.
Subjects. Besides the general data analysis, we considered a number of specific topics often addressed within the vaccine conversation, such as the flu vaccine, hpv, polio, and others. We identified the temporal trends and performed specific analysis related to these subjects, also in connection with the respective media coverage.
News Sources. We analyzed the news sources most cited in the tweets, which include Youtube, NaturalNews (which is generally considered as a biased and fake news website) and Facebook. Overall, among the most cited sources, 32% can be labeled as reliable and 25% as conspiracy/fake news sources. Furthermore 32% of the references point to social networks (including Youtube). This analysis shows how social media and non-reliable sources of information frequently drive vaccine-related conversation on Twitter.
User Stance. We applied stance analysis on the authors of the tweets, to determine the user’s orientation toward a given (pre-chosen) target of interest. Our initial content analysis revealed that a large amount of the content is of satirical or derisive nature, causing a number of classification techniques to perform poorly on the dataset. Given that other studies considered the presence of stance-indicative hashtags as an effective way to discover polarized tweets and users, a rule-based classification was applied, based on a selection of 100+ hashtags that allowed to automatically classify a tweet as pro-vaccination or vaccination-skeptic, obtain- ing a total of 250,000+ classified tweets over the 4 years.
The words used by the two groups of users to discuss of vaccine-related topics are profoundly different, as are the sources of information they refer to. Anti-vaccine users cited mostly fake news websites and very few reliable sources, which are instead largely cited by pro-vaccine users. Social media (primarily Youtube) represent a large portion of linked content in both cases.
Additionally, we performed demographics (age, gender, ethnicity) and spatial analysis over the two categories of users with the aim of understanding the features of the two communities. Our analysis also shows to which extent the different states are polarized pro or against vaccination in the U.S. on Twitter.
A video presenting our research is available on YouTube:
This work has been presented at the IC2S2 conference.
I was listening at R. Martin Chavez, Goldman Sachs deputy CFO just last month in Harvard at the ComputeFest 2017 event, more precisely, the SYMPOSIUM ON THE FUTURE OF COMPUTATION IN SCIENCE AND ENGINEERING on “Data, Dollars, and Algorithms: The Computational Economy” held in Harvard on Thursday, January 19, 2017.
His claim was that
Banks are essentially API providers.
The entire structure and infrastructure of Goldman Sachs is being restructured for that. His case is that you should not compare a bank with a shop or store, you should compare it with Google. Just imagine that every time you want to search on Google you need to get in touch (i.e., make a phone call or submit a request) to some Google employee, who at some points comes back to you with the result. Non sense, right? Well, but this is what actually happens with banks. It was happening with consumer-oriented banks before online banking, and it’s still largely happening for business banks.
But this is going to change. Amount of data and speed and volume of financial transaction doesn’t allow that any more.
Banks are actually among the richest (not [just] in terms of money, but in data ownership). But they are also craving for further “less official” big data sources.
Today at the ISTAT National Big Data Committee meeting in Rome, Juri Marcucci from Bank of Italy discussed their research activity in integration of Google Trends information in their financial predictive analytics.
Google Trends provide insights of user interests in general, as the probability that a random user is going to search for a particular keyword (normalized and scaled, also with geographical detail down to city level).
Bank of Italy is using Google Trends data for complementing their prediction of unemployment rates in short and mid term. It’s definitely a big challenge, but preliminary results are promising in terms of confidence on the obtained models. More details are available in this paper.
Paolo Giudici from University of Pavia showed how one can correlate the risk of bank defaults with their exposition on Twitter:
Obviously, all this must take into account the bias of the sources and the quality of the data collected. This was pointed out also by Paolo Giudici from University of Pavia. Assessment of “trustability” of online sources is crucial. In their research, they defined the T-index on Twitter accounts in a very similar way academics define the h-index for relevance of publications, as reported in the photographed slide below.
It’s very interesting to see how creative the use of (non-traditional, web based) big data is becoming, in very diverse fields, including very traditional ones like macroeconomy and finance.
And once again, I think the biggest challenges and opportunities come from the fusion of multiple data sources together: mobile phones, financial tracks, web searches, online news, social networks, and official statistics.
This is also the path that ISTAT (the official institute for Italian statistics) is pursuing. For instance, in the calculation of official national inflation rates, web scraping techniques (for ecommerce prices) upon more than 40.000 product prices are integrated in the process too.
Ekaterina Shabunina recently graduated under my supervision as a M.Sc. student of the Como Campus of Politecnico di Milano with a thesis titled “Approach based on CRF to Sentiment Classification of Twitter Streams related to Companies”. Thanks to the innovation of her work, she won the Grand Prize 2013 for the GSE Academic Award for Excellence.
The work is based on the assumption that information produced and shared on social networks is getting more and more interesting as a source for inferring trends and happenings in the real world. She applied sentiment classification of Twitter streams related to companies and calculated statistical correlation analysis with the companies’ securities prices variation. Tweets are labeled with a tailored classification model, which by itself exhibits solid performance indicators, and then are correlated to stock market values. The approach applies the Conditional Random Fields probabilistic model to company-related Twitter data streams and shows that there is high correlation between the classified results and the stock market values, even when adopting a very simple feature model. In particular, it presents a near-perfect adherence of accumulated number of net positive tweets versus the stock’s closing price with an ideal level of significance of the regression and a 97.56% explanatory capacity of the achieved fitted equation in the best case.
The project will be presented on the GSE Management Summit in Barcelona on October 14th, 2013. Here is a short interview with Ekaterina.
GSE (Guide Share Europe), a non-profit association of companies, organizations and individuals who are involved in Information and Communication Technology (ICT) solutions based on IBM architectures, established the GSE Academic Award for Excellence for students.
Further information about the awards is available on GSE website.
To keep updated on my activities you can subscribe to the RSS feed of my blog or follow my twitter account (@MarcoBrambi).