Blog

online news and social media

News Sharing Behaviour on Twitter. A Dataset and a Pipeline

Online social media are changing the news industry and revolutionizing the traditional role of journalists and newspapers. In this scenario, investigating the behaviour of users in relationship to news sharing is relevant, as it provides means for understanding the impact of online news, their propagation within social communities, their impact on the formation of opinions, and also for effectively detecting individual stance relative to specific news or topics, as well as for understanding the role of journalism today.

Our contribution is two-fold.

First, we build a robust pipeline for collecting datasets describing news sharing; the pipeline takes as input a list of news sources and generates a large collection of articles, of the accounts that provide them on the social media either directly or by retweeting, and of the social activities performed by these accounts.

The dataset is published on Harvard Dataverse:

https://doi.org/10.7910/DVN/5XRZLH

Second, we also provide a large-scale dataset that can be used to study the social behavior of Twitter users and their involvement in the dissemination of news items. Finally we show an application of our data collection in the context of political stance classification and we suggest other potential usages of the presented resources.

The code is published on GitHub:

https://github.com/DataSciencePolimi/NewsAnalyzer

The details of our approach is published in a paper at ICWSM 2019 accessible online.

You can cite the paper as:

Giovanni Brena, Marco Brambilla, Stefano Ceri, Marco Di Giovanni, Francesco Pierri, Giorgia Ramponi. News Sharing User Behaviour on Twitter: A Comprehensive Data Collection of News Articles and Social Interactions. AAAI ICWSM 2019, pp. 592-597.

Slides are on Slideshare:

You can also download a summary poster.

partenza_poster__1__pdf-2

 

Brand Community Analysis using Graph Representation Learning on Social Networks – with a Fashion Case

In a world more and more connected, new and complex interaction patterns can be extracted in the communication between people.

This is extremely valuable for brands that can better understand  the interests of users and the trends on social media to better target  their products. In this paper, we aim to analyze the communities that arise around commercial brands on social networks to understand the meaning of similarity, collaboration, and interaction among users.

We exploit the network that builds around the brands by encoding it into a graph model. We build a social network graph, considering user nodes and friendship relations; then we compare it with a heterogeneous graph model, where also posts and hashtags
are considered as nodes and connected to the different node types; we finally build also a reduced network, generated by inducing direct user-to-user connections through the intermediate nodes (posts and hashtags). These different variants are encoded using graph representation learning, which generates a numerical vector for each node. Machine learning techniques are applied to these vectors to extract valuable insights for each user and for the communities they belong to.

We report on our experiments performed on an emerging fashion brand on Instagram, and we show that our approach is able to discriminate potential customers for the brand, and to highlight meaningful sub-communities composed by users that share the same kind of content on social networks.

The use case is taken from a joint research project with the Fashion in Process group in the Design Department of Politecnico di Milano, within the framework of FAST (Fashion Sensing Technology).

This study has been published by Springer as part of ACM SAC 2019, Cyprus.

Here is the slideset presenting the idea:

The paper can be referenced as:

Marco Brambilla, Mattia Gasparini: Brand Community Analysis On Social Networks Using Graph Representation Learning. ACM Symposium on Applied Computing (SAC) 2019, pp. 2060-2069.

The link to the officially published paper in the ACM Library will be available shortly.

A sneak peek at the European Union Ethics Guidelines for AI

A few days ago, politico.eu published a preview of the document that the European Union will issue as guidance for ethical issues related to artificial intelligence and machine learning.

The document was written by the High-level Expert Group on Artificial Intelligence, appointed by the European Commission.

This advanced version of the document is available online now for a sneak peek preview.

The official version will be released shortly.

Besides the actual and technical content, this step is important as a principle too, because rarely a governmental institution feels the need to take such positions on scientific/technical evolution. This pronouncement makes it clear how strategic and crucial AI and ML is deemed today, also from a political perspective.

If you want to read more about Europe’s take on AI, you can also read this article on Medium.

Improving Topic Modeling with Knowledge Graph Embeddings

Topic modeling techniques have been applied in many scenarios in recent years, spanning textual content, as well as many different data sources. The existing researches in this field continuously try to improve the accuracy and coherence of the results. Some recent works propose new methods that capture the semantic relations between words into the topic modeling process, by employing vector embeddings over knowledge bases.

In our recent paper presented at the AAAI-MAKE Spring Symposium 2019, held at Stanford University, we studied how knowledge graph embeddings affect topic modeling performance on textual content. In particular, the objective of the work is to determine which aspects of knowledge graph embedding have a significant and positive impact on the accuracy of the extracted topics.

We improve the state of the art by integrating some avanced graph embedding approaches (specifically designed for knowledge graphs) within the topic extraction process.
We also studied how the knowledge base could be expanded by using dataset-specific relations between the words.
We implemented the method and we validated it with a set of experiments with 2 variations of the knowledge base, 7 embedding methods, and 2 methods for incorporation of the embeddings into the topic modeling framework, also considering different parameterizations of topic number and embedding dimensionality.
Besides the specific technical results, the work has also aims at showing the potentials of integrating statistical methods with knowledge-centric methods. The full extent of the impact of these techniques shall be explored further in the future.
The details of the work are reported in the paper, which is available online here, and in the slides, also available online (on SlideShare and here below).

Possible Theses in Data Science

Here is a presentation that summarizes some of the relevant topics currently available for theses within the Data Science Lab under my supervision.

Feel free to get in touch in case you are interested.

Predictive Analysis on U.S. Midterm Elections on Twitter with RNN

We implemented an analysis (meaning both a method and a system) that aim to gauge local support for the two major US political parties in the 68 most competitive House of Representative districts during the 2018 U.S. mid-term elections.

The analysis attempts to mirror the “Generic Ballot” poll, i.e., a survey of voters of a particular district which aims to measure local popularity of national parties by querying participants on the likelihood they would vote for a “generic” Democrat or Republican candidate. We collect the tweets containing national parties and politicians in the 68 most competitive districts. By most competitive we mean that they are rated as: toss up50%-50%, or lean by the Cook Political Report.

This means we are addressing an extremely challenging analysis and prediction problem, while disregarding the simpler cases (everyone is good at predicting the obvious!).

Our solution employs the Twitter Search API to query for tweets mentioning a national leader or party, posted form a limited geographic region (i.e., each specific congressional district). For example, the following query extracts tweets on Republicans:

TRUMP OR REPS OR Republicans OR Republican OR MCCCONNELL OR ‘MIKE PENCE’ OR ‘PAUL RYAN’ OR #Republicans OR #REPS OR @realDonaldTrumpOR @SpeakerRyan OR @senatemajldr OR @VP OR GOP OR @POTUS

To limit the search to each congressional district, we use the geocode field in the search query of the API, which queries a circular area based on the coordinates of the center and the radius. Because of the irregular shape of the congressional districts, multiple queries are needed for each of them, therefore we built a custom set of bubbles that approximate the district shape.

For the analysis of the tweets, we adopted a Recurrent Neural Network, namely a RNN-LSTM binary classifier trained on tweets.

To build training and testing data we collected tweets of users with clear political affiliation, including candidates, political activists, and also lesser know users, well versed in the political vernacular.
The accounts selected yielded around 280,000 tweets in 6 months before election day, labeled based on the author’s political affiliation.

Notice that the method is a general political-purpose language-independent analysis framework, that can be applied to any national or local context.

Further details and the results can be found on this Medium post.

This work has been published as a short scientific paper presented at IEEE Big Data Conference in Seattle, WA on December 2018 and on a previous Medium post by Antonio Lopardo.

You can also download a poster format reporting the work:

poster-midterm

In case you want to cite the work, you can do it in this way:

A. Lopardo and M. Brambilla, “Analyzing and Predicting the US Midterm Elections on Twitter with Recurrent Neural Networks,” 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 2018, pp. 5389–5391.
doi: 10.1109/BigData.2018.8622441.
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8622441&isnumber=8621858

The online running prototype, the full description of the project, its results, and source code are available at http://www.twitterpoliticalsentiment.com/USA/.

Crash Course in Data Science at PoliMi

For the second time, we are proposing a “night-time” multidisciplinary interactive mini-course that introduces data science concepts, methods and use cases to bachelor students (and master students of different faculties such as management, design, architecture, and so on) of Politecnico di Milano.

The full program of the mini-course is:

Day Topic Instructor Classroom (DEIB) Materials
4/2/2019 Intro to big data and data science.(Re)descovering SQL. Ceri / Brambilla Seminari Intro

SQL

5/2/2019 Big Data and NoSQL. Brambilla Conferenze NoSQL overview

NoSQL Databases

Graph Databases

8/2/2019 Data Analysis: dimensionality, clustering Brambilla Seminari Dimensionality Reduction & Clustering
12/2/2019 Data analysis: classification & hands-on on machine learning, AI, neural networks, deep learning Brambilla/ Ramponi/ Di Giovanni Conferenze Classification, neural networks, CNN, RNN, DNN, Deep learning
14/2/2019 Hands-on data analysis Ramponi/ Di Giovanni Seminari Python-datascienceNN-Keras
20/2/2019 Scenarios: Genomics, Bots and Fake News Ceri /Daniel Seminari Bots and fake news
21/2/2019 Statistics in practice Vantini Seminari
27/2/2019 Data visualization Ciuccarelli Seminari Datascience Challenges

Possible Theses and Projects

The course is in Italian, with teaching materials in English.

Classes are always from 5:30pm to 7:00pm.

You can read more at:

http://datascience.deib.polimi.it/course/crash-course-data-science-peopledeib/ 

Or you can get in touch if you want more details: marco.brambilla@polimi.it.

data-science

Understanding Polarized Political Events through Social Media Analysis

Predicting the outcome of elections is a topic that has been extensively studied in political polls, which have generally provided reliable predictions by means of statistical models. In recent years, online social media platforms have become a potential alternative to traditional polls, since they provide large amounts of post and user data, also referring to socio-political aspects.

In this context, we designed a research that aimed at defining a user modeling pipeline to analyze dis cussions and opinions shared on social media regarding polarized political events (such as a public poll or referendum).

The pipeline follows a four-step methodology.

 

  • First, social media posts and users metadata are crawled.
  • Second, a filtering mechanism is applied to filter out spammers and bot users.
  • Third, demographics information is extracted out of the valid users, namely gender, age, ethnicity and location information.
  • Fourth, the political polarity of the users with respect to the analyzed event is predicted.

In the scope of this work, our proposed pipeline is applied to two referendum scenarios:

  • independence of Catalonia in Spain
  • autonomy of Lombardy in Italy

We used these real-world examples to assess the performance of the approach with respect to the capability of collecting correct insights on the demographics of social media users and of predicting the poll results based on the opinions shared by the users.

Cursor_and_KDWEB_2018_paper_1_pdf

Experiments show that the method was effective in predicting the political trends for the Catalonia case, but not for the Lombardy case. Among the various motivations for this, we noticed that in general Twitter was more representative of the users opposing the referendum than the ones in favor.

The work has been presented at the KDWEB workshop at the ICWE 2018 conference.

A preprint of the paper can be downloaded from ArXiv and cited as reported here:

Roberto Napoli, Ali Mert Ertugrul, Alessandro Bozzon, Marco Brambilla. A User Modeling Pipeline for Studying Polarized Political Events in Social Media. KDWeb Workshop 2018, co-located with ICWE 2018, Caceres, Spain, June 2018. arXiv:1807.09459

Data Cleaning for Knowledge Extraction and Understanding on Social Media

 

Social media platforms let users share their opinions through textual or multimedia content. In many settings, this becomes a valuable source of knowledge that can be exploited for specific business objectives. Brands and companies often ask to monitor social media as sources for understanding the stance, opinion, and sentiment of their customers, audience and potential audience. This is crucial for them because it let them understand the trends and future commercial and marketing opportunities.

However, all this relies on a solid and reliable data collection phase, that grants that all the analyses, extractions and predictions are applied on clean, solid and focused data. Indeed, the typical topic-based collection of social media content performed through keyword-based search typically entails very noisy results.

We recently implemented a simple study aiming at cleaning the data collected from social content, within specific domains or related to given topics of interest.  We propose a basic method for data cleaning and removal of off-topic content based on supervised machine learning techniques, i.e. classification, over data collected from social media platforms based on keywords regarding a specific topic. We define a general method for this and then we validate it through an experiment of data extraction from Twitter, with respect to a set of famous cultural institutions in Italy, including theaters, museums, and other venues.

For this case, we collaborated with domain experts to label the dataset, and then we evaluated and compared the performance of classifiers that are trained with different feature extraction strategies.

The work has been presented at the KDWEB workshop at the ICWE 2018 conference.

A preprint of the paper can be downloaded and cited as reported here:

Emre Calisir, Marco Brambilla. The Problem of Data Cleaning for Knowledge Extraction from Social Media. KDWeb Workshop 2018, co-located with ICWE 2018, Caceres, Spain, June 2018.

The slides used in the workshop are available online here:

 

Iterative knowledge extraction from social networks

Yesterday, we presented a new work at The Web Conference in Lyon along the research line on knowledge extraction from human generated content started with our paper “Extracting Emerging Knowledge from Social Media” presented at the WWW 2017 Conference (see also this past post).

Our motivation starts from the fact that knowledge in the world continuously evolves, and thus ontologies and knowledge bases are largely incomplete, especially regarding data belonging to the so-called long tail. Therefore, we proposed a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates.

Based on this foundational idea, we explored the possibility of running our method iteratively, using the results as new seeds. In this paper we address the following research questions:

  1. How does the reconstructed domain knowledge evolve if the candidates of one extraction are recursively used as seeds?
  2. How does the reconstructed domain knowledge spread geographically?
  3. Can the method be used to inspect the past, present, and future of knowledge?
  4. Can the method be used to find emerging knowledge?

This is the presentation given at the conference:

This work was presented at The Web Conference 2018, in the Modeling Social Media (MSM) workshop.

The paper is in the official proceedings of the conference through the ACM Digital Library.

You can also find here a PDF preprint version of “Iterative Knowledge Extraction from Social Networks” by Brambilla et al.