Marco Brambilla

M.Sc. Thesis Topics and Proposals at Polimi Data Science Lab – 2023/24

December 21, 2023 by Marco Brambilla, posted in data science

Within the context of our Data Science Lab and research team, we offer a variety of thesis options.

Check them out in these slide decks:

M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab – 2024 – prof. Brambilla Marco from Marco Brambilla

Thesis Topics and Proposals @ Polimi Data Science Lab – 2023 – prof. Brambilla Marco from Marco Brambilla

Hierarchical Transformers for User Semantic Similarity.

June 15, 2023 by Marco Brambilla, posted in big data, data science, event, social network, social network analysis, social science

Our research o Hierarchical Transformers for User Semantic Similarity has been presented at ICWE 2023.

We discuss the use of hierarchical transformers for user semantic similarity in the context of analyzing users’ behavior and profiling social media users. The objectives of the research include finding the best model for computing semantic user similarity, exploring the use of transformer-based models, and evaluating whether the embeddings reflect the desired similarity concept and can be used for other tasks.

The full paper is published online by Springer in the official Conference Proceedings at this url:

https://link.springer.com/chapter/10.1007/978-3-031-34444-2_11

This work aims to compute accurate user similarities on Twitter just using the textual content shared by users, a feature known to be easy and quick to collect. We design and train a 2-stages hierarchical Transformer-based model, whose first stage independently elaborates single tweets, and its second stage combines the embeddings of the tweets to obtain user-level representations. To evaluate our model we design a ranking task involving many accounts, automatically collected and labeled without the need for human annotators. We extensively investigate hyper-parameters to obtain the best model configuration.

The slides about the work are available here:

Hierarchical Transformersfor User Semantic Similarity – ICWE 2023 from Marco Brambilla

We use a large dataset of Twitter users and apply an automatic labeling approach. The dataset consists of English tweets posted in November and December 2020, totaling about 27GB of compressed data. Preprocessing steps include filtering out short texts, cleaning user connections, and selecting a benchmark set of users for evaluation.

The models used in the study include hierarchical transformers, and the tweet embeddings are obtained using four Transformer-based models: RoBERTa2, BERTweet3, Sentence BERT4, and Twitter4SSE5. We test different techniques for processing tweet embeddings to generate accurate user embeddings, including mean pooling, recurrence over BERT (RoBERT), and transformer over BERT (ToBERT).

Since Transformer architectures are known to work well on short text, we cannot use them on extensive collections of tweets describing the activity of a user. Therefore, we propose a hierarchical structure of transformer models to be used as shown in this schema:

2-stages hierarchical Transformer-based model, whose first stage independently elaborates single tweets, and its second stage combines the embeddings of the tweets to obtain user-level representations.

The evaluation of the models is done on a set of 5,000 users, comparing user similarities with 30 other candidate users, 5 of which are considered similar and 25 considered dissimilar. The evaluation metrics used include mean average precision (MAP), mean reciprocal rank (MRR) at 10, and normalized discounted cumulative gain (nDCG).

The optimization process involves selecting a loss function and using the AdamW optimizer with specific hyperparameters.

Hierarchical Transformers for User Semantic Similarity

We also check whether the obtained embeddings reflect our idea of similarity by testing them on further tasks, including community visualization, outlier detection, and polarization quantification.

The results show that the hierarchical approach with a Stage-1 Twitter4SSE model and a Stage-2 Transformer model performs the best among the alternatives.

In conclusion, the research provides a large unbiased dataset for user similarity analysis, presents a hierarchical language model optimized for accurate user similarity computation, and validates the models’ performance on similarity tasks, with potential applications to related problems.
The future work includes investigating the impact of time and topic drift on the models’ performance.

The paper can be cited as:

Marco Di Giovanni, Marco Brambilla (2023). Hierarchical Transformers for User Semantic Similarity. In: Garrigós, I., Murillo Rodríguez, J.M., Wimmer, M. (eds) Web Engineering. ICWE 2023. Lecture Notes in Computer Science, vol 13893. Springer, Cham. https://doi.org/10.1007/978-3-031-34444-2_11

Visual rendering of additional tasks performed using the generated embeddings: detection of communities (left, projected on 2 principal components) and detection of polarization (right, density distribution over the first principal component). Apparently, the model works quite well.

Online courses on Data, Policies, and COVID-19

February 15, 2023 by Marco Brambilla, posted in big data, data science, social science

Our lab is participating in the PERISCOPE H2020 project, a partnership of 30+ top European universities and associations of professionals worked together for the last two years to study data, policies, actions, and effects of pandemic management. Besides the high-impact research results, the consortium worked on implementing educational materials and courses.

Among those, five online courses (MOOCs) that collect technical and policy solutions to pandemic challenges have been published on Coursera:

“Reaching Vulnerable Groups through Pandemic Policy”, developed by Ghent University and the London School of Economics.
“One Health: pandemic preparedness, prevention, and response”, produced by the Karolinska Institutet and the Federation of European Academies of Medicine (FEAM).
“Strengthening territorial response for better health”, realised by the European Regional and Local Health Authorities (EUREGHA), Mental Health Europe (MHE), and Agència de Qualitat i Avaluació Sanitàries de Catalunya (AQuAS).
“Data science perspectives on pandemics management”, resulted from the collaboration between Politecnico di Milano, University of Pavia, ISI Foundation, and the Delft University of Technology.
“Trustworthy Artificial Intelligence for Healthcare Management”, created by the Goethe University Frankfurt.

Ahead of the publication, the courses were tested by health authorities, policymakers, and public bodies. All courses are free to access.

You can access the courses from this list on Coursera.

Exploring the bi-verse: a trip across the digital and physical ecospheres

November 4, 2022 by Marco Brambilla, posted in big data, social network, social network analysis, smartcity, event, data science, IoT

I’ve been invited to give a keynote talk at the WISE 2022 Conference. Thinking about it, I decided to focus on my idea of a bi-verse. To me, the bi-verse is the duality between the physical and digital worlds.

On one side, the Web and social media are the environments where people post their content, opinions, activities, and resources. Therefore, a considerable amount of user-generated content is produced every day for a wide variety of purposes.

On the other side, people live their everyday life immersed in the physical world, where society, economy, politics, and personal relations continuously evolve. These two opposite and complementary environments are today fully integrated: they reflect each other and they interact with each other in a stronger and stronger way.

Exploring and studying content and data coming from both environments offers a great opportunity to understand the ever-evolving modern society, in terms of topics of interest, events, relations, and behavior.

This slidedeck summarizes my contribution:

In my speech, I discuss business cases and socio-political scenarios, to show how we can extract insights and understand reality by combining and analyzing data from the digital and physical world, so as to reach a better overall picture of reality itself. Along this path, we need to keep into account that reality is complex and varies in time, space, and many other dimensions, including societal and economic variables. The speech highlights the main challenges that need to be addressed and outlines some data science strategies that can be applied to tackle these specific challenges.

The Role of Human Knowledge in Explainable AI

July 11, 2022 by Marco Brambilla, posted in big data, data science

Machine learning and AI are facing a new challenge: making models more explainable.

This means to develop new methodologies to describe the behaviour of widely adopted black-box models, i.e., high-performing models whose internal logic is challenging to describe, justify, and understand from a human perspective.

The final goal of an explainability method is to faithfully describe the behaviour of a (black-box) model to users who can get a better understanding of its logic, thus increasing the trust and acceptance of the system.

Unfortunately, state-of-the-art explainability approaches may not be enough to guarantee the full understandability of explanations from a human perspective. For this reason, human-in-the-loop methods have been widely employed to enhance and/or evaluate explanations of machine learning models. These approaches focus on collecting human knowledge that AI systems can then employ or involving humans to achieve their objectives (e.g., evaluating or improving the system).

Based on these assumptions and requirements, we published a review article that aims to present a literature overview on collecting and employing human knowledge to improve and evaluate the understandability of machine learning models through human-in-the-loop approaches. The paper features a discussion on the challenges, state-of-the-art, and future trends in explainability.

The paper starts from the definition of the notion of “explanation” as an “interface between humans and a decision-maker that is, at the same time, both an accurate proxy of the decision-maker and comprehensible to humans”. Such a description highlights two fundamental features an explanation should have. It must be accurate, i.e., it must faithfully represent the model’s behaviour, and comprehensible, i.e., any human should be able to understand the meaning it conveys.

The Role of Human Knowledge in Explainable AI

The figure above summarizes the four main ways to use human knowledge in explainability, namely: knowledge collection for explainability (red), explainability evaluation (green), understanding human’s perspective in explainability (blue), and improving model explainability (yellow). In the schema, the icons represent human actors.

You may cite the paper as:

Tocchetti, Andrea; Brambilla, Marco. The Role of Human Knowledge in Explainable AI. Data 2022, 7, 93. https://doi.org/10.3390/ data7070093

The VaccinEU dataset of COVID-19 Vaccine Conversations on Twitter in French, German, and Italian

June 29, 2022 by Marco Brambilla, posted in big data, data science, social network, social network analysis, social science

Despite the increasing limitations for unvaccinated people, in many European countries, there is still a non-negligible fraction of individuals who refuse to get vaccinated against SARS-CoV-2, undermining governmental efforts to eradicate the virus.

Within the PERISCOPE project, we studied the role of online social media in influencing individuals’ opinions about getting vaccinated by designing a large-scale collection of Twitter messages in three different languages — French, German, and Italian — and providing public access to the data collected. This work was implemented in collaboration with Observatory on Social Media, Indiana University, Bloomington, USA.

Focusing on the European context, we devised an open dataset called VaccinEU, that aims to help researchers to better understand the impact of online (mis)information about vaccines and design more accurate communication strategies to maximize vaccination coverage.

The dataset is openly accessible in a Dataverse repository and a GitHub repository.

Furthermore, a description has been published in a paper at ICWSM 2022 (open access), which can be cited as:

Di Giovanni, M., Pierri, F., Torres-Lugo, C., & Brambilla, M. (2022). VaccinEU: COVID-19 Vaccine Conversations on Twitter in French, German and Italian. Proceedings of the International AAAI Conference on Web and Social Media, 16(1), 1236-1244. https://ojs.aaai.org/index.php/ICWSM/article/view/19374

Model Driven Software Engineering in Practice now published by Springer Nature

June 29, 2022 by Marco Brambilla, posted in modeling, software engineering

Starting June 2022, our book “Model Driven Software Engineering in Practice” (co-authored with Jordi Cabot and Manuel Wimmer) is now also available via Springer . This means the price is actually lower, and if you are affiliated with an academic institution, you may even have free access to the book through your institutional access. Check it here.

Together with the book, we provided free bonus material including: over 500 slides on MDE ready to use for classes; and all the book examples in the book’s github repository.

As of today, there are 130 institutions already using the book . Make sure you join the list if you are not there yet.

The new book cover. Not particularly appealing, but aligned with the Springer style.

EXP-Crowd: Gamified Crowdsourcing for AI Explainability

June 1, 2022 by Marco Brambilla, posted in big data, data science

The spread of AI and black-box machine learning models makes it necessary to explain their behavior. Consequently, the research field of Explainable AI was born. The main objective of an Explainable AI system is to be understood by a human as the final beneficiary of the model.

In our research we just published on Frontiers in Artificial Intelligence, we frame the explainability problem from the crowd’s point of view and engage both users and AI researchers through a gamified crowdsourcing framework. We research whether it’s possible to improve the crowd’s understanding of black-box models and the quality of the crowdsourced content by engaging users in gamified activities through a crowdsourcing framework called EXP-Crowd. While users engage in such activities, AI researchers organize and share AI- and explainability-related knowledge to educate users.

The next diagram shows the interaction flows of researchers (dashed cyan arrows) and users (orange plain arrows) with the activities devised within our framework. Researchers organize users’ knowledge and set up activities to collect data. As users engage with such activities, they provide Content to researchers. In turn, researchers give the user feedback about the activity they performed. Such feedback aims to improve users’ understanding of the activity itself, the knowledge, and the context provided within it.

Interaction flows of researchers (dashed cyan arrows) and users (orange plain arrows) in the EXP-Crowd framework.

In our recent paper published on Frontiers in Artificial Intelligence, we present the preliminary design of a game with a purpose (G.W.A.P.) to collect features describing real-world entities which can be used for explainability purposes.

One of the crucial steps in the process is the questions and annotation challenge, where Player 1 asks yes/no questions about the entity to be explained. Player 2 answers such questions, and then is asked to complete a series of simple tasks to identify the guessed feature by answering questions and potentially annotating the picture as shown below.

Questioning and annotation steps within the explanation game.

If you are interested in more details, you can read the full EXP-Crowd paper on the journal site (full open access):

You can cite the paper as:

Tocchetti A., Corti L., Brambilla M., and Celino I. (2022). EXP-Crowd: A Gamified Crowdsourcing Framework for Explainability. Frontiers in Artificial Intelligence 5:826499. doi: 10.3389/frai.2022.826499

Analysis of Online Reviews for Evaluating the Quality of Cultural Tourism

December 3, 2021December 3, 2021 by Marco Brambilla, posted in big data, data science, smartcity, social network, social network analysis

Online reviews have long represented a valuable source for data analysis in the tourism field, but these data sources have been mostly studied in terms of the numerical ratings offered by the review platforms.

In a recent article (available as full open-access) and a respective blog post, we explored if social media and online review platforms can be a good source of quantitative evaluation of service quality of cultural venues, such as museums, theaters and so on. Our paper applies automatic analysis of online reviews, by comparing two different automated analysis approaches to evaluate which of the two is more adequate for assessing the quality dimensions. The analysis covers user-generated reviews over the top 100 Italian museums.

Specifically, we compare two approaches:

a ‘top-down’ approach that is based on a supervised classification based upon strategic choices defined by policy makers’ guidelines at the national level;
a ‘bottom-up’ approach that is based on an unsupervised topic model of the online words of reviewers.

The misalignment of the results of the ‘top-down’ strategic studies and ‘bottom-up’ data-driven approaches highlights how data science can offer an important contribution to decision making in cultural tourism. Both the analysis approaches have been applied to the same dataset of 14,250 Italian reviews.

We identified five quality dimensions that follow the ‘top-down’ perspective: Ticketing and Welcoming, Space, Comfort, Activities, and Communication. Each of these dimensions has been considered as a class in a classification problem over user reviews. The top down approach allowed us to tag each review as descriptive of one of those 5 dimensions. Classification has been implemented both as a machine learning classification problem (using BERT, accuracy 88%) and as and keyword-based tagging (accuracy 80%).

The ‘bottom-up’ approach has been implemented through an unsupervised topic modelling approach, namely LDA (Latent Dirichlet Allocation), implemented and tuned over a range up to 30 topics. The best ‘bottom-up’ model we selected identifies 13 latent dimensions in review texts. We further integrated them in 3 main topics: Museum Cultural Heritage, Personal Experience and Museum Services.

The ‘top-down’ approach (based on a set of keywords defined from the standards issued by the policy maker) resulted in 63% of online reviews that did not fit into any of the predefined quality dimension.

63% of the reviews could not be assessed against the official top-down service quality categories.

The ‘bottom-up’ data-driven approach overcomes this limitation by searching for the aspects of interest using reviewers’ own words. Indeed, usually museum reviews discuss more about a museum’s cultural heritage aspects (46% average probability) and personal experiences (31% average probability) than the services offered by the museum (23% average probability).

Among the various quantitative findings of the study, I think the most important point is that the aspects considered as quality dimensions by the decision maker can be highly different from those aspects perceived as quality dimensions by museum visitors.

You can find out more about this analysis by reading the full article published online as open-access, or this longer blog post . The full reference to the paper is:

Agostino, D.; Brambilla, M.; Pavanetto, S.; Riva, P. The Contribution of Online Reviews for Quality Evaluation of Cultural Tourism Offers: The Experience of Italian Museums. Sustainability 2021, 13, 13340. https://doi.org/10.3390/su132313340

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: