Analysis of Online Reviews for Evaluating the Quality of Cultural Tourism

Online reviews have long represented a valuable source for data analysis in the tourism field, but these data sources have been mostly studied in terms of the numerical ratings offered by the review platforms.

In a recent article (available as full open-access) and a respective blog post, we explored if social media and online review platforms can be a good source of quantitative evaluation of service quality of cultural venues, such as museums, theaters and so on. Our paper applies automatic analysis of online reviews, by comparing two different automated analysis approaches to evaluate which of the two is more adequate for assessing the quality dimensions. The analysis covers user-generated reviews over the top 100 Italian museums. 

Specifically, we compare two approaches:

  • a ‘top-down’ approach that is based on a supervised classification based upon strategic choices defined by policy makers’ guidelines at the national level; 
  • a ‘bottom-up’ approach that is based on an unsupervised topic model of the online words of reviewers.

The misalignment of the results of the ‘top-down’ strategic studies and ‘bottom-up’ data-driven approaches highlights how data science can offer an important contribution to decision making in cultural tourism.  Both the analysis approaches have been applied to the same dataset of 14,250 Italian reviews.

We identified five quality dimensions that follow the ‘top-down’ perspective: Ticketing and Welcoming, Space, Comfort, Activities, and Communication. Each of these dimensions has been considered as a class in a classification problem over user reviews. The top down approach allowed us to tag each review as descriptive of one of those 5 dimensions. Classification has been implemented both as a machine learning classification problem (using BERT, accuracy 88%) and as and keyword-based tagging (accuracy 80%).

The ‘bottom-up’ approach has been implemented through an unsupervised topic modelling approach, namely LDA (Latent Dirichlet Allocation), implemented and tuned over a range up to 30 topics. The best ‘bottom-up’ model we selected identifies 13 latent dimensions in review texts. We further integrated them in 3 main topics: Museum Cultural Heritage, Personal Experience and Museum Services.

The ‘top-down’ approach (based on a set of keywords defined from the standards issued by the policy maker) resulted in 63% of online reviews that did not fit into any of the predefined quality dimension.

63% of the reviews could not be assessed against the official top-down service quality categories.

The ‘bottom-up’ data-driven approach overcomes this limitation by searching for the aspects of interest using reviewers’ own words. Indeed, usually museum reviews discuss more about a museum’s cultural heritage aspects (46% average probability) and personal experiences (31% average probability) than the services offered by the museum (23% average probability).

Among the various quantitative findings of the study, I think the most important point is that the aspects considered as quality dimensions by the decision maker can be highly different from those aspects perceived as quality dimensions by museum visitors.

You can find out more about this analysis by reading the full article published online as open-access, or this longer blog post . The full reference to the paper is:

Agostino, D.; Brambilla, M.; Pavanetto, S.; Riva, P. The Contribution of Online Reviews for Quality Evaluation of Cultural Tourism Offers: The Experience of Italian Museums. Sustainability 2021, 13, 13340. https://doi.org/10.3390/su132313340

online news and social media

News Sharing Behaviour on Twitter. A Dataset and a Pipeline

Online social media are changing the news industry and revolutionizing the traditional role of journalists and newspapers. In this scenario, investigating the behaviour of users in relationship to news sharing is relevant, as it provides means for understanding the impact of online news, their propagation within social communities, their impact on the formation of opinions, and also for effectively detecting individual stance relative to specific news or topics, as well as for understanding the role of journalism today.

Our contribution is two-fold.

First, we build a robust pipeline for collecting datasets describing news sharing; the pipeline takes as input a list of news sources and generates a large collection of articles, of the accounts that provide them on the social media either directly or by retweeting, and of the social activities performed by these accounts.

The dataset is published on Harvard Dataverse:

https://doi.org/10.7910/DVN/5XRZLH

Second, we also provide a large-scale dataset that can be used to study the social behavior of Twitter users and their involvement in the dissemination of news items. Finally we show an application of our data collection in the context of political stance classification and we suggest other potential usages of the presented resources.

The code is published on GitHub:

https://github.com/DataSciencePolimi/NewsAnalyzer

The details of our approach is published in a paper at ICWSM 2019 accessible online.

You can cite the paper as:

Giovanni Brena, Marco Brambilla, Stefano Ceri, Marco Di Giovanni, Francesco Pierri, Giorgia Ramponi. News Sharing User Behaviour on Twitter: A Comprehensive Data Collection of News Articles and Social Interactions. AAAI ICWSM 2019, pp. 592-597.

Slides are on Slideshare:

You can also download a summary poster.

partenza_poster__1__pdf-2

 

Understanding Polarized Political Events through Social Media Analysis

Predicting the outcome of elections is a topic that has been extensively studied in political polls, which have generally provided reliable predictions by means of statistical models. In recent years, online social media platforms have become a potential alternative to traditional polls, since they provide large amounts of post and user data, also referring to socio-political aspects.

In this context, we designed a research that aimed at defining a user modeling pipeline to analyze dis cussions and opinions shared on social media regarding polarized political events (such as a public poll or referendum).

The pipeline follows a four-step methodology.

 

  • First, social media posts and users metadata are crawled.
  • Second, a filtering mechanism is applied to filter out spammers and bot users.
  • Third, demographics information is extracted out of the valid users, namely gender, age, ethnicity and location information.
  • Fourth, the political polarity of the users with respect to the analyzed event is predicted.

In the scope of this work, our proposed pipeline is applied to two referendum scenarios:

  • independence of Catalonia in Spain
  • autonomy of Lombardy in Italy

We used these real-world examples to assess the performance of the approach with respect to the capability of collecting correct insights on the demographics of social media users and of predicting the poll results based on the opinions shared by the users.

Cursor_and_KDWEB_2018_paper_1_pdf

Experiments show that the method was effective in predicting the political trends for the Catalonia case, but not for the Lombardy case. Among the various motivations for this, we noticed that in general Twitter was more representative of the users opposing the referendum than the ones in favor.

The work has been presented at the KDWEB workshop at the ICWE 2018 conference.

A preprint of the paper can be downloaded from ArXiv and cited as reported here:

Roberto Napoli, Ali Mert Ertugrul, Alessandro Bozzon, Marco Brambilla. A User Modeling Pipeline for Studying Polarized Political Events in Social Media. KDWeb Workshop 2018, co-located with ICWE 2018, Caceres, Spain, June 2018. arXiv:1807.09459

IEEE Big Data Conference 2017: take home messages from the keynote speakers

I collected here the list of my write-ups of the first three keynote speeches of the conference:

Driving Style and Behavior Analysis based on Trip Segmentation over GPS Information through Unsupervised Learning

Over one billion cars interact with each other on the road every day. Each driver has his own driving style, which could impact safety, fuel economy and road congestion. Knowledge about the driving style of the driver could be used to encourage “better” driving behaviour through immediate feedback while driving, or by scaling auto insurance rates based on the aggressiveness of the driving style.
In this work we report on our study of driving behaviour profiling based on unsupervised data mining methods. The main goal is to detect the different driving behaviours, and thus to cluster drivers with similar behaviour. This paves the way to new business models related to the driving sector, such as Pay-How-You-Drive insurance policies and car rentals. Here is the presentation I gave on this topic:

Driver behavioral characteristics are studied by collecting information from GPS sensors on the cars and by applying three different analysis approaches (DP-means, Hidden Markov Models, and Behavioural Topic Extraction) to the contextual scene detection problems on car trips, in order to detect different behaviour along each trip. Subsequently, drivers are clustered in similar profiles based on that and the results are compared with a human-defined ground-truth on drivers classification.

The proposed framework is tested on a real dataset containing sampled car signals. While the different approaches show relevant differences in trip segment classification, the coherence of the final driver clustering results is surprisingly high.

 


This work has been published at the 4th IEEE Big Data Conference, held in Boston in December 2017. The full paper can be cited as:

M. Brambilla, P. Mascetti and A. Mauri, “Comparison of different driving style analysis approaches based on trip segmentation over GPS information,” 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, 2017, pp. 3784-3791.
doi: 10.1109/BigData.2017.8258379

You can download the full paper PDF from the IEEE Explore Library, at this url:

https://ieeexplore.ieee.org/document/8258379/

If you are interested in further contributions at the conference, here you can find my summaries of the keynote speeches on human-in-the-loop machine learning and on increasing human perception through text mining.

A Curated List of WWW 2017 Papers for Data Science and Web Science

This year the WWW conference 2017 is definitely focusing a lot of emphasis on Web Science and Data Science.

I’m recording here a list of papers I found interesting at the conference, related to this topic. Disclaimer: the list may be incomplete, as I did not go through all the papers. So in case you want full coverage of the conference, you can just browse the full WWW proceedings, which are entirely available online as open-access creative commons content.

Anyway, here is my list:

Prices and Subsidies in the Sharing Economy (Page 53)
Zhixuan Fang (Tsinghua University)
Longbo Huang (Tsinghua University)
Adam Wierman (California Institute of Technology)

Understanding and Discovering Deliberate Self-harm Content in Social Media (Page 93)
Yilin Wang (Arizona State University)
Jiliang Tang (Michigan State University)
Jundong Li (Arizona State University)
Baoxin Li (Arizona State University)
Yali Wan (University of Washington)
Clayton Mellina (Yahoo Research)
Neil O’Hare (Yahoo Research)
Yi Chang (Huawei Research America)

Cataloguing Treatments Discussed and Used in Online Autism Communities (Page 123)
Shaodian Zhang (Columbia University)
Tian Kang (Columbia University)
Lin Qiu (Shanghai Jiao Tong University)
Weinan Zhang (Shanghai Jiao Tong University)
Yong Yu (Shanghai Jiao Tong University)
Noémie Elhadad (Columbia University)

Neural Collaborative Filtering (Page 173)
Xiangnan He (National University of Singapore)
Lizi Liao (National University of Singapore)
Hanwang Zhang (Columbia University)
Liqiang Nie (Shandong University)
Xia Hu (Texas A&M University)
Tat-Seng Chua (National University of Singapore)

Exact Computation of Influence Spread by Binary Decision Diagrams (Page 947)
Takanori Maehara (Shizuoka University & RIKEN Center for Advanced Intelligence Project)
Hirofumi Suzuki (Hokkaido University)
Masakazu Ishihata (Hokkaido University)

Secure Centrality Computation Over Multiple Networks (Page 957)
Gilad Asharov (Cornell-Tech)
Francesco Bonchi (ISI Foundation)
David García Soriano (Eurecat & Pompeu Fabra University)
Tamir Tassa (The Open University)

Interplay between Social Influence and Network Centrality: A Comparative Study on Shapley Centrality and Single-Node-Influence Centrality (Page 967)
Wei Chen (Microsoft Research)
Shang-Hua Teng (University of Southern California)

Portfolio Optimization for Influence Spread (Page 977)
Naoto Ohsaka (The University of Tokyo)
Yuichi Yoshida (National Institute of Informatics & Preferred Infrastructure, Inc.)

Extracting and Ranking Travel Tips from User-Generated Reviews (Page 987)
Ido Guy (Ben-Gurion University of the Negev & eBay Research)
Avihai Mejer (Yahoo Research)
Alexander Nus (Yahoo Research)
Fiana Raiber (Technion – Israel Institute of Technology)

Information Extraction in Illicit Web Domains (Page 997)
Mayank Kejriwal (University of Southern California)
Pedro Szekely (University of Southern California)

Learning to Extract Events from Knowledge Base Revisions (Page 1007)
Alexander Konovalov (Ohio State University)
Benjamin Strauss (Ohio State University)
Alan Ritter (Ohio State University)
Brendan O’Connor (University of Massachusetts, Amherst)

CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases (Page 1015)
Xiang Ren (University of Illinois at Urbana-Champaign)
Zeqiu Wu (University of Illinois at Urbana-Champaign)
Wenqi He (University of Illinois at Urbana-Champaign)
Meng Qu (University of Illinois at Urbana-Champaign)
Clare R. Voss (Army Research Laboratory)
Heng Ji (Rensselaer Polytechnic Institute)
Tarek F. Abdelzaher (University of Illinois at Urbana-Champaign)
Jiawei Han (University of Illinois at Urbana-Champaign)

Urban Data Science Bootcamp

We organize a crash-course on how the science of urban data can be applied to solve metropolitan issues.

crowdinsights_bootcamp_2017_en

The course is a 2 days face-to-face event with teaching sessions, workshops, case study discussions and hands-on activities for non-IT professionals in the field of city management. It is issued in two editions along the year:

  • in Milan, Italy, on  November 8th-9th, 2017
  • in Amsterdam, The Netherlands, on November 30th-December 1st, 2017.

You can download the flyer and program of the Urban datascience bootcamp 2017.

Ideal participants include: Civil servants, Professionals, Students, Urban planners, and managers of city utilities and services. No previous experience in data science or computer science is required. Attendees should have experience in areas such as economic affairs, urban development, management support, strategy & innovation, health & care, public order & safety.

Data is the catalyst needed to make the smart city vision a reality in a transparent and evidence-based (i.e. data-driven) manner. The skills required for data-driven urban analysis and design activities are diverse, and range from data collection (field work, crowdsensing, physical sensor processing, etc.); data processing by employing established big data technology frameworks; data exploration to find patterns and outliers in spatio-temporal data streams; and data visualization conveying the right information in the right manner.

The CrowdInsights professional school “Urban Data Science Bootcamp” provides a no-frills, hands-on introduction to the science of urban data; from data creation, to data analysis, data visualization and sense-making, the bootcamp will introduce more than 10 real-world application uses cases that exemplifies how urban data can be applied to solve metropolitan issues. Attendees will explore the challenges and opportunities that come from the adoption of novel types of urban data source, including social media, mobile phone data, IoT networks, etc.

Urbanscope: Digital Whispers from the Urban Landscape. TedX Talk Video

Together with the Urbanscope team, we gave a TedX talk on the topics and results of the project here at Politecnico di Milano. The talk was actually given by our junior researchers, as we wanted it to be a choral performance as opposed to the typical one-man show.

The message is that cities are not mere physical and organizational devices only: they are informational landscapes where places are shaped more by the streams of data and less by the traditional physical evidences. We devise tools and analysis for understanding these streams and the phenomena they represent, in order to understand better our cities.

Two layers coexist: a thick and dynamic layer of digital traces – the informational membrane – grows everyday on top of the material layer of the territory, the buildings and the infrastructures. The observation, the analysis and the representation of these two layers combined provides valuable insights on how the city is used and lived.

You can now find the video of the talk on the official TedX YouTube channel:

Urbanscope is a research laboratory where collection, organization, analysis, and visualization of cross domain geo-referenced data are experimented.
The research team is based at Politecnico di Milano and encompasses researchers with competencies in Computing Engineering, Communication and Information Design, Management Engineering, and Mathematics.

The aim of Urbanscope is to systematically produce compelling views on urban systems to foster understanding and decision making. Views are like new lenses of a macroscope: they are designed to support the recognition of specific patterns thus enabling new perspectives.

If you enjoyed the show, you can explore our beta application at:

http://www.urbanscope.polimi.it

and discover the other data science activities we are conducting at the Data Science Lab of Politecnico, DEIB.

 

The role of Big Data in Banks

I was listening at R. Martin Chavez, Goldman Sachs deputy CFO just last month in Harvard at the ComputeFest 2017 event, more precisely, the SYMPOSIUM ON THE FUTURE OF COMPUTATION IN SCIENCE AND ENGINEERING on “Data, Dollars, and Algorithms: The Computational Economy” held in Harvard on Thursday, January 19, 2017.

His claim was that

Banks are essentially API providers.

The entire structure and infrastructure of Goldman Sachs is being restructured for that. His case is that you should not compare a bank with a shop or store, you should compare it with Google. Just imagine that every time you want to search on Google you need to get in touch (i.e., make a phone call or submit a request) to some Google employee, who at some points comes back to you with the result. Non sense, right?  Well, but this is what actually happens with banks. It was happening with consumer-oriented banks before online banking, and it’s still largely happening for business banks.

But this is going to change. Amount of data and speed and volume of financial transaction doesn’t allow that any more.

Banks are actually among the richest (not [just] in terms of money, but in data ownership). But they are also craving for further “less official” big data sources.

c4tmizavuaa1fc3
Juri Marcucci: Importance of Big Data for Central (National) Banks.

Today at the ISTAT National Big Data Committee meeting in Rome, Juri Marcucci from Bank of Italy discussed their research activity in integration of Google Trends information in their financial predictive analytics.

Google Trends provide insights of user interests in general, as the probability that a random user is going to search for a particular keyword (normalized and scaled, also with geographical detail down to city level).

Bank of Italy is using Google Trends data for complementing their prediction of unemployment rates in short and mid term. It’s definitely a big challenge, but preliminary results are promising in terms of confidence on the obtained models. More details are available in this paper.

Paolo Giudici from University of Pavia showed how one can correlate the risk of bank defaults with their exposition on Twitter:

c4tuo4yxuae86gm
Paolo Giudici: bank risk contagion based (also) on Twitter data.

Obviously, all this must take into account the bias of the sources and the quality of the data collected. This was pointed out also by Paolo Giudici from University of Pavia. Assessment of “trustability” of online sources is crucial. In their research, they defined the T-index on Twitter accounts in a very similar way academics define the h-index for relevance of publications, as reported in the photographed slide below.

dig
Paolo Giudici: T-index describing the quality of Twitter authors in finance.

It’s very interesting to see how creative the use of (non-traditional, web based) big data is becoming, in very diverse fields, including very traditional ones like macroeconomy and finance.

And once again, I think the biggest challenges and opportunities come from the fusion of multiple data sources together: mobile phones, financial tracks, web searches, online news, social networks, and official statistics.

This is also the path that ISTAT (the official institute for Italian statistics) is pursuing. For instance, in the calculation of official national inflation rates, web scraping techniques (for ecommerce prices) upon more than 40.000 product prices are integrated in the process too.

 

 

The Harvard-Politecnico Joint Program on Data Science in full bloom

After months of preparation, here we are.

This week we kicked off the second edition of the DataShack program on Data Science that brings together interdisciplinary teams of data science, software engineering & computer science, and design students from Harvard (Institute of Applied Computational Science) and Politecnico di Milano (faculties of Engineering and Design).

The students will address big data extraction, analysis, and visualization problems provided by two real-world stakeholders in Italy: the Como city municipality and Moleskine.

logo-moleskineThe Moleskine Data-Shack project will explore the popularity and success of different Moleskine products co-branded with other famous brands (also known as special editions) and launched in specific periods in time. The main field of analysis is the impact that different products have on social media channels. Social media analysis then will be correlated with product distribution and sales performance data, along multiple dimensions (temporal, geographical, etc.) and product features.

logo-comoThe project consists of collecting and analyzing data about the city and the way people live and move within it, by integrating multiple and diverse data sources. The problems to be addressed may include providing estimates of human density and movements within the city, predicting the impact of hypothetical future events, determining the best allocation of sensors in the streets, and defining optimal user experience and interaction for exploring the city data.

img_3doc5n
The kickoff meeting of the DataShack 2017 projects, in Harvard. Faculties Pavlos Protopapas, Stefano Ceri, Paola Bertola, Paolo Ciuccarelli and myself (Marco Brambilla) are involved in the program.

The teams have been formed, and the problems assigned. I really look forward to advising the groups in the next months and seeing the results that will come out. The students have shown already commitment and engagement. I’m confident that they will be excellent and innovative this year!

For further activities on data science within our group you can refer to the DataScience Lab site, Socialometers, and Urbanscope.