In our most recent study, we analysed the user behaviour and profile, as well as the textual and visual content posted on social media for art and culture events.
The corresponding paper has been presented at CD-MAKE 2017 in Reggio Calabria on August 31st, 2017.
Nowadays people share everything on online social networks, from daily life stories to the latest local and global news and events. In our paper, we address the specific problem of user behavioural profiling in the context of cultural and artistic events.
We propose a specific analysis pipeline that aims at examining the profile of online users, based on the textual content they published online. The pipeline covers the following aspects: data extraction and enrichment, topic modeling based on LDA, dimensionality reduction, user clustering, prediction of interest, content analysis including profiling of images and subjects.
We show our approach at work for the monitoring of participation to a large-scale artistic installation that collected more than 1.5 million visitors in just two weeks (namely The Floating Piers, by Christo and Jeanne-Claude). In the paper we report our findings and discuss the pros and cons of the work.
I was listening at R. Martin Chavez, Goldman Sachs deputy CFO just last month in Harvard at the ComputeFest 2017 event, more precisely, the SYMPOSIUM ON THE FUTURE OF COMPUTATION IN SCIENCE AND ENGINEERING on “Data, Dollars, and Algorithms: The Computational Economy” held in Harvard on Thursday, January 19, 2017.
His claim was that
Banks are essentially API providers.
The entire structure and infrastructure of Goldman Sachs is being restructured for that. His case is that you should not compare a bank with a shop or store, you should compare it with Google. Just imagine that every time you want to search on Google you need to get in touch (i.e., make a phone call or submit a request) to some Google employee, who at some points comes back to you with the result. Non sense, right? Well, but this is what actually happens with banks. It was happening with consumer-oriented banks before online banking, and it’s still largely happening for business banks.
But this is going to change. Amount of data and speed and volume of financial transaction doesn’t allow that any more.
Banks are actually among the richest (not [just] in terms of money, but in data ownership). But they are also craving for further “less official” big data sources.
Today at the ISTAT National Big Data Committee meeting in Rome, Juri Marcucci from Bank of Italy discussed their research activity in integration of Google Trends information in their financial predictive analytics.
Google Trends provide insights of user interests in general, as the probability that a random user is going to search for a particular keyword (normalized and scaled, also with geographical detail down to city level).
Bank of Italy is using Google Trends data for complementing their prediction of unemployment rates in short and mid term. It’s definitely a big challenge, but preliminary results are promising in terms of confidence on the obtained models. More details are available in this paper.
Paolo Giudici from University of Pavia showed how one can correlate the risk of bank defaults with their exposition on Twitter:
Obviously, all this must take into account the bias of the sources and the quality of the data collected. This was pointed out also by Paolo Giudici from University of Pavia. Assessment of “trustability” of online sources is crucial. In their research, they defined the T-index on Twitter accounts in a very similar way academics define the h-index for relevance of publications, as reported in the photographed slide below.
It’s very interesting to see how creative the use of (non-traditional, web based) big data is becoming, in very diverse fields, including very traditional ones like macroeconomy and finance.
And once again, I think the biggest challenges and opportunities come from the fusion of multiple data sources together: mobile phones, financial tracks, web searches, online news, social networks, and official statistics.
This is also the path that ISTAT (the official institute for Italian statistics) is pursuing. For instance, in the calculation of official national inflation rates, web scraping techniques (for ecommerce prices) upon more than 40.000 product prices are integrated in the process too.
This year we decided to be present at ICWSM 2016 in Cologne, with two contributions that basically blend model driven software engineering and big data analysis, to provide value to users and citizens both in terms of high quality software and added value information provision.
Studying Multicultural Diversity of Cities and Neighborhoods through Social Media Language Detection, presented at the CityLab workshop at ICWSM 2016. The focus of this work is to study cities as melting pots of people with different culture, religion, and language. Through multilingual analysis of Twitter contents shared within a city, we analyze the prevalent language in the different neighborhoods of the city and we compare the results with census data, in order to highlight any parallelisms or discrepancies between the two data sources. We show that the officially identified neighborhoods are actually representing significantly different communities and that the use of the social media as a data source helps to detect those weak signals that are not captured from traditional data. Slides here:
We now continuously look for new dataset and computational challenges. Feel free to ask or to propose ideas!
To keep updated on my activities you can subscribe to the RSS feed of my blog or follow my twitter account (@MarcoBrambi).