Data Cleaning for Knowledge Extraction and Understanding on Social Media

 

Social media platforms let users share their opinions through textual or multimedia content. In many settings, this becomes a valuable source of knowledge that can be exploited for specific business objectives. Brands and companies often ask to monitor social media as sources for understanding the stance, opinion, and sentiment of their customers, audience and potential audience. This is crucial for them because it let them understand the trends and future commercial and marketing opportunities.

However, all this relies on a solid and reliable data collection phase, that grants that all the analyses, extractions and predictions are applied on clean, solid and focused data. Indeed, the typical topic-based collection of social media content performed through keyword-based search typically entails very noisy results.

We recently implemented a simple study aiming at cleaning the data collected from social content, within specific domains or related to given topics of interest.  We propose a basic method for data cleaning and removal of off-topic content based on supervised machine learning techniques, i.e. classification, over data collected from social media platforms based on keywords regarding a specific topic. We define a general method for this and then we validate it through an experiment of data extraction from Twitter, with respect to a set of famous cultural institutions in Italy, including theaters, museums, and other venues.

For this case, we collaborated with domain experts to label the dataset, and then we evaluated and compared the performance of classifiers that are trained with different feature extraction strategies.

The work has been presented at the KDWEB workshop at the ICWE 2018 conference.

A preprint of the paper can be downloaded and cited as reported here:

Emre Calisir, Marco Brambilla. The Problem of Data Cleaning for Knowledge Extraction from Social Media. KDWeb Workshop 2018, co-located with ICWE 2018, Caceres, Spain, June 2018.

The slides used in the workshop are available online here:

 

IEEE Big Data Conference 2017: take home messages from the keynote speakers

I collected here the list of my write-ups of the first three keynote speeches of the conference:

Driving Style and Behavior Analysis based on Trip Segmentation over GPS Information through Unsupervised Learning

Over one billion cars interact with each other on the road every day. Each driver has his own driving style, which could impact safety, fuel economy and road congestion. Knowledge about the driving style of the driver could be used to encourage “better” driving behaviour through immediate feedback while driving, or by scaling auto insurance rates based on the aggressiveness of the driving style.
In this work we report on our study of driving behaviour profiling based on unsupervised data mining methods. The main goal is to detect the different driving behaviours, and thus to cluster drivers with similar behaviour. This paves the way to new business models related to the driving sector, such as Pay-How-You-Drive insurance policies and car rentals. Here is the presentation I gave on this topic:

Driver behavioral characteristics are studied by collecting information from GPS sensors on the cars and by applying three different analysis approaches (DP-means, Hidden Markov Models, and Behavioural Topic Extraction) to the contextual scene detection problems on car trips, in order to detect different behaviour along each trip. Subsequently, drivers are clustered in similar profiles based on that and the results are compared with a human-defined ground-truth on drivers classification.

The proposed framework is tested on a real dataset containing sampled car signals. While the different approaches show relevant differences in trip segment classification, the coherence of the final driver clustering results is surprisingly high.

 


This work has been published at the 4th IEEE Big Data Conference, held in Boston in December 2017. The full paper can be cited as:

M. Brambilla, P. Mascetti and A. Mauri, “Comparison of different driving style analysis approaches based on trip segmentation over GPS information,” 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, 2017, pp. 3784-3791.
doi: 10.1109/BigData.2017.8258379

You can download the full paper PDF from the IEEE Explore Library, at this url:

https://ieeexplore.ieee.org/document/8258379/

If you are interested in further contributions at the conference, here you can find my summaries of the keynote speeches on human-in-the-loop machine learning and on increasing human perception through text mining.

How Fashionable is Digital Data-Driven Fashion?

Within the context of our data science research track, we have been involved a lot in fashion industry problems recently.

We already showcased some studies in fashion, for instance related to the analysis of the Milano Fashion Week events and their social media impact.

Starting this year, we are also involved in a research and innovation project called FaST – Fashion Sensing Technology. FaST is a project meant to design, experiment with, and implement an ICT tool that could monitor and analyze the activity of Italian emerging Fashion brands on social media. FaST aims at providing SMEs in the Fashion industry with the ability to better understand and measure the behaviours and opinions of consumers on social media, through the study of the interactions between brands and their communities, as well as support a brand’s strategic business decisions.

Given the importance of Fashion as an economic and cultural resource for Lombardy Region and Italy as a whole, the project aims at leveraging on the opportunities given by the creation of an hybrid value chain fashion-digital, in order to design a tool that would allow the codification of new organizational models. Furthermore, the project wants to promote process innovation within the fashion industry but with a customer-centric approach, as well as the design of services that could update and innovate both creative processes and the retail channel which, as of today, represents the core to the sustainability and competitiveness of brands and companies on domestic and international markets.

Within the project, we study social presence and digital / communication strategies of brands, and we will look for space for optimization. We are already crunching a lot of data and running large scale analyses on the topic. We will share our exciting results as soon as available!

 

Acknowledgements

FaST – Fashion Sensing Technology is a project supported by Regione Lombardia through the European Regional Development Fund (grant: “Smart Fashion & Design”). The project is being developed by Politecnico di Milano – Design dept. and Electronics, Information and Bioengineering dept. – in collaboration with Wemanage Group, Studio 4SIGMA, and CGNAL.

logo_w_fondo_transparent 2

A Curated List of WWW 2017 Papers for Data Science and Web Science

This year the WWW conference 2017 is definitely focusing a lot of emphasis on Web Science and Data Science.

I’m recording here a list of papers I found interesting at the conference, related to this topic. Disclaimer: the list may be incomplete, as I did not go through all the papers. So in case you want full coverage of the conference, you can just browse the full WWW proceedings, which are entirely available online as open-access creative commons content.

Anyway, here is my list:

Prices and Subsidies in the Sharing Economy (Page 53)
Zhixuan Fang (Tsinghua University)
Longbo Huang (Tsinghua University)
Adam Wierman (California Institute of Technology)

Understanding and Discovering Deliberate Self-harm Content in Social Media (Page 93)
Yilin Wang (Arizona State University)
Jiliang Tang (Michigan State University)
Jundong Li (Arizona State University)
Baoxin Li (Arizona State University)
Yali Wan (University of Washington)
Clayton Mellina (Yahoo Research)
Neil O’Hare (Yahoo Research)
Yi Chang (Huawei Research America)

Cataloguing Treatments Discussed and Used in Online Autism Communities (Page 123)
Shaodian Zhang (Columbia University)
Tian Kang (Columbia University)
Lin Qiu (Shanghai Jiao Tong University)
Weinan Zhang (Shanghai Jiao Tong University)
Yong Yu (Shanghai Jiao Tong University)
Noémie Elhadad (Columbia University)

Neural Collaborative Filtering (Page 173)
Xiangnan He (National University of Singapore)
Lizi Liao (National University of Singapore)
Hanwang Zhang (Columbia University)
Liqiang Nie (Shandong University)
Xia Hu (Texas A&M University)
Tat-Seng Chua (National University of Singapore)

Exact Computation of Influence Spread by Binary Decision Diagrams (Page 947)
Takanori Maehara (Shizuoka University & RIKEN Center for Advanced Intelligence Project)
Hirofumi Suzuki (Hokkaido University)
Masakazu Ishihata (Hokkaido University)

Secure Centrality Computation Over Multiple Networks (Page 957)
Gilad Asharov (Cornell-Tech)
Francesco Bonchi (ISI Foundation)
David García Soriano (Eurecat & Pompeu Fabra University)
Tamir Tassa (The Open University)

Interplay between Social Influence and Network Centrality: A Comparative Study on Shapley Centrality and Single-Node-Influence Centrality (Page 967)
Wei Chen (Microsoft Research)
Shang-Hua Teng (University of Southern California)

Portfolio Optimization for Influence Spread (Page 977)
Naoto Ohsaka (The University of Tokyo)
Yuichi Yoshida (National Institute of Informatics & Preferred Infrastructure, Inc.)

Extracting and Ranking Travel Tips from User-Generated Reviews (Page 987)
Ido Guy (Ben-Gurion University of the Negev & eBay Research)
Avihai Mejer (Yahoo Research)
Alexander Nus (Yahoo Research)
Fiana Raiber (Technion – Israel Institute of Technology)

Information Extraction in Illicit Web Domains (Page 997)
Mayank Kejriwal (University of Southern California)
Pedro Szekely (University of Southern California)

Learning to Extract Events from Knowledge Base Revisions (Page 1007)
Alexander Konovalov (Ohio State University)
Benjamin Strauss (Ohio State University)
Alan Ritter (Ohio State University)
Brendan O’Connor (University of Massachusetts, Amherst)

CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases (Page 1015)
Xiang Ren (University of Illinois at Urbana-Champaign)
Zeqiu Wu (University of Illinois at Urbana-Champaign)
Wenqi He (University of Illinois at Urbana-Champaign)
Meng Qu (University of Illinois at Urbana-Champaign)
Clare R. Voss (Army Research Laboratory)
Heng Ji (Rensselaer Polytechnic Institute)
Tarek F. Abdelzaher (University of Illinois at Urbana-Champaign)
Jiawei Han (University of Illinois at Urbana-Champaign)

Myths and Challenges in Knowledge Extraction and Big Data Analysis

For centuries, science (in German “Wissenschaft”) has aimed to create (“schaften”) new knowledge (“Wissen”) from the observation of physical phenomena, their modelling, and empirical validation.

Recently, a new source of knowledge has emerged: not (only) the physical world any more, but the virtual world, namely the Web with its ever-growing stream of data materialized in the form of social network chattering, content produced on demand by crowds of people, messages exchanged among interlinked devices in the Internet of Things. The knowledge we may find there can be dispersed, informal, contradicting, unsubstantiated and ephemeral today, while already tomorrow it may be commonly accepted.

Picture2The challenge is once again to capture and create consolidated knowledge that is new, has not been formalized yet in existing knowledge bases, and is buried inside a big, moving target (the live stream of online data).

The myth is that existing tools (spanning fields like semantic web, machine learning, statistics, NLP, and so on) suffice to the objective. While this may still be far from true, some existing approaches are actually addressing the problem and provide preliminary insights into the possibilities that successful attempts may lead to.

I gave a few keynote speeches on this matter (at ICEIS, KDWEB,…), and I also use this argument as a motivating class in academic courses for letting students understand how crucial is to focus on the problems related to big data modeling and analysis. The talk, reported in the slides below, explores through real industrial use cases, the mixed realistic-utopian domain of data analysis and knowledge extraction and reports on some tools and cases where digital and physical world have brought together for better understanding our society.

The presentation is available on SlideShare and are reported here below:

Urban Data Science Bootcamp

We organize a crash-course on how the science of urban data can be applied to solve metropolitan issues.

crowdinsights_bootcamp_2017_en

The course is a 2 days face-to-face event with teaching sessions, workshops, case study discussions and hands-on activities for non-IT professionals in the field of city management. It is issued in two editions along the year:

  • in Milan, Italy, on  November 8th-9th, 2017
  • in Amsterdam, The Netherlands, on November 30th-December 1st, 2017.

You can download the flyer and program of the Urban datascience bootcamp 2017.

Ideal participants include: Civil servants, Professionals, Students, Urban planners, and managers of city utilities and services. No previous experience in data science or computer science is required. Attendees should have experience in areas such as economic affairs, urban development, management support, strategy & innovation, health & care, public order & safety.

Data is the catalyst needed to make the smart city vision a reality in a transparent and evidence-based (i.e. data-driven) manner. The skills required for data-driven urban analysis and design activities are diverse, and range from data collection (field work, crowdsensing, physical sensor processing, etc.); data processing by employing established big data technology frameworks; data exploration to find patterns and outliers in spatio-temporal data streams; and data visualization conveying the right information in the right manner.

The CrowdInsights professional school “Urban Data Science Bootcamp” provides a no-frills, hands-on introduction to the science of urban data; from data creation, to data analysis, data visualization and sense-making, the bootcamp will introduce more than 10 real-world application uses cases that exemplifies how urban data can be applied to solve metropolitan issues. Attendees will explore the challenges and opportunities that come from the adoption of novel types of urban data source, including social media, mobile phone data, IoT networks, etc.

Analysis of user behaviour and social media content for art and culture events

In our most recent study, we analysed the user behaviour and profile, as well as the textual and visual content posted on social media for art and culture events.

The corresponding paper has been presented at CD-MAKE 2017 in Reggio Calabria on August 31st, 2017.

Nowadays people share everything on online social networks, from daily life stories to the latest local and global news and events. In our paper, we address the specific problem of user behavioural profiling in the context of cultural and artistic events.

We propose a specific analysis pipeline that aims at examining the profile of online users, based on the textual content they published online. The pipeline covers the following aspects: data extraction and enrichment, topic modeling based on LDA, dimensionality reduction, user clustering, prediction of interest, content analysis including profiling of images and subjects.

Picture1We show our approach at work for the monitoring of participation to a large-scale artistic installation that collected more than 1.5 million visitors in just two weeks (namely The Floating Piers, by Christo and Jeanne-Claude). In the paper we report our findings and discuss the pros and cons of the work.

The full paper is published by Springer in the LNCS series in volume 10410, pages 219-236.

The slides used for the presentation are available on SlideShare:

 

Urbanscope: Digital Whispers from the Urban Landscape. TedX Talk Video

Together with the Urbanscope team, we gave a TedX talk on the topics and results of the project here at Politecnico di Milano. The talk was actually given by our junior researchers, as we wanted it to be a choral performance as opposed to the typical one-man show.

The message is that cities are not mere physical and organizational devices only: they are informational landscapes where places are shaped more by the streams of data and less by the traditional physical evidences. We devise tools and analysis for understanding these streams and the phenomena they represent, in order to understand better our cities.

Two layers coexist: a thick and dynamic layer of digital traces – the informational membrane – grows everyday on top of the material layer of the territory, the buildings and the infrastructures. The observation, the analysis and the representation of these two layers combined provides valuable insights on how the city is used and lived.

You can now find the video of the talk on the official TedX YouTube channel:

Urbanscope is a research laboratory where collection, organization, analysis, and visualization of cross domain geo-referenced data are experimented.
The research team is based at Politecnico di Milano and encompasses researchers with competencies in Computing Engineering, Communication and Information Design, Management Engineering, and Mathematics.

The aim of Urbanscope is to systematically produce compelling views on urban systems to foster understanding and decision making. Views are like new lenses of a macroscope: they are designed to support the recognition of specific patterns thus enabling new perspectives.

If you enjoyed the show, you can explore our beta application at:

http://www.urbanscope.polimi.it

and discover the other data science activities we are conducting at the Data Science Lab of Politecnico, DEIB.

 

Modeling, Modeling, Modeling: From Web to Enterprise to Crowd to Social

This is our perspective on the world: it’s all about modeling. 

So, why is it that model-driven engineering is not taking over the whole technological and social eco-system?

Let me make the case that it is.

A Comprehensive Guide Through the Italian Database Research Over the Last 25 YearsIn the occasion of the 25th edition of the Italian Symposium of Database Systems (SEBD 2017) we (Stefano Ceri and I) have been asked to write a retrospective on the last years of database and systems research from our perspective, published in a dedicated volume by Springer. After some brainstorming, we agreed that it all boils down to this: modeling, modeling, modeling.

Long time ago, in the past century, the International DB Research Community used to meet for assessing new research directions, starting the meetings with 2-minutes gong shows  to tell each one’s opinion and influencing follow-up discussion. Bruce Lindsay from IBM had just been quoted for his message:

There are 3 important things in data management: performance, performance, performance.

Stefano Ceri had a chance to speak out immediately after and to give a syntactically similar but semantically orthogonal message:

There are 3 important things in data management: modeling, modeling, modeling.

Data management is continuously evolving for serving the needs of an increasingly connected society. New challenges apply not only to systems and technology, but also to the models and abstractions for capturing new application requirements.

In our retrospective paper, we describe several models and abstractions which have been progressively designed to capture new forms of data-centered interactions in the last twenty five years – a period of huge changes due to the spreading of web-based applications and the increasingly relevant role of social interactions.

We initially focus on Web-based applications for individuals, then discuss applications among enterprises, and this is all about WebML and IFML; then we discuss how these applications may include rankings which are computed using services or using crowds, and this is related to our work on crowdsourcing (liquid query and crowdsearcher tool); we conclude with hints to a recent research discussing how social sources can be used for capturing emerging knowledge (the social knowledge extractor perspective and tooling).

162940660.KcsktUrP

All in all, modeling as a cognitive tool is all around us, and is growing in terms of potential impact thanks to formal cognification.

It’s also true that model-driven engineering is not necessarily the tool of choice for this to happen. Why? As technician, we always tend to blame the customer for not understanding our product. But maybe we should look into ourselves and the kind of tools (conceptual and technical) the MDE community is offering. I’m pretty sure we could find plenty of space for improvement.

Any idea on how to do this?