IEEE Big Data Conference 2017: take home messages from the keynote speakers

I collected here the list of my write-ups of the first three keynote speeches of the conference:

Driving Style and Behavior Analysis based on Trip Segmentation over GPS Information through Unsupervised Learning

Over one billion cars interact with each other on the road every day. Each driver has his own driving style, which could impact safety, fuel economy and road congestion. Knowledge about the driving style of the driver could be used to encourage “better” driving behaviour through immediate feedback while driving, or by scaling auto insurance rates based on the aggressiveness of the driving style.
In this work we report on our study of driving behaviour profiling based on unsupervised data mining methods. The main goal is to detect the different driving behaviours, and thus to cluster drivers with similar behaviour. This paves the way to new business models related to the driving sector, such as Pay-How-You-Drive insurance policies and car rentals. Here is the presentation I gave on this topic:

Driver behavioral characteristics are studied by collecting information from GPS sensors on the cars and by applying three different analysis approaches (DP-means, Hidden Markov Models, and Behavioural Topic Extraction) to the contextual scene detection problems on car trips, in order to detect different behaviour along each trip. Subsequently, drivers are clustered in similar profiles based on that and the results are compared with a human-defined ground-truth on drivers classification.

The proposed framework is tested on a real dataset containing sampled car signals. While the different approaches show relevant differences in trip segment classification, the coherence of the final driver clustering results is surprisingly high.

 


This work has been published at the 4th IEEE Big Data Conference, held in Boston in December 2017. The full paper can be cited as:

M. Brambilla, P. Mascetti and A. Mauri, “Comparison of different driving style analysis approaches based on trip segmentation over GPS information,” 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, 2017, pp. 3784-3791.
doi: 10.1109/BigData.2017.8258379

You can download the full paper PDF from the IEEE Explore Library, at this url:

https://ieeexplore.ieee.org/document/8258379/

If you are interested in further contributions at the conference, here you can find my summaries of the keynote speeches on human-in-the-loop machine learning and on increasing human perception through text mining.

Urban Data Science Bootcamp

We organize a crash-course on how the science of urban data can be applied to solve metropolitan issues.

crowdinsights_bootcamp_2017_en

The course is a 2 days face-to-face event with teaching sessions, workshops, case study discussions and hands-on activities for non-IT professionals in the field of city management. It is issued in two editions along the year:

  • in Milan, Italy, on  November 8th-9th, 2017
  • in Amsterdam, The Netherlands, on November 30th-December 1st, 2017.

You can download the flyer and program of the Urban datascience bootcamp 2017.

Ideal participants include: Civil servants, Professionals, Students, Urban planners, and managers of city utilities and services. No previous experience in data science or computer science is required. Attendees should have experience in areas such as economic affairs, urban development, management support, strategy & innovation, health & care, public order & safety.

Data is the catalyst needed to make the smart city vision a reality in a transparent and evidence-based (i.e. data-driven) manner. The skills required for data-driven urban analysis and design activities are diverse, and range from data collection (field work, crowdsensing, physical sensor processing, etc.); data processing by employing established big data technology frameworks; data exploration to find patterns and outliers in spatio-temporal data streams; and data visualization conveying the right information in the right manner.

The CrowdInsights professional school “Urban Data Science Bootcamp” provides a no-frills, hands-on introduction to the science of urban data; from data creation, to data analysis, data visualization and sense-making, the bootcamp will introduce more than 10 real-world application uses cases that exemplifies how urban data can be applied to solve metropolitan issues. Attendees will explore the challenges and opportunities that come from the adoption of novel types of urban data source, including social media, mobile phone data, IoT networks, etc.

Social Media Behaviour during Live Events: the Milano Fashion Week #MFW case

Social media are getting more and more  important in the context of live events, such as fairs, exhibits, festivals, concerts, and so on,  as they play an essential role in communicating them to  fans, interest groups, and the general population. These kinds of events are geo-localized within a city or territory and are scheduled within a public calendar.

Together with the people in the Fashion in Process group of Politecnico di Milano, we studied the impact on social media of a specific scenario, the Milano Fashion Week (MFW), which is an important event in Milano for the whole fashion business.

We presented this work at the Location and the Web workshop co-located with the WWW 2017 Conference in Perth, Australia.

We focus our attention on the spreading of social content  in space, measuring the spreading of the event propagation in space. We build different clusters of fashion brands, we characterize several features of propagation in space and we correlate them to the popularity of the brand and temporal propagation.

We show that the clusters along space, time and popularity dimensions are loosely correlated, and therefore trying to  understand the dynamics of the events only based on popularity  aspects would not be appropriate.

The paper PDF is available as open access PDF online on the WWW 2017 Conference web site. You can download it here.

A subsequent paper on the temporal analysis of the same event “Temporal Analysis of Social Media Response to Live Events: The Milano Fashion Week”, focusing on Granger Causality and other measures, has been published at ICWE 2017 and is available in the proceedings by Springer.

The PowerPoint presentation is available on SlideShare.

The role of Big Data in Banks

I was listening at R. Martin Chavez, Goldman Sachs deputy CFO just last month in Harvard at the ComputeFest 2017 event, more precisely, the SYMPOSIUM ON THE FUTURE OF COMPUTATION IN SCIENCE AND ENGINEERING on “Data, Dollars, and Algorithms: The Computational Economy” held in Harvard on Thursday, January 19, 2017.

His claim was that

Banks are essentially API providers.

The entire structure and infrastructure of Goldman Sachs is being restructured for that. His case is that you should not compare a bank with a shop or store, you should compare it with Google. Just imagine that every time you want to search on Google you need to get in touch (i.e., make a phone call or submit a request) to some Google employee, who at some points comes back to you with the result. Non sense, right?  Well, but this is what actually happens with banks. It was happening with consumer-oriented banks before online banking, and it’s still largely happening for business banks.

But this is going to change. Amount of data and speed and volume of financial transaction doesn’t allow that any more.

Banks are actually among the richest (not [just] in terms of money, but in data ownership). But they are also craving for further “less official” big data sources.

c4tmizavuaa1fc3
Juri Marcucci: Importance of Big Data for Central (National) Banks.

Today at the ISTAT National Big Data Committee meeting in Rome, Juri Marcucci from Bank of Italy discussed their research activity in integration of Google Trends information in their financial predictive analytics.

Google Trends provide insights of user interests in general, as the probability that a random user is going to search for a particular keyword (normalized and scaled, also with geographical detail down to city level).

Bank of Italy is using Google Trends data for complementing their prediction of unemployment rates in short and mid term. It’s definitely a big challenge, but preliminary results are promising in terms of confidence on the obtained models. More details are available in this paper.

Paolo Giudici from University of Pavia showed how one can correlate the risk of bank defaults with their exposition on Twitter:

c4tuo4yxuae86gm
Paolo Giudici: bank risk contagion based (also) on Twitter data.

Obviously, all this must take into account the bias of the sources and the quality of the data collected. This was pointed out also by Paolo Giudici from University of Pavia. Assessment of “trustability” of online sources is crucial. In their research, they defined the T-index on Twitter accounts in a very similar way academics define the h-index for relevance of publications, as reported in the photographed slide below.

dig
Paolo Giudici: T-index describing the quality of Twitter authors in finance.

It’s very interesting to see how creative the use of (non-traditional, web based) big data is becoming, in very diverse fields, including very traditional ones like macroeconomy and finance.

And once again, I think the biggest challenges and opportunities come from the fusion of multiple data sources together: mobile phones, financial tracks, web searches, online news, social networks, and official statistics.

This is also the path that ISTAT (the official institute for Italian statistics) is pursuing. For instance, in the calculation of official national inflation rates, web scraping techniques (for ecommerce prices) upon more than 40.000 product prices are integrated in the process too.

 

 

The Harvard-Politecnico Joint Program on Data Science in full bloom

After months of preparation, here we are.

This week we kicked off the second edition of the DataShack program on Data Science that brings together interdisciplinary teams of data science, software engineering & computer science, and design students from Harvard (Institute of Applied Computational Science) and Politecnico di Milano (faculties of Engineering and Design).

The students will address big data extraction, analysis, and visualization problems provided by two real-world stakeholders in Italy: the Como city municipality and Moleskine.

logo-moleskineThe Moleskine Data-Shack project will explore the popularity and success of different Moleskine products co-branded with other famous brands (also known as special editions) and launched in specific periods in time. The main field of analysis is the impact that different products have on social media channels. Social media analysis then will be correlated with product distribution and sales performance data, along multiple dimensions (temporal, geographical, etc.) and product features.

logo-comoThe project consists of collecting and analyzing data about the city and the way people live and move within it, by integrating multiple and diverse data sources. The problems to be addressed may include providing estimates of human density and movements within the city, predicting the impact of hypothetical future events, determining the best allocation of sensors in the streets, and defining optimal user experience and interaction for exploring the city data.

img_3doc5n
The kickoff meeting of the DataShack 2017 projects, in Harvard. Faculties Pavlos Protopapas, Stefano Ceri, Paola Bertola, Paolo Ciuccarelli and myself (Marco Brambilla) are involved in the program.

The teams have been formed, and the problems assigned. I really look forward to advising the groups in the next months and seeing the results that will come out. The students have shown already commitment and engagement. I’m confident that they will be excellent and innovative this year!

For further activities on data science within our group you can refer to the DataScience Lab site, Socialometers, and Urbanscope.

 

Modeling and Analyzing Engagement in Social Network Challenges

Within a completely new line of research, we are exploring the power of modeling for human behaviour analysis, especially within social networks and/or in occasion of large scale live events. Participation to challenges within social networks is a very effective instrument for promoting a brand or event and therefore it is regarded as an excellent marketing tool.
Our first reasearch has been published in November 2016 at WISE Conference, covering the analysis of user engagement within social network challenges.
In this paper, we take the challenge organizer’s perspective, and we study how to raise the
engagement of players in challenges where the players are stimulated to
create and evaluate content, thereby indirectly raising the awareness about the brand or event itself. Slides are available on slideshare:

We illustrate a comprehensive model of the actions and strategies that can be exploited for progressively boosting the social engagement during the challenge evolution. The model studies the organizer-driven management of interactions among players, and evaluates
the effectiveness of each action in light of several other factors (time, repetition, third party actions, interplay between different social networks, and so on).
We evaluate the model through a set of experiment upon a real case, the YourExpo2015 challenge. Overall, our experiments lasted 9 weeks and engaged around 800,000  users on two different social platforms; our quantitative analysis assesses the validity of the model.

The paper is published by Springer here.

cross-platform_pdf

 

To keep updated on my activities you can subscribe to the RSS feed of my blog or follow my twitter account (@MarcoBrambi).

"What’s special about us?" Harvard computational science symposium on Brain+Computer systems

On Friday, January 22, 2016 I attended a very interesting symposium organised by Harvard University Institute for Computational Science on “BRAIN + MACHINES: EXPLORING THE FRONTIERS OF NEUROSCIENCE AND COMPUTER SCIENCE”.

Although it fell outside my main research fields, I found it very interesting and enlightening. And the discussed topics could also imply some crucial role for modelling practices.
The introductory speech by David Cox, addressed the role and span of brain studies. First, he pointed out that when we say we want to study the brain, at a deep level, we say we want to study ourselves.
Indeed, we all perceive human species is special. But why is that? We are not the biggest, longest-living, most numerous, most adapted species. We simply cover a niche, as any other species.

What’s special about us is the complexity, not in general sense (nature is plenty of complexity), but specifically complexity of our brain.
Our brain includes 100 billions neutrons, and 100 trillions connections.
We are able to deal with complex information in incredible ways, because each neuron is actually a small computer, and globally our brain is enormously more powerful than any computer built so far.
We therefore build clusters of computers. But this is still not enough to obtain the brain power, we need to understand how brain works, to treat and replicate it.
Typical and crucial problems include to study: vision and image processing, positioning and mobility, and so on.
That’s where I think modelling can play a crucial role here.
As we clearly pointed out in our book Model-driven Development in Practice, modelling and abstraction is a natural way of working for our brain. And I got confirmation from renowned luminaries from Harvard today.
I really think that, should we discover the modelling approaches of our mind, we could disclose a lot of important aspects of several research fields.
Just imagine if:

  • we could represent human brain processes through models
  • we could replicate these processes and apply modelling techniques for improving, transforming and exploiting such models.

This would pave the way to infinite applications and researches. However, one big challenge opens up for the modelling community: are we able to deal with models including trillions of items??

Any further insights on this?

If you want further details on the event, checkout the official website of the symposium here and my storified social media report here.

To keep updated on my activities you can subscribe to the RSS feed of my blog or follow my twitter account (@MarcoBrambi).

Efficient Subgraph Matching – Keynote by V.S. Subrahmanian at ASONAM 2012

V.S. Subrahmanian (Univ. of Maryland)
As part of the Data Management in the Social Semantic Web workshop (DMSSW workshop) at the ASONAM 2012 conference in Istanbul, Turkey, V.S. Subrahmanian (University of Maryland) gave an interesting talk on efficient subgraph matching on (social) networks.
Queries are defined as graphs themselves, with some of the places defined as constants and some defined as variables.
The complexity of queries over graphs is high, due to the large number of joins to be performed even in case of fairly simple queries.
The size of the query is typically relatively small with respect to the entire dataset (network). The proposed approaches are useful for scale of at least tens of millions of nodes in the network.

How to work on disk

The mechanism implemented is called DOGMA index and applies an algorithm called K-merge.
The algorithm builds a hierarchical index where I put at most K nodes of the graph in each index item. For obtaining that, I merge together connected nodes.You can do that randomly or more intelligently by trying to minimizing connections between nodes in different index items.
Example of DOGMA Index, where nodes of the original network (at the bottom) are merged in higher level representations in the level above (in this example, K = 4, since we have 4 nodes in each index position).
I don’t want to build the index by partitioning the whole graph, because it’s painful for large graphs. 
I start from a G0 graph, and I merge nodes until I get G1, G2, Gn graphs, each of them is more or less half the size of the previous, until Gn has K nodes or less. Then, I build the dogma index over Gn.
For the query, I can use a basic approach: identify the variable nodes that are immediately close to a constant node, and then find the possible satisfying values for those variables, starting from the constants. I can apply conditions considering distance constraints between constants and variables, as well as between candidate variable names. To allow this, I also save in every node of the index the distance of closest node in the other nodes. 

How to work on the cloud

This approach has been also implemented in the cloud, through the so called COSI Architecture, assuming a cloud of k+1 computing nodes. The implementation of the edge cuts that generates the index must be very quick and produce fairly good cuts (but not necessarily optimal).
The image below lists some of the references to S.V. works on the topic.
Some references to V.S. Subrahmanian works on subgraph matching.

To keep updated on my activities you can subscribe to the RSS feed of my blog or follow my twitter account (@MarcoBrambi).

Keynote by David Campbell (Microsoft) on Big Data Challenges at VLDB 2011, Seattle

This post is about the insights I got from the interesting keynote speech by David Campbell (Microsoft) on Big Data Challenges that was given on August 31st 2011 at VLDB 2011, Seattle.

The challenge of big data is not interesting just because of the “big” per sè. It’s a multi-faceted concept and all the perspectives need to be considered.
The point is that this big data must be available on small devices and in shorter time-to-concept or time-to-insight than in the past.
We cannot afford any more the traditional paradigm in which the lifecycle is:

  • pose question
  • conceptual model
  • collect data
  • logical model
  • physical model
  • respond question

The question lifecycle can be summarized by the graph below:

However, the current lead time of this is too long (weeks or months). The true challenge is that We have much more data than we can model. The bottleneck is becoming the modeling phase, as shown below:

The correct cycle to be adopted is the sensemaking developed by Pirolli and Card in 2005 in the Intelligence Analysis community.
The notion is to have a frame that explains the data and viceversa the data supports the explanatory frame, in a continuous feedback and interdependent relationship. (see the Data-frame theory for sensemaking by Klein et al.)
So far, this is viable in modeled domains, while big data expands this to unmodeled domains.
This needs to enable automatic model generation.
The other challenge is to grant that the new paradigm is able to comprise the traditional data application and that it will be able to get the best of traditional data and big data.
A few patterns have been identified for big data:

  • Digital shoebox: retain all the ambient data to enable sensemaking. This is motivated by the cost of data acquisition and data storage going toward zero. I simply augment the raw data with sourceID and instanceID and keep it for future usage or sensemaking.
  • Information production: turn the acquired data from digital shoebox to other events, states, and results, thus transforming raw data into information (still requiring subsequent processing). The results go back in the digital shoebox
  • Model development: enable sensemaking direclty over the digital shoebox without extensive up front modeling, so as to create knowledge. Simple visualizations often suffice for getting the big picture of a trend or a behaviour (e.g., home automation sensors can provide the habits of a family).
  • Monitor, mine, manage: develop and use generated models to perfom active management or intervention. Models (or algorithms) are automatically generated so as to be installed as a new system (e.g., think to fraud detection or other fields).

I think that these patterns can actually be defined more as new development phases than patterns. Their application can significantly shorten the time-to-insight and is independent on the size of the datasource.
On the other side, I think this paradigm can apply more to sensor data that generally speaking big data (e.g., datasources on the web), but still has a huge potential both for personal information management, social networking data and also for enterprise management.

To keep updated on my activities you can subscribe to the RSS feed of my blog or follow my twitter account (@MarcoBrambi).