This post is about the insights I got from the interesting keynote speech by David Campbell (Microsoft) on Big Data Challenges that was given on August 31st 2011 at VLDB 2011, Seattle.
The challenge of big data is not interesting just because of the “big” per sè. It’s a multi-faceted concept and all the perspectives need to be considered.
The point is that this big data must be available on small devices and in shorter time-to-concept or time-to-insight than in the past.
We cannot afford any more the traditional paradigm in which the lifecycle is:
- pose question
- conceptual model
- collect data
- logical model
- physical model
- respond question
The question lifecycle can be summarized by the graph below:
However, the current lead time of this is too long (weeks or months). The true challenge is that We have much more data than we can model. The bottleneck is becoming the modeling phase, as shown below:
The correct cycle to be adopted is the sensemaking developed by Pirolli and Card in 2005 in the Intelligence Analysis community.
The notion is to have a frame that explains the data and viceversa the data supports the explanatory frame, in a continuous feedback and interdependent relationship. (see the Data-frame theory for sensemaking by Klein et al.)
So far, this is viable in modeled domains, while big data expands this to unmodeled domains.
This needs to enable automatic model generation.
The other challenge is to grant that the new paradigm is able to comprise the traditional data application and that it will be able to get the best of traditional data and big data.
A few patterns have been identified for big data:
- Digital shoebox: retain all the ambient data to enable sensemaking. This is motivated by the cost of data acquisition and data storage going toward zero. I simply augment the raw data with sourceID and instanceID and keep it for future usage or sensemaking.
- Information production: turn the acquired data from digital shoebox to other events, states, and results, thus transforming raw data into information (still requiring subsequent processing). The results go back in the digital shoebox
- Model development: enable sensemaking direclty over the digital shoebox without extensive up front modeling, so as to create knowledge. Simple visualizations often suffice for getting the big picture of a trend or a behaviour (e.g., home automation sensors can provide the habits of a family).
- Monitor, mine, manage: develop and use generated models to perfom active management or intervention. Models (or algorithms) are automatically generated so as to be installed as a new system (e.g., think to fraud detection or other fields).
I think that these patterns can actually be defined more as new development phases than patterns. Their application can significantly shorten the time-to-insight and is independent on the size of the datasource.
On the other side, I think this paradigm can apply more to sensor data that generally speaking big data (e.g., datasources on the web), but still has a huge potential both for personal information management, social networking data and also for enterprise management.