Myths and Challenges in Knowledge Extraction and Big Data Analysis

For centuries, science (in German “Wissenschaft”) has aimed to create (“schaften”) new knowledge (“Wissen”) from the observation of physical phenomena, their modelling, and empirical validation.

Recently, a new source of knowledge has emerged: not (only) the physical world any more, but the virtual world, namely the Web with its ever-growing stream of data materialized in the form of social network chattering, content produced on demand by crowds of people, messages exchanged among interlinked devices in the Internet of Things. The knowledge we may find there can be dispersed, informal, contradicting, unsubstantiated and ephemeral today, while already tomorrow it may be commonly accepted.

Picture2The challenge is once again to capture and create consolidated knowledge that is new, has not been formalized yet in existing knowledge bases, and is buried inside a big, moving target (the live stream of online data).

The myth is that existing tools (spanning fields like semantic web, machine learning, statistics, NLP, and so on) suffice to the objective. While this may still be far from true, some existing approaches are actually addressing the problem and provide preliminary insights into the possibilities that successful attempts may lead to.

I gave a few keynote speeches on this matter (at ICEIS, KDWEB,…), and I also use this argument as a motivating class in academic courses for letting students understand how crucial is to focus on the problems related to big data modeling and analysis. The talk, reported in the slides below, explores through real industrial use cases, the mixed realistic-utopian domain of data analysis and knowledge extraction and reports on some tools and cases where digital and physical world have brought together for better understanding our society.

The presentation is available on SlideShare and are reported here below:

Analysis of user behaviour and social media content for art and culture events

In our most recent study, we analysed the user behaviour and profile, as well as the textual and visual content posted on social media for art and culture events.

The corresponding paper has been presented at CD-MAKE 2017 in Reggio Calabria on August 31st, 2017.

Nowadays people share everything on online social networks, from daily life stories to the latest local and global news and events. In our paper, we address the specific problem of user behavioural profiling in the context of cultural and artistic events.

We propose a specific analysis pipeline that aims at examining the profile of online users, based on the textual content they published online. The pipeline covers the following aspects: data extraction and enrichment, topic modeling based on LDA, dimensionality reduction, user clustering, prediction of interest, content analysis including profiling of images and subjects.

Picture1We show our approach at work for the monitoring of participation to a large-scale artistic installation that collected more than 1.5 million visitors in just two weeks (namely The Floating Piers, by Christo and Jeanne-Claude). In the paper we report our findings and discuss the pros and cons of the work.

The full paper is published by Springer in the LNCS series in volume 10410, pages 219-236.

The slides used for the presentation are available on SlideShare:

 

Urbanscope: Digital Whispers from the Urban Landscape. TedX Talk Video

Together with the Urbanscope team, we gave a TedX talk on the topics and results of the project here at Politecnico di Milano. The talk was actually given by our junior researchers, as we wanted it to be a choral performance as opposed to the typical one-man show.

The message is that cities are not mere physical and organizational devices only: they are informational landscapes where places are shaped more by the streams of data and less by the traditional physical evidences. We devise tools and analysis for understanding these streams and the phenomena they represent, in order to understand better our cities.

Two layers coexist: a thick and dynamic layer of digital traces – the informational membrane – grows everyday on top of the material layer of the territory, the buildings and the infrastructures. The observation, the analysis and the representation of these two layers combined provides valuable insights on how the city is used and lived.

You can now find the video of the talk on the official TedX YouTube channel:

Urbanscope is a research laboratory where collection, organization, analysis, and visualization of cross domain geo-referenced data are experimented.
The research team is based at Politecnico di Milano and encompasses researchers with competencies in Computing Engineering, Communication and Information Design, Management Engineering, and Mathematics.

The aim of Urbanscope is to systematically produce compelling views on urban systems to foster understanding and decision making. Views are like new lenses of a macroscope: they are designed to support the recognition of specific patterns thus enabling new perspectives.

If you enjoyed the show, you can explore our beta application at:

http://www.urbanscope.polimi.it

and discover the other data science activities we are conducting at the Data Science Lab of Politecnico, DEIB.

 

Modeling, Modeling, Modeling: From Web to Enterprise to Crowd to Social

This is our perspective on the world: it’s all about modeling. 

So, why is it that model-driven engineering is not taking over the whole technological and social eco-system?

Let me make the case that it is.

A Comprehensive Guide Through the Italian Database Research Over the Last 25 YearsIn the occasion of the 25th edition of the Italian Symposium of Database Systems (SEBD 2017) we (Stefano Ceri and I) have been asked to write a retrospective on the last years of database and systems research from our perspective, published in a dedicated volume by Springer. After some brainstorming, we agreed that it all boils down to this: modeling, modeling, modeling.

Long time ago, in the past century, the International DB Research Community used to meet for assessing new research directions, starting the meetings with 2-minutes gong shows  to tell each one’s opinion and influencing follow-up discussion. Bruce Lindsay from IBM had just been quoted for his message:

There are 3 important things in data management: performance, performance, performance.

Stefano Ceri had a chance to speak out immediately after and to give a syntactically similar but semantically orthogonal message:

There are 3 important things in data management: modeling, modeling, modeling.

Data management is continuously evolving for serving the needs of an increasingly connected society. New challenges apply not only to systems and technology, but also to the models and abstractions for capturing new application requirements.

In our retrospective paper, we describe several models and abstractions which have been progressively designed to capture new forms of data-centered interactions in the last twenty five years – a period of huge changes due to the spreading of web-based applications and the increasingly relevant role of social interactions.

We initially focus on Web-based applications for individuals, then discuss applications among enterprises, and this is all about WebML and IFML; then we discuss how these applications may include rankings which are computed using services or using crowds, and this is related to our work on crowdsourcing (liquid query and crowdsearcher tool); we conclude with hints to a recent research discussing how social sources can be used for capturing emerging knowledge (the social knowledge extractor perspective and tooling).

162940660.KcsktUrP

All in all, modeling as a cognitive tool is all around us, and is growing in terms of potential impact thanks to formal cognification.

It’s also true that model-driven engineering is not necessarily the tool of choice for this to happen. Why? As technician, we always tend to blame the customer for not understanding our product. But maybe we should look into ourselves and the kind of tools (conceptual and technical) the MDE community is offering. I’m pretty sure we could find plenty of space for improvement.

Any idea on how to do this?

 

A Vision towards the Cognification of Model-driven Software Engineering

Jordi Cabot, Robert Clarisó, Marco Brambilla and Sébastien Gerard submitted a visionary paper on Cognifying Model-driven Software Development to the workshop GrandMDE (Grand Challenges in Modeling) co-located with STAF 2017 in Margburg (Germany) on July 17, 2017. The paper advocates for the cross-domain fertilization of disciplines such as machine learning and artificial intelligence, behavioural analytics, social studies, cognitive science, crowdsourcing and many more, in order to help model-driven software development.
But actually, what is cognification?

Cognification is the application of knowledge to boost the performance and impact of any process.

It is recognized as one of the 12 technological forces that will shape our future. We are flooded with data, ideas, people, activities, businesses, and goals. But this flooding could even be helpful.

The thesis of our paper is that cognification will also revolution in the way software is built. In particular, we discuss the opportunities and challenges of cognifying Model-Driven Software Engineering (MDSE or MDE) tasks.

MDE has seen limited adoption in the software development industry, probably because the perception from developers’ and managers’ perspective is that its benefits do not outweigh its costs.

We believe cognification could drastically improve the benefits and reduce the costs of adopting MDSE, and thus boost its adoption.

At the practical level, cognification comprises tools that go from artificial intelligence (machine learning, deep learning, as well as human cognitive capabilities, exploited through online activities, crowdsourcing, gamification and so on.

Opportunities (and challenges) for MDE

Here is a set of MDSE tasks and tools whose benefits can be especially boosted thanks to cognification.

  • A modeling bot playing the role of virtual assistant in the modeling tasks
  • A model inferencer able to deduce a common schema behind a set of unstructured data coming from the software process
  • A code generator able to learn the style and best practices of a company
  • A real-time model reviewer able to give continuous quality feedback
  • A morphing modeling tool, able to adapt its interface at run-time
  • A semantic reasoning platform able to map modeled concepts to existing ontologies
  • A data fusion engine that is able to perform semantic integration and impact analysis of design-time models with runtime data
  • A tool for collaboration between domain experts and modeling designers

A disclaimer

Obviously, we are aware that some research initiatives aiming at cognifying specific tasks in Software Engineering exist (including some activities of ours). But what we claim here is a change in magnitude of their coverage, integration, and impact in the short-term future.

If you want to get a more detailed description, you can go through the detailed post by Jordi Cabot that reports the whole content of the paper.

Instrumenting Continuous Knowledge Extraction, Sharing, and Benchmarking

This is a contribution in response to the Call for Linked Research for the workshop at ESWC 2017 entitled Enabling Decentralised Scholarly Communication.

Authors: Marco Brambilla, Emanuele Della Valle, Andrea Mauri, Riccardo Tommasini.

Affiliation: Politecnico di Milano, DEIB, Data Science Lab. Milano, Italy.

You can also read and download the full article in PDF
“Nanos gigantum humeris insidentes”
(Bernard of Chartres, 1115 ca.)

Introduction

Science aims at  creating new knowledge upon the existing one, from the observation of physical phenomena, their modeling and empirical validation. This combines the well known motto “standing on the shoulders of giants” (attributed to Bernard of Chartres and subsequently rephrased by Isaac Newton) with the need of trying and validating new experiments.
However, knowledge in the world continuously evolves, at a pace that cannot be traced even by large crowdsourced bodies of knowledge such as Wikipedia. A large share of generated data are not currently analysed and consolidated into exploitable information and knowledge (Ackoff 1989). In particular, the process of ontological knowledge discovery tends to focus on the most popular items, those which are mostly quoted or referenced, and is less effective in discovering less popular items, belonging to the so-called long tail , i.e. the portion of the entity’s distribution having fewer occurrences (Brambilla 2016).
This becomes a challenge for practitioners, enterprises and scholars / researchers, which need to be up to date to innovation and emerging facts. The scientific community also need to make sure there is a structured and formal way to represent, store and access such knowledge, for instance as ontologies or linked data sources.
Our idea is to propose a vision towards a set of (possibly integrated) publicly available tools that can help scholars keeping the pace with the evolving knowledge. This implies the capability of integrating informal sources, such as social networks, blogs, and user-generated content in general. One can conjecture that somewhere, within the massive content shared by people online, any low-frequency, emerging concept or fact has left some traces. The challenge is to detect such traces, assess their relevance and trustworthiness, and transform them into formalized knowledge (Stieglitz 2014).An appropriate set of tools that can improve effectiveness of knowledge extraction, storage, analysis, publishing and experimental benchmarking could be extremely beneficial for the entire research community across fields and interests.

Our Vision towards Continuous Knowledge Extraction and Publishing

We foresee a paradigm where knowledge seeds can be planted, and subsequently grow, finally leading to the generation and collection of new knowledge, as depicted in the exemplary process shown below: knowledge seeding (through types, context variables, and example instances), growing (for instance by exploring social media), and harvesting for extracting concepts (instances and types).

ske-paradigm
We advocate for a set of tools that, when implemented and integrated, enable  the following perspective reality:

  • possibility of selecting any kind of source of raw data, independently of their format, type or  semantics (spanning quantitative data, textual content, multimedia content), covering both data streams or pull-based data sources;
  • possibility of applying different data cleaning and data analysis pipelines to the different sources, in order to increase data quality and abstraction / aggregation;
  • possibility of integrating the selected sources;
  • possibility of running homogeneous knowledge extraction processes of the integrated sources;
  • possibility of publishing the results of the analysis and semantic enrichment as new and further (richer) data sources and streams, in a coherent, standard and semantic way.

This enables generation of new sources which in turn can be used in subsequent knowledge extraction processes of the same kind. The results of this process must be available at any stage to be shared for building an open, integrated and continuously evolving knowledge for research, innovation, and dissemination purposes.

A Preliminary Feasibility Perspective

Whilst beneficial and powerful, the vision we propose is far from being achieved nowadays.  However, we are convinced that the vision is not out of reach in the mid term. To give a hint of this, we report here our experience with the research, design and implementation of a few tools that point in the proposed direction:

  1. Social Knowledge Extractor (SKE) is a publicly available tool for discovering emerging knowledge by extracting it from social content. Once instrumented by experts through very simple initialization, the tool is capable of finding emerging entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors, built by using terms occurring in their social content, and then ranks the candidates by using their distance from the centroid of seeds, returning the top candidates as result. The tool can run continuously or with periodic iterations, using the results as new seeds. Our research on this has been published in (Brambilla et al., 2017), a simplified implementation is currently available online for demo purposes
    at http://datascience.deib.polimi.it/social-knowledge/,
    and the code is available as open-source under an Apache 2.0 license on GitHub at https://github.com/DataSciencePolimi/social-knowledge-extractor.
  2. TripleWave is a tool for disseminating and exchanging RDF streams on the Web. At the purpose of  processing information streams in real-time and at Web scale, TripleWave integrates nicely with RDF Stream Processing (RSP) and Stream Reasoning (SR) as solutions to combine semantic technologies with stream and event processing techniques. In particular, it integrates with an existing ecosystem of solutions to query, reason and perform real-time processing over heterogeneous and distributed data streams. TripleWave can be fed with existing Web streams (e.g. Twitter and Wikipedia streams) or time-annotated RDF datasets (e.g. the Linked Sensor Data dataset) and it can be invoked through both pull- and push-based mechanisms, thus enabling RSP engines to automatically register and receive data from TripleWave. The tool has been described in (Mauri et al., 2016) and the code is available as open-source on GitHub at https://github.com/streamreasoning/TripleWave/.
  3. RSPlab enables efficient design and execution of reproducible experiments,  as well as sharing of the results. It integrates two existing RSP benchmarks (LSBench and CityBench) and two RSP engines (C-SPARQL engine and CQELS). It provides a programmatic environment to: deploy in the cloud RDF Streams and RSP engines; interact with them using TripleWave and RSP Services; continuously monitor their performances and collect statistics. RSPlab is released as open-source under an Apache 2.0 license, is currently under submission at ISWC – Resources Track and is available on GitHub
    at https://github.com/streamreasoning/rsplab.

Conclusions

We believe that knowledge intaking by scholars is going to become more and more time consuming and expensive, due to the amount of knowledge that is being built and shared everyday. We envision a comprehensive approach based on integrated tools that allow data collection, cleaning, integration, analysis and semantic representation that can be run continuously  for keeping the formalized knowledge bases aligned with the evolution of knowledge, with limited cost and high recall on the facts and concepts that emerge or decay. These tools do not need to be implemented by the same vendor or provider; we instead advocate for opensource publishing of all the implementations, as well as for the definition of an agreed-upon integration platform that allows them all to colloquiate appropriately.

Outlook on Research Resource Sharing

As we envisioned an ecosystem that includes, but is not limited to, modules for extraction, sharing and benchmarking, two research questions require investigation in the immediate future:
First, how can we design and publish new resources for such an ecosystem? Do they exist already? It is important to understand what else is available out there.  Researchers commonly support their scientific studies with resources that can benefit the whole community, if released. The release process must comply with a scientific method that ensures repeatability and reproducibility. However, a standard agreed-upon methodology that guide this process does not exists yet.
Second, how should we combine these resources towards shared research workflows? To investigate this research question, we need a platform that enables researchers to deploy their resources and interact with the ecosystem. Therefore, we call for an open discussion about how this integration should be done.

References

  • Russell L. Ackoff. From data to wisdom. Journal of applied systems analysis 16, 3–9 (1989).

  • Marco Brambilla, Stefano Ceri, Florian Daniel, Emanuele Della Valle. On the quest for changing knowledge. In Proceedings of the Workshop on Data-Driven Innovation on the Web – DDI 16. ACM Press, 2016. Link

  • Stefan Stieglitz, Linh Dang-Xuan, Axel Bruns, Christoph Neuberger. Social Media Analytics. Business & Information Systems Engineering 6, 89–96 Springer Nature, 2014. Link

  • Marco Brambilla, Stefano Ceri, Emanuele Della Valle, Riccardo Volonterio, Felix Xavier Acero Salazar. Extracting Emerging Knowledge from Social Media. In Proceedings of the 26th International Conference on World Wide Web – WWW 17. ACM Press, 2017. Link

  • Andrea Mauri, Jean-Paul Calbimonte, Daniele Dell’Aglio, Marco Balduini, Marco Brambilla, Emanuele Della Valle, Karl Aberer. TripleWave: Spreading RDF Streams on the Web. 140–149 In Lecture Notes in Computer Science. Springer International Publishing, 2016. Link

(*) Note: the current version includes content in response to an online open review.

Extracting Emerging Knowledge from Social Media

Today I presented our full paper titled “Extracting Emerging Knowledge from Social Media” at the WWW 2017 conference.

The work is based on a rather obvious assumption, i.e., that knowledge in the world continuously evolves, and ontologies are largely incomplete for what concerns low-frequency data, belonging to the so-called long tail.

Socially produced content is an excellent source for discovering emerging knowledge: it is huge, and immediately reflects the relevant changes which hide emerging entities.

In the paper we propose a method and a tool for discovering emerging entities by extracting them from social media.

Once instrumented by experts through very simple initialization, the method is capable of finding emerging entities; we propose a mixed syntactic + semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors, built by using terms occurring in their social content, and then ranks the candidates by using their distance from the centroid of seeds, returning the top candidates as result.

The method can be continuously or periodically iterated, using the results as new seeds.

The PDF of the full paper presented at WWW 2017 is available online (open access with Creative Common license).

You can also check out the slides of my presentation on Slideshare.

A demo version of the tool is available online for free use, thanks also to our partners Dandelion and Microsoft Azure.

You can TRY THE TOOL NOW if you want.

Spark-based Big Data Analysis of Semantic IFML Models and Web Logs for Enhanced User Behavior Analytics

I’d like to report on our demonstration paper at WWW 2017, focusing on Spark-based Big Data Analysis of  Semantic IFML Models and Web Logs  for Enhanced User Behavior Analytics.

The motivation of the work is that  no approaches exist for merging web log analysis and statistics with information about the Web application structure, content and semantics. Indeed, basic Web analytics tools are widespread and provide statistics about Web site navigation at the syntactic level only: they analyze the user interaction at page level in terms of page views, entry and landing page, page views per visit, and so on. Unfortunately, those tools do not provide precise statistics neither about the content and semantics of the visited pages, nor about the actual reactions of the users to the actual content (instances) he is shown.

With our work we demonstrate the advantages of combining Web application models with runtime navigation logs, at the purpose of deepening the understanding of users behaviour.

We propose a model-driven approach that combines user interaction modeling (based on the IFML standard), full code generation of the designed application, user tracking at runtime through logging of runtime component execution and user activities, integration with page content details, generation of integrated schema-less data streams, and application of large-scale analytics and visualization tools for big data, by applying data visualization techniques that build direct representation of statistics on the IFML visual models of the Web application.

The paper describing the approach is available in the WWW 2017 proceedings.

The video of the demo is available on YouTube:

The Dawn of a new Digital Renaissance in Cultural Heritage

Fluxedo joined forces with the Observatory of Digital Innovation in Arts & Culture Heritage (Osservatorio per l’innovazione digitale nei beni e attività culturali) by the School of Management (MIP) of Politecnico di Milano, for covering the social media analytics of Italian and international museums.

The results of the work have been presented during an event on January 19th, 2017 hosted by Piccolo Teatro di Milano, which was very successful.

The live dashboard of the SocialOmeters analysis on the museums is available here:

www.socialometers.com/osservatoriomusei/

socialometers_musei_

A summary of the event through social media content of the event as generated via Storify is available here.

beni_e_attivita_culturali__l_alba_del_rinascimento_digitale__with_images__tweets__%c2%b7_marcobrambi_%c2%b7_storify

The official hashtag of the event #OBAC17 has become Twitter trend  in Italy, with 579 tweets, 187 users, around 600 likes and retweets, and a potential audience of 2.2 million users.

The event had a huge visibility on the national media, as reported in this press review:

1.      La rivoluzione dei musei online. Il primato di Triennale e Pinacoteca – read Il Corriere della Sera Milano

2.      L’innovazione prolifera (ma fatica) – read Il Sole 24 Ore Nòva

3.      I musei italiani e la digitalizzazione: il punto del Politecnico di Milano – read Advertiser

4.      Osservatorio Politecnico, musei social ma con pochi servizi digitali – read Arte Magazine

5.      Arte & Innovazione. Musei italiani sempre più social (52%) e virtuali (20%) – read
Corriere del Web

6.      Boom di visitatori nei musei ma è flop dei servizi digitali – read Il Sole 24 Ore Blog

7.      Capitolini, comunali e Maxxi di Roma tra i musei più popolari sui social network – read La Repubblica Roma

8.      La pagina Facebook della Reggia di Venaria è la più apprezzata d’Italia con oltre 166 mila “like” – read La Stampa Torino

9.      Il 52% dei musei italiani è social ma i servizi digitali per la fruizione delle opere sono limitati. Un’analisi dell’Osservatorio Innovazione Digitale nei Beni e Attività Culturali – read Lombard Street

10.  Musei sempre più social, ecco i più cliccati  – read TTG Italia

11.  Tanta cultura, poco digitale: solo il 52% dei musei italiani è sui social e il 43% non ha ancora un sito – read Vodafone News

12.  Tra Twitter e Instagram, 52% musei italiani punta sui social media – read ADNKronos

13.  Musei strizzano occhi a social, ma strada è lunga – read ANSA ViaggiArt

14.  Oltre la metà dei musei italiani è online e sui social, ma i servizi digitali evoluti e quelli on site sono ancora scarsi – read Brand News

15.  Il 52% dei musei italiani è social ma i servizi digitali per la fruizione delle opere sono limitati – read DailyNet

16.  Musei italiani sempre più social, ma i servizi digitali sono limitati – read Diario Innovazione

17.  Musei e social network – read Inside Art

18.  Musei Vaticani e Maxxi tra i più social d’Italia – read Il Messaggero

19.  I musei si fanno spazio sui social – read Italia

20.  Musei italiani social, ma non troppo – read La Repubblica

21.  Capitolini e Maxxi da record sui social – read La Repubblica Roma

22.  Musei lucani poco social e poco visitabili in web – read La Siritide

23.  Venaria regina dei social – read La Stampa Torino

24.  In calo nel 2016 il numero degli ingressi nei luoghi di cultura in Basilicata – read Oltre

25.  Musei sempre più social, ma poco interattivi – read QN ILGIORNO – il Resto del Carlino – LA NAZIONE

26.  Dal marketing alle guide per disabili. Cultura, boom dell’industria digitale – read QN ILGIORNO – il Resto del Carlino – LA NAZIONE

27.  Musei romani sempre più social – read Radio Colonna

28.  Social network: il Maxxi tra i musei più popolari – read Roma2Oggi

29.  Musei italiani sempre più social, ma la strada è ancora lunga – read Travel No Stop

30.  Musei italiani sempre più social e virtuali – read Uomini & Donne della Comunicazione

31.  Tra Twitter e Instagram, un museo su due in Italia scommette sui …  – read Italia per Me

32.  Beni culturali, musei lucani poco social – read La Nuova del Sud

33.  Fb, Instagram e Twitter: i musei italiani puntano sui social ma non basta – read  La Repubblica

34.  Tra Twitter e Instagram, un museo su due in Italia scommette sui social media – read La Stampa

35.  Il 52% dei musei italiani è social ma i servizi digitali per la fruizione delle opere sono limitati – read Sesto Potere

36.  Il 52% dei musei italiani è social, ma la fruizione delle opere digital è limitata – read Il Sole 24 Ore

Keynote from Google Research on Building Knowlege Bases at #ICWE2016

I report here some highlights of the keynote speech by Xin Luna Dong at the 16th International Conference on Web Engineering (ICWE 2016). Incidentally, she is now moving to Amazon for starting a new project on building an Amazon knowledge base.
Building knowledge bases still remains a challenging task.
First, one has to decide how to build the knowledge: automatically or manually?
A survey in 2014 reported the following list of large efforts in knowledge building: the top 4 approaches are manually curated, the bottom 3 are automatic.
Google’s knowledge vault and knowledge Graph are the big winners in terms of volume.
When you move to long tail content, curation does not scale. Automation must be viable and precise.
This is in line with our own research line we are starting on Extracting Changing Knowledge (we presented a short paper at a Web Science 2016 workshop last month). Here is a summary of our approach:
Where knowledge can be extracted from? In Knowledge Valut:
  • largest share of the content comes from DOM structured documents
  • then textual content
  • then annotated content
  • and a small share from web tables

Knowledge Vault is a matrix based approach to knowledge building, with rows = entities and columns= attributes.

It assumes the entities to be available (e.g. in Freebase), and builds a training over that.
One can build KBs by building buckets of triples, with similar probability of being correct. It’s important to precisely estimate correctness probability.
Errors can include mistakes on:
  • triple identification
  • entity linkage
  • predicate linkage
  • source data

Besides general purpose KBs, Google built lightweight vertical knowledge bases (more than 100 available now).

When extracting knowledge, the ingredients are: datasource, extractor approach, the data items themselves, facts and their probability of truth.

Several models can be used for extracting knowledge. Two extremes of the spectrum are:

  1. Single-truth model. Every fact has only one truth. We trust the value of the highest number of datasources.
  2. Multilaeyer model. separates source quality from extractor quality and data errors from extraction errors. One can build a knowledge-based trust model, defining trustworthiness of web pages. One can compare this measure with respect to page rank of web pages:

In general, the challenge is to move from individual information and data points, to integrated and connected knowledge. Building the right edges is really hard though.
Overall, a lot of ingredients influence the correctness of knowledge: temporal aspects, data source correctness, capability of extraction and validation, and so on–

In summary: Plenty of research challenges to be addressed, both by the datascience and modeling communities!

To keep updated on my activities you can subscribe to the RSS feed of my blog or follow my twitter account (@MarcoBrambi).