Keynote from Google Research on Building Knowlege Bases at #ICWE2016

I report here some highlights of the keynote speech by Xin Luna Dong at the 16th International Conference on Web Engineering (ICWE 2016). Incidentally, she is now moving to Amazon for starting a new project on building an Amazon knowledge base.
Building knowledge bases still remains a challenging task.
First, one has to decide how to build the knowledge: automatically or manually?
A survey in 2014 reported the following list of large efforts in knowledge building: the top 4 approaches are manually curated, the bottom 3 are automatic.
Google’s knowledge vault and knowledge Graph are the big winners in terms of volume.
When you move to long tail content, curation does not scale. Automation must be viable and precise.
This is in line with our own research line we are starting on Extracting Changing Knowledge (we presented a short paper at a Web Science 2016 workshop last month). Here is a summary of our approach:
Where knowledge can be extracted from? In Knowledge Valut:
  • largest share of the content comes from DOM structured documents
  • then textual content
  • then annotated content
  • and a small share from web tables

Knowledge Vault is a matrix based approach to knowledge building, with rows = entities and columns= attributes.

It assumes the entities to be available (e.g. in Freebase), and builds a training over that.
One can build KBs by building buckets of triples, with similar probability of being correct. It’s important to precisely estimate correctness probability.
Errors can include mistakes on:
  • triple identification
  • entity linkage
  • predicate linkage
  • source data

Besides general purpose KBs, Google built lightweight vertical knowledge bases (more than 100 available now).

When extracting knowledge, the ingredients are: datasource, extractor approach, the data items themselves, facts and their probability of truth.

Several models can be used for extracting knowledge. Two extremes of the spectrum are:

  1. Single-truth model. Every fact has only one truth. We trust the value of the highest number of datasources.
  2. Multilaeyer model. separates source quality from extractor quality and data errors from extraction errors. One can build a knowledge-based trust model, defining trustworthiness of web pages. One can compare this measure with respect to page rank of web pages:

In general, the challenge is to move from individual information and data points, to integrated and connected knowledge. Building the right edges is really hard though.
Overall, a lot of ingredients influence the correctness of knowledge: temporal aspects, data source correctness, capability of extraction and validation, and so on–

In summary: Plenty of research challenges to be addressed, both by the datascience and modeling communities!

To keep updated on my activities you can subscribe to the RSS feed of my blog or follow my twitter account (@MarcoBrambi).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s