- largest share of the content comes from DOM structured documents
- then textual content
- then annotated content
- and a small share from web tables
Knowledge Vault is a matrix based approach to knowledge building, with rows = entities and columns= attributes.
- triple identification
- entity linkage
- predicate linkage
- source data
Besides general purpose KBs, Google built lightweight vertical knowledge bases (more than 100 available now).
When extracting knowledge, the ingredients are: datasource, extractor approach, the data items themselves, facts and their probability of truth.
Several models can be used for extracting knowledge. Two extremes of the spectrum are:
- Single-truth model. Every fact has only one truth. We trust the value of the highest number of datasources.
- Multilaeyer model. separates source quality from extractor quality and data errors from extraction errors. One can build a knowledge-based trust model, defining trustworthiness of web pages. One can compare this measure with respect to page rank of web pages:
In general, the challenge is to move from individual information and data points, to integrated and connected knowledge. Building the right edges is really hard though.
Overall, a lot of ingredients influence the correctness of knowledge: temporal aspects, data source correctness, capability of extraction and validation, and so on–
In summary: Plenty of research challenges to be addressed, both by the datascience and modeling communities!