A Vision towards the Cognification of Model-driven Software Engineering

Jordi Cabot, Robert Clarisó, Marco Brambilla and Sébastien Gerard submitted a visionary paper on Cognifying Model-driven Software Development to the workshop GrandMDE (Grand Challenges in Modeling) co-located with STAF 2017 in Margburg (Germany) on July 17, 2017. The paper advocates for the cross-domain fertilization of disciplines such as machine learning and artificial intelligence, behavioural analytics, social studies, cognitive science, crowdsourcing and many more, in order to help model-driven software development.
But actually, what is cognification?

Cognification is the application of knowledge to boost the performance and impact of any process.

It is recognized as one of the 12 technological forces that will shape our future. We are flooded with data, ideas, people, activities, businesses, and goals. But this flooding could even be helpful.

The thesis of our paper is that cognification will also revolution in the way software is built. In particular, we discuss the opportunities and challenges of cognifying Model-Driven Software Engineering (MDSE or MDE) tasks.

MDE has seen limited adoption in the software development industry, probably because the perception from developers’ and managers’ perspective is that its benefits do not outweigh its costs.

We believe cognification could drastically improve the benefits and reduce the costs of adopting MDSE, and thus boost its adoption.

At the practical level, cognification comprises tools that go from artificial intelligence (machine learning, deep learning, as well as human cognitive capabilities, exploited through online activities, crowdsourcing, gamification and so on.

Opportunities (and challenges) for MDE

Here is a set of MDSE tasks and tools whose benefits can be especially boosted thanks to cognification.

  • A modeling bot playing the role of virtual assistant in the modeling tasks
  • A model inferencer able to deduce a common schema behind a set of unstructured data coming from the software process
  • A code generator able to learn the style and best practices of a company
  • A real-time model reviewer able to give continuous quality feedback
  • A morphing modeling tool, able to adapt its interface at run-time
  • A semantic reasoning platform able to map modeled concepts to existing ontologies
  • A data fusion engine that is able to perform semantic integration and impact analysis of design-time models with runtime data
  • A tool for collaboration between domain experts and modeling designers

A disclaimer

Obviously, we are aware that some research initiatives aiming at cognifying specific tasks in Software Engineering exist (including some activities of ours). But what we claim here is a change in magnitude of their coverage, integration, and impact in the short-term future.

If you want to get a more detailed description, you can go through the detailed post by Jordi Cabot that reports the whole content of the paper.

Instrumenting Continuous Knowledge Extraction, Sharing, and Benchmarking

This is a contribution in response to the Call for Linked Research for the workshop at ESWC 2017 entitled Enabling Decentralised Scholarly Communication.

Authors: Marco Brambilla, Emanuele Della Valle, Andrea Mauri, Riccardo Tommasini.

Affiliation: Politecnico di Milano, DEIB, Data Science Lab. Milano, Italy.

You can also read and download the full article in PDF
“Nanos gigantum humeris insidentes”
(Bernard of Chartres, 1115 ca.)

Introduction

Science aims at  creating new knowledge upon the existing one, from the observation of physical phenomena, their modeling and empirical validation. This combines the well known motto “standing on the shoulders of giants” (attributed to Bernard of Chartres and subsequently rephrased by Isaac Newton) with the need of trying and validating new experiments.
However, knowledge in the world continuously evolves, at a pace that cannot be traced even by large crowdsourced bodies of knowledge such as Wikipedia. A large share of generated data are not currently analysed and consolidated into exploitable information and knowledge (Ackoff 1989). In particular, the process of ontological knowledge discovery tends to focus on the most popular items, those which are mostly quoted or referenced, and is less effective in discovering less popular items, belonging to the so-called long tail , i.e. the portion of the entity’s distribution having fewer occurrences (Brambilla 2016).
This becomes a challenge for practitioners, enterprises and scholars / researchers, which need to be up to date to innovation and emerging facts. The scientific community also need to make sure there is a structured and formal way to represent, store and access such knowledge, for instance as ontologies or linked data sources.
Our idea is to propose a vision towards a set of (possibly integrated) publicly available tools that can help scholars keeping the pace with the evolving knowledge. This implies the capability of integrating informal sources, such as social networks, blogs, and user-generated content in general. One can conjecture that somewhere, within the massive content shared by people online, any low-frequency, emerging concept or fact has left some traces. The challenge is to detect such traces, assess their relevance and trustworthiness, and transform them into formalized knowledge (Stieglitz 2014).An appropriate set of tools that can improve effectiveness of knowledge extraction, storage, analysis, publishing and experimental benchmarking could be extremely beneficial for the entire research community across fields and interests.

Our Vision towards Continuous Knowledge Extraction and Publishing

We foresee a paradigm where knowledge seeds can be planted, and subsequently grow, finally leading to the generation and collection of new knowledge, as depicted in the exemplary process shown below: knowledge seeding (through types, context variables, and example instances), growing (for instance by exploring social media), and harvesting for extracting concepts (instances and types).

ske-paradigm
We advocate for a set of tools that, when implemented and integrated, enable  the following perspective reality:

  • possibility of selecting any kind of source of raw data, independently of their format, type or  semantics (spanning quantitative data, textual content, multimedia content), covering both data streams or pull-based data sources;
  • possibility of applying different data cleaning and data analysis pipelines to the different sources, in order to increase data quality and abstraction / aggregation;
  • possibility of integrating the selected sources;
  • possibility of running homogeneous knowledge extraction processes of the integrated sources;
  • possibility of publishing the results of the analysis and semantic enrichment as new and further (richer) data sources and streams, in a coherent, standard and semantic way.

This enables generation of new sources which in turn can be used in subsequent knowledge extraction processes of the same kind. The results of this process must be available at any stage to be shared for building an open, integrated and continuously evolving knowledge for research, innovation, and dissemination purposes.

A Preliminary Feasibility Perspective

Whilst beneficial and powerful, the vision we propose is far from being achieved nowadays.  However, we are convinced that the vision is not out of reach in the mid term. To give a hint of this, we report here our experience with the research, design and implementation of a few tools that point in the proposed direction:

  1. Social Knowledge Extractor (SKE) is a publicly available tool for discovering emerging knowledge by extracting it from social content. Once instrumented by experts through very simple initialization, the tool is capable of finding emerging entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors, built by using terms occurring in their social content, and then ranks the candidates by using their distance from the centroid of seeds, returning the top candidates as result. The tool can run continuously or with periodic iterations, using the results as new seeds. Our research on this has been published in (Brambilla et al., 2017), a simplified implementation is currently available online for demo purposes
    at http://datascience.deib.polimi.it/social-knowledge/,
    and the code is available as open-source under an Apache 2.0 license on GitHub at https://github.com/DataSciencePolimi/social-knowledge-extractor.
  2. TripleWave is a tool for disseminating and exchanging RDF streams on the Web. At the purpose of  processing information streams in real-time and at Web scale, TripleWave integrates nicely with RDF Stream Processing (RSP) and Stream Reasoning (SR) as solutions to combine semantic technologies with stream and event processing techniques. In particular, it integrates with an existing ecosystem of solutions to query, reason and perform real-time processing over heterogeneous and distributed data streams. TripleWave can be fed with existing Web streams (e.g. Twitter and Wikipedia streams) or time-annotated RDF datasets (e.g. the Linked Sensor Data dataset) and it can be invoked through both pull- and push-based mechanisms, thus enabling RSP engines to automatically register and receive data from TripleWave. The tool has been described in (Mauri et al., 2016) and the code is available as open-source on GitHub at https://github.com/streamreasoning/TripleWave/.
  3. RSPlab enables efficient design and execution of reproducible experiments,  as well as sharing of the results. It integrates two existing RSP benchmarks (LSBench and CityBench) and two RSP engines (C-SPARQL engine and CQELS). It provides a programmatic environment to: deploy in the cloud RDF Streams and RSP engines; interact with them using TripleWave and RSP Services; continuously monitor their performances and collect statistics. RSPlab is released as open-source under an Apache 2.0 license, is currently under submission at ISWC – Resources Track and is available on GitHub
    at https://github.com/streamreasoning/rsplab.

Conclusions

We believe that knowledge intaking by scholars is going to become more and more time consuming and expensive, due to the amount of knowledge that is being built and shared everyday. We envision a comprehensive approach based on integrated tools that allow data collection, cleaning, integration, analysis and semantic representation that can be run continuously  for keeping the formalized knowledge bases aligned with the evolution of knowledge, with limited cost and high recall on the facts and concepts that emerge or decay. These tools do not need to be implemented by the same vendor or provider; we instead advocate for opensource publishing of all the implementations, as well as for the definition of an agreed-upon integration platform that allows them all to colloquiate appropriately.

Outlook on Research Resource Sharing

As we envisioned an ecosystem that includes, but is not limited to, modules for extraction, sharing and benchmarking, two research questions require investigation in the immediate future:
First, how can we design and publish new resources for such an ecosystem? Do they exist already? It is important to understand what else is available out there.  Researchers commonly support their scientific studies with resources that can benefit the whole community, if released. The release process must comply with a scientific method that ensures repeatability and reproducibility. However, a standard agreed-upon methodology that guide this process does not exists yet.
Second, how should we combine these resources towards shared research workflows? To investigate this research question, we need a platform that enables researchers to deploy their resources and interact with the ecosystem. Therefore, we call for an open discussion about how this integration should be done.

References

  • Russell L. Ackoff. From data to wisdom. Journal of applied systems analysis 16, 3–9 (1989).

  • Marco Brambilla, Stefano Ceri, Florian Daniel, Emanuele Della Valle. On the quest for changing knowledge. In Proceedings of the Workshop on Data-Driven Innovation on the Web – DDI 16. ACM Press, 2016. Link

  • Stefan Stieglitz, Linh Dang-Xuan, Axel Bruns, Christoph Neuberger. Social Media Analytics. Business & Information Systems Engineering 6, 89–96 Springer Nature, 2014. Link

  • Marco Brambilla, Stefano Ceri, Emanuele Della Valle, Riccardo Volonterio, Felix Xavier Acero Salazar. Extracting Emerging Knowledge from Social Media. In Proceedings of the 26th International Conference on World Wide Web – WWW 17. ACM Press, 2017. Link

  • Andrea Mauri, Jean-Paul Calbimonte, Daniele Dell’Aglio, Marco Balduini, Marco Brambilla, Emanuele Della Valle, Karl Aberer. TripleWave: Spreading RDF Streams on the Web. 140–149 In Lecture Notes in Computer Science. Springer International Publishing, 2016. Link

(*) Note: the current version includes content in response to an online open review.

ExploreWeb workshop on exploration of (Semantic) Web data at ICWE 2011

Together with Piero Fraternali and Daniel Schwabe I organized a workshop at ICWE 2011 on on Search, Exploration and Navigation of Web Data Sources, named ExploreWeb 2011.
It was a new challenge for us, because it was at its first edition, but I can say it has been a quite successful event.
We got 12 submissions and we accepted 7 of them, and furthermore we invited Soren Auer as a keynote speaker to start the day. The attendance was also very good, as we got 25+ people in the room for the whole day.
Here is a quick summary of the day.

Soren Auer: Exploration and other stages of the Linked Data Life Cycle

Soren Auer at exploreWeb 2011.

Soren Auer from University of Leipzig gave a very nice keynote talk at the beginning of the exploreWeb workshop on the entire lifecycle of Linked Data.
 The talk was centered on the requirements imposed by the continuous growth of the Linked Open Data cloud (LOD) and on the life cycle associated to the LD contents. Such life cycle comprises the following phases:

  1.  Extraction of LOD: extracting linked data is a challenge per se. Indeed, for instance in the case of DBpedia extraction from Wikipedia, the issues to be considered include: keeping aligned the semantic version with the user generated one; and at the same time coping with the messy and incoherent “schemas” offered by Wikipedia infoboxes. For covering this aspect, a “Mapping Wiki” based on a higher level ontology has been created for defining the mapping between labels in Wikipedia.
  2. Storage and Querying of LOD: the critical issue here is that it’s still 5 to 50 time slower than RDBMS. On the other side, it obviously grants increased flexibility (especially at the schema manipulation level). A new benchmark recently performed by Soren’s group provides new performance results for Virtuoso, Sesame, Jena, and BigOWLIM. The benchmark was performed on 25 frequent DBpedia queries and shows that Virtuoso consistently grants speed two times higher than the competitors, while Jena confirms as the most poorly performing platform.
  3. Authoring of LOD: different approaches can be adopted, including: Semantic Wikis (e.g., OntoWiki), in which users do not edit text but semantic descriptions built with forms. We can identify two main classes of semantic wikis: semantic text wikis and semantic data wiki. A new approach is now adopted by the new RDFa Content Editor, which uses OpenCalais and other APIs for helping annotating the text within a WYSIWYG environment.
  4. Linking LOD: approaches to linking can be automatic, semi-automatic (e.g., see the tools SILK and LIMES), or manual (e.g., see Sindice in UIs and Semantic Pingback). 
  5. Evolution of LOD: the evolution of linked data is a critical problem, not yet fully addressed. The EvoPat project is a first attempt to formalize the problem and the solution, by defining a set of evolution patterns and anti-patterns. Some features are already integrated into Ontowiki. 
  6. Exploration of LOD: Challenging because of: size, heterogeneity, distributedness.  Spacial and faceted exploration of LinkedGeoData #ld #semweb.  #freebase is the best search assistant for #ld . Also: Parallax and neofonie faceted browser . domain-specific exploration tools (relationship finder on RDF), visual query builders, … 
  7. Visualization of LOD: on this, Soren highlighted that with the continuously growing size of LOD, the (semantic) data visualization will become more and more important. He presented some preliminary approaches but a lot of work still needs to be done in this field.

 In the discussion and Q&A that followed the keynote, the hot topics have beenthe performance benchmark and the authoring of LOD related to the end users v expert/ technical user

Alessandro Bozzon: A Conceptual Framework for Linked Data Exploration

Alessandro Bozzon

Alessandro discussed some motivation to the problem of exploration and integration of linked data sources and then described the Search Computing approach to linked data exploration, which applies the general purpose SeCo framework to the specific needs of the LOD.
More on this can be found on the Search Computing web site, including also a demo and a video.

Daniel Schwabe: Support for reusable explorations of Linked Data in the Semantic Web
Daniel Schwabe started his talk with some strong motivation statements.
One of the main benefits of linked data should be that data bring their own self-description.
However, if you work on it you may end up doing really dirt works on the data, to make them linked.

Daniel Schwabe

When you go to exploration interfaces, expectations of the end users might be very different with respect to what the exploration tools for tech-savvy user. That gap needs to be filled, and Rexplorator moves in that direction. Explorator was presented in the Linked Data workshop(LDOW) in Madrid in 2009. Now its extension Rexplorator has been demonstrated at ISWC 2010 and now presented extensively at the ExploreWeb workshop.
With it, you can do composition of functions, parametrization of operators, storage and reuse of “use cases”, with a query by example approach. The UI lets you think that you are dealing with resources and sets of resources, but actually the system is dealing only with triples and SPARQL queries.
A pretty interesting approach, which has something in common with the Search Computing one, and also features great UI and expressive power. It also covers faceted search.
Rexplorator is a MVC based application implemented with Ruby using ActiveRDF DSL.

In-Young Ko

Han-Gyu Ko and In-Young Ko. Generation of Semantic Clouds based on Linked Data for Efficient Multimedia Semantic Annotations
The presentation started from the definition of the requirements of semantic cloud generation: the idea is to produce tag clouds and help people annotating multimedia contents (e.g., for IP-TV contents).
The requirements include being able to:

  • identify the optimal number of tag clouds
  • balance the size of the different clouds shown to the users
  • check the coherency between the clouds and avoid ambiguity of each cloud.

The proposed lifecycle includes three phases:

  1. locating the spotting points: with a context-aware searching of linked data, starting from more important and densely connected nodes. More general nodes are more likely to be selected
  2. selecting the relations to traverse: the aim here is to reduce the RDF graph to the set of relevant relations only
  3. identify term similarity and clustering of tags.

If compared with simpler approaches for constructing clouds (e.g., based on rdf:type and SKOS parsing), this approach leads to better and more meaningful clouds of keywords.
The implemented system overlays the generated clouds upon the IPTV screen and let people select the tags.

Mamoun Abu Helou. Segmentation of Geo-Referenced Queries

Mamoun Abu Helou

This work aimed at manipulating natural language, multi-objective queries so as to split them into several simple single-aim queries.
The focus of the work was limited to geographical queries. It exploited Geowordnet, Yago, GeoNames, and Google GeoCoder API for identifying the important geographical concepts in the query. Both instances (e.g., Louvre) and classes (e.g., museum) can be identified.
A benchmark over 250 queries show promising results for the approach.

Peter Dolog.SimSpectrum: A Similarity Based Spectral Clustering Approach to Generate a Tag Cloud
Peter’s work addressed the specific problem of clustering within tag clouds.There are some problems in clouds: recent tags are overlooked because they have lower frequency; frequent ones are often useless; … .
The presentation delved into the discussion on the selection of the best algorithms for clustering of tags.
The aim was to reduce the number of tags, pick the most relevant ones, and put at nearby locations in the cloud the semantically close terms. The evaluation of the approach has been calculated in terms of coverage, overlap and relevance between the queries and the generated clouds, in the medical field.

Matthias Keller. A Unified Approach for Modeling Navigation over Hierarchical, Linear and Networked Structures

Matthias Keller

This is a visionary presentation on the needs and possible directions for a navigational model for data structures.
Data structures are very diverse (trees, graphs, …), and extracting hyperlink/access structure from the content structure is very difficult (basically there is no automatic transformation between the two). CMS enable something of this, but with limited expressive power and difficult configurability.
The idea is then to model:

  • the content organization supporting different graph-based content structures
  • the description of the access structures
  • the relation between the content and the access structure

The authors propose a graphical notation for covering these requirements and define some navigation patterns using this language.

Rober Morales-Chaparro. Data-driven and User-driven Multidimensional Data Visualization
This work aims at extracting automatically a set of optimal visualization of complex data, covering the entire lifecycle:

  1. the data model 
  2. the data mining
  3. the information model
  4. the visualization proposal engine
  5. the visualization model
  6. the code generation
  7. and the final generated application for the end user

To conclude, here is a simple tag cloud generated for the content discussed during the workshop:

To keep updated on my activities you can subscribe to the RSS feed of my blog or follow my twitter account (@MarcoBrambi).