Documentation

Introduction

The goal of the hackathon was to devise a preliminary instantiation of a generic architecture for question answering on Linked Data. In the following, we begin by presenting key requirements to the architecture. We then present the architecture itself and describe the functionality of each of its module. We include preliminary implementations and identify components that must still be implemented. Finally, we describe our approach to evaluating our system.

Architecture

Key requirements

The architecture was designed to be as generic as possible while remaining easy to understand, implement and use. Our first key requirement was to ensure that no programming language is imposed on the user. The motivation behind this requirement was simply that certain programming language as better suited for certain tasks. Given the variety of tasks that are required to achieve high-quality question answering, enforcing a programming language would have been prohibitive to the functionality and extensibility of the framework. We thus decided that all modules would be implemented as web services.

Our second requirement was to reuse existing standards as much as possible. We thus decided that all services are to generate and consume JSON objects according to the architectural design below.

Our third requirement was that of provenance tracking. We thus chose to add the ID of each service to its JSON output, making the contribution of each module easy to track throughout the QA process.

Design

Several types of architecture can be envisaged for QA. We assumed the QA process to be a workflow in which a controller decides on the workflow to employ, stores metadata on the current workflow and is free to call component in the order it requires. Each component on the other hand assumes a particular type of JSON object as input and returns JSON as output. Depending on their implementation, components are free to access as many other components as required.

Overall, 8 modules were specified as integral parts of the QA process.

  1. Question Generation: This module takes a question or question fragment as input and returns a set of scored questions or question fragments as output. Questions are generated by sample templates and word dictionaries. Autocomplete user interface is used in this module.

  2. Template Generation: Takes in a question and generates pseudo-queries as well as a list of strings (so-called slots) for which data from the knowledge base is needed. The pseudo-queries are scored. This module can be agnostic of the underlying knowledge-base.

  3. Disambiguation: This module an element the output of the template generation and returns URIs that map the slots.

  4. Query Generation: This module combines the results of the disambiguation and the template generation to queries (e.g., SPARQL queries).

  5. Answer Generation: The answer generation module takes a list of queries and endpoints as input and returns the results of the QA system.

  6. Rendering: This module renders the results of the QA system in a user-friendly manner.

Other modules that might be considered when building a QA system include:

  1. Named Entity Recognition

  2. Path Finder

  3. Full-Text Index

Evaluation

Evaluation of QA systems typically involve having a set of pre-defined questions, example queries, and answers to the questions. The Question Answering over Linked Data (QALD) is an evaluation campaign on multilingual question answering over linked data. QALD-4 provides a set of biomedical questions drawn from DrugBank, SIDER and Diseaseome. We wanted to extend this evaluation to a broader set of questions of increasing complexity, and to consider the data from dbPedia, Bio2RDF and BioGateway. Towards this goal, we

(1) established a set of descriptors to characterize questions [document][spreadsheet]

(2) created questions, queries, and answers over our KB.