Easing the Questioning of Semantic Biomedical Data

—Researchers have been using semantic technologies as essential tools to structure knowledge. This is particularly relevant in the biomedical domain, where large dataset are continuously generated. Semantic technologies offer the ability to describe data and to map and linking distributed repositories, creating a network where the searching interface is a single entry point. However, the increasing number of semantic data repositories that are publicly available is creating new challenges related to its exploration. Despite being human and machine-readable, these technologies are much more challenging for end-users. Querying services usually require mastering formal languages and that knowledge is beyond the typical user’s expertise, being a critical issue in adopting semantic web information systems. In particular, the questioning of biomedical data presents speciﬁc challenges for which there are still no mature proposals for production environments. This paper presents a solution to query biomedical semantic databases using natural language. The system is at the intersection between semantic parsing and the use of templates. It makes it possible to extract information in a friendly way for users who are not experts in semantic queries.


I. INTRODUCTION
The digitization of science in all research institutions has transformed science into a set of data-driven activities, enabling the exponential advancement of human knowledge [1]. This deluge of digital records resulted in numerous data repositories in the most varied formats, from simple spreadsheets to sophisticated databases. This situation made the reuse of data a challenge, emphasizing cases in the long tail of science where information exists closed and accessible only to the research group's elements that produced the data [2]. In the case of biomedical sciences, we find that the wide variety of repositories responds to concrete needs. Some examples are the electronic health record databases [3], data resulting from genetic studies [4], the massive collections of medical images [5], or the metadata related to biobanks' description [6]. Scientific practices established that the secondary use of data benefits various health research areas, significantly impacting the population's quality of life [7]. Therefore, researchers must have access to the best tools for sharing their data with their peers for the community's benefit.
Research in information systems tried solving the integration and interoperability of data distributed on the Internet from an early age. The Semantic Web (SW) and Linked Data (LD) principles responded to those challenges, and its use gained traction in the biomedical community [8]. Semantic technologies are at the core of many systems used, for example, in areas as diverse as translational medicine, system biology, and biopharmaceutics [9]. With the SW, the structuring of knowledge domains gained a powerful tool for formalization, the Web Ontology Language (OWL), which abstractly identifies classes, properties, and individuals [10]. This approach's success catches evident in the NCBO Bio-Portal, where many biomedical ontologies and terminologies are available [11].
The Resource Description Framework (RDF) is the SW's data model, establishing a basic structure, the RDF triple of a subject, a predicate, and an object. This simple way of specifying semantic units of information allows capturing biomedical data's richness in a scalable way [12]. The subject-predicateobject representation, together with ontologies, enables the annotation of knowledge and the creation of semantic repositories that can be massive. It is, therefore, necessary to have tools capable of questioning this data to obtain answers and create new knowledge. The standard strategy available out-of-the-box is the use of formal languages such as SPARQL [13]. Formal languages allow a vast range of options for forming queries, structured with their logical forms. For example, in SPARQL, if we need to retrieve variables and their bindings directly, we use the SELECT clause, and to obtain a boolean indicating a matching pattern, we ASK. Despite powerful, this and other constructs are difficult to use by non-IT people, limiting such systems' adoption.
One way to overcome the difficulties presented by systems that use formal languages is by creating interfaces that allow the use of natural language. This strategy frees users from the burden of mastering logical formalism and represents an opportunity for more users to take advantage of stored knowledge. Despite the benefits that these systems promise, the technology is not yet mature enough, and there is a need to investigate new solutions [14]. This paper presents a solution to query biomedical semantic databases using natural language, building on articulating semantic parsing and templates.
We organized the rest of the paper as follows: Section II overviews the related work in question-answering over knowledge bases. We present our solution for questioning semantic data in Section III, integrated into a semantic data creation tool. In Section IV, we use the tool to transform and explore data of patients with Huntington disease. Finally, Section V rounds up the paper with conclusions.
II. QUESTION-ANSWERING OVER KNOWLEDGE BASES Generally, we call question-answering (QA) systems those interfacing databases through natural language (NL) interfaces. The goal is to obtain precise information supported by the data without using formal query languages. The implementation of these systems for the most varied data types has been investigated, considering the questioned data's specificities. Thus, some solutions specialize in conventional relational databases and other questions unstructured data such as text corpus [15]. In addition to these, a particular set of linguistic interfaces aims to take advantage of information residing in semantic databases. Sharing similar Natural Language Processing (NLP) challenges with the first types of systems, they nevertheless present particularities deserving to be highlighted [16]. When the way the entities in the question in NL are diverse from the forms used in the knowledge base (KB), we are in a lexical gap (e.g. "the King", in the NL question vs. dbp:Elvis Presley, in DBpedia). The fact that the same phrasal name can represent several entities gives rise to ambiguity (homographs, e.g. money bank vs. river bank). Also problematic in specific contexts is the processing of complex questions that ask for aggregated, filtered, or ordered outputs. Finally, multilingualism refers to using the same interface to ask questions in several NLs and/or multilingual KB. Several proposals have emerged to tackle the enunciated difficulties grouped into the four architectural styles described in the following subsections.

A. Semantic Parsing Pipelines
The most common QA systems process data from input to output sequentially. The information passes through a pipeline's elements and transforms until it reaches a logical form digestible by the conventional SPARQL query engine. The typical architecture for this type of solution is shown in Figure 1 and consists of several blocks, commonly called filters. Several architecture blocks correspond to known NLP elements, and many implementations are available to build tailormade solutions. When creating the parse tree, we usually do tokenization, named-entity recognition (NER), part-of-speech (POS) tagging, and dependency parsing [17]. This way of doing, system improvements can emerge from improving particular components. The next transformation is entity linking (EL) [18]. Although we have good solutions for constructing the parse tree for EL, demanding challenges arise when dealing with lexical gaps and ambiguities. As Ruseti et al. did, we can use an ontology to reduce ambiguity [19], but often none is available. Once the EL process is closed, a final module is responsible for transforming the parse tree with the entities and relations correctly linked into a SPARQL query.

B. Subgraphs Matching
One way to avoid difficulties with the semantic pipeline's last filters is to replace them with an architectural block for constructing subgraphs, as depicted in Figure 2. Usually, this kind of solution builds upon realising that executing a formal query is equivalent to finding a subgraph [20]. Beyond this observation, it is possible to construct the answer to a question by navigating the semantic graph nodes to collect triple candidates for the final solution. Therefore, we are dealing with a search problem in a space that can be prohibitively large without considering appropriate heuristics [21]. At the end of the process, we need a strategy for selecting the most likely response.

C. Template-based QA
When looking for the answer to complex questions, the previous systems are not the most suitable. The challenges posed by the lexical gap and ambiguity cannot always be solved satisfactorily by strict semantic pipelines. The possibility of using templates allows a more accurate operation in fighting these problems [22]. A template is a query skeleton with an arbitrary degree of complexity, fitting the KB, and has slots to fill with information from entities and relations. Figure 3 outlines this type of solutions.
The creation of templates is performed offline, analysing the questions to be asked and the KB data. Solutions with a manual annotation component are common, being an obvious limitation. To have more templates is better, but the quality is also essential. Therefore, for fully automatic template generation systems, we carry on carefully. One way is to use textual information that extends the KB [23]. The online phase is easy to describe: a question is matched with a template to produce a logical form.

D. QA based on Information Extraction
When we proceed to the direct extraction of triples, we are in the presence of information extraction systems were we completely bypass the creation of a logical form. The use of machine learning techniques to create vector representations is usual (see Figure 4).

III. SCALEUS-FD FOR QUESTION-ANSWERING
SCALEUS-FD is a semantic web tool developed to allow data integration [25], and it is available as open-source at https://github.com/bioinformatics-ua/scaleus-fair. Quickly, we can list some of its main features: • Very easy to deploy and start using; • Ontology-independent; • RDF resource loading (.ttl, .rdf, .owl, .nt, .jsonld, .rj, .n3, .trig, .trix, .trdf, .rt); • Supports importing data from spreadsheets (.xlsx, .xls, .ods); • Support for multiple datasets; • Text search; • SPARQL queries; • Query federation to the available data; • Inference support; • Metadata creation allowing search engine indexing; • Web services API. The application offers semantic data for remote access allowing indexation by search engines crawling Data Catalog Vocabulary (DCAT) 1 descriptions. Figure 5 shows the software architecture. The interface with users is via a graphical interface, and a web services API enables machine-to-machine operations. Next, in the first subsection, we outline features related to creating semantic data and metadata (Data Handler and Metadata Handler). In the second subsection, we cover the QA Module.

A. Semantic Data and Metadata Modules
The Data Handler module is responsible for transforming the information provided in a non-semantic format, such as data tables. The creation of semantic data maps the input data entities to the triples and store them in the KB. The user is free to establish a semantic scheme by creating convenient relations between data. The freedom to choose semantic prefixes is complete, and they can be created and stored for future use. Naturally, all transactions with the application's databases must ensure data integrity. The transaction database (TDB) components prevent data from being corrupted when dealing with creating, reading, updating, and deleting operations.
The metadata module ensures that data is Findable, Accessible, Interoperable, and Reusable, following the FAIR principles [26], commonly adopted in data stewardship. We ensure interoperability by using HTTP URIs to identify resources uniquely. We use the DCAT specification to characterize different layers of machine-readable metadata for describing the organizational schema catalog-dataset-distribution, which allows automatic indexation by search engines. Both data and metadata services are available through a REST API.

B. QA Module
The QA module allows querying the stored semantic data. On the one hand, we can operate in the traditional way by using SPARQL. This option enables advanced users to exploit all the power that a logical query language offers to construct very complex queries. On the other hand, the possibility of asking questions in natural language (in English) allows users less familiar with formal query languages to consult the knowledge stored in the KB. We integrated into the module the linguistic processing tools that allow us to do semantic parsing. Thus, the information is processed by transforming the NL question into a formal query that is then used internally to obtain the answers. But the strength of the solution is the possibility of using templates in the information retrieval process.
We can create templates in two ways. On the one hand, it is possible to provide curated lists, manually crafted. This way of doing has the advantage of capturing more precisely the users' intentions. However, it also has significant limitations. This strategy does not scale conveniently in production environments where the questioning needs give rise to new questions not covered by the previously created listings. A more efficient approach is to automate the creation of templates as carried out by the QA module ( Figure 6). As we can see in the figure's right branch, the system's online phase operates to transform the question in natural language into an intermediate form to pair with the appropriate template. A query is created in a formal language after filling in the slots with specific entities and relations. After this process, the final answer derives from a SPARQL query generated internally by the system.
In the offline phase, we train a deep learning model to create templates automatically. This way, we acquire more contextual information about the KB. A typical example of this procedure is the use of Wikipedia texts to expand DBpedia's knowledge. This stage is challenging since success depends on the careful choice of the set of texts we use. For instance, for a KB created by automatically extracting triples from some text corpus, this corpus can be reused to create the templates.

IV. QUESTIONING SEMANTIC BIOMEDICAL DATA
To test the tool, we started by loading and transforming to the semantic format a spreadsheet with data from patients with Huntington disease (HD). For the sake of security and privacy, this cohort's data has been anonymized. For this example, we decided to select only a small set of headers: subject, gender, and the columns related to the Problem Behaviours Assessment (PBA-s) items [27]. We used concepts from the Dublin Core Metadata Initiative 2 , FOAF Vocabulary Specification 3 , and the Human Phenotype Ontology 4 . Table I shows the mapping we performed. With the data transformed and adequately loaded, we can ask questions using a graphical interface (see Figure 7). The SPARQL queries and the NL questions use the same form for simplicity since the system recognizes the input type processing it transparently.

V. CONCLUSION
The conversion of biomedical data into a semantic format allows the sharing of relevant information between research groups. However, in addition to this essential data processing step, the systems' ability to ease retrieving information is also critical. Interfaces accepting inputs in a natural language enhance adhesion to semantic solutions. In this paper, we have proposed a tool for creating semantic data which allow us to pose questions in natural language. We believe that this tool can become part of the researchers' toolbox for their sharing of data.