Data Integration in the Life Sciences (DILS2018)
12
2018
342
5 hours 8 minutes
12 results
18:25
30Henry, VincentAlzheimer’s disease (AD) pathophysiology is still imperfectly understood and current paradigms have not led to curative outcome. Omics technologies offer great promises for improving our understanding and generating new hypotheses. However, integration and interpretation of such data pose major challenges, calling for adequate knowledge models. AlzPathway is a disease map that gives a detailed and broad account of AD pathophysiology. However, AlzPathway lacks formalism, which can lead to ambiguity and misinterpretation. Ontologies are an adequate framework to overcome this limitation, through their axiomatic definitions and logical reasoning properties. We introduce the AD Map Ontology (ADMO), an ontological upper model based on systems biology terms. We then propose to convert AlzPathway into an ontology and to integrate it into ADMO. We demonstrate that it allows one to deal with issues related to redundancy, naming, consistency, process classification and pathway relationships. Further, it opens opportunities to expand the model using elements from other resources, such as generic pathways from Reactome or clinical features contained in the ADO (AD Ontology).
2018Technische Informationsbibliothek (TIB)
23:24
17Friedrich, AndreasRecent publications have shown that the majority of studies cannot be adequately reproduced. The underlying causes seem to be diverse. Usage of the wrong statistical tools can lead to the reporting of dubious correlations as significant results. Missing information from lab protocols or other metadata can make verification impossible. Especially with the advent of Big Data in the life sciences and the hereby-involved measurement of thousands of multi-omics samples, researchers depend more than ever on adequate metadata annotation. In recent years, the scientific community has created multiple experimental design standards, which try to define the minimum information necessary to make experiments reproducible. Tools help with creation or analysis of this abundance of metadata, but are often still based on spreadsheet formats and lack intuitive visualizations. We present an interactive graph visualization tailored to experiments using a factorial experimental design. Our solution summarizes sample sources and extracted samples based on similarity of independent variables, enabling a quick grasp of the scientific question at the core of the experiment even for large studies. We support the ISA-Tab standard, enabling visualization of diverse omics experiments. As part of our platform for data-driven biomedical research, our implementation offers additional features to detect the status of data generation and more.
2018Technische Informationsbibliothek (TIB)
17:37
24Verma, Ghanshyam et al.Gene expression profiles help to capture the functional state in the body and to determine dysfunctional conditions in individuals. In principle, respiratory and other viral infections can be judged from blood samples; however, it has not yet been determined which genetic expression levels are predictive, in particular for the early transition states of the disease onset. For these reasons, we analyse the expression levels of infected and non-infected individuals to determine genes (potential biomarkers) which are active during the progression of the disease. We use machine learning (ML) classification algorithms to determine the state of respiratory viral infections in humans exploiting time-dependent gene expression measurements; the study comprises four respiratory viruses (H1N1, H3N2, RSV, and HRV), seven distinct clinical studies and 104 healthy test candidates involved overall. From the overall set of 12,023 genes, we identified the 10 top-ranked genes which proved to be most discriminatory with regards to prediction of the infection state. Our two models focus on the time stamp nearest to t = 48 hours and nearest to t = \Onset Time" denoting the symptom onset (at different time points) according to the candidate's specific immune system response to the viral infection. We evaluated algorithms including k-Nearest Neighbour (k-NN), Random Forest, linear Support Vector Machine (SVM), and SVM with radial basis function (RBF) kernel, in order to classify whether the gene expression sample collected at early time point t is infected or not infected. The \Onset Time" appears to play a vital role in prediction and identification of ten most discriminatory genes.
2018Technische Informationsbibliothek (TIB)
36:02
25Gerbel, SvetlanaThe reuse of routine healthcare data for research purposes is chal-lenging not only because of the volume of the data but also because of the va-riety of clinical information systems. A data warehouse based approach enables researchers to use heterogeneous data sets by consolidating and aggregating da-ta from various sources. This paper presents the Enterprise Clinical Research Data Warehouse (ECRDW) of the Hannover Medical School (MHH). ECRDW has been developed since 2011 using the Microsoft SQL Server Data Ware-house and Business Intelligence technology and operates since 2013 as an in-terdisciplinary platform for research relevant questions at the MHH. ECRDW incrementally integrates heterogeneous data sources and currently contains (as of 8/2018) data of more than 2,1 million distinct patients with more than 500 million single data points (diagnoses, lab results, vital signs, medical records, as well as metadata to linked data, e.g. biospecimen or images).
2018Technische Informationsbibliothek (TIB)
22:24
6Reis, Júlio César dosThe extraction of codes from Electronic Health Records (EHR) data is an important task because extracted codes can be used for different purposes such as billing and reimbursement, quality control, epidemiological studies, and cohort identification for clinical trials. The codes are based on standardized vo-cabularies. Diagnostics, for example, are frequently coded using the Interna-tional Classification of Diseases (ICD), which is a taxonomy of diagnosis codes organized in a hierarchical structure. Extracting codes from free-text medical notes in EHR such as the discharge summary requires the review of patient data searching for information that can be coded in a standardized manner. The manual human coding assignment is a complex and time-consuming process. The use of machine learning and natural language processing approaches have been receiving an increasing attention to automate the process of ICD coding. In this article, we investigate the use of Support Vector Machines (SVM) and the binary relevance method for multi-label classification in the task of auto-matic ICD coding from free-text discharge summaries. In particular, we ex-plored the role of SVM parameters optimization and class weighting for addressing imbalanced class. Experiments conducted with the Medical Infor-mation Mart for Intensive Care III (MIMIC III) database reached 49.86% of f1-macro for the 100 most frequent diagnostics. Our findings indicated that opti-mization of SVM parameters and the use of class weighting can improve the ef-fectiveness of the classifier.
2018Technische Informationsbibliothek (TIB)
19:21
35Christen, Victor et al.There exist many tools to annotate mentions of medical entities in documents with concepts from biomedical ontologies. To improve the overall quality of the annotation process, we propose the use of machine learning to combine the results of different annotation tools. We comparatively evaluate the results of the machine-learning based approach with the results of the single tools and a simpler set-based result combination.
2018Technische Informationsbibliothek (TIB) et al.
52:04
16Pühler, Alfred2018Technische Informationsbibliothek (TIB)
18:37
71Najafabadipour, Marjan et al.Recent rapid increase in the generation of clinical data and rapid development of computational science make us able to extract new insights from massive datasets in healthcare industry. Oncological Electronic Health Records (EHRs) are creating rich databases for documenting patient’s history and they potentially contain a lot of patterns that can help in better management of the disease. However, these patterns are locked within free text (unstructured) portions of EHRs and consequence in limiting health professionals to extract useful information from them and to finally perform Query and Answering (Q&A) process in an accurate way. The Information Extraction (IE) process requires Natural Language Processing (NLP) techniques to assign semantics to these patterns. Therefore, in this paper, we analyze the design of annotators for specific lung cancer concepts that can be integrated over Apache Unstructured Information Management Architecture (UIMA) framework. In addition, we explain the details of generation and storage of annotation outcomes.
2018Technische Informationsbibliothek (TIB)
22:00
48Jha, Alokkumar et al.Visualization of Gene Expression (GE) is a challenging task since the number of genes and their associations are diffcult to predict in various set of biological studies. GE could be used to understand tissuegene- protein relationships. Currently, Heatmaps is the standard visualization technique to depict GE data. However, Heatmaps only covers the cluster of highly dense regions. It does not provide the Interaction, Functional Annotation and pooled understanding from higher to lower expression. In the present paper, we propose a graph-based technique - based on color encoding from higher to lower expression map, along with the functional annotation. This visualization technique is highly interactive (HeatMaps are mainly static maps). The visualization system here explains the association between overlapping genes with and without tissues types. Traditional visualization techniques (viz-Heatmaps) generally explain each of the association in distinct maps. For example, overlapping genes and their interactions, based on co-expression and expression cut off are three distinct Heatmaps. We demonstrate the usability using ortholog study of GE and visualize GE using GExpressionMap. We further compare and benchmark our approach with the existing visualization techniques. It also reduces the task to cluster the expressed gene networks further to understand the over/under expression. Further, it provides the interaction based on co-expression network which itself creates co-expression clusters. GExpressionMap provides a unique graphbased visualization for GE data with their functional annotation and associated interaction among the DEGs (Differentially Expressed Genes).
2018Technische Informationsbibliothek (TIB)
31:17
16Fiebeck, Johanna et al.Data in healthcare and routine medical treatment is growing fast. Therefore and because of its variety, possible correlation within these are becoming even more complex. Popular tools for facilitating the daily routine for the clinical researchers are more often based on machine learning (ML) algorithms. Those tools might facilitate data management, data integration or even content classification. Besides commercial functionalities, there are many solutions which are developed by the user himself for his own, specific question of research or task. One of these tasks is described within this work: qualifying the Weber fracture, an ankle joint fracture, from radiological findings with the help of supervised machine learning algorithms. To do so, the findings were firstly processed with common natural language processing (NLP) methods. For the classifiying part, we used the bags-of-words-approach to bring together the medical findings on the one hand, and the metadata of the findings on the other hand, and compared several common classifier to have the best results. In order to conduct this study, we used the data and the technology of the Enterprise Clinical Research Data Warehouse (ECRDW) from Hannover Medical School. This paper shows the implementation of machine learning and NLP techniques into the data warehouse integration process in order to provide consolidated, processed and qualified data to be queried for teaching and research purposes.
2018Technische Informationsbibliothek (TIB)
22:37
40Wiese, Lena et al.Genome analysis is a major precondition for future advances in the life sciences. The complex organization of genome data and the interactions between genomic components can often be modeled and visualized in graph structures. In this paper we propose the integration of several data sets into a graph database. We study the aptness of the database system in terms of analysis and visualization of a genome regulatory network (GRN) by running a benchmark on it. Major advantages of using a database system are the modifiability of the data set, the immediate visualization of query results as well as built-in indexing and caching features.
2018Technische Informationsbibliothek (TIB)
24:10
14Stocker, Markus et al.Scientific information communicated in scholarly literature remains largely inaccessible to machines. The global scientific knowledge base is little more than a collection of (digital) documents. The main reason is in the fact that the document is the principal form of communication and since underlying data, software and other materials mostly remain unpublished -the fact that the scholarly article is, essentially, the only form used to communicate scientific information. Based on a use case in life sciences, we argue that virtual research environments and semantic technologies are transforming the capability of research infrastructures to systematically acquire and curate machine readable scientific information communicated in scholarly literature.
2018Technische Informationsbibliothek (TIB)