Free text scientific literature has the potential to be an incredibly valuable source of data for uncovering the often hidden relationships between genes, diseases and phenotypes. Phenotypic descriptions cover abnormalities in anatomical structures, processes and behaviours. For example 'growth delay' and 'body weight loss'. Such descriptions form the basis for determining the existence and treatment of a disease but, because of their inherent complexity, have previously received less attention by the text mining community. In recent years, significant effort has been spent by a small number of expert curators to create coding systems for phenotypes (called "ontologies"), such as the Human Phenotype Ontology (HP) and the Mammalian Phenotype Ontology (MP). The PheneBank project proposes to support and speed up curation using terms discovered directly from the literature and to automatically integrate them with such standard ontologies.
The project seeks to harness texts for extracting statistically significant associations between phenotypes, diseases and genes. Earlier approaches have suffered from not providing deep semantic representations of the phenotypes they tried to target. Our deep learning-based approach is an attempt to overcome this issue by reducing the uncertainty between textual and ontological forms of phenotypes. The approach builds on ground breaking research at the European Bininformatics Institute by the PI (Collier) and the Wellcome Trust Sanger Instittue by the Co-investigator (Smedley), including terminology alignment of phenotypes using pairwise scoring of the conceptual elements that make up the phenotype.