Projects 2024

Bioinformatics

Biofilms refer to microbial life on surfaces. Microorganisms attach to surfaces and develop biofilms. Scientific imaging has proved to be an accepted research technique for investigating, analyzing and understanding biofilm systems. Due to the complexity and diversity of biofilms, as well as the surrounding habitat, different types of data formats exist to assess biofilm structure and composition. One of the most common ways to resolve structural aspects of biofilms as well as structure-function relationships is laser-based two and three dimensional imaging. Our goal is to develop tools that will rapidly identify biofilm regions of interest from these microscopes and machine learning techniques to gather information, objects and key features that are difficult to recognize from biofilm associated images for human interpretation. Also focus on developing applications that will enable managing large volumes of biofilm specific images.
Connecting genomic regions to phenotypes is critical in many biological fields, from medicine to conservation to agriculture and beyond. But it requires large numbers of genomes and associated phenotype data in order to capture diversity and provide enough samples for testing and training, making pangenomics difficult to scale in eukaryotic organisms with their large, complex genomes. This is complicated by heterogeneity resulting from different qualities of assemblies which affects pangenomic graphs. The best strategies for working with heterogeneous datasets and quantifying any resulting uncertainty in phenotype prediction have not been well studied in pangenomics. Genomes have the added issue of being related by descent and this evolutionary relatedness which can lead to issues such as false positive connections between genomic regions and phenotypes
Most pangenomic graphs are created within single species. Expanding across evolutionary distance in order to capture variation contained in a larger clade is difficult because the nucleotide divergence levels make the number of paths in the graph greatly expand and many regions that are functionally equivalent between genomes don’t have enough sequence conservation to be recognized. However, being able to recognize and access genetic diversity from more distantly related organisms is important because important traits that don’t exist in your species of interest, such as disease resistance can be identified and brought in from these relatives or, once recognized, can be edited into the genome of the species of interest directly.
Neurons in the brainstem medullary reticular formation govern vital motor behaviors, such as breathing, vocalization, swallowing, and chewing. Critical for understanding these neural circuits is identification and localization of reticular subpopulations controlling these myriad functions. A major obstacle to identifying these nuclei is the lack of clear cytoarchitectonic boundaries and molecular markers within the reticular formation delineating functional nuclei. This project will investigate the three‐dimensional proximity and similarities in neuronal gene expression profiles to determine a mapping for motor behaviors in the brainstem.

Medical Informatics

Accurate diagnosis of lung lesions in computed tomography (CT) depends on many factors, including the radiologists’ ability to detect and correctly interpret these lesions. Computer-aided diagnosis (CAD) systems can be used to measurably increase the accuracy of radiologists in this task. Various CAD systems have been developed over the years for the detection and classification of pulmonary nodules. Most of these systems mimic domain knowledge in order to extract image content and use a comparison with ground truth for evaluation. However, these systems work in an algorithmic fashion that is only tenuously related to human perception and characterization of image features. In the image retrieval community, this is known as the semantic gap problem – the lack of coincidence between the quantitative information that may be extracted computationally from the image data and the visual interpretation of this data by human observers. Students working on these projects will 1) establish the link between computer-based image features of lung nodules in CT scans and visual descriptors defined by human experts in the Lung Image Database Consortium (LIDC) terminology and 2) integrate these links into content-based lung nodule image retrieval (CBIR) systems.
Chronic fatigue syndrome is a prevalent disease with little known about its probable cause. Some research has shown a link to the Epstein-Barr virus, implicated in mononucleosis. In this project, we will perform complex statistical analysis of a data set gathered by the DePaul Psychology department; a data set of blood proteins measured in students at three stages: healthy volunteers, students who developed mono and a six month follow-up. We will analyze individual proteins in an attempt to be able to predict whether individuals are likely to develop chronic fatigues syndrome after getting mono, and also use correlational analysis to determine if there are patterns of protein co-activation that characterize healthy controls differently than individuals with chronic fatigue syndrome. A truly multi-disciplinary project, this project will also include regular meetings with the research group at the Psychology department.
Traumatic brain injury (TBI) is well known to be related to intimate partner violence (IPV). However, the exact relationship is hard to quantify given the stigma and shame associated with IPC and given the difficulty in assessing TBI. This project aims to better understand this relationship by investigating the intake reports on patients entering the emergency department because of IPV and by investigating any subsequent reports to assess TBI. Further, we are interested in understanding how the Covid-19 lockdown affected further the complex relationship between IPV and TBI.

Healthcare Informatics

Extracting scientific facts from unstructured text is difficult due to challenges specific to the ambiguity of the language, the complexity of the scientific named entities and relations to be extracted. This problem is well illustrated through the extraction of polymer names and their properties. Even in the cases where the property is a temperature, identifying the temperature’s polymer name may require expertise due to the use of acronyms, synonyms, complicated naming conventions and by the fact that new polymer names are being “introduced” to the vernacular as polymer science advances. While there exist domain-specific machine learning toolkits that address these challenges, perhaps the greatest challenge is the lack of—time-consuming, error-prone, and costly—labeled data to train these machine learning models. We have previously worked on Ensemble Labeling for Scientific Information Extraction (ELSIE) to identify sentences that contain the information to be extracted as a first step towards extracting the target information. We have extended ELSIE to identify important paragraphs as the information was sometimes scattered across sentences. Through ELSIE-Blob we are now able to extract more important sentences from publications. The next step in this project is to extract scientific facts from relevant sentences.
Machine Learning Established in 2000, the Sinai Urban Health Institute’s (SUHI) is the research arm of Sinai Chicago. SUHI is a nationally recognized community research center that works in partnership with community members and organizations to identify and address health inequities in some of the most underserved communities in the city. SUHI was an early adopter of the Community Health Worker (CHW) model. CHWs form a liaison between the patients and the hospital and help address inequities by addressing patients needs and connecting them with resources. Patients are referred to CHWs and asked to fill out a Social Determinants of Health (SDoH) survey. Dr. Tchoua is working with SUHI data to provide data-driven evidence of the positive impact of CHWs on the Emergency Department (ED) 30-day readmission rate, an important outcome of patient health and an important metric of hospital quality of care. This study also aims to highlight important aspects of the program and make recommendations that can improve the CHW program. The ultimate goal of the project is to promote the use of SDoH data in more Sinai clinics and other hospitals.
Leveraging collaboration with Rush University Hospital, this project will look at syndemics, i.e., finding subgroups of diseases that occur together in patients and together contribute to excess burden of disease in a certain population. The problem of identifying syndemics can be formulated as a complex application of clustering. Syndemics or synergistic epidemics are the aggregation of two or more concurrent or sequential epidemics or disease clusters in a population with biological interactions, which exacerbate the prognosis and burden of disease. For example, the SAVA syndemic is comprised of substance abuse, violence, and AIDS, three conditions that disproportionately afflict those living in poverty in US cities. The problem of syndemics can be placed under the larger umbrella of personalized medicine in the sense that identifying subgroups of patients that are similar – have a similar set of characteristics and/or diagnosis – enables physicians to fine-tune treatment for individual patients. Instead of prescribing treatments for the “average patient”, physicians can use the context from clusters closely related to the patients to prescribe personalized treatment.