Advanced form of age-related macular degeneration (AMD) is a major health burden that can lead to irreversible vision loss in the elderly population. For early preventative interventions, there is a lack of effective tools to predict the prognosis outcome of advanced AMD because of the similar visual appearance of retinal image scans in the early stage and the variability of prognosis paths among patients. The existing prognosis models have several limitations: First, previous studies assume constant time intervals between doctor visits; however, in real world clinical settings, the visits may happen at irregular time intervals. The assumption of constant time intervals will lead to over-optimistic prediction results on specific training data sets while failing to produce generalizable results on new patient data sets. Second, current studies only predict one form of advanced AMD form at a time. Third, computer-based prognosis results are typically not validated on new patients and therefore, it is difficult to evaluate the generalizability of the proposed approaches. Lastly, there is a lack of interpretability of the models and explainability of how a computer-based prognosis determination has been made. Students working on this project will design, develop, and evaluate machine learning models that can detect most relevant images containing AMD biomarkers, manage unevenly spaced sequential optical coherence tomography (OCT) images and predict all advanced AMD forms that can help with the interpretation and explainability of computer-aided prognosis models.
Accurate diagnosis of lung lesions in computed tomography (CT) depends on many factors, including the radiologists’ ability to detect and correctly interpret these lesions. Computer-aided diagnosis (CAD) systems can be used to measurably increase the accuracy of radiologists in this task. Various CAD systems have been developed over the years for the detection and classification of pulmonary nodules. Most of these systems mimic domain knowledge in order to extract image content and use a comparison with ground truth for evaluation. However, these systems work in an algorithmic fashion that is only tenuously related to human perception and characterization of image features. In the image retrieval community, this is known as the semantic gap problem – the lack of coincidence between the quantitative information that may be extracted computationally from the image data and the visual interpretation of this data by human observers. Students working on these projects will 1) establish the link between computer-based image features of lung nodules in CT scans and visual descriptors defined by human experts in the Lung Image Database Consortium (LIDC) terminology and 2) integrate these links into content-based lung nodule image retrieval (CBIR) systems.
Chronic fatigue syndrome is a prevalent disease with little known about its probable cause. Some research has shown a link to the Epstein-Barr virus, implicated in mononucleosis. In this project, we will perform complex statistical analysis of a data set gathered by the DePaul Psychology department; a data set of blood proteins measured in students at three stages: healthy volunteers, students who developed mono and a six month follow-up. We will analyze individual proteins in an attempt to be able to predict whether individuals are likely to develop chronic fatigues syndrome after getting mono, and also use correlational analysis to determine if there are patterns of protein co-activation that characterize healthy controls differently than individuals with chronic fatigue syndrome. A truly multi-disciplinary project, this project will also include regular meetings with the research group at the Psychology department.
Biofilms refer to microbial life on surfaces. Microorganisms attach to surfaces and develop biofilms. Scientific imaging has proved to be an accepted research technique for investigating, analyzing and understanding biofilm systems. Due to the complexity and diversity of biofilms, as well as the surrounding habitat, different types of data formats exist to assess biofilm structure and composition. One of the most common ways to resolve structural aspects of biofilms as well as structure-function relationships is laser-based two and three dimensional imaging. Our goal is to develop tools that will rapidly identify biofilm regions of interest from these microscopes and machine learning techniques to gather information, objects and key features that are difficult to recognize from biofilm associated images for human interpretation. Also focus on developing applications that will enable managing large volumes of biofilm specific images.
Connecting genomic regions to phenotypes is critical in many biological fields, from medicine to conservation to agriculture and beyond. But it requires large numbers of genomes and associated phenotype data in order to capture diversity and provide enough samples for testing and training, making pangenomics difficult to scale in eukaryotic organisms with their large, complex genomes. This is complicated by heterogeneity resulting from different qualities of assemblies which affects pangenomic graphs. The best strategies for working with heterogeneous datasets and quantifying any resulting uncertainty in phenotype prediction have not been well studied in pangenomics. Genomes have the added issue of being related by descent and this evolutionary relatedness which can lead to issues such as false positive connections between genomic regions and phenotypes
Most pangenomic graphs are created within single species. Expanding across evolutionary distance in order to capture variation contained in a larger clade is difficult because the nucleotide divergence levels make the number of paths in the graph greatly expand and many regions that are functionally equivalent between genomes don’t have enough sequence conservation to be recognized. However, being able to recognize and access genetic diversity from more distantly related organisms is important because important traits that don’t exist in your species of interest, such as disease resistance can be identified and brought in from these relatives or, once recognized, can be edited into the genome of the species of interest directly.
Data science is intrinsically inter-disciplinary; however, end-users of machine learning models are not always trained data scientists. On the other hand, it is crucial that these models be infused with domain knowledge in order to increase explainability and trust in their output. Our goal in this project is to provide domain-aware confidence scores and enable domain experts to interact with and guide the clustering. Our hypothesis is that given confidence scores, end-users will be more willing to trust and adopt machine learning models. We test this hypothesis with materials informatics, a field that has the potential to greatly reduce time-to-market and development costs for new materials as it leverages machine learning and large datasets for targeted design. For example, automated phase-mapping seeks to discover samples of materials mixture with similar structure. This is challenging because measurements per samples far exceed the number of samples to clustering making it difficult to interpret and generalize. Towards our goal, we are building a dashboard comparing clustering methods. We envision that scientist will not only be able to assess confidence scores but also interact with results, merging and splitting clusters, guiding the discovery process. We describe the signals in terms of peaks and other interpretable features; we compare and contrast multiple clustering techniques and provide several visualization options (e.g., layered graphs, samples closest to and farthest from centroids) to assist domain experts through the clustering of this complex data.
Phase mapping has traditionally been a bottleneck of the high-throughput materials discovery cycle as the synthesis and characterization experiments can be performed on several materials per day while the manual effort required to solve a given phase mapping problem limits the throughput to only several phase diagrams per year. We propose a method that would accelerate the process by combining machine learning and expert knowledge in the discovery of different phases (samples with similar structures of materials) in X-Ray diffraction data. Our hypothesis is that we can automate parts of the process to alleviate the burden on the expert materials scientists that have to analyze this data. Previous work has explored active learning, to guide the order of measurement and analysis of XRD data. A Gaussian Process (GP) has been demonstrated to accelerate the phase region boundaries discovery by assigning uncertainty to the clustering. We now propose to further accelerate and improve the process by incorporating human input. At each iteration, the system chooses between the GPC suggestions or the human input in order to converge faster and more accurately towards the correct clustering solution.
Extracting scientific facts from unstructured text is difficult due to challenges specific to the ambiguity of the language, the complexity of the scientific named entities and relations to be extracted. This problem is well illustrated through the extraction of polymer names and their properties. Even in the cases where the property is a temperature, identifying the temperature’s polymer name may require expertise due to the use of acronyms, synonyms, complicated naming conventions and by the fact that new polymer names are being “introduced” to the vernacular as polymer science advances. While there exist domain-specific machine learning toolkits that address these challenges, perhaps the greatest challenge is the lack of—time-consuming, error-prone, and costly—labeled data to train these machine learning models. We have previously worked on Ensemble Labeling for Scientific Information Extraction (ELSIE) to identify sentences that contain the information to be extracted as a first step towards extracting the target information. We have extended ELSIE to identify important paragraphs as the information was sometimes scattered across sentences. Through ELSIE-Blob we are now able to extract more important sentences from publications. The next step in this project is to extract scientific facts from relevant sentences.
Deep Learning with Applications to Biomedical Informatics
One of the most powerful tools in the computer vision toolbox is the convolutional neural net: a combination of learned multi-scale convolutions and a fully connected deep neural net. Unfortunately, while able to tackle an immense variety of complex vision tasks, the fully connected deep neural net is a black box: there are few satisfying ways to characterize in intuitive or easily explained ways how the neural classifier makes its decisions. On the other hand, the deep neural nets provide a seamless mechanism to learn weights from a given loss function all the way back to the initial weights of the multi-scale convolutions. This project will explore replacing the fully connected deep neural net with more traditional classifiers, such as decision trees, which are explainable and intuitive. This project will require significant knowledge of differential calculus.
General Transportation Analytics Transportation analytics is an area rich in data visualization, geospatial analysis, time series analysis, and predictive modeling. This research area includes analyzing how people move from point A to point B. This can include analyzing the congestion and speed on the transportation networks, public transit, how traffic patterns change over time, how short-term or long-term events affect these changes (construction, large social gatherings, the addition of new roads or public transport, etc.), choice of transportation mode, large external factors such as Covid-19 impacting the transportation networks, etc. Other possibilities include machine learning and deep learning applications in connected and autonomous vehicles, safe driving analysis (safety belt, distracted driving, etc.), bike transport and bike infrastructure, work zone mobility analysis, road safety and vehicle crash analysis, etc. Transportation analytics can be combined with other data sources including but not limited to demographics, health, economics, employment, and more to make the analysis richer and more meaningful. There are lots of possibilities in transportation analytics and the topic can be chosen based on the student’s interest. Some possible research areas can be listed as:
Traffic crashes have a significant impact on the economy both in the form of property damage and also in the form of lost time. The most vulnerable population in traffic crashes are pedestrians and cyclists. Identifying the crash-prone locations will help traffic safety, transportation planning, and law enforcement to prioritize their efforts and resources to minimize the risk of accidents.
COVID-19 has had dramatic impact on transportation networks. Analyzing the change on road congestion, public transit ridership, rideshare usage, airplane trips, etc. before, during, and after the pandemic might reveal interesting patterns.