Motorola Foundation Scholars Projects 2023

Motorola Foundation Scholars Projects 2023

 

Deep Learning with Applications to Biomedical Informatics

One of the most powerful tools in the computer vision toolbox is the convolutional neural net: a combination of learned multi-scale convolutions and a fully connected deep neural net. Unfortunately, while able to tackle an immense variety of complex vision tasks, the fully connected deep neural net is a black box: there are few satisfying ways to characterize in intuitive or easily explained ways how the neural classifier makes its decisions. On the other hand, the deep neural nets provide a seamless mechanism to learn weights from a given loss function all the way back to the initial weights of the multi-scale convolutions. This project will explore replacing the fully connected deep neural net with more traditional classifiers, such as decision trees, which are explainable and intuitive. This project will require significant knowledge of differential calculus.

Materials Informatics

Data science is intrinsically inter-disciplinary; however, end-users of machine learning models are not always trained data scientists. On the other hand, it is crucial that these models be infused with domain knowledge in order to increase explainability and trust in their output. Our goal in this project is to provide domain-aware confidence scores and enable domain experts to interact with and guide the clustering. Our hypothesis is that given confidence scores, end-users will be more willing to trust and adopt machine learning models. We test this hypothesis with materials informatics, a field that has the potential to greatly reduce time-to-market and development costs for new materials as it leverages machine learning and large datasets for targeted design. For example, automated phase-mapping seeks to discover samples of materials mixture with similar structure. This is challenging because measurements per samples far exceed the number of samples to clustering making it difficult to interpret and generalize. Towards our goal, we are building a dashboard comparing clustering methods. We envision that scientist will not only be able to assess confidence scores but also interact with results, merging and splitting clusters, guiding the discovery process. We describe the signals in terms of peaks and other interpretable features; we compare and contrast multiple clustering techniques and provide several visualization options (e.g., layered graphs, samples closest to and farthest from centroids) to assist domain experts through the clustering of this complex data.

Phase mapping has traditionally been a bottleneck of the high-throughput materials discovery cycle as the synthesis and characterization experiments can be performed on several materials per day while the manual effort required to solve a given phase mapping problem limits the throughput to only several phase diagrams per year. We propose a method that would accelerate the process by combining machine learning and expert knowledge in the discovery of different phases (samples with similar structures of materials) in X-Ray diffraction data. Our hypothesis is that we can automate parts of the process to alleviate the burden on the expert materials scientists that have to analyze this data. Previous work has explored active learning, to guide the order of measurement and analysis of XRD data. A Gaussian Process (GP) has been demonstrated to accelerate the phase region boundaries discovery by assigning uncertainty to the clustering. We now propose to further accelerate and improve the process by incorporating human input. At each iteration, the system chooses between the GPC suggestions or the human input in order to converge faster and more accurately towards the correct clustering solution.

Extracting scientific facts from unstructured text is difficult due to challenges specific to the ambiguity of the language, the complexity of the scientific named entities and relations to be extracted. This problem is well illustrated through the extraction of polymer names and their properties. Even in the cases where the property is a temperature, identifying the temperature’s polymer name may require expertise due to the use of acronyms, synonyms, complicated naming conventions and by the fact that new polymer names are being “introduced” to the vernacular as polymer science advances. While there exist domain-specific machine learning toolkits that address these challenges, perhaps the greatest challenge is the lack of—time-consuming, error-prone, and costly—labeled data to train these machine learning models. We have previously worked on Ensemble Labeling for Scientific Information Extraction (ELSIE) to identify sentences that contain the information to be extracted as a first step towards extracting the target information. We have extended ELSIE to identify important paragraphs as the information was sometimes scattered across sentences. Through ELSIE-Blob we are now able to extract more important sentences from publications. The next step in this project is to extract scientific facts from relevant sentences.

Bioinformatics

Biofilms refer to microbial life on surfaces. Microorganisms attach to surfaces and develop biofilms. Scientific imaging has proved to be an accepted research technique for investigating, analyzing and understanding biofilm systems. Due to the complexity and diversity of biofilms, as well as the surrounding habitat, different types of data formats exist to assess biofilm structure and composition. One of the most common ways to resolve structural aspects of biofilms as well as structure-function relationships is laser-based two and three dimensional imaging. Our goal is to develop tools that will rapidly identify biofilm regions of interest from these microscopes and machine learning techniques to gather information, objects and key features that are difficult to recognize from biofilm associated images for human interpretation. Also focus on developing applications that will enable managing large volumes of biofilm specific images.

Connecting genomic regions to phenotypes is critical in many biological fields, from medicine to conservation to agriculture and beyond. But it requires large numbers of genomes and associated phenotype data in order to capture diversity and provide enough samples for testing and training, making pangenomics difficult to scale in eukaryotic organisms with their large, complex genomes. This is complicated by heterogeneity resulting from different qualities of assemblies which affects pangenomic graphs. The best strategies for working with heterogeneous datasets and quantifying any resulting uncertainty in phenotype prediction have not been well studied in pangenomics. Genomes have the added issue of being related by descent and this evolutionary relatedness which can lead to issues such as false positive connections between genomic regions and phenotypes

Most pangenomic graphs are created within single species. Expanding across evolutionary distance in order to capture variation contained in a larger clade is difficult because the nucleotide divergence levels make the number of paths in the graph greatly expand and many regions that are functionally equivalent between genomes don’t have enough sequence conservation to be recognized. However, being able to recognize and access genetic diversity from more distantly related organisms is important because important traits that don’t exist in your species of interest, such as disease resistance can be identified and brought in from these relatives or, once recognized, can be edited into the genome of the species of interest directly.

Medical informatics

 Transportation Analytics

General Transportation Analytics Transportation analytics is an area rich in data visualization, geospatial analysis, time series analysis, and predictive modeling. This research area includes analyzing how people move from point A to point B. This can include analyzing the congestion and speed on the transportation networks, public transit, how traffic patterns change over time, how short-term or long-term events affect these changes (construction, large social gatherings, the addition of new roads or public transport, etc.), choice of transportation mode, large external factors such as Covid-19 impacting the transportation networks, etc. Other possibilities include machine learning and deep learning applications in connected and autonomous vehicles, safe driving analysis (safety belt, distracted driving, etc.), bike transport and bike infrastructure, work zone mobility analysis, road safety and vehicle crash analysis, etc. Transportation analytics can be combined with other data sources including but not limited to demographics, health, economics, employment, and more to make the analysis richer and more meaningful. There are lots of possibilities in transportation analytics and the topic can be chosen based on the student’s interest. Some possible research areas can be listed as:

Traffic crashes have a significant impact on the economy both in the form of property damage and also in the form of lost time. The most vulnerable population in traffic crashes are pedestrians and cyclists. Identifying the crash-prone locations will help traffic safety, transportation planning, and law enforcement to prioritize their efforts and resources to minimize the risk of accidents.