2026 Program Projects

Artificial Intelligence

This project aims to develop Graph-µP, the first principled scaling framework for Graph Neural Networks (GNNs). While GNNs are widely used in domains such as drug discovery, materials science, and recommendation systems, theylack the reliable scaling laws that have powered advances in CNNs and Transformers. In practice, scaling a GNN’s width or depth requires extensive hyperparameter retuning, making large-model development costly and preventing consistent evaluation across model sizes. This project adapts recent advances in µ-Parameterization (µP) to GNNs, combining mean-field theoretical analysis with empirical validation to establish hyperparameter transfer rules that remain stable as model size grows. By enabling hyperparameters tuned on small models to work seamlessly on larger ones, Graph-µP aims to unlock scalable and efficient graph learning.
The internet is full of useful educational videos in different formats such as lectures, presentations, tutorials, and others. Our goal is to look into processing these videos at scale for a variety of purposes such as information retrieval, summarization, navigation, and intelligent tutoring. To train and develop tools that can accurately extract and analyze this content at scale, we will need state-of-the-art models based on deep neural networks. However, these models often require massive amounts of data to be properly trained. At the same time, existing datasets for the educational domain are limited in size and scope. The goal of this project is to create a curated high-quality dataset which can be used to pretrain these AI models. Pretraining is a key stage of preparing any deep learning-based model. Unlike the final fine-tuning stage, the goal of pretraining is to leverage as much unannotated data as possible and have the model learn a representation using this data. This often requires the usage of a pretext task, where the model needs to do complete a simple but challenging task on the data, and the inputs and outputs for the task can be accurately generated using a simple algorithm, without the need for further human supervision. In some cases, we can also leverage simpler models trained on smaller datasets to create “weak labels”, noisy, which can be used to either accelerate the human annotation process, or even to use these labels to pretrain models at scale on datasets for which only the weak labels are available, and later on, the model can be fine-tuned on smaller datasets that have been carefully annotated by humans.

Bioinformatics

G-quadruplexes (G4s) are unique structures formed by guanine-rich DNA or RNA sequences, that have significant roles in genome, telomere maintenance, and connection to cancer biology. In cells, the dynamic structure of G4s is regulated by G4 binding-proteins (G4BPs), and G4-G4BP interactions are involved in various biological processes, including viral replication, cancer progression, cell homeostasis, DNA damage response, and transcriptional and translational regulation. Various protein domains with RNA-binding capabilities have been identified, and it is becoming more evident that RNA and DNA can influence protein functions more frequently than previously realized. Exploring the formation of G4s and their interactions with G4BPs across the transcriptome is essential for therapeutic targeting. This project will investigate the design and development of a scalable software pipeline that utilizes Artificial Intelligence and Information Retrieval methods to capture, rank, predict, and retrieve novel G4s, G4BPs and G4-G4BP interactions from public genomic and proteomic data, including research papers, datasets and databases.

High-Performance Computing

Increasing data volumes, particularly in science and engineering, has resulted in the widespread adoption of parallel and distributed file systems for data storage and access. However, as file system sizes and the amount of data “owned” by users has grown, it is increasingly difficult to discover and locate data amongst the terabytes or petabytes of accessible data. While it is now routine to search for data on a personal computer or discover data online at the click of a button, there is no such equivalent method for discovering data on large parallel and distributed file systems in High-Performance Computing systems. Popular search solutions, such as Apache Lucene, were designed and implemented to run on commodity hardware thus posing significant limitations in achieving good efficiency on large-scale storage systems with many-core architectures, multiple NUMA nodes, and multiple NVMe storage devices. This project will investigate the design and development of a scalable software tool that can exploit the properties of modern High-Performance Computing systems, to build a persistent index that is compatible with the Apache Lucene index structure. This project aims to adapt optimization techniques developed in prior work.
For decades, advancements in information retrieval technologies have changed how we discover data on computer systems. It is commonplace to quickly search the internet, and recent advancements in Artificial Intelligence (AI) have made the concept of neural and semantic search prevalent for millions of individuals. This growth in AI applications has shown the incredible benefits of neural search techniques, though there are currently no such information retrieval tools that can scale well on High-Performance Computing (HPC) system. One of the primary constraints with building large information retrieval systems is index construction time. This constraint becomes incredibly prevalent for neural and semantic information retrieval systems. Recent efforts have been made to improve index construction time, though many of these advances are designed to leverage more traditional computing systems and may not be optimal for HPC systems. This project aims to identify common performance bottlenecks in the processing pipeline of neural and semantic indexes and aims to address them by using optimization techniques that exploit the properties of HPC systems to improve index construction time.

Medical Informatics

Understanding how diseases relate to one another is central to improving clinical decision-making, especially in multimorbidity settings where patients present with multiple interacting chronic conditions. Prior work has used clustering techniques to identify persistent disease clusters that reliably co-occur across patients, but clustering treats each disease as an isolated category and cannot capture deeper semantic relationships such as shared biological mechanisms, overlapping symptoms, or similar progression patterns. This project instead learns disease embeddings—continuous vector representations of diseases—derived from ICD descriptions, medical literature, and structured EHR signals. In this learned latent space, diseases that are conceptually or clinically related are positioned closer together, enabling more nuanced discovery of multimorbidity patterns than discrete clustering alone. By integrating these embeddings with patient data, the project aims to identify meaningful disease groupings, quantify disease–disease similarity, and develop clinician-interpretable tools for exploring the latent structure of chronic conditions.
For the past decade, due to the advancement of computer technology, the rapid increase in large quantities of data and the reemergence of Artificial Intelligence, there has been a lot of effort dedicated towards improving the efficiency and effectiveness of Data Science and Machine Learning models, both in terms of the quality of the model but also in terms of the performance of the implementation. Typically, machine learning or deep learning models can be tuned and improved to obtain very high accuracy and precision over specific training and test dataset. However, when the same model is confronted with data outside of the training and test dataset, commonly named Out-of-Distribution data, the performance of the model can decrease dramatically. This can have sever repercussions in the context of critical domains. For example, in medicine the model could misdiagnose a patient, that would lead to incorrect treatments. Or in the context of home robotics, it could misidentify an object or a command, leading to unintended actions. Detecting and highlighting Out-of-Distribution data points becomes cumbersome when using popular standard tools such as Python Matplotlib and Jupyter Notebooks. Thus, this project will focus on the design and implementation of a scalable and interactive Out-of-Distribution data visualization software platform that will automate and simplify the process of analyzing and identifying Out-of-Distribution data. The proposed software platform will be used to enhance the Machine Learning models developed in previous work, that focus on predicting lung nodule tumor malignancy from medical images.

Robotics

This project focuses on developing a non-invasive brain-computer interface system that enables direct neural control of robotic systems through EEG signal processing. Using commercially available or custom headset devices, the system will capture brain activity patterns and translate them into precise robot control commands with high accuracy. The project emphasizes practical applications in therapeutic and assistive contexts, specifically targeting art therapy environments where patients can control creative robotic tools through thought, and movement assistance scenarios where individuals with limited mobility can operate robotic aids through neural commands. By developing robust signal classification algorithms and tailoring the interpretation of EEG patterns to specific use cases, the system will adapt to individual users’ neural signatures and application requirements. This approach democratizes BCI technology by utilizing non-invasive methods while maintaining the accuracy and responsiveness needed for meaningful real-world applications.
This project develops a unified multimodal system that combines Large Language Models, visual perception, and robotic action into a cohesive control framework. The system will enable robots to interpret natural language commands, perceive their environment through vision systems, and execute appropriate physical actions. By integrating LLMs for command interpretation and movement planning with vision-language-action mapping, the platform will translate high-level human instructions into context-aware robot behaviors. The system will process visual inputs to understand scene composition, use language models to interpret user intent and generate motion plans, and create direct mappings between perception, linguistic understanding, and motor execution. This holistic approach allows robots to learn from demonstrations, adapt to environmental changes, and respond to complex instructions without requiring specialized programming knowledge.

Software Engineering

Modern SaaS platforms often lose users during onboarding because they fail to adapt to users’ evolving goals and experience levels. This project investigates how adaptive, context-aware recommendation can improve onboarding by interpreting user queries and actions in real time. The framework proposes a method to extract rich feature representations from complex SaaS systems and construct model to track user progress through different interface states. We then compare traditional static and adaptive onboarding flows, examining how recommendation evolves as users gain familiarity with the system. Key evaluation; time-to-activation, task success, and user confidence; are used to assess whether adaptive onboarding enhances learning efficiency and retention. The resulting prototype will be deployed and tested within real SaaS platforms used by the DePaul faculty to assess its practical impact.
Large language models (LLMs) have transformed coding and software engineering tasks. Current benchmarks for evaluating LLM coding tasks, such as SWE-bench, SWE-Lancer and others, evaluate end results; whether an LLM fixed a bug or produced correct code; but overlook the efficiency and reasoning process behind these outcomes. This project proposes a new observability framework that measures an LLM’s contextual reasoning, assessing its ability to identify relevant code artifacts, trace faults during iterative reasoning loops, and reason efficiently about program semantics. The goal is to shift LLM evaluation from outcome-based accuracy toward a deeper understanding of how models’ reason about code. Result will be beneficial for practitioners to gain deeper insights LLM reasoning trajectories and optimize costs.
Industrial leaders estimate that LLMs such as GitHub Copilot and ChatGPT now generate 30–45% of production code. However, the code they produce is not always correct; syntactically valid yet semantically incorrect. Developers may fix these issues manually or rely on the same LLM to repair them, but it remains unclear whether an LLM can effectively fix bugs it introduced itself. This raises a fundamental question: does the reasoning an LLM uses to repair code align with the reasoning it used to generate it? If not, how does this misalignment vary with the depth of the underlying semantic error? To explore this, we propose a framework, where LLM deliberately injects and repairs its own bugs while its reasoning traces are analyzed for causal consistency. By measuring semantic alignment between generation and fix, this study investigates whether LLMs truly understand code semantics. The result oft is work will provide valuable insights for self-auditing and trustworthy AI in coding tasks.