
Thesis Defense: Caleb Ellington | March 31, 2026 | 1:30pm
CBD and CPCB are proud to announce the following thesis defense:
TITLE: Learning on Heterogeneous Data with Context-Adaptive Models
Caleb Ellington
Tuesday, March 31st @ 1:30PM ET
NSH 3305, 好色先生TV
: passcode: 639295
Most models assume that datasets are comprised of repeated measurements from a single set of controlled experimental conditions.
Many real world systems violate this assumption. In biology and medicine, each sample often comes with a unique context, shaped by genetics, environment, treatments, and disease history. We define this as heterogeneity, where every sample may come from a different distribution under a different context. Under this setting, core modeling assumptions break down, limiting the application of both classical models and modern machine learning methods.
This thesis develops methods for learning on heterogeneous data. We approach this by developing methods for context-adaptive inference: before predicting, the system uses information about the current context to specialize its parameters of computation for that instance. This perspective unifies many rare, heterogeneous, localized, dynamic, and context-dependent modeling problems under a general framework.
The subsequent work makes three primary contributions. Aim 1 introduces contextualized modeling, a unifying family of methods that estimate context-specific models by learning a mapping from context to model parameters. This enables estimation and inference in settings where each model must be tailored to each sample. Aim 2 demonstrates how contextualized modeling enables new forms of context-adaptive study design. In this regime, models improve with the addition of new experimental conditions rather than requiring large numbers of replicates from each experimental condition. These studies converge on an analytical workflow: we contextualize a model of interest, perform context-adaptive inference to obtain context-specific models for every sample, and re-organize the data in terms of sample-specific model parameters, which improves sample grouping, stratification, and downstream learning. Aim 3 generalizes and scales this approach through the use of large-scale foundation models. Self-supervised pretraining on massive heterogeneous biological data produces generalizable and comparable representations without the explicit modeling constraints imposed by contextualized modeling, leading to similar improvements in grouping, stratification, and transfer learning.
While heterogeneity is often treated as a confounder and a nuisance, these contributions show that in many real world settings heterogeneity is a critical resource for accurate modeling. Context-adaptive inference learns how mechanisms vary with context, and context-specific models enable analyses that would be impossible under traditional fixed-population assumptions. In biology and medicine, this opens new opportunities for cohort stratification, drug repurposing, and therapeutic target identification.
NSH 3305, 好色先生TV
: passcode: 639295
Committee:
Eric P. Xing, Advisor, CMU
Jian Ma, CMU
David Koes, PITT
Manolis Kellis, MIT
Most models assume that datasets are comprised of repeated measurements from a single set of controlled experimental conditions.
Many real world systems violate this assumption. In biology and medicine, each sample often comes with a unique context, shaped by genetics, environment, treatments, and disease history. We define this as heterogeneity, where every sample may come from a different distribution under a different context. Under this setting, core modeling assumptions break down, limiting the application of both classical models and modern machine learning methods.
This thesis develops methods for learning on heterogeneous data. We approach this by developing methods for context-adaptive inference: before predicting, the system uses information about the current context to specialize its parameters of computation for that instance. This perspective unifies many rare, heterogeneous, localized, dynamic, and context-dependent modeling problems under a general framework.
The subsequent work makes three primary contributions. Aim 1 introduces contextualized modeling, a unifying family of methods that estimate context-specific models by learning a mapping from context to model parameters. This enables estimation and inference in settings where each model must be tailored to each sample. Aim 2 demonstrates how contextualized modeling enables new forms of context-adaptive study design. In this regime, models improve with the addition of new experimental conditions rather than requiring large numbers of replicates from each experimental condition. These studies converge on an analytical workflow: we contextualize a model of interest, perform context-adaptive inference to obtain context-specific models for every sample, and re-organize the data in terms of sample-specific model parameters, which improves sample grouping, stratification, and downstream learning. Aim 3 generalizes and scales this approach through the use of large-scale foundation models. Self-supervised pretraining on massive heterogeneous biological data produces generalizable and comparable representations without the explicit modeling constraints imposed by contextualized modeling, leading to similar improvements in grouping, stratification, and transfer learning.
While heterogeneity is often treated as a confounder and a nuisance, these contributions show that in many real world settings heterogeneity is a critical resource for accurate modeling. Context-adaptive inference learns how mechanisms vary with context, and context-specific models enable analyses that would be impossible under traditional fixed-population assumptions. In biology and medicine, this opens new opportunities for cohort stratification, drug repurposing, and therapeutic target identification.