Workshop on Statistical and machine learning methods for health data science

发表于： 2025-07-02 点击：

Conference: July 6, 2025

Meeting venue: The Second Lecture Hall on the first floor of the Mathematics Building, Jilin University

On-campus contact person：Hongyuan Cao , [email protected]

Schedule:

Workshop arrangement:

Title: Semiparametric mixture regression for asynchronous longitudinal data using multivariate functional principal component analysis

Speaker: Yehua Li, Professor and Chair, Department of Statistics, UC-Riverside

Abstract:

The transitional phase of menopause induces significant hormonal fluctuations, exerting a profound influence on the long-term well-being of women. In an extensive longitudinal investigation of women’s health during mid- life and beyond, known as the Study of Women’s Health Across the Nation (SWAN), hormonal biomarkers are repeatedly assessed, following an asynchronous schedule compared to other error-prone covariates, such as physical and cardiovascular measurements. We conduct a subgroup analysis of the SWAN data employing a semiparametric mixture regression model, which allows us to explore how the relationship between hormonal responses and other time-varying or time-invariant covariates varies across subgroups. To address the challenges posed by asynchronous scheduling and measurement errors, we model the time-varying covariate trajectories as functional data with reduced-rank Karhunen-Lo ́eve expansions, where splines are employed to capture the mean and eigenfunctions. Treating the latent subgroup membership and the functional principal component (FPC) scores as missing data, we propose an Expectation-Maximization (EM) algorithm to effectively fit the joint model, combining the mixture regression for the hormonal response and the FPC model for the asynchronous, time-varying covariates. In addition, we explore data-driven methods to determine the optimal number of subgroups within the population. Through our comprehensive analysis of the SWAN data, we unveil a crucial subgroup structure within the aging female population, shedding light on important distinctions and patterns among women undergoing menopause.

Dr. Yehua Li is Professor and Chair of Statistics at the University of California, Riverside. He got his Ph.D. in Statistics in 2006 from Texas A&M University. Before joining UCR in 2018, he held faculty positions at the University of Georgia and Iowa State University. He is a Fellow of the American Statistical Association, a Fellow of the Institute of Mathematical Statistics, an Elected Member of the International Statistical Institute, and a recipient of the National Science Foundation CAREER Award in 2012. He has served on the editorial boards of the Canadian Journal of Statistics, the Journal of Multivariate Analysis, and Stat. His research interests include functional and longitudinal data analysis, non- and semi-parametric methods, spatial statistics, measurement error, mixture models, cardiovascular disease, and neuroimaging.

Title: Contrastive Learning on Multimodal Analysis of Electronic Health Records

Speaker: Doudou Zhou, Assistant Professor, Department of Statistics, National University of Singapore

Abstract: Electronic health record (EHR) systems contain a wealth of multimodal clinical data including structured data like clinical codes and unstructured data such as clinical notes. However, many existing EHR-focused studies has traditionally either concentrated on an individual modality or merged different modalities in a rather rudimentary fashion. This approach often results in the perception of structured and unstructured data as separate entities, neglecting the inherent synergy between them. Specifically, the two important modalities contain clinically relevant, inextricably linked and complementary health information. A more complete picture of a patient's medical history is captured by the joint analysis of the two modalities of data. Despite the great success of multimodal contrastive learning on vision-language, its potential remains under-explored in the realm of multimodal EHR, particularly in terms of its theoretical understanding. To accommodate the statistical analysis of multimodal EHR data, in this paper, we propose a novel multimodal feature embedding generative model and design a multimodal contrastive loss to obtain the multimodal EHR feature representation. Our theoretical analysis demonstrates the effectiveness of multimodal learning compared to single-modality learning and connects the solution of the loss function to the singular value decomposition of a pointwise mutual information matrix. This connection paves the way for a privacy-preserving algorithm tailored for multimodal EHR feature representation learning. Simulation studies show that the proposed algorithm performs well under a variety of configurations. We further validate the clinical utility of the proposed algorithm in real-world EHR data.

Dr. Doudou Zhou is an Assistant Professor at the Department of Statistics and Data Science, National University of Singapore. He obtained his Ph.D. in Statistics from the University of California, Davis, and completed his postdoctoral training at the Harvard T.H. Chan School of Public Health. His research interests include electronic health records, high-dimensional statistics, transfer learning, and federated learning. His work has been published in leading journals and conferences such as Journal of the American Statistical Association, Journal of Machine Learning Research, IEEE Transactions on Information Theory, and Conference on Learning Theory.

Title: Addressing Information Bias in EHR-Based COVID-19 Vaccine Effectiveness Studies: Inference under Treatment Misclassification

Speaker: Qiong Wu, Assistant Professor, Department of Biostatistics, University of Pittsburgh

Abstract:

Post-introduction evaluation of COVID-19 vaccines is essential to answer critical questions not fully addressed in clinical trials, such as effectiveness against evolving SARS-CoV-2 variants and the incidence of rare adverse events, and thus inform vaccine recommendations to the general public. However, in the U.S., the fragmentation of immunization records across multiple, often disconnected, systems results in incomplete vaccination histories in patients’ Electronic Health Records (EHRs). In this talk, I will first introduce a novel comparative effectiveness research (CER) pipeline leveraging the rank-preserving property of propensity scores and the integrated likelihood method to estimate effectiveness with misclassified treatments, in the absence of validation data. As the pandemic progressed, we achieved internal validation with reliable vaccination records by integrating EHR data with immunization information systems. Consequently, I will introduce the efficient influence function for average treatment effects in contexts where such internal validation is available. Following this, we construct an efficient estimator that achieves full statistical efficiency and has faster rates of convergence than estimating complex nuisance parameters. The proposed method is applied to data from a national network of U.S. pediatric medical centers (PEDSnet) to assess vaccination effectiveness among children and adolescents during the Omicron period.

Qiong is currently an assistant professor in the Department of Biostatistics and Health Data Science at the University of Pittsburgh. Her research interests include federated learning, transfer learning, and causal inference using real-world data, such as electronic health records (EHR) data, as well as complex-structured biomedical data, like neuroimaging data. Prior to her current position, she was a postdoctoral researcher in the Department of Biostatistics, Epidemiology, and Informatics (DBEI) at the University of Pennsylvania. She received her PhD degree from the University of Maryland in statistics.

Title: Diagnosing scientific replicability through probabilistic distinguishability

Speaker: Peng Wang, PhD Candidate in Mathematics, Jilin University

Abstract: Despite the widely recognized importance of replicability in biological research, computational methods to quantify irreplicability and identify irreplicable instances remain underdeveloped. This paper presents an efficient and robust computational framework to address this gap. To tackle the challenge of defining an acceptable level of intrinsic heterogeneity among replicable studies, we introduce a distinguishability criterion, ensuring that replicable effects, while potentially heterogeneous, can be distinguished from zero effects and maintain consistent directions. We implement a Bayesian model criticism approach, reporting a Bayesian $p$-value to identify potential irreplicable instances. Through numerical experiments, we demonstrate the efficacy of the proposed methods in detecting batch effects in high-throughput experiments and identifying instances of the publication bias. Finally, we apply the framework to multi-tissue eQTL data from the GTEx consortium, uncovering tissue-specific eQTLs that represent genuinely irreplicable biological effects.

Title: Meta-clustering of Gene Expression Data

Speaker: Yingying Wei, Associate Professor, Department of Statistics, Chinese University of Hong Kong

Abstract: Traditional meta-analyses pool effect sizes across studies to improve statistical power. Likewise, there is growing interest in joint clustering across datasets to identify disease subtypes for bulk gene expression data and to discover cell types for single-cell RNA-sequencing (scRNA-seq) data. Unfortunately, due to the prevalence of technical batch effects, directly clustering of samples from multiple gene expression datasets can lead to wrong results. Therefore, in the past several years, there has been very active research on the integration of multiple gene expression datasets. However, the discussion on when multiple gene expression datasets can be integrated for joint clustering is lacking. Obviously, if different subtypes are assayed in distinct batches, then meta-clustering would be impossible no matter what types of machine learning or statistical methods are used. In this talk, I will present our Batch-effects-correction-with-Unknown-Subtypes (BUS) framework. BUS is capable of adjusting batch effects explicitly, grouping samples that share similar characteristics into subtypes, identifying genes that distinguish subtypes and enjoying a linear-order computational complexity. The BUS framework can be adapted to perform meta-clustering for bulk gene expression data, scRNA-seq data collected from a single biological condition, and scRNA-seq data collected from multiple biological conditions, respectively. The proofs for model identifiability for the corresponding models provide insights on when multiple gene expression data can be integrated for meta-clustering and guidelines on experimental designs. Simulation studies and real data analyses show the advantages of our proposed models over state-of-the-art methods, especially when performing differential inference for scRNA-seq data collected from multiple conditions.

Yingying Wei is an Associate Professor in the Department of Statistics at the Chinese University of Hong Kong. She obtained her bachelor's degree in Mathematics from Tsinghua University and her Master’s degree (MSc Eng) in Computer Science and PhD degree in Biostatistics from Johns Hopkins University. She is interested in developing statistical methods for genomic, biomedical and social network data. She received the W. J. Youden Award in Interlaboratory Testing from the American Statistical Association in 2019.

下一篇： 2025 China-Japan Joint Meeting on Polynomial Ring Theory

娱乐城

科学研究

Workshop on Statistical and machine learning methods for health data science