Soumyadeep Roy

I am a postdoctoral scholar working with Prof. Tina Hernandez-Boussard at the Division of Computational Medicine, Department of Medicine, Stanford University.

Currently, I am building and evaluating LLM-based clinical decision support systems, specifically on two applications: (i) clinical guideline adherence over real-world patient trajectories and (ii) synthetic cohort generation for evaluating AI-based patient-to-trial matching systems like TrialGPT.

My work sits at the intersection of biomedical AI, machine learning, natural language processing, and real-world evidence. As medicine continues to generate increasingly complex clinical, genomic, and textual data, this kind of research is central to the future of AI in medicine.

Worked with clinical data (structured EHR and unstructured notes) of Parkinson’s disease (L3S Research Center, Hannover Medical School, Germany), oncology (breast, lung and prostate cancer) at GE Healthcare India and postoperative pain management at Stanford Medicine.

Research Career Overview

Translational Works

[Boston GrandHack 2026] Co-developed AuriCare, a holistic pain-management decision-support concept, presented at MIT Hacking Medicine’s Boston GrandHack 2026. Blog Event

[Patent, filed 2025] Lead co-inventor on a deep-learning representation-learning system for medical-imaging-equipment sensor logs, enabling anomaly detection and predictive maintenance (Wipro GE Healthcare).

[Translation Funded Project] My PhD work on interpretable clinical trial search has been continued as a funded translational research project at AI4ICPS programme at IIT Kharagpur, led by my co-author Official Website Poster

news

Jul 04, 2026	Serving as a reviewer for NeurIPS 2026, ACL Rolling Review (March, May 2026) and AAAI 2027
May 10, 2026	Guest Lecturer for Stanford BMDS 223 Course “Deploying and Evaluating Fair AI in Healthcare” (Spring 2026). One lecture on “Bias Evaluation in LLMs” and one hands-on coding workshop on “Bias Audit on Real-World Data (MIMIC-IV)”
Apr 10, 2026	Our work on efficient vocabulary adaptation on medical and legal domains got accepted as an ACL 2026 Mains track as a full paper. I will be presenting in-person at San Diego, California. Code Preprint
Mar 16, 2026	Presented AuriCare, a holistic pain management decision support concept, at MIT Hacking Medicine’s Boston GrandHack 2026. Demo Blog
Feb 20, 2026	Served as a reviewer for 2 conferences (FAccT 2026, ACL ARR January 2026) and 2 journals (JAMIA, Frontiers in AI)
Feb 14, 2026	Our work “LongTailQA: Benchmarking LLMs and RAG Models on Disambiguated Long-Tail Entities” got accepted to LREC 2026. Year-long collaborative effort with PhD students and colleagues from L3S Research Center, Germany
Dec 05, 2025	Presented our work on vocabulary adaptation (VA) for training medical language models at the Microsoft Research India (Bangalore) Friday Breakfast talk series. Link to slides
Nov 21, 2025	Our Parkinson Disease Subtyping paper with L3S Research Center Germany and Hannover Medical School got published at the Frontiers in AI Journal under Section Medicine and Public Health https://doi.org/10.3389/frai.2025.1668206. Link to slides
Sep 03, 2025	Started my postdoc at Stanford Medicine with Prof. Tina Hernandez-Boussard. I will work on understanding how real-world patient trajectories deviate from clinical guidelines. Does it lead to positive patient outcomes or avoidable harm?
Aug 01, 2025	Served as a reviewer for A* conferences such as EMNLP, AAAI, ACL RR - July and journals such as Frontiers in Genetics, Knowledge and Information Systems.

selected publications

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

Gunjan Balde , Soumyadeep Roy, Mainack Mondal , and 1 more author

In Main Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics , Sep 2026

Abs PDF Code

Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by 35-55% over continual pretraining and reduce parameter counts up to 37% w.r.t expansion-only methods.
Decision tree-based approach to robust Parkinson’s disease subtyping using clinical data of the Michael J. Fox Foundation LRRK2 cross-sectional study

Soumyadeep Roy, Stefanie Krähe , Michael Marschollek , and 3 more authors

Frontiers in Artificial Intelligence, Sep 2025

Abs HTML Code Slides

Parkinson’s Disease (PD) is a neurodegenerative disorder with high heterogeneity in clinical symptoms, progression course, treatment response, and genetic factors. Thus, PD subtyping aims to enhance understanding of disease mechanisms and helps to facilitate targeted interventions or treatment regimens. Data-driven PD subtyping is typically done using cluster analysis. Still, such studies face difficulty from widespread adoption in clinical practice due to the following issues: (i) results are quite sensitive to study design, and actual subtype rules are not reasonably interpretable; (ii) results are not robustly replicable across multiple datasets, and most studies focus on a single dataset. This paper aims to identify novel PD subtypes using an interpretable decision-tree-based method that is robustly reproducible in an independent PD cohort. We first train a decision tree classifier on an LRRK2 dataset to determine whether a patient has early onset or late onset PD. By tracing back from the leaves of the learned decision tree subtyping rules are established. The independent MDS dataset is used for external validation, after mapping features between the two datasets. We finally obtained six novel subtypes that are clinically consistent and sufficiently large across both training and external validation datasets. Finally, a clinical characterization study showed that the following clinical features may be the most important diagnostic markers for our six detected subtypes: (i) persistent asymmetry affecting the side of onset most, (ii) clinical course of 10 years or more, and (iii) postural instability not caused by other dysfunction. The subtypes identified in our study may provide relevant guidance for prognosis and therapeutic strategies. An early onset subtype (E4) can be linked to a comparatively favorable prognosis. In contrast, the mixed onset subtypes (M3 and M7) may predict faster functional decline, suggesting that patients in these groups could benefit from intensified supportive measures. One late onset subtype (L1) seems to have a more benign course, while the other two (L2 and L4) are connected with predictors of reduced quality of life and increased care dependency.
Building Trustworthy AI Models for Medicine: From Theory to Applications

Soumyadeep Roy, Sowmya S. Sundaram , Dominik Wolff , and 1 more author

In The 18th ACM International Conference on Web Search and Data Mining , Sep 2025

Abs HTML PDF Slides

AI is emerging as an efficient companion in medicine. While AI holds promise for reducing the cognitive load of researchers and practitioners, its adoption is often hindered by a lack of trust in new AI advancements. We present sophisticated techniques for developing trustworthy artificial intelligence (AI) models in medicine, bridging breakthroughs in AI research with practical healthcare applications. We will discuss in-depth the four stages (Design, Development, Implementation, and Evaluation) involved in the process of building trustworthy AI models customized for the medical domain. We present various techniques for incorporating important Trustworthy AI principles like data privacy, robustness, explainability, interpretability, medical experts-in-the-loop, and risk assessment while developing AI models for medicine. In contrast to prior tutorials, we make the following two key contributions: (i) While explaining the ’Implementation’ stage, we cover various real-world healthcare applications developed as part of research projects in academia in collaboration with medical schools in India and Germany. (ii) By including a health informatics professional as one of the tutorial organizers, we provide a fresh and much-needed perspective on the research challenges and mitigation strategies in building AI models for medicine.
MEDVOC: Vocabulary Adaptation for Fine-tuning Pre-trained Language Models on Medical Text Summarization

Gunjan Balde , Soumyadeep Roy, Mainack Mondal , and 1 more author

In Proceedings of the 33rd International Joint Conference on Artificial Intelligence , Sep 2024

Abs arXiv HTML PDF Code Slides

This work presents a dynamic vocabulary adaptation strategy, MEDVOC, for fine-tuning pre-trained language models (PLMs) like BertSumAbs, BART, and PEGASUS for improved medical text summarization. In contrast to existing domain adaptation approaches in summarization, MEDVOC treats vocabulary as an optimizable parameter and optimizes the PLM vocabulary based on fragment score conditioned only on the downstream task’s reference summaries. Unlike previous works on vocabulary adaptation (limited only to classification tasks), optimizing vocabulary based on summarization tasks requires an additional, extremely costly intermediate fine-tuning step on large summarization datasets. To that end, our novel fragment score-based hyperparameter search very significantly reduces this fine-tuning time—from 450 days to less than 2 days on average. Furthermore, while previous works on vocabulary adaptation are often primarily tied to single PLMs, MEDVOC is designed to be deployable across multiple PLMs (with varying model vocabulary sizes, pre-training objectives, and model sizes) – bridging the limited vocabulary overlap between the biomedical literature domain and PLMs. MEDVOC outperforms baselines by 15.74% in terms of Rouge-L in zero-shot setting and shows gains of 17.28% in high Out-Of-Vocabulary (OOV) concentrations. Our human evaluation shows MEDVOC generates more faithful medical summaries (88% compared to 59% in baselines).
Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Soumyadeep Roy, Aparup Khatua , Fatemeh Ghoochani , and 3 more authors

In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , Sep 2024

Abs arXiv HTML PDF Code Slides

GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical experts through Prolific, a well-known crowdsourcing platform. We annotated 300 out of these 919 incorrect data points at a granular level for different classes and created a multi-label span to identify the reasons behind the error. In our annotated dataset, a substantial portion of GPT-4’s incorrect responses is categorized as a "Reasonable response by GPT-4," by annotators. This sheds light on the challenge of discerning explanations that may lead to incorrect options, even among trained medical professionals. We also provide medical concepts and medical semantic predications extracted using the SemRep tool for every data point. We believe that it will aid in evaluating the ability of LLMs to answer complex medical questions. We make the resources available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy .
GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning

Soumyadeep Roy, Jonas Wallat , Sowmya S. Sundaram , and 2 more authors

In 26th European Conference on Artificial Intelligence ECAI 2023 , Sep 2023

Abs arXiv HTML PDF Code Slides

Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at unmapped: uri https://github.com/roysoumya/GeneMask.
Interpretable Clinical Trial Search using Pubmed Citation Network

Soumyadeep Roy, Niloy Ganguly , Shamik Sural , and 1 more author

In 2023 IEEE International Conference on Digital Health (ICDH) , Sep 2023

Abs HTML PDF Code Slides

Clinical trials are an essential source of information for practicing Evidence-Based Medicine because they help to determine the efficacy of newly developed treatments and drugs. However, most of the existing trial search systems focus on a specific disease (e.g., cancer) and utilize disease-specific knowledge bases that hinder the adaptation of such methods to new diseases. In this work, we overcome both limitations and propose a graph-based model that explores both clinical trials and the Pubmed databases to alleviate the shortage of relevant clinical trials for a query. We construct a large heterogeneous graph (750K nodes and 1.2 Million edges) made of clinical trials and Pubmed articles linked to clinical trials. As both the graph edges and nodes are labeled, we develop a novel metapath-based similarity search (MPSS) method to retrieve and rank clinical trials across multiple disease classes. We primarily focus on consumers and users that do not have any prior medical knowledge. As there are no multiple disease-wide trial search evaluation datasets, we contribute a high-quality, well-annotated query-relevant trial set comprising around 25 queries and, on average, approximately 95 annotated trials per query. We also perform a detailed evaluation of MPSS on the TREC Precision Medicine Benchmark Dataset, a disease-specific clinical trial search setting. We make all the codes and data publicly available at https://github.com/roysoumya/MPSS-clinical-trial-search.
Knowledge-Aware Neural Networks for Medical Forum Question Classification

Soumyadeep Roy, Sudip Chakraborty , Aishik Mandal , and 6 more authors

In Proceedings of the 30th ACM International Conference on Information & Knowledge Management , Sep 2021

Abs arXiv HTML PDF Code Slides

Online medical forums have become a predominant platform for answering health-related information needs of consumers. However, with a significant rise in the number of queries and the limited availability of experts, it is necessary to automatically classify medical queries based on a consumer’s intention, so that these questions may be directed to the right set of medical experts. Here, we develop a novel medical knowledge-aware BERT-based model (MedBERT) that explicitly gives more weightage to medical concept-bearing words, and utilize domain-specific side information obtained from a popular medical knowledge base. We also contribute a multi-label dataset for the Medical Forum Question Classification (MFQC) task. MedBERT achieves state-of-the-art performance on two benchmark datasets and performs very well in low resource settings.