publications | Soumyadeep Roy

2025

Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings

Gunjan Balde , Soumyadeep Roy, Mainack Mondal , and 1 more author

In Findings of the 63rd Annual Meeting of the Association for Computational Linguistics , 2025

Abs HTML PDF

Large Language Models (LLMs) recently achieved great success in medical text summarization tasks by simply using in-context learning. However, these studies do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty (23.40% and 28.73% drop for Llama-2 and Llama-3.1 respectively). Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces the over-fragmentation issue with medical words. We show vocabulary adaptation helps improve the LLM summarization performance. Through extensive experimentation of multiple vocabulary adaptation strategies, two continual pretraining strategies, and three benchmark medical summarization datasets, we gain valuable insights into the role of vocabulary adaptation strategies for customizing LLMs to the medical domain. We also performed a human evaluation study with medical experts where they found that vocabulary adaptation results in more relevant and faithful summaries.
A Systematic Evaluation of Single-Cell Foundation Models on Cell-Type Classification Task

Nicolas Steiner , Ziteng Li , Omid Vosoughi , and 4 more authors

In The 18th ACM International Conference on Web Search and Data Mining , 2025

Abs HTML PDF

This study presents a comprehensive benchmarking of three state-of-the-art single-cell foundation models scGPT, Geneformer, and scFoundation, on cell-type classification tasks. We evaluate the models on three datasets: myeloid, human pancreas, and multiple sclerosis, examining both standard fine-tuning and few-shot learning scenarios. Our work reveals that scFoundation consistently achieves the best performance while Geneformer performs poorly, yielding results sometimes even worse than those of the baseline models. Additionally, we demonstrate that a good foundation model can generalize well even when fine-tuned with out-of-distribution data, a capability that the baseline models lack. Our work highlights the potential of foundation models for addressing challenging biomedical questions, particularly in contexts where models are trained on one population but deployed on another.
Building Trustworthy AI Models for Medicine: From Theory to Applications

Soumyadeep Roy, Sowmya S. Sundaram , Dominik Wolff , and 1 more author

In The 18th ACM International Conference on Web Search and Data Mining , 2025

Abs HTML PDF

AI is emerging as an efficient companion in medicine. While AI holds promise for reducing the cognitive load of researchers and practitioners, its adoption is often hindered by a lack of trust in new AI advancements. We present sophisticated techniques for developing trustworthy artificial intelligence (AI) models in medicine, bridging breakthroughs in AI research with practical healthcare applications. We will discuss in-depth the four stages (Design, Development, Implementation, and Evaluation) involved in the process of building trustworthy AI models customized for the medical domain. We present various techniques for incorporating important Trustworthy AI principles like data privacy, robustness, explainability, interpretability, medical experts-in-the-loop, and risk assessment while developing AI models for medicine. In contrast to prior tutorials, we make the following two key contributions: (i) While explaining the ’Implementation’ stage, we cover various real-world healthcare applications developed as part of research projects in academia in collaboration with medical schools in India and Germany. (ii) By including a health informatics professional as one of the tutorial organizers, we provide a fresh and much-needed perspective on the research challenges and mitigation strategies in building AI models for medicine.

2024

Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models

Gunjan Balde , Soumyadeep Roy, Mainack Mondal , and 1 more author

In Findings of the 2024 Conference on Empirical Methods in Natural Language Processing , 2024

Abs HTML PDF Code Slides

In this work, we show a fundamental limitation in vocabulary adaptation approaches that use Byte-Pair Encoding (BPE) tokenization scheme for fine-tuning pretrained language models (PLMs) to expert domains. Current approaches trivially append the target domain-specific vocabulary at the end of the PLM vocabulary. This approach leads to a lower priority score and causes sub-optimal tokenization in BPE that iteratively uses merge rules to tokenize a given text. To mitigate this issue, we propose AdaptBPE where the BPE tokenization initialization phase is modified to first perform the longest string matching on the added (target) vocabulary before tokenizing at the character level. We perform an extensive evaluation of AdaptBPE versus the standard BPE over various classification and summarization tasks; AdaptBPE improves by 3.57% (in terms of accuracy) and 1.87% (in terms of Rouge-L), respectively. AdaptBPE for MEDVOC works particularly well when reference summaries have high OOV concentration or are longer in length. We also conduct a human evaluation, revealing that AdaptBPE generates more relevant and more faithful summaries as compared to MEDVOC.
Unlocking Efficiency: Adaptive Masking for Gene Transformer Models

Soumyadeep Roy, Shamik Sural , and Niloy Ganguly

In Proceedings of the 27th European Conference on Artificial Intelligence , 2024

Abs HTML PDF Code Slides

Gene transformer models such as Nucleotide Transformer, DNABert, and LOGO are trained to learn optimal gene sequence representations, by using the Masked Language Modeling (MLM) training objective over the complete Human Reference Genome. However, the typical tokenization methods employ a basic sliding window of tokens, such as k-mers, that fail to utilize any gene-centric semantics. This could result in the (trivial) masking of sequences that are easily predictable, leading to inefficient MLM training. Time-variant training strategies are known to improve pretraining efficiency in both language and vision tasks. In this work, we focus on using curriculum masking where we systematically increase the difficulty of masked token prediction task, by using a Pointwise Mutual Information-based difficulty criterion, as gene sequences lack well-defined semantic units similar to words or sentences of NLP domain. Our proposed Curriculum Masking-based Gene Masking Strategy (CM-GEMS) demonstrates superior representation learning capabilities compared to baseline masking approaches, when evaluated on downstream gene sequence classification tasks. We perform extensive evaluation in both few-shot (five datasets) and full dataset settings (Genomic Understanding Evaluation benchmark consisting of 27 tasks). Our findings reveal that CM-GEMS outperforms state-of-the-art models (DNABert-2, Nucleotide transformer, DNABert) trained at 120K steps, achieving similar results in just 10K and 1K steps. We also demonstrate that Curriculum-Learned LOGO (a 2-layer DNABert-like model) can achieve nearly 90% of the state-of-the-art model performance of 120K steps. We will make the models and codes publicly available on GitHub.
MEDVOC: Vocabulary Adaptation for Fine-tuning Pre-trained Language Models on Medical Text Summarization

Gunjan Balde , Soumyadeep Roy, Mainack Mondal , and 1 more author

In Proceedings of the 33rd International Joint Conference on Artificial Intelligence , 2024

Abs HTML PDF Code Slides

This work presents a dynamic vocabulary adaptation strategy, MEDVOC, for fine-tuning pre-trained language models (PLMs) like BertSumAbs, BART, and PEGASUS for improved medical text summarization. In contrast to existing domain adaptation approaches in summarization, MEDVOC treats vocabulary as an optimizable parameter and optimizes the PLM vocabulary based on fragment score conditioned only on the downstream task’s reference summaries. Unlike previous works on vocabulary adaptation (limited only to classification tasks), optimizing vocabulary based on summarization tasks requires an additional, extremely costly intermediate fine-tuning step on large summarization datasets. To that end, our novel fragment score-based hyperparameter search very significantly reduces this fine-tuning time—from 450 days to less than 2 days on average. Furthermore, while previous works on vocabulary adaptation are often primarily tied to single PLMs, MEDVOC is designed to be deployable across multiple PLMs (with varying model vocabulary sizes, pre-training objectives, and model sizes) – bridging the limited vocabulary overlap between the biomedical literature domain and PLMs. MEDVOC outperforms baselines by 15.74% in terms of Rouge-L in zero-shot setting and shows gains of 17.28% in high Out-Of-Vocabulary (OOV) concentrations. Our human evaluation shows MEDVOC generates more faithful medical summaries (88% compared to 59% in baselines).
Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Soumyadeep Roy, Aparup Khatua , Fatemeh Ghoochani , and 3 more authors

In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2024

Abs HTML PDF Code Slides

GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical experts through Prolific, a well-known crowdsourcing platform. We annotated 300 out of these 919 incorrect data points at a granular level for different classes and created a multi-label span to identify the reasons behind the error. In our annotated dataset, a substantial portion of GPT-4’s incorrect responses is categorized as a "Reasonable response by GPT-4," by annotators. This sheds light on the challenge of discerning explanations that may lead to incorrect options, even among trained medical professionals. We also provide medical concepts and medical semantic predications extracted using the SemRep tool for every data point. We believe that it will aid in evaluating the ability of LLMs to answer complex medical questions. We make the resources available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy .

2023

GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning

Soumyadeep Roy, Jonas Wallat , Sowmya S. Sundaram , and 2 more authors

In 26th European Conference on Artificial Intelligence ECAI 2023 , Sep 2023

Abs HTML PDF Code Slides

Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at unmapped: uri https://github.com/roysoumya/GeneMask.
Interpretable Clinical Trial Search using Pubmed Citation Network

Soumyadeep Roy, Niloy Ganguly , Shamik Sural , and 1 more author

In 2023 IEEE International Conference on Digital Health (ICDH) , Sep 2023

Abs HTML PDF Code Slides

Clinical trials are an essential source of information for practicing Evidence-Based Medicine because they help to determine the efficacy of newly developed treatments and drugs. However, most of the existing trial search systems focus on a specific disease (e.g., cancer) and utilize disease-specific knowledge bases that hinder the adaptation of such methods to new diseases. In this work, we overcome both limitations and propose a graph-based model that explores both clinical trials and the Pubmed databases to alleviate the shortage of relevant clinical trials for a query. We construct a large heterogeneous graph (750K nodes and 1.2 Million edges) made of clinical trials and Pubmed articles linked to clinical trials. As both the graph edges and nodes are labeled, we develop a novel metapath-based similarity search (MPSS) method to retrieve and rank clinical trials across multiple disease classes. We primarily focus on consumers and users that do not have any prior medical knowledge. As there are no multiple disease-wide trial search evaluation datasets, we contribute a high-quality, well-annotated query-relevant trial set comprising around 25 queries and, on average, approximately 95 annotated trials per query. We also perform a detailed evaluation of MPSS on the TREC Precision Medicine Benchmark Dataset, a disease-specific clinical trial search setting. We make all the codes and data publicly available at https://github.com/roysoumya/MPSS-clinical-trial-search.

2021

Knowledge-Aware Neural Networks for Medical Forum Question Classification

Soumyadeep Roy, Sudip Chakraborty , Aishik Mandal , and 6 more authors

In Proceedings of the 30th ACM International Conference on Information & Knowledge Management , Sep 2021

Abs HTML PDF Code Slides

Online medical forums have become a predominant platform for answering health-related information needs of consumers. However, with a significant rise in the number of queries and the limited availability of experts, it is necessary to automatically classify medical queries based on a consumer’s intention, so that these questions may be directed to the right set of medical experts. Here, we develop a novel medical knowledge-aware BERT-based model (MedBERT) that explicitly gives more weightage to medical concept-bearing words, and utilize domain-specific side information obtained from a popular medical knowledge base. We also contribute a multi-label dataset for the Medical Forum Question Classification (MFQC) task. MedBERT achieves state-of-the-art performance on two benchmark datasets and performs very well in low resource settings.
An Integrated Approach for Improving Brand Consistency of Web Content: Modeling, Analysis, and Recommendation

Soumyadeep Roy, Shamik Sural , Niyati Chhaya , and 2 more authors

ACM Trans. Web, May 2021

Abs HTML PDF Code Slides

A consumer-dependent (business-to-consumer) organization tends to present itself as possessing a set of human qualities, which is termed the brand personality of the company. The perception is impressed upon the consumer through the content, be it in the form of advertisement, blogs, or magazines, produced by the organization. A consistent brand will generate trust and retain customers over time as they develop an affinity toward regularity and common patterns. However, maintaining a consistent messaging tone for a brand has become more challenging with the virtual explosion in the amount of content that needs to be authored and pushed to the Internet to maintain an edge in the era of digital marketing. To understand the depth of the problem, we collect around 300K web page content from around 650 companies. We develop trait-specific classification models by considering the linguistic features of the content. The classifier automatically identifies the web articles that are not consistent with the mission and vision of a company and further helps us to discover the conditions under which the consistency cannot be maintained. To address the brand inconsistency issue, we then develop a sentence ranking system that outputs the top three sentences that need to be changed for making a web article more consistent with the company’s brand personality.

2019

Understanding Brand Consistency from Web Content

Soumyadeep Roy, Niloy Ganguly , Shamik Sural , and 2 more authors

In Proceedings of the 10th ACM Conference on Web Science , May 2019

Abs HTML PDF Code Slides

Brands produce content to engage with the audience continually and tend to maintain a set of human characteristics in their marketing campaigns. In this era of digital marketing, they need to create a lot of content to keep up the engagement with their audiences. However, such kind of content authoring at scale introduces challenges in maintaining consistency in a brand’s messaging tone, which is very important from a brand’s perspective to ensure a persistent impression for its customers and audiences. In this work, we quantify brand personality and formulate its linguistic features. We score text articles extracted from brand communications on five personality dimensions: sincerity, excitement, competence, ruggedness and sophistication, and show that a linear SVM model achieves a decent F1 score of 0.822. The linear SVM allows us to annotate a large set of data points free of any annotation error. We utilize this huge annotated dataset to characterize the notion of brand consistency, which is maintaining a company’s targeted brand personality across time and over different content categories; we make certain interesting observations. As per our knowledge, this is the first study which investigates brand personality from the company’s official websites, and that formulates and analyzes the notion of brand consistency on such a large scale.
Towards an Aspect-Based Ranking Model for Clinical Trial Search

Soumyadeep Roy, Koustav Rudra , Nikhil Agrawal , and 2 more authors

In Computational Data and Social Networks , May 2019

Abs HTML PDF Code Slides

Clinical Trials are crucial for the practice of evidence-based medicine. It provides updated and essential health-related information for the patients. Sometimes, Clinical trials are the first source of information about new drugs and treatments. Different stakeholders, such as trial volunteers, trial investigators, and meta-analyses researchers often need to search for trials. In this paper, we propose an automated method to retrieve relevant trials based on the overlap of UMLS concepts between the user query and clinical trials. However, different stakeholders may have different information needs, and accordingly, we rank the retrieved clinical trials based on the following four aspects – Relevancy, Adversity, Recency, and Popularity. We aim to develop a clinical trial search system which covers multiple disease classes, instead of only focusing on retrieval of oncology-based clinical trials. We follow a rigorous annotation scheme and create an annotated retrieval set for 25 queries, across five disease categories. Our proposed method performs better than the baseline model in almost }}90\backslash%}} cases. We also measure the correlation between the different aspect-based ranking lists and observe a high negative Spearman rank’s correlation coefficient between popularity and recency.

2018

Understanding Email Interactivity and Predicting User Response to Email

Soumyadeep Roy, Nibir Pal , Kousik Dasgupta , and 1 more author

May 2018

Abs HTML PDF Code Slides

Email is important for task and project management, information exchange, scheduling, and also for social communication among users. Understanding the pattern of interaction in between the email and the recipients as well as the factors which are responsible for determining user replying behavior help in addressing the problem of email overload, which users face due to the increasing volume of email traffic. In this paper, we develop a binary classification model to predict an email recipient response, based on certain email metadata and recipient email usage characteristics. For this task, we study a 2016 HackerRank contest email dataset. We first identify email interactivity patterns separately for each recipient response and then understand the factors responsible for determining the user response to an email. We then propose a novel feature selection methodology where we perform dataset profiling based on user and sent day characteristics, using k-means clustering. We observe that the Decision Tree classifier performs the best with a F1 score of 0.6279 and the fraction of emails opened by the recipient in the past being the most significant feature for the model.