Hi, I'm Antonio

I am an Assistant Professor of Computer Science at William & Mary. My research lies at the intersection of Artificial Intelligence (AI), Natural Language Processing (NLP), and Software Engineering (SE), with a strong emphasis on the automation of SE-related practices. My work promotes explainability, efficiency, and optimization from both a model-centric perspective (e.g., robustness and adaptability of foundation models such as GitHub Copilot) and an output-centric perspective (e.g., documentation and summarization of code components). More broadly, my research addresses the reliability and efficiency of AI systems for SE, advancing next-generation intelligent tools that improve transparency, scalability, and developer productivity.

Resource-Efficient AI for Code

Developing scalable, resource-efficient AI techniques — quantization, parameter-efficient fine-tuning, knowledge distillation — to make code intelligence models practical for real-world deployment and developer productivity.

AI Agents for SE

Designing autonomous AI agents that plan, reason, and execute multi-step software engineering workflows — from issue resolution to code review — with minimal human intervention.

Neurosymbolic Program Reasoning

Combining neural language models with symbolic reasoning — grammars, type systems, program analysis — to build AI tools that are both powerful and formally grounded for software comprehension.

Causality for Software Systems

Applying causal reasoning and counterfactual analysis to understand cause-and-effect in software systems — from root-cause debugging to evaluating the true impact of AI-driven interventions on developer workflows.

Publications

2025 · 10 papers
Saima Afrin, Md Zahidul Haque, Antonio Mastropaolo
A Systematic Literature Review of Parameter-Efficient Fine-Tuning for Large Code Models
TOSEM 2025
The rise of Artificial Intelligence (AI)-and particularly Large Language Models (LLMs) for code--has reshaped Software Engineering (SE) by enabling the automation of tasks such as code generation, bug detection, and repair. However, these models require significant computational resources for training and fine-tuning, posing challenges for real-world adoption in resource-constrained environments. To address this, the research community has increasingly turned to Parameter-Efficient Fine-Tuning (PEFT)--a class of techniques that enables the adaptation of large models by updating only a small subset of parameters, rather than the entire model. In this Systematic Literature Review (SLR), we examine the growing application of PEFT techniques--across a wide range of software engineering tasks. We analyze how these methods are used to optimize various deep learning (DL) architectures, focusing on their impact on both performance and efficiency. Our study synthesizes findings from 28 peer-reviewed papers, identifying patterns in configuration strategies and adaptation trade-offs. The outcome of this review is a comprehensive taxonomy that categorizes PEFT usage by task type, distinguishing between generative and non-generative scenarios. Our findings aim to inform future research and guide the practical deployment of PEFT in sustainable, AI-powered software development.
@article{afrin2025peft,
  title={A Systematic Literature Review of Parameter-Efficient Fine-Tuning for Large Code Models},
  author={Saima Afrin and Md Zahidul Haque and Antonio Mastropaolo},
  journal={ACM Transactions on Software Engineering and Methodology (TOSEM)},
  year={2025}
}
Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, Gabriele Bavota
On the Effectiveness of LLM-as-a-Judge for Code Generation and Summarization
TSE 2025
Large Language Models (LLMs) have been recently exploited as judges for complex natural language processing tasks, such as Q&A. The basic idea is to delegate to an LLM the assessment of the "quality" of the output provided by an automated technique for tasks for which: (i) quantitative metrics would only tell part of the story, and; (ii) a large-scale human-based evaluation would be too expensive. We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization. For code generation, we check whether eight LLMs are able to judge the correctness of 1,405 Java methods and 1,281 Python functions generated by the same LLMs or implemented by humans. For code summarization, we compare the judgment of five LLMs to those provided by nine humans for ~1.2k summaries, related to both Java and Python functions. Our findings show that GPT-4-turbo is the best LLM in terms of judging capabilities for both tasks, with "smaller" LLMs featuring tens of billions parameters not being able to cope with judging tasks. However, even the best-performing LLM frequently misjudges the correctness of the code and summary quality.
@article{crupi2025llmjudge,
  title={On the Effectiveness of LLM-as-a-Judge for Code Generation and Summarization},
  author={Giuseppe Crupi and Rosalia Tufano and Alejandro Velasco and Antonio Mastropaolo and Denys Poshyvanyk and Gabriele Bavota},
  journal={IEEE Transactions on Software Engineering (TSE)},
  year={2025}
}
Honglin Shu, Dong Wang, Antonio Mastropaolo, Gabriele Bavota, Yasutaka Kamei
An Empirical Study on Language Models for Generating Log Statements in Test Code
TOSEM 2025
Log statements play a critical role in modern software development, capturing essential runtime information necessary for software maintenance. We conduct an empirical study on 5,206,759 Java test methods collected from 6,405 GitHub projects to explore and disclose the effectiveness and limitations of Pre-trained Language Models (PLMs) and Large Language Models (LLMs) for generating and injecting test log statements. Our findings demonstrate that general-purpose LLMs like GPT-3.5-Turbo, when properly instructed, performs comparably to the best-performing PLMs on predicting log level. Additionally, GPT-3.5-Turbo substantially outperforms the best in PLMs on predicting log position with a 33.97% improvement while also achieving superior performance in predicting log messages in terms of BLEU and ROUGE.
@article{shu2025testlogging,
  title={An Empirical Study on Language Models for Generating Log Statements in Test Code},
  author={Honglin Shu and Dong Wang and Antonio Mastropaolo and Gabriele Bavota and Yasutaka Kamei},
  journal={ACM Transactions on Software Engineering and Methodology (TOSEM)},
  year={2025}
}
Antonio Mastropaolo, Camilo Escobar-Velásquez, Mario Linares-Vásquez
From Triumph to Uncertainty: The Journey of Software Engineering in the AI Era
TOSEM 2025
Over the last ten years, the realm of Artificial Intelligence (AI) has experienced an explosion of revolutionary breakthroughs, transforming what seemed like a far-off dream into a reality that is now deeply embedded in our everyday lives. In this paper, we aim at outlining the key elements that, based on our expertise, are vital for the smooth integration of AI into SE, all while preserving the intrinsic human creativity that has been the driving force behind the field. We delve into the intricate interplay between AI-driven automation and human innovation, exploring how these two components can work together to advance SE practices to new methods and standards.
@article{mastropaolo2025triumph,
  title={From Triumph to Uncertainty: The Journey of Software Engineering in the AI Era},
  author={Antonio Mastropaolo and Camilo Escobar-Vel\'{a}squez and Mario Linares-V\'{a}squez},
  journal={ACM Transactions on Software Engineering and Methodology (TOSEM)},
  year={2025}
}
Saima Afrin, Bowen Xu, Antonio Mastropaolo
Is Quantization a Deal-breaker? Empirical Insights from Large Code Models
ICSME 2025
The growing scale of large language models (LLMs) not only demands extensive computational resources but also raises environmental concerns due to their increasing carbon footprint. Model quantization emerges as an effective approach that can reduce the resource demands of LLMs by decreasing parameter precision without substantially affecting performance. Our study investigates the effects of quantization on the qualitative aspects of automatically generated code. We apply Activation-aware Weight Quantization (AWQ) to two widely used code models, CodeLlama and DeepSeekCoder, to generate Java and Python code. Our findings reveal that quantization is a robust technique that not only preserves functional correctness, but also retains key qualitative code attributes sought after by developers, such as maintainability and structural simplicity.
@inproceedings{afrin2025quantization,
  title={Is Quantization a Deal-breaker? Empirical Insights from Large Code Models},
  author={Saima Afrin and Bowen Xu and Antonio Mastropaolo},
  booktitle={IEEE 41st International Conference on Software Maintenance and Evolution (ICSME)},
  year={2025}
}
Antonio Mastropaolo, Denys Poshyvanyk
A Path Less Traveled: Reimagining Software Engineering Automation via a Neurosymbolic Paradigm
AI-SDLC 2025
The emergence of Large Code Models (LCMs) has transformed software engineering (SE) automation, driving significant advancements in tasks such as code generation, source code documentation, code review, and bug fixing. However, these advancements come with trade-offs: achieving high performance often entails exponential computational costs, reduced interpretability, and an increasing dependence on data-intensive models. In this paper, we propose Neurosymbolic Software Engineering (NSE) as a promising paradigm combining neural learning with symbolic (rule-based) reasoning, while strategically introducing a controlled source of chaos to simulate the complex dynamics of real-world software systems.
@inproceedings{mastropaolo2025neurosymbolic,
  title={A Path Less Traveled: Reimagining Software Engineering Automation via a Neurosymbolic Paradigm},
  author={Antonio Mastropaolo and Denys Poshyvanyk},
  booktitle={The 1st International Workshop on Envisioning the AI-Augmented SDLC (AI-SDLC), co-located with FSE 2025},
  year={2025}
}
Francesco Casillo, Antonio Mastropaolo, Gabriele Bavota, Vincenzo Deufemia, Carmine Gravino
Towards Generating the Rationale for Code Changes
ICPC 2025
Commit messages are essential to understand changes in software projects, providing a way for developers to communicate code evolution. Our study explores a more complex task: generating rationale explanations for code changes. We developed a method to identify rationale sentences in commit messages and compiled a dataset of 45,945 commits with their corresponding rationales. While the approach we engineered for the extraction of rationale exhibited a 75% precision, the model trained to generate the rationale only worked in a minority of cases. Our findings highlight the difficulty of the tackled task and the need for additional research in the area.
@inproceedings{casillo2025rationale,
  title={Towards Generating the Rationale for Code Changes},
  author={Francesco Casillo and Antonio Mastropaolo and Gabriele Bavota and Vincenzo Deufemia and Carmine Gravino},
  booktitle={IEEE/ACM 33rd International Conference on Program Comprehension (ICPC -- ReNE)},
  year={2025}
}
Antonio Vitale, Antonio Mastropaolo, Rocco Oliveto, Massimiliano Di Penta, Simone Scalabrino
Optimizing Datasets for Code Summarization: Is Code-Comment Coherence Enough?
ICPC 2025
Automated code summarization is a long-standing goal for code comprehension. We explore the extent to which code-comment coherence can be used to optimize code summarization datasets. We examine multiple selectivity levels of training instances from two state-of-the-art datasets (TL-CodeSum and Funcom) and evaluate the resulting models on three manually curated test sets. The results show that even halving the training set sizes does not significantly affect the model's ability to generate summaries. However, when comparing the most restrictive selection strategy with a simpler one that randomly selects instances, the resulting accuracy does not change, suggesting that current datasets contain many irrelevant examples and different quality attributes should be explored.
@inproceedings{vitale2025coherence,
  title={Optimizing Datasets for Code Summarization: Is Code-Comment Coherence Enough?},
  author={Antonio Vitale and Antonio Mastropaolo and Rocco Oliveto and Massimiliano Di Penta and Simone Scalabrino},
  booktitle={IEEE/ACM 33rd International Conference on Program Comprehension (ICPC)},
  year={2025}
}
Alejandro Velasco, Aya Garryyeva, David Nader Palacio, Antonio Mastropaolo, Denys Poshyvanyk
Toward Neurosymbolic Program Comprehension
ICPC 2025
Recent advancements in Large Language Models (LLMs) have paved the way for Large Code Models (LCMs), enabling automation in complex software engineering tasks. However, the ambition to scale these models to trillion-parameter sizes poses significant challenges including rising computational demands and issues related to trustworthiness, bias, and interpretability. In this paper, we advocate for a Neurosymbolic research direction that combines the strengths of existing DL techniques with traditional symbolic methods--renowned for their reliability, speed, and determinism. We outline the core features and present preliminary results for our envisioned approach, aimed at establishing the first NeuroSymbolic Program Comprehension (NsPC) framework to aid in identifying defective code components.
@inproceedings{velasco2025nspc,
  title={Toward Neurosymbolic Program Comprehension},
  author={Alejandro Velasco and Aya Garryyeva and David Nader Palacio and Antonio Mastropaolo and Denys Poshyvanyk},
  booktitle={IEEE/ACM 33rd International Conference on Program Comprehension (ICPC -- ERA)},
  year={2025}
}
Saima Afrin, Joseph Call, Khai Nguyen, Oscar Chaparro, Antonio Mastropaolo
Resource-Efficient and Effective Code Summarization
FORGE 2025
Code Language Models (CLMs) have demonstrated high effectiveness in automating software engineering tasks such as bug fixing, code generation, and code documentation. However, as models grow in scale, sustainability concerns emerge. GreenAI techniques, such as QLoRA (Quantized Low-Rank Adaptation), offer a promising path for dealing with large models' sustainability. We investigate the extent to which QLoRA's capabilities in NL-to-Code tasks can be leveraged and transferred to code summarization. Our study evaluates two state-of-the-art CLMs across two programming languages. The findings confirm that QLoRA not only allows efficient fine-tuning of CLMs for code summarization but also achieves the best results with minimal parameter adjustment compared to full model fine-tuning.
@inproceedings{afrin2025qlora,
  title={Resource-Efficient and Effective Code Summarization},
  author={Saima Afrin and Joseph Call and Khai Nguyen and Oscar Chaparro and Antonio Mastropaolo},
  booktitle={ACM 2nd International Conference on AI Foundation Models and Software Engineering (FORGE)},
  year={2025}
}
2024 · 10 papers
Rosalia Tufano, Ozren Dabic, Antonio Mastropaolo, Matteo Ciniselli, Gabriele Bavota
Code Review Automation: Strengths and Weaknesses of the State of the Art
TSE 2024
We aim at characterizing the cases in which three code review automation techniques tend to succeed or fail. The study has a strong qualitative focus, with ~105 man-hours of manual inspection invested in manually analyzing correct and wrong predictions generated by the three techniques, for a total of 2,291 inspected predictions. The output of this analysis are two taxonomies reporting the types of code changes on which the experimented techniques tend to succeed or to fail, pointing to areas for future work. Finally, we assess the importance of researching in techniques specialized for code review automation by comparing their performance with ChatGPT, finding that ChatGPT struggles in commenting code as a human reviewer would do.
@article{tufano2024codereview,
  title={Code Review Automation: Strengths and Weaknesses of the State of the Art},
  author={Rosalia Tufano and Ozren Dabic and Antonio Mastropaolo and Matteo Ciniselli and Gabriele Bavota},
  journal={IEEE Transactions on Software Engineering (TSE)},
  year={2024}
}
Antonio Mastropaolo, Valentina Ferrari, Luca Pascarella, Gabriele Bavota
Log Statements Generation via Deep Learning: Widening the Support Provided to Developers
JSS 2024
Logging assists in monitoring events that transpire during the execution of software. We introduced LANCE, an approach rooted in deep learning that has demonstrated the ability to correctly inject a log statement into Java methods in ~15% of cases. To address its limitations, we present LEONID, a DL-based technique that can distinguish between methods that do and do not require the inclusion of log statements. Furthermore, LEONID supports the injection of multiple log statements within a given method when necessary, and it also enhances LANCE's proficiency in generating meaningful log messages through the combination of DL and Information Retrieval (IR).
@article{mastropaolo2024logging,
  title={Log Statements Generation via Deep Learning: Widening the Support Provided to Developers},
  author={Antonio Mastropaolo and Valentina Ferrari and Luca Pascarella and Gabriele Bavota},
  journal={Elsevier Journal of Systems and Software (JSS)},
  year={2024}
}
Antonio Mastropaolo, Emad Aghajani, Luca Pascarella, Gabriele Bavota
Automated Variable Renaming: Are We There Yet?
EMSE 2024
Identifiers form a large portion of source code. Therefore, low-quality identifiers can substantially hinder code comprehension. We present a large-scale study investigating the potential of data-driven approaches to support automated variable renaming. We experiment with three state-of-the-art techniques: a statistical language model and two DL-based models. Our quantitative and qualitative analyses show the potential of such techniques that, under specific conditions, can provide valuable recommendations and are ready to be integrated in rename refactoring tools.
@article{mastropaolo2024renaming,
  title={Automated Variable Renaming: Are We There Yet?},
  author={Antonio Mastropaolo and Emad Aghajani and Luca Pascarella and Gabriele Bavota},
  journal={Springer Empirical Software Engineering (EMSE)},
  year={2024}
}
Federica Pepe, Fiorella Zampetti, Antonio Mastropaolo, Gabriele Bavota, Massimiliano Di Penta
A Taxonomy of Self-Admitted Technical Debt in Deep Learning Systems
ICSME 2024
This paper empirically analyzes the presence of Self-Admitted Technical Debt (SATD) in DL systems. After selecting 100 open-source Python projects using popular DL frameworks, we identified SATD from their source comments and created a stratified sample of 443 SATD to analyze manually. We derived a taxonomy of DL-specific SATD through open coding, featuring seven categories and 41 leaves. Our findings indicate that DL-specific SATD differs from DL bugs found in previous studies, as it typically pertains to suboptimal solutions rather than functional problems.
@inproceedings{pepe2024satd,
  title={A Taxonomy of Self-Admitted Technical Debt in Deep Learning Systems},
  author={Federica Pepe and Fiorella Zampetti and Antonio Mastropaolo and Gabriele Bavota and Massimiliano Di Penta},
  booktitle={IEEE 40th International Conference on Software Maintenance and Evolution (ICSME)},
  year={2024}
}
Antonio Mastropaolo, Camilo Escobar-Velásquez, Mario Linares-Vásquez
The Rise and Fall (?) of Software Engineering
SE2030 2024
Over the last ten years, the realm of Artificial Intelligence has experienced an explosion of revolutionary breakthroughs. We aim at outlining the key elements vital for the smooth integration of AI into SE, all while preserving the intrinsic human creativity that has been the driving force behind the field. We delve into the intricate interplay between AI-driven automation and human innovation, exploring how these two components can work together to advance SE practices.
@inproceedings{mastropaolo2024risefall,
  title={The Rise and Fall (?) of Software Engineering},
  author={Antonio Mastropaolo and Camilo Escobar-Vel\'{a}squez and Mario Linares-V\'{a}squez},
  booktitle={Software Engineering in 2030 (SE2030)},
  year={2024}
}
Rosalia Tufano, Antonio Mastropaolo, Federica Pepe, Ozren Dabic, Massimiliano Di Penta, Gabriele Bavota
Unveiling ChatGPT's Usage in Open Source Projects: A Mining-based Study
MSR 2024 Distinguished Paper Award
We mine 1,501 commits, pull requests, and issues from open-source projects by matching regular expressions likely to indicate the usage of ChatGPT. Then, we manually analyze these instances, categorizing the task automated in the 467 true positive instances (165 commits, 159 PRs, 143 issues). This resulted in a taxonomy of 45 tasks which developers automate via ChatGPT, providing developers with valuable insights on how to exploit LLMs in their workflow and researchers with a clear overview of tasks that could benefit from automated solutions.
@inproceedings{tufano2024chatgpt,
  title={Unveiling ChatGPT's Usage in Open Source Projects: A Mining-based Study},
  author={Rosalia Tufano and Antonio Mastropaolo and Federica Pepe and Ozren Dabic and Massimiliano Di Penta and Gabriele Bavota},
  booktitle={IEEE/ACM 21st International Conference on Mining Software Repositories (MSR)},
  year={2024}
}
Federica Pepe, Vittoria Nardone, Antonio Mastropaolo, Gerardo Canfora, Massimiliano Di Penta, Gabriele Bavota
How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study
ICPC 2024 Distinguished Paper Award
Taking as a case study the transformer models hosted by Hugging Face, this paper empirically investigates the transparency of pre-trained transformer models. We look at the extent to which model descriptions (i) specify the datasets being used for their pre-training, (ii) discuss their possible training bias, (iii) declare their license, and whether projects using such models take these licenses into account. Results indicate that pre-trained models still have a limited exposure of their training datasets, possible biases, and adopted licenses. Also, we found several cases of possible licensing violations by client projects.
@inproceedings{pepe2024huggingface,
  title={How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study},
  author={Federica Pepe and Vittoria Nardone and Antonio Mastropaolo and Gerardo Canfora and Massimiliano Di Penta and Gabriele Bavota},
  booktitle={IEEE/ACM 32nd International Conference on Program Comprehension (ICPC)},
  year={2024}
}
Antonio Mastropaolo, Matteo Ciniselli, Luca Pascarella, Rosalia Tufano, Emad Aghajani, Gabriele Bavota
Towards Summarizing Code Snippets Using Pre-Trained Transformers
ICPC 2024
Most recent approaches exploit deep learning to automatically document classes or functions, while very little effort has been devoted to more fine-grained documentation. In this work, we take all steps needed to train a DL model to automatically document code snippets. First, we manually built a dataset featuring 6.6k comments. Second, we used it to train a multi-task DL model achieving 84% accuracy and recall/precision higher than 80%. Third, we ran this model on 10k open source projects, automatically building a large-scale dataset that has then been used to train a new DL model able to automatically document code snippets.
@inproceedings{mastropaolo2024snippets,
  title={Towards Summarizing Code Snippets Using Pre-Trained Transformers},
  author={Antonio Mastropaolo and Matteo Ciniselli and Luca Pascarella and Rosalia Tufano and Emad Aghajani and Gabriele Bavota},
  booktitle={IEEE/ACM 32nd International Conference on Program Comprehension (ICPC)},
  year={2024}
}
Antonio Mastropaolo, Matteo Ciniselli, Massimiliano Di Penta, Gabriele Bavota
Evaluating Code Summarization Techniques: A New Metric and an Empirical Characterization
ICSE 2024
We perform a thorough empirical investigation on the complementarity of different types of metrics in capturing the quality of a generated summary. We propose to address the limitations of existing metrics by considering a new dimension, capturing the extent to which the generated summary aligns with the semantics of the documented code snippet, independently from the reference summary. To this end, we present a new metric based on contrastive learning. We empirically show that the inclusion of this novel dimension enables a more effective representation of developers' evaluations regarding the quality of automatically generated summaries.
@inproceedings{mastropaolo2024metric,
  title={Evaluating Code Summarization Techniques: A New Metric and an Empirical Characterization},
  author={Antonio Mastropaolo and Matteo Ciniselli and Massimiliano Di Penta and Gabriele Bavota},
  booktitle={IEEE/ACM 46th International Conference on Software Engineering (ICSE)},
  year={2024}
}
Antonio Mastropaolo, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta
Toward Automatically Completing GitHub Workflows
ICSE 2024
We present GH-WCOM (GitHub Workflow COMpletion), a Transformer-based approach supporting developers in writing GitHub workflows. To deal with such a task, we designed an abstraction process to help the learning of the transformer while still making GH-WCOM able to recommend very peculiar workflow elements such as tool options and scripting elements. Our empirical study shows that GH-WCOM provides up to 34.23% correct predictions, and the model's confidence is a reliable proxy for the recommendations' correctness likelihood.
@inproceedings{mastropaolo2024ghwcom,
  title={Toward Automatically Completing GitHub Workflows},
  author={Antonio Mastropaolo and Fiorella Zampetti and Gabriele Bavota and Massimiliano Di Penta},
  booktitle={IEEE/ACM 46th International Conference on Software Engineering (ICSE)},
  year={2024}
}
2023 · 5 papers
Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Antonio Mastropaolo, Emad Aghajani, Denys Poshyvanyk, Massimiliano Di Penta, Gabriele Bavota
An Empirical Study on the Usage of Transformer Models for Code Completion
TSE 2023
We present a large-scale study exploring the capabilities of state-of-the-art Transformer-based models in supporting code completion at different granularity levels, including single tokens, one or multiple entire statements, up to entire code blocks. We experimented with several variants of RoBERTa and the Text-To-Text Transfer Transformer (T5). The achieved results show that Transformer-based models, and in particular the T5, represent a viable solution for code completion, with perfect predictions ranging from ~29% up to ~69%.
@article{ciniselli2023codecompletion,
  title={An Empirical Study on the Usage of Transformer Models for Code Completion},
  author={Matteo Ciniselli and Nathan Cooper and Luca Pascarella and Antonio Mastropaolo and Emad Aghajani and Denys Poshyvanyk and Massimiliano Di Penta and Gabriele Bavota},
  journal={IEEE Transactions on Software Engineering (TSE)},
  year={2023}
}
Antonio Mastropaolo, Nathan Cooper, David Nader Palacio, Simone Scalabrino, Denys Poshyvanyk, Rocco Oliveto, Gabriele Bavota
Using Transfer Learning for Code-Related Tasks
TSE 2023
We assess the performance of the T5 model in supporting four different code-related tasks: automatic bug-fixing, injection of code mutants, generation of assert statements, and code summarization. We pay particular attention in studying the role played by pre-training and multi-task fine-tuning on the model's performance. We show that the T5 can achieve better performance as compared to state-of-the-art baselines; and while pre-training helps the model, not all tasks benefit from a multi-task fine-tuning.
@article{mastropaolo2023transfer,
  title={Using Transfer Learning for Code-Related Tasks},
  author={Antonio Mastropaolo and Nathan Cooper and David Nader Palacio and Simone Scalabrino and Denys Poshyvanyk and Rocco Oliveto and Gabriele Bavota},
  journal={IEEE Transactions on Software Engineering (TSE)},
  year={2023}
}
Antonio Mastropaolo, Massimiliano Di Penta, Gabriele Bavota
Towards Automatically Addressing Self-Admitted Technical Debt: How Far Are We?
ASE 2023
This paper empirically investigates the extent to which technical debt can be automatically paid back by neural-based generative models. We extract a dataset of 5,039 Self-Admitted Technical Debt removals from 595 open-source projects and experiment with seven different generative DL model configurations. Results indicate that the best model we experimented with is able to automatically fix ~2% to 8% of test instances, depending on the number of attempts. The model's pre-training plays a fundamental role in boosting performance.
@inproceedings{mastropaolo2023satd,
  title={Towards Automatically Addressing Self-Admitted Technical Debt: How Far Are We?},
  author={Antonio Mastropaolo and Massimiliano Di Penta and Gabriele Bavota},
  booktitle={IEEE/ACM 38th International Conference on Automated Software Engineering (ASE)},
  year={2023}
}
Antonio Mastropaolo, Luca Pascarella, Emanuela Guglielmi, Matteo Ciniselli, Simone Scalabrino, Rocco Oliveto, Gabriele Bavota
On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot
ICSE 2023
We present an empirical study in which we aim at understanding whether different but semantically equivalent natural language descriptions result in the same recommended function from GitHub Copilot. We asked Copilot to automatically generate 892 Java methods starting from their original Javadoc description. Our results show that modifying the description results in different code recommendations in ~46% of cases. Also, differences in the semantically equivalent descriptions might impact the correctness of the generated code ~28%.
@inproceedings{mastropaolo2023copilot,
  title={On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot},
  author={Antonio Mastropaolo and Luca Pascarella and Emanuela Guglielmi and Matteo Ciniselli and Simone Scalabrino and Rocco Oliveto and Gabriele Bavota},
  booktitle={IEEE/ACM 45th International Conference on Software Engineering (ICSE)},
  year={2023}
}
Giovanni Rosa, Antonio Mastropaolo, Simone Scalabrino, Gabriele Bavota, Rocco Oliveto
Automatically Generating Dockerfiles via Deep Learning: Challenges and Promises
ICSSP 2023
We present a study in which we aim at understanding to what extent Deep Learning can be used for generating entire Dockerfiles from scratch given a high-level specification of requirements. We defined a structured natural language specification for Dockerfile requirements and used a dataset with 670,982 instances to train and test a T5 model. The results of our evaluation show that T5 performs similarly to the more trivial IR-based baselines, and we report the open challenges associated with the application of deep learning in this context.
@inproceedings{rosa2023dockerfiles,
  title={Automatically Generating Dockerfiles via Deep Learning: Challenges and Promises},
  author={Giovanni Rosa and Antonio Mastropaolo and Simone Scalabrino and Gabriele Bavota and Rocco Oliveto},
  booktitle={IEEE/ACM 17th International Conference on Software and System Processes (ICSSP)},
  year={2023}
}
2021 · 3 papers
Simone Scalabrino, Antonio Mastropaolo, Rocco Oliveto, Gabriele Bavota
An Adaptive Search Budget Allocation Approach for Search-Based Test Case Generation
TOSEM 2021
We introduce Budget Optimization for Testing (BOT), an approach to adaptively allocate the search budget to the classes under test. BOT requires information about the branch coverage that will be achieved on each class with a given search budget. Therefore, we also introduce BRANCHOS, an approach that predicts coverage in a budget-aware way. The results of our experiments show that BRANCHOS can approximate the branch coverage in time with a low error, and BOT can significantly increase the coverage achieved by a test generation tool.
@article{scalabrino2021bot,
  title={An Adaptive Search Budget Allocation Approach for Search-Based Test Case Generation},
  author={Simone Scalabrino and Antonio Mastropaolo and Rocco Oliveto and Gabriele Bavota},
  journal={ACM Transactions on Software Engineering and Methodology (TOSEM)},
  year={2021}
}
Antonio Mastropaolo, Emad Aghajani, Luca Pascarella, Gabriele Bavota
An Empirical Study on Code Comment Completion
ICSME 2021
We tackle the problem of code comment completion: instead of generating a comment for a given code from scratch, we investigate the extent to which state-of-the-art techniques can help developers in writing comments faster. We present a large-scale study in which we empirically assess how a simple n-gram model and the recently proposed T5 architecture can perform in autocompleting a code comment the developer is typing. The achieved results show the superiority of the T5 model, despite the n-gram model being a competitive solution.
@inproceedings{mastropaolo2021comment,
  title={An Empirical Study on Code Comment Completion},
  author={Antonio Mastropaolo and Emad Aghajani and Luca Pascarella and Gabriele Bavota},
  booktitle={IEEE/ACM 37th International Conference on Software Maintenance and Evolution (ICSME)},
  year={2021}
}
Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, Gabriele Bavota
Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks
ICSE 2021
We empirically investigate how the T5 model performs when pre-trained and fine-tuned to support code-related tasks. We pre-train a T5 model on a dataset composed of natural language English text and source code. Then, we fine-tune such a model by reusing datasets used in four previous works for: fix bugs, inject code mutants, generate assert statements, and generate code comments. We show that our T5 model, exploiting additional data for the self-supervised pre-training phase, can achieve performance improvements over the four baselines.
@inproceedings{mastropaolo2021t5,
  title={Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks},
  author={Antonio Mastropaolo and Simone Scalabrino and Nathan Cooper and David Nader Palacio and Denys Poshyvanyk and Rocco Oliveto and Gabriele Bavota},
  booktitle={IEEE/ACM 43rd International Conference on Software Engineering (ICSE)},
  year={2021}
}

Teaching

COLL 100: Prompt Engineering
Fall 2025
🎓 Undergraduate ChatGPT Copilot DALL-E Ethics

This first-year seminar introduces students to computing through the emerging practice of prompt engineering -- the use of natural language to direct advanced AI systems. By engaging with tools such as ChatGPT, Copilot, and DALL-E, students learn how clearly expressed ideas can be transformed into computational outcomes. The course emphasizes creativity, accessibility, and critical reflection, including the societal impact of AI, ethics, bias, and "tech for good" applications.

GenAI for Software Development
Spring 2025, Spring 2026
✍️ UG/Graduate Deep Learning Code Generation LLMs

This course provides students with the foundational and technical skills needed to develop and apply Deep Learning-based tools, especially Generative AI methods, to enhance software development tasks like code generation and documentation. By the end, undergraduate students will understand Generative AI for software development, while graduate students will also be skilled in critically evaluating research and proposing innovative solutions.

AI for Software Engineering
Fall 2024, Fall 2025
🔬 Graduate AI4SE Automation Research

This course is designed to equip students with an understanding of how recent advancements in Artificial Intelligence are leading to innovative automated practices in the realm of software engineering. Participants will investigate the usage of AI techniques to foster and automate software engineering processes while understanding their transformative impact on the software development lifecycle. Students will be introduced to the core concepts of conducting research at the intersection of AI and SE.

Awards & Grants

DISTINGUISHED PAPER

Distinguished Paper Award

Technical Track
MSR 2024 — Lisbon, Portugal
Mining Software Repositories Conference
DISTINGUISHED PAPER

Distinguished Paper Award

Research Track
ICPC 2024 — Lisbon, Portugal
International Conference on Program Comprehension
Distinguished
DISTINGUISHED REVIEWER

Distinguished Reviewer

Main Track
FSE 2025 — Trondheim, Norway
Distinguished
DISTINGUISHED REVIEWER

Distinguished Reviewer

Demonstrations Track
FSE 2025 — Trondheim, Norway
Distinguished
DISTINGUISHED REVIEWER

Distinguished Reviewer

Research Track
ASE 2024 — Sacramento, USA
🏆
Outstanding
OUTSTANDING REVIEWER

Outstanding Reviewer Award

JCST 2022
Journal of Computer Science and Technology

Service

Program Committees

2026
ICSE Research Track -- Rio de Janeiro, Brazil
2025
ASE Research Track -- Seoul, South Korea
ICSME Research Track -- Auckland, New Zealand
SCAM Research Track -- Auckland, New Zealand
FSE Research Track -- Trondheim, Norway
FSE Tools Demo Track -- Trondheim, Norway
Internetware Research Track -- Trondheim, Norway
NLBSE Workshop -- Ottawa, Canada
MSR Data and Tool Showcase Track -- Ottawa, Canada
ICPC ERA Track -- Ottawa, Canada
FORGE Industry Papers Track -- Ottawa, Canada
SANER Short Papers and Posters Track -- Montreal, Canada
2024
FORGE Research Track -- Lisbon, Portugal
ASE Research Track -- Sacramento, USA
SANER Short Papers and Posters Track -- Rovaniemi, Finland
NLBSE Workshop -- Lisbon, Portugal
2023
MSR Junior PC -- Melbourne, Australia
2022
MSR Shadow PC -- Pittsburgh, USA

Organizing Committees

Program Co-Chair -- 1st International Workshop on Benchmark Infrastructure for LLMs for Code (FSE 2025), Trondheim, Norway
Benchmarking Program Co-Chair -- FORGE 2025, Ottawa, Canada
Program Co-Chair -- ReSAISE 2024, Tsukuba, Japan
Social Media Chair -- FORGE 2024, Lisbon, Portugal
Proceedings Chair -- ICSME 2023, Bogota, Colombia

Community Service

Volunteer at ICSME 2023, Bogota, Colombia
Volunteer at ICSE 2023, Melbourne, Australia

Invited Talks

2026Can You Trust Your AI Programmer? Trustworthiness & Evaluation of AI-Generated Code — Guest Lecture, University of Central Florida (UCF), Spring 2026
2023Deep-Learning for Software Engineering: Where are we heading to? — University of Molise, Pesche, Italy
2022On Automating Code-Related Tasks via Pre-trained Models of Code — Invited talk at Microsoft VS Data Science Meeting

Blog

Mar 2026 Opinion

Why I Think LLMs Won't Replace Developers

The hype is real, but so is the gap between generating code and engineering software. Here's why I think developers are safe -- for now.

Feb 2026 Teaching

Teaching GenAI to the Next Generation

Reflections on designing a course about generative AI for software development. What worked, what surprised me, and what I'd change.

Jan 2026 Academic Life

First Year on the Tenure Track

An honest look at the first year as an assistant professor -- the highs, the lows, and the lessons learned along the way.