Peer-reviewed work across the AI × software engineering community.
Peer-reviewed journal and conference papers spanning ICSE, FSE, ASE, TSE, TOSEM, EMSE, and friends — with citation counts triple-verified against OpenAlex, Semantic Scholar, and CrossRef. The view defaults to reviewed work in the current year; magazine articles and preprints live in their own tabs below.
Publication index.
Reviewed papers (journal & conference) come first; magazine pieces and preprints are split into their own de-emphasized tabs. Slice by year or zoom in on award-winning work — citation counts refresh nightly from a multi-source pipeline.
Showing 2026 reviewed publications · click a year, or open Magazine / Preprints for the rest.
Reviewed papers
2026
5 papers
EMSE
2026
Evaluating the Impact of Post-Training Quantization on Large Language Models for Code Generation
Large Language Models (LLMs) have shown an impressive capability in code generation. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code. However, when it comes to deploying LLM-based code generators, larger LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. A previous work by Wei et al. proposed to leverage quantization techniques to reduce the memory footprint of LLM-based code generators without substantially degrading their effectiveness. In short, they studied LLMs featuring up to 16B parameters, quantizing their precision from floating point 32 bits down to int 8 bits and showing their limited impact on code generation performance. Given the fast pace at which LLM capabilities and quantization techniques are evolving, in this work we present a differentiated replication of the work by Wei et al. in which we consider (i) on the one side, more recent and larger code-related LLMs, of up to 34B parameters; (ii) the latest advancements in model quantization techniques, which allow pushing the compression to the extreme quantization level of 2 bits per model parameter and; (iii) different types of calibration datasets to guide the quantization process, including code-specific ones. Our empirical evaluation reveals that the new frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70% compared to the original model without observing any significant decrease in performance. Additionally, when the quantization becomes even more extreme (3 and 2 bits), a code-specific calibration dataset helps to limit the loss of performance.
EMSE
2026
Developers and Generative AI: A Study of Self-Admitted Usage in Open Source Projects
JAWs
2026
BRACE: Unified Benchmarking of Accuracy and Energy for Code Language Models
JAWs
2026
Search-Based Evolutionary Data Pruning for Class-Level Code Summarization
2025
10 papers
A Systematic Literature Review of Parameter-Efficient Fine-Tuning for Large Code Models
The rise of Artificial Intelligence (AI)-and particularly Large Language Models (LLMs) for code--has reshaped Software Engineering (SE) by enabling the automation of tasks such as code generation, bug detection, and repair. However, these models require significant computational resources for training and fine-tuning, posing challenges for real-world adoption in resource-constrained environments. To address this, the research community has increasingly turned to Parameter-Efficient Fine-Tuning (PEFT)--a class of techniques that enables the adaptation of large models by updating only a small subset of parameters, rather than the entire model. In this Systematic Literature Review (SLR), we examine the growing application of PEFT techniques--across a wide range of software engineering tasks. We analyze how these methods are used to optimize various deep learning (DL) architectures, focusing on their impact on both performance and efficiency. Our study synthesizes findings from 28 peer-reviewed papers, identifying patterns in configuration strategies and adaptation trade-offs. The outcome of this review is a comprehensive taxonomy that categorizes PEFT usage by task type, distinguishing between generative and non-generative scenarios. Our findings aim to inform future research and guide the practical deployment of PEFT in sustainable, AI-powered software development.
On the Effectiveness of LLM-as-a-Judge for Code Generation and Summarization
Large Language Models (LLMs) have been recently exploited as judges for complex natural language processing tasks, such as Q&A. The basic idea is to delegate to an LLM the assessment of the "quality" of the output provided by an automated technique for tasks for which: (i) quantitative metrics would only tell part of the story, and; (ii) a large-scale human-based evaluation would be too expensive. We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization. For code generation, we check whether eight LLMs are able to judge the correctness of 1,405 Java methods and 1,281 Python functions generated by the same LLMs or implemented by humans. For code summarization, we compare the judgment of five LLMs to those provided by nine humans for ~1.2k summaries, related to both Java and Python functions. Our findings show that GPT-4-turbo is the best LLM in terms of judging capabilities for both tasks, with "smaller" LLMs featuring tens of billions parameters not being able to cope with judging tasks. However, even the best-performing LLM frequently misjudges the correctness of the code and summary quality.
An Empirical Study on Language Models for Generating Log Statements in Test Code
Log statements play a critical role in modern software development, capturing essential runtime information necessary for software maintenance. We conduct an empirical study on 5,206,759 Java test methods collected from 6,405 GitHub projects to explore and disclose the effectiveness and limitations of Pre-trained Language Models (PLMs) and Large Language Models (LLMs) for generating and injecting test log statements. Our findings demonstrate that general-purpose LLMs like GPT-3.5-Turbo, when properly instructed, performs comparably to the best-performing PLMs on predicting log level. Additionally, GPT-3.5-Turbo substantially outperforms the best in PLMs on predicting log position with a 33.97% improvement while also achieving superior performance in predicting log messages in terms of BLEU and ROUGE.
From Triumph to Uncertainty: The Journey of Software Engineering in the AI Era
Over the last ten years, the realm of Artificial Intelligence (AI) has experienced an explosion of revolutionary breakthroughs, transforming what seemed like a far-off dream into a reality that is now deeply embedded in our everyday lives. In this paper, we aim at outlining the key elements that, based on our expertise, are vital for the smooth integration of AI into SE, all while preserving the intrinsic human creativity that has been the driving force behind the field. We delve into the intricate interplay between AI-driven automation and human innovation, exploring how these two components can work together to advance SE practices to new methods and standards.
Is Quantization a Deal-breaker? Empirical Insights from Large Code Models
The growing scale of large language models (LLMs) not only demands extensive computational resources but also raises environmental concerns due to their increasing carbon footprint. Model quantization emerges as an effective approach that can reduce the resource demands of LLMs by decreasing parameter precision without substantially affecting performance. Our study investigates the effects of quantization on the qualitative aspects of automatically generated code. We apply Activation-aware Weight Quantization (AWQ) to two widely used code models, CodeLlama and DeepSeekCoder, to generate Java and Python code. Our findings reveal that quantization is a robust technique that not only preserves functional correctness, but also retains key qualitative code attributes sought after by developers, such as maintainability and structural simplicity.
AI-SDLC
2025
A Path Less Traveled: Reimagining Software Engineering Automation via a Neurosymbolic Paradigm
The emergence of Large Code Models (LCMs) has transformed software engineering (SE) automation, driving significant advancements in tasks such as code generation, source code documentation, code review, and bug fixing. However, these advancements come with trade-offs: achieving high performance often entails exponential computational costs, reduced interpretability, and an increasing dependence on data-intensive models. In this paper, we propose Neurosymbolic Software Engineering (NSE) as a promising paradigm combining neural learning with symbolic (rule-based) reasoning, while strategically introducing a controlled source of chaos to simulate the complex dynamics of real-world software systems.
Towards Generating the Rationale for Code Changes
Commit messages are essential to understand changes in software projects, providing a way for developers to communicate code evolution. Our study explores a more complex task: generating rationale explanations for code changes. We developed a method to identify rationale sentences in commit messages and compiled a dataset of 45,945 commits with their corresponding rationales. While the approach we engineered for the extraction of rationale exhibited a 75% precision, the model trained to generate the rationale only worked in a minority of cases. Our findings highlight the difficulty of the tackled task and the need for additional research in the area.
Optimizing Datasets for Code Summarization: Is Code-Comment Coherence Enough?
Automated code summarization is a long-standing goal for code comprehension. We explore the extent to which code-comment coherence can be used to optimize code summarization datasets. We examine multiple selectivity levels of training instances from two state-of-the-art datasets (TL-CodeSum and Funcom) and evaluate the resulting models on three manually curated test sets. The results show that even halving the training set sizes does not significantly affect the model's ability to generate summaries. However, when comparing the most restrictive selection strategy with a simpler one that randomly selects instances, the resulting accuracy does not change, suggesting that current datasets contain many irrelevant examples and different quality attributes should be explored.
Toward Neurosymbolic Program Comprehension
Recent advancements in Large Language Models (LLMs) have paved the way for Large Code Models (LCMs), enabling automation in complex software engineering tasks. However, the ambition to scale these models to trillion-parameter sizes poses significant challenges including rising computational demands and issues related to trustworthiness, bias, and interpretability. In this paper, we advocate for a Neurosymbolic research direction that combines the strengths of existing DL techniques with traditional symbolic methods--renowned for their reliability, speed, and determinism. We outline the core features and present preliminary results for our envisioned approach, aimed at establishing the first NeuroSymbolic Program Comprehension (NsPC) framework to aid in identifying defective code components.
FORGE
2025
Resource-Efficient and Effective Code Summarization
Code Language Models (CLMs) have demonstrated high effectiveness in automating software engineering tasks such as bug fixing, code generation, and code documentation. However, as models grow in scale, sustainability concerns emerge. GreenAI techniques, such as QLoRA (Quantized Low-Rank Adaptation), offer a promising path for dealing with large models' sustainability. We investigate the extent to which QLoRA's capabilities in NL-to-Code tasks can be leveraged and transferred to code summarization. Our study evaluates two state-of-the-art CLMs across two programming languages. The findings confirm that QLoRA not only allows efficient fine-tuning of CLMs for code summarization but also achieves the best results with minimal parameter adjustment compared to full model fine-tuning.
2024
12 papers
Code Review Automation: Strengths and Weaknesses of the State of the Art
We aim at characterizing the cases in which three code review automation techniques tend to succeed or fail. The study has a strong qualitative focus, with ~105 man-hours of manual inspection invested in manually analyzing correct and wrong predictions generated by the three techniques, for a total of 2,291 inspected predictions. The output of this analysis are two taxonomies reporting the types of code changes on which the experimented techniques tend to succeed or to fail, pointing to areas for future work. Finally, we assess the importance of researching in techniques specialized for code review automation by comparing their performance with ChatGPT, finding that ChatGPT struggles in commenting code as a human reviewer would do.
JSS
2024
Log Statements Generation via Deep Learning: Widening the Support Provided to Developers
Logging assists in monitoring events that transpire during the execution of software. We introduced LANCE, an approach rooted in deep learning that has demonstrated the ability to correctly inject a log statement into Java methods in ~15% of cases. To address its limitations, we present LEONID, a DL-based technique that can distinguish between methods that do and do not require the inclusion of log statements. Furthermore, LEONID supports the injection of multiple log statements within a given method when necessary, and it also enhances LANCE's proficiency in generating meaningful log messages through the combination of DL and Information Retrieval (IR).
EMSE
2024
Automated Variable Renaming: Are We There Yet?
Identifiers form a large portion of source code. Therefore, low-quality identifiers can substantially hinder code comprehension. We present a large-scale study investigating the potential of data-driven approaches to support automated variable renaming. We experiment with three state-of-the-art techniques: a statistical language model and two DL-based models. Our quantitative and qualitative analyses show the potential of such techniques that, under specific conditions, can provide valuable recommendations and are ready to be integrated in rename refactoring tools.
A Taxonomy of Self-Admitted Technical Debt in Deep Learning Systems
This paper empirically analyzes the presence of Self-Admitted Technical Debt (SATD) in DL systems. After selecting 100 open-source Python projects using popular DL frameworks, we identified SATD from their source comments and created a stratified sample of 443 SATD to analyze manually. We derived a taxonomy of DL-specific SATD through open coding, featuring seven categories and 41 leaves. Our findings indicate that DL-specific SATD differs from DL bugs found in previous studies, as it typically pertains to suboptimal solutions rather than functional problems.
SE 2030
2024
The Rise and Fall (?) of Software Engineering
Over the last ten years, the realm of Artificial Intelligence has experienced an explosion of revolutionary breakthroughs. We aim at outlining the key elements vital for the smooth integration of AI into SE, all while preserving the intrinsic human creativity that has been the driving force behind the field. We delve into the intricate interplay between AI-driven automation and human innovation, exploring how these two components can work together to advance SE practices.
Unveiling ChatGPT's Usage in Open Source Projects: A Mining-based Study
We mine 1,501 commits, pull requests, and issues from open-source projects by matching regular expressions likely to indicate the usage of ChatGPT. Then, we manually analyze these instances, categorizing the task automated in the 467 true positive instances (165 commits, 159 PRs, 143 issues). This resulted in a taxonomy of 45 tasks which developers automate via ChatGPT, providing developers with valuable insights on how to exploit LLMs in their workflow and researchers with a clear overview of tasks that could benefit from automated solutions.
How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study
Taking as a case study the transformer models hosted by Hugging Face, this paper empirically investigates the transparency of pre-trained transformer models. We look at the extent to which model descriptions (i) specify the datasets being used for their pre-training, (ii) discuss their possible training bias, (iii) declare their license, and whether projects using such models take these licenses into account. Results indicate that pre-trained models still have a limited exposure of their training datasets, possible biases, and adopted licenses. Also, we found several cases of possible licensing violations by client projects.
Towards Summarizing Code Snippets Using Pre-Trained Transformers
Most recent approaches exploit deep learning to automatically document classes or functions, while very little effort has been devoted to more fine-grained documentation. In this work, we take all steps needed to train a DL model to automatically document code snippets. First, we manually built a dataset featuring 6.6k comments. Second, we used it to train a multi-task DL model achieving 84% accuracy and recall/precision higher than 80%. Third, we ran this model on 10k open source projects, automatically building a large-scale dataset that has then been used to train a new DL model able to automatically document code snippets.
Evaluating Code Summarization Techniques: A New Metric and an Empirical Characterization
We perform a thorough empirical investigation on the complementarity of different types of metrics in capturing the quality of a generated summary. We propose to address the limitations of existing metrics by considering a new dimension, capturing the extent to which the generated summary aligns with the semantics of the documented code snippet, independently from the reference summary. To this end, we present a new metric based on contrastive learning. We empirically show that the inclusion of this novel dimension enables a more effective representation of developers' evaluations regarding the quality of automatically generated summaries.
Toward Automatically Completing GitHub Workflows
We present GH-WCOM (GitHub Workflow COMpletion), a Transformer-based approach supporting developers in writing GitHub workflows. To deal with such a task, we designed an abstraction process to help the learning of the transformer while still making GH-WCOM able to recommend very peculiar workflow elements such as tool options and scripting elements. Our empirical study shows that GH-WCOM provides up to 34.23% correct predictions, and the model's confidence is a reliable proxy for the recommendations' correctness likelihood.
EASE
2024
How the Training Procedure Impacts the Performance of Deep Learning-based Vulnerability Patching
Generative deep learning (DL) models have been successfully adopted for vulnerability patching. However, such models require the availability of a large dataset of patches to learn from. To overcome this issue, researchers have proposed to start from models pre-trained with general knowledge, either on the programming language or on similar tasks such as bug fixing. Despite the efforts in the area of automated vulnerability patching, there is a lack of systematic studies on how these different training procedures impact the performance of DL models for such a task. This paper provides a manyfold contribution to bridge this gap, by (i) comparing existing solutions of self-supervised and supervised pre-training for vulnerability patching; and (ii) for the first time, experimenting with different kinds of prompt-tuning for this task. The study required to train/test 23 DL models. We found that a supervised pre-training focused on bug-fixing, while expensive in terms of data collection, substantially improves DL-based vulnerability patching. When applying prompt-tuning on top of this supervised pre-trained model, there is no significant gain in performance. Instead, prompt-tuning is an effective and cheap solution to substantially boost the performance of self-supervised pre-trained models, i.e., those not relying on the bug-fixing pre-training.
NL4AI
2024
On the Reform of the Italian Constitution: an Interdisciplinary Text Readability Analysis
2023
5 papers
An Empirical Study on the Usage of Transformer Models for Code Completion
We present a large-scale study exploring the capabilities of state-of-the-art Transformer-based models in supporting code completion at different granularity levels, including single tokens, one or multiple entire statements, up to entire code blocks. We experimented with several variants of RoBERTa and the Text-To-Text Transfer Transformer (T5). The achieved results show that Transformer-based models, and in particular the T5, represent a viable solution for code completion, with perfect predictions ranging from ~29% up to ~69%.
Using Transfer Learning for Code-Related Tasks
We assess the performance of the T5 model in supporting four different code-related tasks: automatic bug-fixing, injection of code mutants, generation of assert statements, and code summarization. We pay particular attention in studying the role played by pre-training and multi-task fine-tuning on the model's performance. We show that the T5 can achieve better performance as compared to state-of-the-art baselines; and while pre-training helps the model, not all tasks benefit from a multi-task fine-tuning.
Towards Automatically Addressing Self-Admitted Technical Debt: How Far Are We?
This paper empirically investigates the extent to which technical debt can be automatically paid back by neural-based generative models. We extract a dataset of 5,039 Self-Admitted Technical Debt removals from 595 open-source projects and experiment with seven different generative DL model configurations. Results indicate that the best model we experimented with is able to automatically fix ~2% to 8% of test instances, depending on the number of attempts. The model's pre-training plays a fundamental role in boosting performance.
On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot
We present an empirical study in which we aim at understanding whether different but semantically equivalent natural language descriptions result in the same recommended function from GitHub Copilot. We asked Copilot to automatically generate 892 Java methods starting from their original Javadoc description. Our results show that modifying the description results in different code recommendations in ~46% of cases. Also, differences in the semantically equivalent descriptions might impact the correctness of the generated code ~28%.
ICSSP
2023
Automatically Generating Dockerfiles via Deep Learning: Challenges and Promises
We present a study in which we aim at understanding to what extent Deep Learning can be used for generating entire Dockerfiles from scratch given a high-level specification of requirements. We defined a structured natural language specification for Dockerfile requirements and used a dataset with 670,982 instances to train and test a T5 model. The results of our evaluation show that T5 performs similarly to the more trivial IR-based baselines, and we report the open challenges associated with the application of deep learning in this context.
2022
2 papers
Using Deep Learning to Generate Complete Log Statements
Logging is a practice widely adopted in several phases of the software lifecycle. For example, during software development log statements allow engineers to verify and debug the system by exposing fine-grained information of the running software. While the benefits of logging are undisputed, taking proper decisions about where to inject log statements, what information to log, and at which log level (e.g., error, warning) is crucial for the logging effectiveness. In this paper, we present LANCE (Log stAtemeNt reCommEnder), the first approach supporting developers in all these decisions. LANCE features a Text-To-Text-Transfer-Transformer (T5) model that has been trained on 6,894,456 Java methods. LANCE takes as input a Java method and injects in it a full log statement, including a human-comprehensible logging message and properly choosing the needed log level and the statement location. Our results show that LANCE is able to (i) properly identify the location in the code where to inject the statement in 65.9% of Java methods requiring it; (ii) selecting the proper log level in 66.2% of cases; and (iii) generate a completely correct log statement including a meaningful logging message in 15.2% of cases.
Using Pre-Trained Models to Boost Code Review Automation
Code review is a practice widely adopted in open source and industrial projects. Given the non-negligible cost of such a process, researchers started investigating the possibility of automating specific code review tasks. We recently proposed Deep Learning (DL) models targeting the automation of two tasks: the first model takes as input a code submitted for review and implements in it changes likely to be recommended by a reviewer; the second takes as input the submitted code and a reviewer comment posted in natural language and automatically implements the change required by the reviewer. While the preliminary results we achieved are encouraging, both models had been tested in rather simple code review scenarios, substantially simplifying the targeted problem. This was also due to the choices we made when designing both the technique and the experiments. In this paper, we build on top of that work by demonstrating that a pre-trained Text-To-Text Transfer Transformer (T5) model can outperform previous DL models for automating code review tasks. Also, we conducted our experiments on a larger and more realistic (and challenging) dataset of code review activities.
2021
3 papers
An Adaptive Search Budget Allocation Approach for Search-Based Test Case Generation
We introduce Budget Optimization for Testing (BOT), an approach to adaptively allocate the search budget to the classes under test. BOT requires information about the branch coverage that will be achieved on each class with a given search budget. Therefore, we also introduce BRANCHOS, an approach that predicts coverage in a budget-aware way. The results of our experiments show that BRANCHOS can approximate the branch coverage in time with a low error, and BOT can significantly increase the coverage achieved by a test generation tool.
An Empirical Study on Code Comment Completion
We tackle the problem of code comment completion: instead of generating a comment for a given code from scratch, we investigate the extent to which state-of-the-art techniques can help developers in writing comments faster. We present a large-scale study in which we empirically assess how a simple n-gram model and the recently proposed T5 architecture can perform in autocompleting a code comment the developer is typing. The achieved results show the superiority of the T5 model, despite the n-gram model being a competitive solution.
Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks
We empirically investigate how the T5 model performs when pre-trained and fine-tuned to support code-related tasks. We pre-train a T5 model on a dataset composed of natural language English text and source code. Then, we fine-tune such a model by reusing datasets used in four previous works for: fix bugs, inject code mutants, generate assert statements, and generate code comments. We show that our T5 model, exploiting additional data for the self-supervised pre-training phase, can achieve performance improvements over the four baselines.
2013
1 papers
ICAIL
2013
Legal documents categorization by compression
In this paper we investigate how to categorize text excerpts from Italian normative texts. Although text categorization is a problem of broader interest, we single out a specific issue. Namely, we are concerned with categorizing the set of subjects in which Italian Regions are allowed to produce norms: this is the so-called residual legislative power problem. It basically consists in making explicit a set of subjects that was originally defined only in a residual and negative fashion. The categorization of legal text fragments is acknowledged to be a difficult problem, featured by abstract concepts along with a variety of locutions used to denote them, by convoluted sentence structure, and by several other facets. In addition, in the present case subjects are often partially overlapped, and a training set of sufficient size (for the problem under consideration) does not exist: all these aspects make our task challenging. In this setting, classical feature-based approaches provide poor quality results, so we explored algorithms based on compression techniques. We tested three such techniques: we illustrate their main features and report the results of an experimentation where our implementation of such algorithms is compared with the output of standard machine learning algorithms. Far from having found a silver bullet, we show that compression-based techniques provide the best results for the problem at hand, and argue that these approaches can be effectively coupled with more informative and semantically grounded ones.
Magazine non-peer-reviewed
Mind the Overlap: Trustworthy Evaluation for Large Code Models
In this month’s Spotlight on Transactions, we showcase an IEEE Transactions on Software Engineering article by López et al., exploring how hidden data leakage arising from interdataset code duplication artificially elevates benchmark performance and to outline actionable strategies for maintaining rigorous, reliable model assessment.
Secrets in the Synapses: When Steganography Meets Large Language Models
Digital steganography has traditionally focused on hiding information within visible media, but generative artificial intelligence is shifting the hiding place itself. Li et al., in “Steganography in Large Language Models” (IEEE Transactions on Artificial Intelligence), show how models can carry hidden information within their own parameters.
LLM-Powered Security Test Generation: Oracles, Vulnerability Probes, and Adversarial Inputs
Large language models (LLMs) can synthesize test oracles from invariants where ground truth is unavailable, translate vulnerability catalogs such as CWE, OWASP, and CVE into executable probes, and generate adversarial inputs that stress both traditional software and LLM-based systems.
The Virtue of Hallucination: When AI Mistakes Make Software Safer
“LLMorpheus: Mutation Testing Using Large Language Models,” published in IEEE Transactions on Software Engineering in 2025, proposes a large language model-driven mutation testing framework that moves beyond fixed operator catalogs by using context-aware code infilling.
Toward Reliable Security Operations Center Testing With Foundation Models
This article presents a practical method for using foundation models to generate realistic, structured scenarios that support repeatable testing of Security Operations Center (SOC) analytics as environments evolve.
The Price of Intelligence: Can AI Afford to Be Sustainable?
Orchestrated Entropy: Foundation Model Nondeterminism for SOC
Human or Machine? Rebuilding Trust in the Age of AI-Based Text Generation
From Heuristics to Intelligence: Large Language Model-Driven Test Case Generation
Prompt Alchemy: Engineering the Magic of Code
Breaking Bottlenecks in LLM Inference With Adaptive Cache Management
Code, Chaos, and Clarity: Neurosymbolic Approaches to Trustworthy Software Automation
When Databases Age: How SQL Server and MySQL Handle the Test of Time
Smarter, Not Harder: Efficient AI Training With Selective Data
Pixels of Deception: How Evolutionary Algorithms Break AI Reliability
How Artificial Intelligence Is Reshaping Our Lives
Preprints not yet peer-reviewed
Parameter-Efficient Multi-Task Fine-Tuning in Code-Related Tasks
Large Language Models (LLMs) have proven highly effective in automating software engineering tasks, bridging natural language and code semantics to achieve notable results in code generation and summarization. However, their scale incurs substantial computational costs, making full fine-tuning impractical. Parameter-Efficient Fine-Tuning (PEFT) methods like QLoRA enable efficient specialization with lower resource demands. Recent studies show QLoRA-optimized Large Code Models (LCMs) perform strongly across diverse tasks, yet it remains unclear whether this effectiveness persists when a single model is QLoRA fine-tuned for multiple code-related tasks. The interaction between Multi-task fine-tuning and QLoRA optimization, and how transfer learning affects correctness and quality of generated artifacts, remains largely unexplored. We investigate Multi-task QLoRA fine-tuning across three representative tasks: code generation, translation, and summarization. We evaluate functional correctness through execution-based and similarity-based metrics, complemented by comprehensive code quality analysis--an aspect largely overlooked in prior work. Our findings show that Multi-task QLoRA effectively leverages transfer learning, achieving competitive or superior performance relative to both Single-task QLoRA and Multi-task full fine-tuning. Larger models demonstrate more consistent balance between correctness and quality, whereas smaller models preserve functionality but exhibit a higher incidence of quality-related issues.
Not All Tokens Matter: Data-Centric Optimization for Efficient Code Summarization
Instruction-tuned Language Models ILMs have become essential components of modern AI systems, demonstrating exceptional versatility across a wide range of natural language and reasoning tasks. Among their most impactful applications is code generation, where ILMs--commonly referred to as Code Language Models CLMs--have demonstrated remarkable capability. This strength stems from their defining feature: the use of explicit task instructions during fine-tuning, which enables them to bridge natural language and code by translating human intent into executable code. While much of their progress has been driven by advances in scaling laws and training methodologies, one critical aspect remains underexplored--the impact of system prompts on the performance of both general-purpose ILMs and specialized CLMs when instantiated to assist users with code generation activities. In this study, we take a first step toward bridging this gap by systematically evaluating how system prompts of varying instructional detail, along with model scale, prompting strategy, and programming language, affect ILMs and CLMs in code generation tasks. Our evaluation framework, spanning 120 model configurations, reveals that (1) the influence of system prompts increases with model scale; (2) few-shot prompting reduces this effect compared to zero-shot; and (3) programming language matters, with Java showing greater sensitivity to system prompt variations than Python.
Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering
Large language models for code are advancing fast, yet our ability to evaluate them lags behind. Current benchmarks focus on narrow tasks and single metrics, which hide critical gaps in robustness, interpretability, fairness, efficiency, and real-world usability. They also suffer from inconsistent data engineering practices, limited software engineering context, and widespread contamination issues. To understand these problems and chart a path forward, we combined an in-depth survey of existing benchmarks with insights gathered from a dedicated community workshop. We identified three core barriers to reliable evaluation: the absence of software-engineering-rich datasets, overreliance on ML-centric metrics, and the lack of standardized, reproducible data pipelines. Building on these findings, we introduce BEHELM, a holistic benchmarking infrastructure that unifies software-scenario specification with multi-metric evaluation. BEHELM provides a structured way to assess models across tasks, languages, input and output granularities, and key quality dimensions. Our goal is to reduce the overhead currently required to construct benchmarks while enabling a fair, realistic, and future-proof assessment of LLMs in software engineering.
An Empirical Study on the Effects of System Prompts in Instruction-Tuned Models for Code Generation
Instruction-tuned Language Models (ILMs) have become essential components of modern AI systems, demonstrating exceptional versatility across natural language and reasoning tasks. Among their most impactful applications is code generation, where ILMs -- commonly referred to as Code Language Models (CLMs) -- translate human intent into executable programs. While progress has been driven by advances in scaling and training methodologies, one critical aspect remains underexplored: the impact of system prompts on both general-purpose ILMs and specialized CLMs for code generation. We systematically evaluate how system prompts of varying instructional detail, along with model scale, prompting strategy, and programming language, affect code assistant. Our experimental setting spans 360 configurations across four models, five system prompts, three prompting strategies, two languages, and two temperature settings. We find that (1) increasing system-prompt constraint specificity does not monotonically improve correctness -- prompt effectiveness is configuration-dependent and can help or hinder based on alignment with task requirements and decoding context; (2) for larger code-specialized models, few-shot examples can degrade performance relative to zero-shot generation, contrary to conventional wisdom; and (3) programming language matters, with Java exhibiting significantly greater sensitivity to system prompt variations than Python, suggesting language-specific prompt engineering strategies may be necessary.
Fine-grained Multi-Document Extraction and Generation of Code Change Rationale
Understanding the reasons behind past code changes is critical for many software engineering tasks, including refactoring and reviewing code, diagnosing bugs, and implementing new features. Unfortunately, locating and reconstructing this rationale can be difficult for developers because the information is often fragmented, inconsistently documented, and scattered across different artifacts such as commit messages, issue reports, and PRs. In this paper, we address this challenge in two steps. First, we conduct an empirical study of 63 commits from five open-source Java projects to analyze how rationale components (e.g., a change's goal, need, and alternative) are distributed across artifacts. We find that the rationale is highly fragmented: commit messages and pull requests primarily capture goals, while needs and alternatives are more often found in issues and PRs. Other components are scarce but found in artifacts other than commit messages. No single artifact type captures all components, underscoring the need for cross-document reasoning and synthesis. Second, we introduce ARGUS, an LLM-based approach that identifies sentences expressing goal, need, and alternative across a commit's artifacts and creates concise rationale summaries to support code comprehension and maintenance tasks. We evaluated ARGUS on the 63 commits and compared its performance against baseline variants. The best-performing version achieved 51.4% precision and 93.2% recall for rationale identification, while producing rationale summaries rated as accurate. A user study with 12 Java developers further showed that these summaries were perceived as useful and helpful for tasks such as code review, documentation, and debugging. Our results highlight the need for multi-document reasoning in capturing rationale and demonstrate the potential of ARGUS to help developers understand and maintain software systems.
Prompt-Driven Code Summarization: A Systematic Literature Review
Software documentation is essential for program comprehension, developer onboarding, code review, and long-term maintenance. Yet producing quality documentation manually is time-consuming and frequently yields incomplete or inconsistent results. Large language models (LLMs) offer a promising solution by automatically generating natural language descriptions from source code, helping developers understand code more efficiently, facilitating maintenance, and supporting downstream activities such as defect localization and commit message generation. However, the effectiveness of LLMs in documentation tasks critically depends on how they are prompted. Properly structured instructions can substantially improve model performance, making prompt engineering-the design of input prompts to guide model behavior-a foundational technique in LLM-based software engineering. Approaches such as few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning show promise for code summarization, yet current research remains fragmented. There is limited understanding of which prompting strategies work best, for which models, and under what conditions. Moreover, evaluation practices vary widely, with most studies relying on overlap-based metrics that may not capture semantic quality. This systematic literature review consolidates existing evidence, categorizes prompting paradigms, examines their effectiveness, and identifies gaps to guide future research and practical adoption.
Toward Explaining Large Language Models in Software Engineering Tasks
Recent progress in Large Language Models (LLMs) has substantially advanced the automation of software engineering (SE) tasks, enabling complex activities such as code generation and code summarization. However, the black-box nature of LLMs remains a major barrier to their adoption in high-stakes and safety-critical domains, where explainability and transparency are vital for trust, accountability, and effective human supervision. Despite increasing interest in explainable AI for software engineering, existing methods lack domain-specific explanations aligned with how practitioners reason about SE artifacts. To address this gap, we introduce FeatureSHAP, the first fully automated, model-agnostic explainability framework tailored to software engineering tasks. Based on Shapley values, FeatureSHAP attributes model outputs to high-level input features through systematic input perturbation and task-specific similarity comparisons, while remaining compatible with both open-source and proprietary LLMs. We evaluate FeatureSHAP on two bi-modal SE tasks: code generation and code summarization. The results show that FeatureSHAP assigns less importance to irrelevant input features and produces explanations with higher fidelity than baseline methods. A practitioner survey involving 37 participants shows that FeatureSHAP helps practitioners better interpret model outputs and make more informed decisions. Collectively, FeatureSHAP represents a meaningful step toward practical explainable AI in software engineering. FeatureSHAP is available at https://github.com/deviserlab/FeatureSHAP.
Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation
Quantization has emerged as a mainstream method for compressing Large Language Models (LLMs), reducing memory requirements and accelerating inference without architectural modifications. While existing research primarily focuses on evaluating the effectiveness of quantized LLMs compared to their original counterparts, the impact on robustness remains largely unexplored.In this paper, we present the first systematic investigation of how quantization affects the robustness of LLMs in code generation tasks. Through extensive experiments across four prominent LLM families (LLaMA, DeepSeek, CodeGen, and StarCoder) with parameter scales ranging from 350M to 33B, we evaluate robustness from dual perspectives: adversarial attacks on input prompts and noise perturbations on model architecture. Our findings challenge conventional wisdom by demonstrating that quantized LLMs often exhibit superior robustness compared to their full-precision counterparts, with 51.59% versus 42.86% of our adversarial experiments showing better resilience in quantized LLMs. Similarly, our noise perturbation experiments also confirm that LLMs after quantitation generally withstand higher levels of weight disturbances. These results suggest that quantization not only reduces computational requirements but can actually enhance LLMs' reliability in code generation tasks, providing valuable insights for developing more robust and efficient LLM deployment strategies.
Smart but Costly? Benchmarking LLMs on Functional Accuracy and Energy Efficiency
The rapid advancement of AI technologies and their accelerated adoption in software development necessitates a systematic evaluation of their environmental impact alongside functional correctness. While prior studies have examined sustainability in large language models, existing approaches lack systematic frameworks for evaluating accuracy-energy trade-offs in Code Language Models (CLMs). In this paper, we present a framework, BRACE, to benchmark CLMs on a unified scale of energy efficiency and functional correctness (referred to as accuracy). We benchmark 22 state-of-the-art models on code generation and summarization tasks, proposing two rating methods: Concentric Incremental Rating Circles (CIRC) and Observation to Expectation Rating (OTER). CIRC provides deterministic Euclidean-based rankings with static trade-offs that are robust to outliers, and OTER offers trend-aware evaluation with dynamic trade-offs that capture the complex correlation between energy and accuracy, each offering a distinct perspective and addressing the problem in a unique way. These rating methods enable us to rate LLMs on a 1-5 scale reflecting their combined capabilities in terms of energy efficiency and functional correctness. Our analysis reveals models generally perform better in the code summarization tasks as they are not enforced to generate a grammar-based and syntactically correct output. Also, we find that models' size does not have a significant impact on their ratings, indicating that if models utilize their parameters efficiently, they can be ranked higher on these scales. The proposed BRACE framework empowers practitioners to make evidence-based model selections that balance sustainability with task requirements, guiding rating choice -- CIRC for deterministic comparisons or OTER for trend-aware evaluation -- based on deployment priorities.