On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages

Name: On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages
Published: 2022-03-23
License: https://creativecommons.org/publicdomain/zero/1.0/

How to Download

FRDR offers multiple ways to download datasets. Learn more in our documentation.

On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages

Contact Dataset Administrator

Description:	Pre-trained Language Models (PLM) such as CodeBERT and GraphCodeBERT, when trained on a large corpus of code, have recently displayed promising results in Software Engineering (SE) down-stream tasks. A PLM is most useful if it can be leveraged to improve the performance on code corpora written in low-resource programming languages, where training data is limited. In this work, our focus is on studying the impact of PLMs on a low-resource programming language corpus — specifically, we choose Ruby as the study subject. A recent study by Ahmed and Devanbu reported that using a corpus of code written in multilingual datasets to fine-tune multilingual PLMs achieves higher performance as opposed to using a corpus of code written in just one programming language. However, no analysis was made with respect to monolingual PLMs. Furthermore, some programming languages are inherently different and code written in one language usually cannot be interchanged with the others, i.e., Ruby and Java code possess very different structure. To better understand how monolingual and multilingual PLM affects different programming languages, we investigate 1) the performance of PLMs on Ruby for two popular SE tasks: Code Summarization and Code Search, 2) the strategy (to select programming languages) that works well on fine-tuning multilingual PLMs for Ruby, and 3) the performance of the fine-tuned PLMs on Ruby given different code lengths — here, we bin the Ruby code based on its number of tokens; understanding the performance on different code lengths will enable developers to make more informed decision on the use of PLMs based on their code. This dataset, containing the PLMs and their fine-tuned models (there are over a hundred trained and fine-tuned models), was generated by the researchers at the University of British Columbia, Singapore Management University and JetBrains.
Authors:	Chen, Fuxiang; University of British Columbia
Keywords:	PLM Fine-tuned Models Pre-trained Language Models Ruby CodeBERT
Field of Research:	Computer and information sciences > Programming languages and software engineering > Programming language and software engineering, not elsewhere classified
Publication Date:	2022-03-23
Publisher:	Federated Research Data Repository / dépôt fédéré de données de recherche
URI:	https://doi.org/10.20383/102.0563

Files in Dataset

No files uploaded

Download entire dataset using Globus Transfer. This method requires a Globus account and installing software. Watch Video: Get Started with FRDR: Download a Dataset

Download with Globus

Access to this dataset is subject to the following terms:

Creative Commons Public Domain Dedication (CC0 1.0) https://creativecommons.org/publicdomain/zero/1.0/

Citation

Chen, F. (2022). On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages. Federated Research Data Repository. https://doi.org/10.20383/102.0563

Show full record Return to dashboard View item statistics

How to Download

On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages

Footer First

Footer Second

Footer Third

Footer Fourth