Field
Value
Language
dc.contributor.author
Chen, Fuxiang
datacite.creator.affiliationIdentifier
https://ror.org/03rmrcq20
en_US
datacite.creator.affiliation
University of British Columbia
en_US
datacite.creator.nameIdentifier
en_US
dc.coverage.temporal
2022-03-15/2032-03-31
dc.date.accessioned
2022-03-23T20:30:25Z
dc.date.available
2022-03-23T20:30:25Z
dc.date.issued
2022-03-23
dc.identifier.uri
https://www.frdr-dfdr.ca/repo/dataset/7c3eba54-7635-4459-9523-63508e613a06
dc.identifier.uri
https://doi.org/10.20383/102.0563
dc.description
Pre-trained Language Models (PLM) such as CodeBERT and GraphCodeBERT, when trained on a large corpus of code, have recently displayed promising results in Software Engineering (SE) down-stream tasks. A PLM is most useful if it can be leveraged to improve the performance on code corpora written in low-resource programming languages, where training data is limited. In this work, our focus is on studying the impact of PLMs on a low-resource programming language corpus — specifically, we choose Ruby as the study subject. A recent study by Ahmed and Devanbu reported that using a corpus of code written in multilingual datasets to fine-tune multilingual PLMs achieves higher performance as opposed to using a corpus of code written in just one programming language. However, no analysis was made with respect to monolingual PLMs. Furthermore, some programming languages are inherently different and code written in one language usually cannot be interchanged with the others, i.e., Ruby and Java code possess very different structure. To better understand how monolingual and multilingual PLM affects different programming languages, we investigate 1) the performance of PLMs on Ruby for two popular SE tasks: Code Summarization and Code Search, 2) the strategy (to select programming languages) that works well on fine-tuning multilingual PLMs for Ruby, and 3) the performance of the fine-tuned PLMs on Ruby given different code lengths — here, we bin the Ruby code based on its number of tokens; understanding the performance on different code lengths will enable developers to make more informed decision on the use of PLMs based on their code.
This dataset, containing the PLMs and their fine-tuned models (there are over a hundred trained and fine-tuned models), was generated by the researchers at the University of British Columbia, Singapore Management University and JetBrains.
en_US
dc.publisher
Federated Research Data Repository / dépôt fédéré de données de recherche
dc.rights
Creative Commons Public Domain Dedication (CC0 1.0)
en_US
dc.rights.uri
https://creativecommons.org/publicdomain/zero/1.0/
en_US
dc.subject
PLM
en_US
dc.subject
Fine-tuned Models
en_US
dc.subject
Pre-trained Language Models
en_US
dc.subject
Ruby
en_US
dc.subject
CodeBERT
en_US
dc.title
On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages
en_US
globus.shared_endpoint.name
f163c1b3-9c88-42f6-a7bb-5839ed6c4063
globus.shared_endpoint.path
/1/published/publication_558/
frdr.preservation.status
AIP generation and transfer successful
frdr.preservation.datetime
2023-11-03
datacite.publicationyear
2022
datacite.resourcetype
Dataset
en_US
datacite.fundingReference.funderName
en_US
datacite.fundingReference.awardNumber
en_US
datacite.fundingReference.awardTitle
en_US
frdr.crdc.code
RDF1020399
en_US
frdr.crdc.group_en
Computer and information sciences
en_US
frdr.crdc.class_en
Programming languages and software engineering
en_US
frdr.crdc.field_en
Programming language and software engineering, not elsewhere classified
en_US
frdr.crdc.group_fr
Informatique et systèmes d'information
fr_CA
frdr.crdc.class_fr
Langages de programmation et génie logiciel
fr_CA
frdr.crdc.field_fr
Langages de programmation et génie logiciel, non classé ailleurs
fr_CA