On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages

Name: On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages
Published: 2022-03-23
License: https://creativecommons.org/publicdomain/zero/1.0/

How to Download

FRDR offers multiple ways to download datasets. Learn more in our documentation.

On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages

Full metadata record

Field

Value

Language

dc.contributor.author

Chen, Fuxiang

datacite.creator.affiliationIdentifier

https://ror.org/03rmrcq20

en_US

datacite.creator.affiliation

University of British Columbia

en_US

datacite.creator.nameIdentifier

en_US

dc.coverage.temporal

2022-03-15/2032-03-31

dc.date.accessioned

2022-03-23T20:30:25Z

dc.date.available

2022-03-23T20:30:25Z

dc.date.issued

2022-03-23

dc.identifier.uri

https://www.frdr-dfdr.ca/repo/dataset/7c3eba54-7635-4459-9523-63508e613a06

dc.identifier.uri

https://doi.org/10.20383/102.0563

dc.description

Pre-trained Language Models (PLM) such as CodeBERT and GraphCodeBERT, when trained on a large corpus of code, have recently displayed promising results in Software Engineering (SE) down-stream tasks. A PLM is most useful if it can be leveraged to improve the performance on code corpora written in low-resource programming languages, where training data is limited. In this work, our focus is on studying the impact of PLMs on a low-resource programming language corpus — specifically, we choose Ruby as the study subject. A recent study by Ahmed and Devanbu reported that using a corpus of code written in multilingual datasets to fine-tune multilingual PLMs achieves higher performance as opposed to using a corpus of code written in just one programming language. However, no analysis was made with respect to monolingual PLMs. Furthermore, some programming languages are inherently different and code written in one language usually cannot be interchanged with the others, i.e., Ruby and Java code possess very different structure. To better understand how monolingual and multilingual PLM affects different programming languages, we investigate 1) the performance of PLMs on Ruby for two popular SE tasks: Code Summarization and Code Search, 2) the strategy (to select programming languages) that works well on fine-tuning multilingual PLMs for Ruby, and 3) the performance of the fine-tuned PLMs on Ruby given different code lengths — here, we bin the Ruby code based on its number of tokens; understanding the performance on different code lengths will enable developers to make more informed decision on the use of PLMs based on their code.

This dataset, containing the PLMs and their fine-tuned models (there are over a hundred trained and fine-tuned models), was generated by the researchers at the University of British Columbia, Singapore Management University and JetBrains.

en_US

dc.publisher

Federated Research Data Repository / dépôt fédéré de données de recherche

dc.rights

Creative Commons Public Domain Dedication (CC0 1.0)

en_US

dc.rights.uri

https://creativecommons.org/publicdomain/zero/1.0/

en_US

dc.subject

PLM

en_US

dc.subject

Fine-tuned Models

en_US

dc.subject

Pre-trained Language Models

en_US

dc.subject

Ruby

en_US

dc.subject

CodeBERT

en_US

dc.title

On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages

en_US

globus.shared_endpoint.name

f163c1b3-9c88-42f6-a7bb-5839ed6c4063

globus.shared_endpoint.path

/1/published/publication_558/

frdr.preservation.status

AIP generation and transfer successful

frdr.preservation.datetime

2023-11-03

datacite.publicationyear

2022

datacite.resourcetype

Dataset

en_US

datacite.fundingReference.funderName

en_US

datacite.fundingReference.awardNumber

en_US

datacite.fundingReference.awardTitle

en_US

frdr.crdc.code

RDF1020399

en_US

frdr.crdc.group_en

Computer and information sciences

en_US

frdr.crdc.class_en

Programming languages and software engineering

en_US

frdr.crdc.field_en

Programming language and software engineering, not elsewhere classified

en_US

frdr.crdc.group_fr

Informatique et systèmes d'information

fr_CA

frdr.crdc.class_fr

Langages de programmation et génie logiciel

fr_CA

frdr.crdc.field_fr

Langages de programmation et génie logiciel, non classé ailleurs

fr_CA

Files in Dataset

No files uploaded

Download entire dataset using Globus Transfer. This method requires a Globus account and installing software. Watch Video: Get Started with FRDR: Download a Dataset

Download with Globus

Access to this dataset is subject to the following terms:

Creative Commons Public Domain Dedication (CC0 1.0) https://creativecommons.org/publicdomain/zero/1.0/

Citation

Chen, F. (2022). On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages. Federated Research Data Repository. https://doi.org/10.20383/102.0563

Show simple record Return to dashboard View item statistics

How to Download

On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages

Footer First

Footer Second

Footer Third

Footer Fourth