Repository logo
 
Publication

From source code identifiers to natural language terms

dc.contributor.authorCarvalho, Nuno Ramos
dc.contributor.authorAlmeida, José João
dc.contributor.authorHenriques, Pedro Rangel
dc.contributor.authorPereira, Maria João
dc.date.accessioned2015-01-15T12:46:09Z
dc.date.available2015-01-15T12:46:09Z
dc.date.issued2015
dc.description.abstractProgram comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented.por
dc.description.sponsorshipThis work is funded by National Funds through the FCT–Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) within project PEst-OE/EEI/UI0752/2014. We would like to thank the reviewers for their valuable insight and detailed comments, which aided in improving this paper. We would like to thank Latifa Guerrouj, Philippe Galinier, Yann-Gaël Guéhéneuc, Giuliano Antoniol, and Massimiliano Di Penta, for their work in Guerrouj et al. (2012) ,and Emily Hill, David Binkley, Dawn Lawrie, Lori Pollok and K. Vijay-Shanker for their work in Hill et al. (2013), which allowed the experimental comparison between approaches.por
dc.identifier.citationCarvalho, Nuno; Almeida, José João; Henriques, Pedro; Pereira, Maria João (2015). From source code identifiers to natural language terms. Journal of Systems and Software. ISSN 0164-1212. 100, p. 117-128por
dc.identifier.doi10.1016/j.jss.2014.10.013
dc.identifier.issn0164-1212
dc.identifier.urihttp://hdl.handle.net/10198/11577
dc.language.isoengpor
dc.peerreviewedyespor
dc.publisherElsevierpor
dc.subjectProgram comprehensionpor
dc.subjectNatural language processingpor
dc.subjectIdentifier splittingpor
dc.titleFrom source code identifiers to natural language termspor
dc.typejournal article
dspace.entity.typePublication
oaire.awardURIinfo:eu-repo/grantAgreement/FCT/5876/PEst-OE%2FEEI%2FUI0752%2F2014/PT
oaire.citation.endPage128por
oaire.citation.startPage117por
oaire.citation.titleJournal of Systems and Softwarepor
oaire.fundingStream5876
person.familyNamePereira
person.givenNameMaria João
person.identifier.ciencia-idC912-4A49-A3B3
person.identifier.orcid0000-0001-6323-0071
person.identifier.ridG-5999-2011
person.identifier.scopus-author-id13907870300
project.funder.identifierhttp://doi.org/10.13039/501100001871
project.funder.nameFundação para a Ciência e a Tecnologia
rcaap.rightsopenAccesspor
rcaap.typearticlepor
relation.isAuthorOfPublicationa20ccfa6-4e84-4c25-ab0d-8d6ba196ffc2
relation.isAuthorOfPublication.latestForDiscoverya20ccfa6-4e84-4c25-ab0d-8d6ba196ffc2
relation.isProjectOfPublicationa0c3030d-e6ca-4acd-8461-f3d91b18c9e3
relation.isProjectOfPublication.latestForDiscoverya0c3030d-e6ca-4acd-8461-f3d91b18c9e3

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
From Source code.pdf
Size:
1.23 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.75 KB
Format:
Item-specific license agreed upon to submission
Description: