Beyond Syntax: How Do LLMs Understand Code?

Պահպանված է:

Մատենագիտական մանրամասներ
Հրատարակված է:	The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Conference Proceedings (2025), p. 86-90
Հիմնական հեղինակ:	North, Marc
Այլ հեղինակներ:	Amir Atapour-Abarghouei, Bencomo, Nelly
Հրապարակվել է:	The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Խորագրեր:	Programming languages Semantics Engineering research Syntax Large language models Software engineering Representations Coding Social
Առցանց հասանելիություն:	Citation/Abstract
Ցուցիչներ:	Ավելացրեք ցուցիչ Չկան պիտակներ, Եղեք առաջինը, ով նշում է այս գրառումը!

MARC


LEADER	00000nab a2200000uu 4500
001	3217773908
003	UK-CbPIL
024	7		\|a 10.1109/ICSE-NIER66352.2025.00023 \|2 doi
035			\|a 3217773908
045	2		\|b d20250101 \|b d20251231
084			\|a 228229 \|2 nlm
100	1		\|a North, Marc \|u Durham University,CS,Durham,UK
245	1		\|a Beyond Syntax: How Do LLMs Understand Code?
260			\|b The Institute of Electrical and Electronics Engineers, Inc. (IEEE) \|c 2025
513			\|a Conference Proceedings
520	3		\|a Conference Title: 2025 IEEE/ACM 47th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER)Conference Start Date: 2025 April 27Conference End Date: 2025 May 3Conference Location: Ottawa, ON, CanadaWithin software engineering research, Large Language Models (LLMs) are often treated as ‘black boxes’, with only their inputs and outputs being considered. In this paper, we take a machine interpretability approach to examine how LLMs internally represent and process code.We focus on variable declaration and function scope, training classifier probes on the residual streams of LLMs as they process code written in different programming languages to explore how LLMs internally represent these concepts across different programming languages. We also look for specific attention heads that support these representations and examine how they behave for inputs of different languages.Our results show that LLMs have an understanding — and internal representation — of language-independent coding semantics that goes beyond the syntax of any specific programming language, using the same internal components to process code, regardless of the programming language that the code is written in. Furthermore, we find evidence that these language-independent semantic components exist in the middle layers of LLMs and are supported by language-specific components in the earlier layers that parse the syntax of specific languages and feed into these later semantic components.Finally, we discuss the broader implications of our work, particularly in relation to concerns that AI, with its reliance on large datasets to learn new programming languages, might limit innovation in programming language design. By demonstrating that LLMs have a language-independent representation of code, we argue that LLMs may be able to flexibly learn the syntax of new programming languages while retaining their semantic understanding of universal coding concepts. In doing so, LLMs could promote creativity in future programming language design, providing tools that augment rather than constrain the future of software engineering.
653			\|a Programming languages
653			\|a Semantics
653			\|a Engineering research
653			\|a Syntax
653			\|a Large language models
653			\|a Software engineering
653			\|a Representations
653			\|a Coding
653			\|a Social
700	1		\|a Amir Atapour-Abarghouei \|u Durham University,CS,Durham,UK
700	1		\|a Bencomo, Nelly \|u Durham University,CS,Durham,UK
773	0		\|t The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Conference Proceedings \|g (2025), p. 86-90
786	0		\|d ProQuest \|t Science Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3217773908/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch