Core Concepts
Large language models blur the lines between machine- and human-authored code, but DetectCodeGPT offers a novel method to detect machine-generated code by capturing distinct stylized patterns.
Abstract
The article discusses the challenges posed by large language models in distinguishing between machine- and human-authored code. It introduces DetectCodeGPT, a method that strategically perturbs code with spaces and newlines to identify machine-generated code based on unique patterns. The study analyzes lexical diversity, conciseness, and naturalness in both types of code to highlight differences. Experimental results show DetectCodeGPT outperforms existing methods in detecting machine-generated code across various models and datasets.
Directory:
- Introduction
- Large language models revolutionize software engineering tasks like code generation.
- Problem Statement
- Blurring distinctions between machine- and human-authored code.
- Existing Methods
- DetectGPT for identifying machine-generated text faces challenges when applied to code.
- Proposed Solution: DetectCodeGPT
- Strategically perturbs code with spaces and newlines to capture distinct stylized patterns.
- Experimental Evaluation
- Extensive experiments show superior performance of DetectCodeGPT in detecting machine-generated code.
- Ablation Study
- Comparison of different perturbation strategies highlights the effectiveness of stylized perturbations.
- Impact of Perturbation Count
- Increasing number of perturbations improves detection performance.
Stats
Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code 1.