Evaluating Large Language Models' Coding Proficiency Beyond Standard Benchmarks
Existing coding benchmarks have limitations in comprehensively evaluating the program synthesis abilities of large language models (LLMs). EVOEVAL, a new benchmark suite, is introduced to evolve existing problems into more diverse and challenging domains to better assess LLM coding capabilities.