The paper explores the use of large language models (LLMs) for data preprocessing (DP) through instruction-tuning, focusing on the creation of the Jellyfish dataset. It discusses the challenges in developing generic solutions for DP tasks and highlights the strengths of LLMs in processing natural language. The experiments show that Jellyfish models, particularly Jellyfish-13B, outperform non-LLM methods on seen and unseen datasets, showcasing their effectiveness in solving DP tasks beyond what they are tuned for. The impact of tuning with single-task data and multi-task data on DP performance is analyzed, revealing insights into the importance of different tasks in enhancing overall performance.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Haochen Zhan... في arxiv.org 03-14-2024
https://arxiv.org/pdf/2312.01678.pdfاستفسارات أعمق