The paper explores the use of large language models (LLMs) for data preprocessing (DP) through instruction-tuning, focusing on the creation of the Jellyfish dataset. It discusses the challenges in developing generic solutions for DP tasks and highlights the strengths of LLMs in processing natural language. The experiments show that Jellyfish models, particularly Jellyfish-13B, outperform non-LLM methods on seen and unseen datasets, showcasing their effectiveness in solving DP tasks beyond what they are tuned for. The impact of tuning with single-task data and multi-task data on DP performance is analyzed, revealing insights into the importance of different tasks in enhancing overall performance.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Haochen Zhan... a las arxiv.org 03-14-2024
https://arxiv.org/pdf/2312.01678.pdfConsultas más profundas