The KVP10k dataset is a significant contribution to the field of document information extraction, addressing the critical need for a comprehensive and diverse dataset tailored specifically for key-value pair (KVP) extraction. The dataset includes 10,707 richly annotated pages from a wide range of business document sources, including invoices, contracts, reports, and more.
Key highlights of the dataset:
Diverse Sources: The dataset covers a broad spectrum of document types and sources, including web crawl data and documents from publicfiles.fcc.gov, ensuring a diverse representation of real-world business documents.
Detailed Annotations: The dataset features extensive annotations, including the labeling of text as keys or values, as well as the identification of unkeyed values and unvalued keys, providing a comprehensive foundation for training and evaluating KVP extraction models.
Benchmark and Metrics: The authors have developed a comprehensive benchmark framework with two distinct tasks - Entity Recognition and Key-Value Pair Detection - along with corresponding evaluation metrics to facilitate the assessment and comparison of KVP extraction models.
Baseline Results: The authors have provided initial baseline results using an LMDX-like approach, establishing a foundation for future research and advancements in this field.
The KVP10k dataset aims to address the notable gap in the availability of high-quality, diverse datasets for KVP extraction, which has hindered the progress of document understanding technologies. By providing this resource, the authors hope to catalyze further research and innovation in the domain of information extraction from complex business documents, ultimately benefiting a wide range of industries and organizations.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Oshri Napars... a las arxiv.org 05-02-2024
https://arxiv.org/pdf/2405.00505.pdfConsultas más profundas