Kernkonzepte
Effective knowledge cutoffs in LLMs differ from reported cutoff dates due to deduplication issues and temporal misalignment of CommonCrawl dumps.
Zusammenfassung
The content delves into the complexities of knowledge cutoffs in Large Language Models (LLMs). It discusses the importance of effective cutoff dates, distinct from reported ones, and highlights issues such as deduplication challenges and temporal misalignments in training data. The analysis reveals discrepancies between reported and effective cutoffs, impacting the usability and accuracy of LLMs for users.
Directory:
- Abstract:
- Effective knowledge cutoff vs. reported dates.
- Proposal for automatic determination of effective cutoffs.
- Introduction:
- Lack of transparency in providing pre-training data.
- Importance of understanding knowledge cutoff dates.
- Related Work:
- Documenting LLM training data.
- Membership inference attacks on LLMs.
- Methodology:
- Probing LLMs to determine resource-level effective cutoffs.
- Time-spanning datasets for evaluation.
- Results:
- Analysis of perplexity curves for different models based on datasets used.
- Complications:
- Deduplication challenges in training pipelines.
- Misalignment Factors:
- Impact of deduplication failures on alignment with reported dates.
- Influence of older data in CommonCrawl dumps on model training.
Statistiken
"FalconRW removed all documents that had the top-level domain of Wikipedia, so they could use FalconRW in conjunction with a curated version of Wikipedia in the future (as they did in the unreleased main Falcon dataset). They assumed this would deduplicate the data, however, we find that due to versioning and Wikipedia mirrors, there are still near duplicate Wikipedia documents."
"Similarly, the Pile also contains accidentally duplicated Wikipedia documents."
"We show an example in Table 3 of a pair of documents which contain the same three-sentence span."
Zitate
"Imagine a layperson using an LLM for tax advice, without realizing that the effective cutoff of the tax code is 2022 and thus outdated – despite the fact that the reported cutoff is advertised as 2023."
"As re-training a LLM is prohibitively expensive, it is infeasible for LLMs to keep up with living online resources."
"Our results show that knowledge cutoffs are not as simple as they have seemed and that care must be taken both by LLM dataset curators as well as practitioners who seek to use information from these models."