[LLM 02] Data cleaning and Tokenizations
This article summarizes word knowledge from Large Language Models: A Survey 1 . This is a series of articles continuing LLM 01 - Large Language Model Families 1. Data Cleaning Data quality is pivotal to the performance of language models. Effective data cleaning techniques, such as filtering and deduplication, can significantly enhance model performance. 1.1 Data Filtering Data filtering aims to enhance the quality of training data and improve the effectiveness of the trained language models....