[LLM 02] Data cleaning and Tokenizations

This article summarizes word knowledge from Large Language Models: A Survey 1 . This is a series of articles continuing LLM 01 - Large Language Model Families 1. Data Cleaning Data quality is pivotal to the performance of language models. Effective data cleaning techniques, such as filtering and deduplication, can significantly enhance model performance. 1.1 Data Filtering Data filtering aims to enhance the quality of training data and improve the effectiveness of the trained language models....

June 10, 2024 · 10 min · Thang Nguyen-Duc

[LLM 01] Large Language Model Overview

This article summarizes word knowledge from Large Language Models: A Survey 1 . In this section, I summarize about Large Language Model Families. 1. Basic Architecture The invention of the Transformer architecture marks another milestone in the development of LLM. By applying self-attention to compute in parallel for every word in a sentence of the document an “attention score” to model the influence each word has on another. Transformers allow for much more parallelization than RNNs....

May 1, 2024 · 8 min · Thang Nguyen-Duc