CS224Day-05

Pretraining

Posted by Jeffzzc on October 9, 2025

Lecture9- Pretraining

Pretraining

  • byte-pair encoding algorithm(BPE编码)

BPE 的关键特点

  1. 优点
  • 解决 OOV 问题:未登录词可拆分为词表中的子词(如 “unhappiness” 拆为 “un + happy + ness”),大幅减少 OOV 率;
  • 词表效率高:子词比字符长、比单词短,既能降低序列长度(提升模型训练 / 推理速度),又能控制词表大小(避免单词级词表的冗余);
  • 跨语言适配性强:对形态丰富的语言(如德语、法语,存在大量词根 / 后缀)尤其友好,也可用于中文(以汉字为初始单元,合并高频词如 “人民”“国家”)。
  1. 缺点
  • 依赖训练数据:若训练数据分布与下游任务差异大,合并的子词可能无意义(如低频词的子词分割不合理);
  • 无全局最优性:每次仅合并 “当前最频繁对”,属于贪心策略,无法保证生成全局最优子词集;
  • 处理速度较慢:迭代合并过程需反复统计符号对频率,对大规模语料(如 TB 级文本)的预处理成本较高。

Pretraining encoders (Masked LM)

经典例子是BERT: Bidirectional Encoder Representations from Transformers

Two models were released:

• BERT-base: 12 layers, 768-dim hidden states, 12 attention heads, 110 million params.

• BERT-large: 24 layers, 1024-dim hidden states, 16 attention heads, 340 million params.

Pretraining encoder-decoders

What Raffel et al., 2018 found to work best was span corruption. Their model: T5.

Generative Pretrained Transformer (GPT)

2018’s GPT was a big success in pretraining a decoder! We mentioned how pretrained decoders can be used in their capacities as language models. GPT-2, a larger version (1.5B) of GPT trained on more data, was shown to produce relatively convincing samples of natural language. GPT-3, In-context learning, and very large models