Lecture9- Pretraining
Pretraining
- byte-pair encoding algorithm(BPE编码)
BPE 的关键特点
- 优点
- 解决 OOV 问题:未登录词可拆分为词表中的子词(如 “unhappiness” 拆为 “un + happy + ness”),大幅减少 OOV 率;
- 词表效率高:子词比字符长、比单词短,既能降低序列长度(提升模型训练 / 推理速度),又能控制词表大小(避免单词级词表的冗余);
- 跨语言适配性强:对形态丰富的语言(如德语、法语,存在大量词根 / 后缀)尤其友好,也可用于中文(以汉字为初始单元,合并高频词如 “人民”“国家”)。
- 缺点
- 依赖训练数据:若训练数据分布与下游任务差异大,合并的子词可能无意义(如低频词的子词分割不合理);
- 无全局最优性:每次仅合并 “当前最频繁对”,属于贪心策略,无法保证生成全局最优子词集;
- 处理速度较慢:迭代合并过程需反复统计符号对频率,对大规模语料(如 TB 级文本)的预处理成本较高。
Pretraining encoders (Masked LM)
经典例子是BERT: Bidirectional Encoder Representations from Transformers
Two models were released:
• BERT-base: 12 layers, 768-dim hidden states, 12 attention heads, 110 million params.
• BERT-large: 24 layers, 1024-dim hidden states, 16 attention heads, 340 million params.
Pretraining encoder-decoders
What Raffel et al., 2018 found to work best was span corruption. Their model: T5.
Generative Pretrained Transformer (GPT)
2018’s GPT was a big success in pretraining a decoder! We mentioned how pretrained decoders can be used in their capacities as language models. GPT-2, a larger version (1.5B) of GPT trained on more data, was shown to produce relatively convincing samples of natural language. GPT-3, In-context learning, and very large models