CS224Day-02

Lecture 4-6 Dependency Parsing HW && RNNs

Dependency Parsing HW

把最后的Assignment2写完了，下面是最后的run.py跑分结果以及代码：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
CS224N 2023-2024: Homework 2
run.py: Run the dependency parser.
Sahil Chopra <schopra8@stanford.edu>
Haoshen Hong <haoshen@stanford.edu>
"""
from datetime import datetime
import os
import pickle
import math
import time
import argparse

from torch import nn, optim
import torch
from tqdm import tqdm

from parser_model import ParserModel
from utils.parser_utils import minibatches, load_and_preprocess_data, AverageMeter

parser = argparse.ArgumentParser(description='Train neural dependency parser in pytorch')
parser.add_argument('-d', '--debug', action='store_true', help='whether to enter debug mode')
args = parser.parse_args()

# -----------------
# Primary Functions
# -----------------
def train(parser, train_data, dev_data, output_path, batch_size=1024, n_epochs=10, lr=0.0005):
    """ Train the neural dependency parser.

    @param parser (Parser): Neural Dependency Parser
    @param train_data ():
    @param dev_data ():
    @param output_path (str): Path to which model weights and results are written.
    @param batch_size (int): Number of examples in a single batch
    @param n_epochs (int): Number of training epochs
    @param lr (float): Learning rate
    """
    best_dev_UAS = 0


    ### YOUR CODE HERE (~2-7 lines)
    ### TODO:
    ###      1) Construct Adam Optimizer in variable `optimizer`
    ###      2) Construct the Cross Entropy Loss Function in variable `loss_func` with `mean`
    ###         reduction (default)
    ###
    ### Hint: Use `parser.model.parameters()` to pass optimizer
    ###       necessary parameters to tune.
    ### Please see the following docs for support:
    ###     Adam Optimizer: https://pytorch.org/docs/stable/optim.html
    ###     Cross Entropy Loss: https://pytorch.org/docs/stable/nn.html#crossentropyloss

    optimizer = optim.Adam(parser.model.parameters(), lr=lr)
    loss_func = nn.CrossEntropyLoss(reduction='mean')

    ### END YOUR CODE

    for epoch in range(n_epochs):
        print("Epoch {:} out of {:}".format(epoch + 1, n_epochs))
        dev_UAS = train_for_epoch(parser, train_data, dev_data, optimizer, loss_func, batch_size)
        if dev_UAS > best_dev_UAS:
            best_dev_UAS = dev_UAS
            print("New best dev UAS! Saving model.")
            torch.save(parser.model.state_dict(), output_path)
        print("")


def train_for_epoch(parser, train_data, dev_data, optimizer, loss_func, batch_size):
    """ Train the neural dependency parser for single epoch.

    Note: In PyTorch we can signify train versus test and automatically have
    the Dropout Layer applied and removed, accordingly, by specifying
    whether we are training, `model.train()`, or evaluating, `model.eval()`

    @param parser (Parser): Neural Dependency Parser
    @param train_data ():
    @param dev_data ():
    @param optimizer (nn.Optimizer): Adam Optimizer
    @param loss_func (nn.CrossEntropyLoss): Cross Entropy Loss Function
    @param batch_size (int): batch size

    @return dev_UAS (float): Unlabeled Attachment Score (UAS) for dev data
    """
    parser.model.train() # Places model in "train" mode, i.e. apply dropout layer
    n_minibatches = math.ceil(len(train_data) / batch_size)
    loss_meter = AverageMeter()

    with tqdm(total=(n_minibatches)) as prog:
        for i, (train_x, train_y) in enumerate(minibatches(train_data, batch_size)):
            optimizer.zero_grad()   # remove any baggage in the optimizer
            loss = 0. # store loss for this batch here
            train_x = torch.from_numpy(train_x).long()
            train_y = torch.from_numpy(train_y.nonzero()[1]).long()

            ### YOUR CODE HERE (~4-10 lines)
            ### TODO:
            ###      1) Run train_x forward through model to produce `logits`
            ###      2) Use the `loss_func` parameter to apply the PyTorch CrossEntropyLoss function.
            ###         This will take `logits` and `train_y` as inputs. It will output the CrossEntropyLoss
            ###         between softmax(`logits`) and `train_y`. Remember that softmax(`logits`)
            ###         are the predictions (y^ from the PDF).
            ###      3) Backprop losses
            ###      4) Take step with the optimizer
            ### Please see the following docs for support:
            ###     Optimizer Step: https://pytorch.org/docs/stable/optim.html#optimizer-step

            logits = parser.model(train_x)
            loss = loss_func(logits, train_y)
            loss.backward()
            optimizer.step()

            ### END YOUR CODE
            prog.update(1)
            loss_meter.update(loss.item())

    print ("Average Train Loss: {}".format(loss_meter.avg))

    print("Evaluating on dev set",)
    parser.model.eval() # Places model in "eval" mode, i.e. don't apply dropout layer
    dev_UAS, _ = parser.parse(dev_data)
    print("- dev UAS: {:.2f}".format(dev_UAS * 100.0))
    return dev_UAS


if __name__ == "__main__":
    debug = args.debug

    assert (torch.__version__.split(".") >= ["1", "0", "0"]), "Please install torch version >= 1.0.0"

    print(80 * "=")
    print("INITIALIZING")
    print(80 * "=")
    parser, embeddings, train_data, dev_data, test_data = load_and_preprocess_data(debug)

    start = time.time()
    model = ParserModel(embeddings)
    parser.model = model
    print("took {:.2f} seconds\n".format(time.time() - start))

    print(80 * "=")
    print("TRAINING")
    print(80 * "=")
    output_dir = "results/{:%Y%m%d_%H%M%S}/".format(datetime.now())
    output_path = output_dir + "model.weights"

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    train(parser, train_data, dev_data, output_path, batch_size=1024, n_epochs=10, lr=0.0005)

    if not debug:
        print(80 * "=")
        print("TESTING")
        print(80 * "=")
        print("Restoring the best model weights found on the dev set")
        parser.model.load_state_dict(torch.load(output_path))
        print("Final evaluation on test set",)
        parser.model.eval()
        UAS, dependencies = parser.parse(test_data)
        print("- test UAS: {:.2f}".format(UAS * 100.0))
        print("Done!")

跑分结果：

================================================================================
TESTING
================================================================================
Restoring the best model weights found on the dev set
/workspace/CS224n/a2/run.py:159: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  parser.model.load_state_dict(torch.load(output_path))
Final evaluation on test set
2919736it [00:00, 57512000.83it/s]                                                                                                                                                                                                                                                             
- test UAS: 89.00
Done!

Language Models

Language models compute the probability of occurrence of a number of words in a particular sequence. The probability of a sequence of m words w1,…,wm is denoted as P(w1,…,wm).

n-gram Language Models

To compute the probabilities mentioned above, the count of each n-gram could be compared against the frequency of each word. This is called an n-gram Language Model. For instance, if the model takes bi-grams, the frequency of each bi-gram, calculated via combining a word with its previous word, would be divided by the frequency of the corresponding uni-gram.

但是从上面的分析过程，我们也不难发现这样得到的LM的一些严重问题：

稀疏性问题我们前面提到过分子分母很容易为0，就是由于N-gram的稀疏性造成的，N越大，这种稀疏性的问题就越严重，很可能你统计的大多数N-gram都不存在。
存储问题虽然这种基于计数的LM的很简单，但是我们必须穷举出预料中所有可能的N-gram，并逐一去计数、保存，N一旦大起来的话，模型的大小就会陡增。
N太大了会有过于稀疏、难以保存的问题，相反N太小了，模型的预测准确率就会明显降低。

我们知道，语言模型最直接的用途，就是文本联想，和文本生成了。对于文本联想，这种基于计数的LM也不是不能用，毕竟联想出来的词确实在统计意义上是更加频繁的，所以用户直接感受不出它的缺点。

但对于文本生成来说，这种基于计数的LM则有很大问题了。会出现它说着说着就说偏了，再说一会又偏了，导致整个文本没有一个清晰的主题。很明显，这里训练的LM采用的N-gram一定不大。但是N一大起来，训练起来又很困难。这就是基于计数的LM的困境。

window-based neural model

This model learns a distributed representation of words, along with the probability function for word sequences expressed in terms of these representations. The input word vectors are used by both the hidden layer and the output layer.Equation 4 represents Figure 1 and shows the parameters of the softmax() function, consisting of the standard tanh() function (i.e. the hidden layer) as well as the linear function, W(3)x + b(3), that captures all the previous n input word vectors. y =softmax(W(2)tanh(W(1)x +b(1)) +W(3)x +b(3))

Improvements over n-gram LM: • No sparsity problem • Don’t need to store all observed n-grams

Remaining problems: • Fixed window is too small • Enlarging window enlarges 𝑊 • Window can never be large enough! • 𝑥(1) and 𝑥(2) are multiplied by completely different weights in 𝑊. No symmetry in how the inputs are processed.

Recurrent Neural Networks (RNN)

RNN即循环神经网络，为何叫循环呢？因为不管RNN有多长，它实际上都是在同一个神经网络中不断循环 ，因此他们的权重都是一样的，只是根据输入的不同，而产生不同的输出。

有了这样的结构，使得RNN具有以下这些优点：

可以处理任意长度的输入
模型的大小不随输入长度的增加而增大
后面步骤的处理过程可以利用到前面的信息
每一步具有相同的权重矩阵，使得网络可以利用不同输入的相似性

然而，RNN也有其缺点：

计算慢。这是由于它必须在处理完上一步后才能进行下一步的计算。
当输入过长的时候，难以捕捉长距离依赖。

Trainning RNN

• Get a big corpus of text which is a sequence of words • Feed into RNN-LM; compute output distribution for every step t. • Loss function on step t is cross-entropy between predicted probability distribution, and the true next word • Average this to get overall loss for entire training set

RNN的问题

梯度消失和梯度爆炸问题（vanishing/exploding gradients）

我们称这个经典的RNN结构，为vanilla RNN或者simple RNN，这个vanilla的意思是“普通的，毫无特色的”，在论文中我们会经常看到。

这里计算J关于h_t的梯度的时候就会存在梯度消失/梯度下降的问题，(当W很小或者很大，同时i和j相差很远的时候)

「梯度消失」 时，会让RNN在更新的时候，只更新邻近的几步，远处的步子就更新不了。所以遇到“长距离依赖”的时候，这种RNN就无法handle了。
「梯度爆炸」 时，会导致在梯度下降的时候，每一次更新的步幅都过大，这就使得优化过程变得十分困难。

如何解决vanilla RNN中的梯度消失、爆炸问题

梯度爆炸问题的解决

前面讲到梯度爆炸带来的主要问题是在梯度更新的时候步幅过大。那么最直接的想法就是限制这个步长，或者想办法让步长变短。因此，我们可以使用“梯度修剪（gradient clipping）”的技巧来应对梯度爆炸。即当步长超过某阈值，那就把步长缩减到这个阈值。

梯度消失问题的解决

那么如何解决梯度消失的问题呢？梯度消失带来的最严重问题在于，在更新参数时，相比于临近的step，那些较远的step几乎没有被更新。从另一个角度讲，每一个step的信息，由于每一步都在被反复修改，导致较远的step的信息难以传递过来，因此也难以被更新。可以看出，hidden state在不断被重写，这样的话，经过几步的传递，最开始的信息就已经所剩无几了。这根前面在讨论梯度消失的那个包含指数计算的公式是遥相呼应的，都反映了vanilla RNN无法对付长距离依赖的问题。

于是就引出了明天的主题LSTM

Dependency Parsing HW && RNNs