CS224Day-01

Word Vectors and Language Models

Posted by Jeffzzc on October 5, 2025

Lecture 1-3 Word Vectors and Language Models && Dependency Parsing

如何让计算机处理自然语言

one-hot

后来,人们开始对词汇进行 「离散的表示」 ,即 「one-hot」 表示。这种方式也曾一度推动了NLP中许多任务,取得了一定的效果。然后这种方式很明显有几个问题

a. 词汇太多,用one-hot表示太稀疏 b. 难以衡量词汇之间的相似度

针对上面的相似度的问题,实际上后面有人想到了使用 「构建词语相似度表」 (word-similarity table)的方式来解决,这样首先需要人工得确定每两个词的相似性程度,这显然是不可能完成的任务,那通过WordNet来获取相似度呢?这样可以小范围的实现,但是明显WordNet是很不完整的。

Word2Vec(低维分布式表示)

再后来,划时代的Word2Vec到来了。它主要的思想就是一句话:

❝ 一个词的意义,应该由其周围经常出现的词来表达。❞

我们可以试着使用公式来表示一下: 一句话中,我们选择一个中心词(center/target word),设置一个窗口(window)大小来选择上下文词(context)

我们显然希望对于真实的中心词与上下文词,这个概率值应该尽可能大,这样就说明我们可以使用一个词来预测其周围的词。

但是,我们在优化时发现,上面这个式子求导不太方便,我们如果对中心词和周围词采用两套词向量分别表示 ,即中心词用vc表示,周围词用uo表示, 求导就会容易很多

上面的算法,就是 skip-gram 算法的核心。

GloVe词向量

GloVe是斯坦福团队来2014年提出一个新的词向量,GloVe的全名叫“Global Vectors”,重点在于这个global,即它是直接利用全局的统计信息进行训练的。

GloVe会用到全局的词语之间共现的统计信息,因此我们需要首先构建共现矩阵.简化之前的损失函数,我们可以有一种更加高效的计算方法,我们不用一个窗口一个窗口慢慢地滑动计算,而是直接把这些重复的项一起计算。就会发现,本质就是交叉熵

另外,GloVe词向量的训练,是直接面对Xij,即共现矩阵进行优化的。也就是它是直接朝着全局的统计信息进行训练优化的。这样有什么好处呢?

「a」 更充分的利用统计信息 「b」 充分利用语料中的大量重复信息来简化计算

句法结构与依存句法分析

自然语言处理任务中,有很重要的一块,就是分析语言的结构。语言的结构,一般可以有两 种视角:

  1. 组成关系(Constituency):组块结构(Constituency = 短语结构语法 = CFG)
  2. 依赖关系(Dependency)

前者,主要关心的是句子是怎么构成的,词怎么组成短语。所以研究Constituency,主要是研究忽略语义的“ 语法” 结构(content-free grammars) 。

后者,依赖关系,则主要关心的是句子中的每一个词, 都依赖于哪个其他的词。

  • 示例:句子 “Look in the large crate in the kitchen by the door” 中,“large” 依赖 “crate”(修饰 crate),“in the kitchen” 依赖 “crate”(说明 crate 的位置),“by the door” 依赖 “kitchen”(说明 kitchen 的位置);
传统的基于转移的依存分析(Transition-based Parsing)

这里主要介绍Nivre在2003年提出的“Greedy Deterministic Transition-based Parsing”方法,一度成为依存分析的标准方法。这里我简单地介绍一下它的工作原理。

我们构造一个三元组,分别是Stack、Buffer和一个Dependency Set。

  • Stack最开始只存放一个Root节点;
  • Buffer则装有我们需要解析的一个句子;
  • Set中则保存我们分析出来的依赖关系, 最开始是空的。

Example: 分析 I love WuHan.

从第二行,这个时候Stack中只有[Root,I] ,不构成依赖关系,所以我们需要从Buffer中加入一个单词了,因此采取的Action是Shift(把Buffer中的首个词,移动到Stack中),于是就到了第三行。

第三行,我们的Stack变成了[Root,I,love] ,其中I和Love构成了依赖关系,且是Love指向I,即“向左指”的依赖关系,因此我们将采取“Left Arc”的action,把被依赖的词(此时就是关系中的左边的词)给移除Stack,把这个关系给放入到Dependency Set中。

如何决定Action呢?在Nivre的年代,这里使用是机器学习的方法, 需要做繁重的特征工程。这里的特征,往往有个二值特征,即无数个指示条件作为特征,来训练模型,可以想象这么高纬度的特征是十分稀疏的。因此,这种模型的95%左右的解析时间都花费在计算特征上。这也是传统方法的最要问题。

神经依存分析(Neural Dependency Parsing)

神经依存分析方法,是斯坦福团队2014年的研究成果,主要就是利用了神经网络的方法代替了传统机器学习方法、用低维分布式表示来代替传统方法的复杂的高维稀疏特征表示。而整个解析的过程依然是根据之前的Transition-based方法。

首先明确,我们的预测任务是根据当前的状态,即Stack、Buffer、Set的当前状态,来构建特征,然后预测出下一步的动作。

在神经依存分析中,我们的特征是怎么构建的呢?我们可以利用的信息包括词(word)、词性(postag)和依赖关系的标签(label)。我们对这三者,都进行低维分布式表示,即通过Embedding的方法,把离散的word、label、tag都转化成低维向量表示。

对于一个状态,我们可以选取stack、Buffer、set中的某些词和关系,构成一个集合,然后把他们所有的embedding向量都拼接起来,这样就构成了该状态的特征表示。

下面是具体实现一个Minibatch Dependency Parsing的代码:

class PartialParse(object):
    def __init__(self, sentence):
        """Initializes this partial parse.

        @param sentence (list of str): The sentence to be parsed as a list of words.
                                        Your code should not modify the sentence.
        """
        # The sentence being parsed is kept for bookkeeping purposes. Do NOT alter it in your code.
        self.sentence = sentence

        ### YOUR CODE HERE (3 Lines)
        ### Your code should initialize the following fields:
        ###     self.stack: The current stack represented as a list with the top of the stack as the
        ###                 last element of the list.
        ###     self.buffer: The current buffer represented as a list with the first item on the
        ###                  buffer as the first item of the list
        ###     self.dependencies: The list of dependencies produced so far. Represented as a list of
        ###             tuples where each tuple is of the form (head, dependent).
        ###             Order for this list doesn't matter.
        ###
        ### Note: The root token should be represented with the string "ROOT"
        ### Note: If you need to use the sentence object to initialize anything, make sure to not directly 
        ###       reference the sentence object.  That is, remember to NOT modify the sentence object. 

        self.stack = ["ROOT"]
        self.buffer = sentence[:]
        self.dependencies = []

        ### END YOUR CODE


    def parse_step(self, transition):
        """Performs a single parse step by applying the given transition to this partial parse

        @param transition (str): A string that equals "S", "LA", or "RA" representing the shift,
                                left-arc, and right-arc transitions. You can assume the provided
                                transition is a legal transition.
        """
        ### YOUR CODE HERE (~7-12 Lines)
        ### TODO:
        ###     Implement a single parsing step, i.e. the logic for the following as
        ###     described in the pdf handout:
        ###         1. Shift
        ###         2. Left Arc
        ###         3. Right Arc
        if transition == "S":
            self.stack.append(self.buffer.pop(0))
        elif transition == "LA":
            self.dependencies.append((self.stack[-1], self.stack[-2]))
            self.stack.pop(-2)
        elif transition == "RA":
            self.dependencies.append((self.stack[-2], self.stack[-1]))
            self.stack.pop()

        ### END YOUR CODE

    def parse(self, transitions):
        """Applies the provided transitions to this PartialParse

        @param transitions (list of str): The list of transitions in the order they should be applied

        @return dependencies (list of string tuples): The list of dependencies produced when
                                                        parsing the sentence. Represented as a list of
                                                        tuples where each tuple is of the form (head, dependent).
        """
        for transition in transitions:
            self.parse_step(transition)
        return self.dependencies


def minibatch_parse(sentences, model, batch_size):
    """Parses a list of sentences in minibatches using a model.

    @param sentences (list of list of str): A list of sentences to be parsed
                                            (each sentence is a list of words and each word is of type string)
    @param model (ParserModel): The model that makes parsing decisions. It is assumed to have a function
                                model.predict(partial_parses) that takes in a list of PartialParses as input and
                                returns a list of transitions predicted for each parse. That is, after calling
                                    transitions = model.predict(partial_parses)
                                transitions[i] will be the next transition to apply to partial_parses[i].
    @param batch_size (int): The number of PartialParses to include in each minibatch


    @return dependencies (list of dependency lists): A list where each element is the dependencies
                                                    list for a parsed sentence. Ordering should be the
                                                    same as in sentences (i.e., dependencies[i] should
                                                    contain the parse for sentences[i]).
    """
    dependencies = []

    ### YOUR CODE HERE (~8-10 Lines)
    ### TODO:
    ###     Implement the minibatch parse algorithm.  Note that the pseudocode for this algorithm is given in the pdf handout.
    ###
    ###     Note: A shallow copy (as denoted in the PDF) can be made with the "=" sign in python, e.g.
    ###                 unfinished_parses = partial_parses[:].
    ###             Here `unfinished_parses` is a shallow copy of `partial_parses`.
    ###             In Python, a shallow copied list like `unfinished_parses` does not contain new instances
    ###             of the object stored in `partial_parses`. Rather both lists refer to the same objects.
    ###             In our case, `partial_parses` contains a list of partial parses. `unfinished_parses`
    ###             contains references to the same objects. Thus, you should NOT use the `del` operator
    ###             to remove objects from the `unfinished_parses` list. This will free the underlying memory that
    ###             is being accessed by `partial_parses` and may cause your code to crash.

    partial_parses = [PartialParse(sentence) for sentence in sentences]
    unfinished_parses = partial_parses[:]
    while unfinished_parses:
        minibatch = unfinished_parses[:batch_size]
        transitions = model.predict(minibatch)
      
        for partial_parse, transition in zip(minibatch, transitions):
            partial_parse.parse_step(transition)
      
        keep = [partial_parse for partial_parse in minibatch if len(partial_parse.buffer) > 0 or len(partial_parse.stack) > 1]
        unfinished_parses = keep + unfinished_parses[batch_size:]
  
    dependencies = [partial_parse.dependencies for partial_parse in partial_parses]

    ### END YOUR CODE

    return dependencies