CS224Day-01

Lecture 1-3 Word Vectors and Language Models && Dependency Parsing

如何让计算机处理自然语言

one-hot

后来，人们开始对词汇进行 「离散的表示」 ，即 「one-hot」 表示。这种方式也曾一度推动了NLP中许多任务，取得了一定的效果。然后这种方式很明显有几个问题

a. 词汇太多，用one-hot表示太稀疏 b. 难以衡量词汇之间的相似度

针对上面的相似度的问题，实际上后面有人想到了使用 「构建词语相似度表」 （word-similarity table）的方式来解决，这样首先需要人工得确定每两个词的相似性程度，这显然是不可能完成的任务，那通过WordNet来获取相似度呢？这样可以小范围的实现，但是明显WordNet是很不完整的。

Word2Vec（低维分布式表示）

再后来，划时代的Word2Vec到来了。它主要的思想就是一句话：

❝ 一个词的意义，应该由其周围经常出现的词来表达。❞

我们可以试着使用公式来表示一下：一句话中，我们选择一个中心词（center/target word），设置一个窗口（window）大小来选择上下文词（context）

我们显然希望对于真实的中心词与上下文词，这个概率值应该尽可能大，这样就说明我们可以使用一个词来预测其周围的词。

但是，我们在优化时发现，上面这个式子求导不太方便，我们如果对中心词和周围词采用两套词向量分别表示 ,即中心词用vc表示，周围词用uo表示， 求导就会容易很多

上面的算法，就是 skip-gram 算法的核心。

GloVe词向量

GloVe是斯坦福团队来2014年提出一个新的词向量，GloVe的全名叫“Global Vectors”，重点在于这个global，即它是直接利用全局的统计信息进行训练的。

GloVe会用到全局的词语之间共现的统计信息，因此我们需要首先构建共现矩阵.简化之前的损失函数，我们可以有一种更加高效的计算方法，我们不用一个窗口一个窗口慢慢地滑动计算，而是直接把这些重复的项一起计算。就会发现，本质就是交叉熵

另外，GloVe词向量的训练，是直接面对Xij，即共现矩阵进行优化的。也就是它是直接朝着全局的统计信息进行训练优化的。这样有什么好处呢？

「a」 更充分的利用统计信息 「b」 充分利用语料中的大量重复信息来简化计算

句法结构与依存句法分析

自然语言处理任务中，有很重要的一块，就是分析语言的结构。语言的结构，一般可以有两种视角：

组成关系（Constituency）：组块结构（Constituency = 短语结构语法 = CFG）
依赖关系（Dependency）

前者，主要关心的是句子是怎么构成的，词怎么组成短语。所以研究Constituency，主要是研究忽略语义的“ 语法” 结构（content-free grammars）。

后者，依赖关系，则主要关心的是句子中的每一个词，都依赖于哪个其他的词。

示例：句子 “Look in the large crate in the kitchen by the door” 中，“large” 依赖 “crate”（修饰 crate），“in the kitchen” 依赖 “crate”（说明 crate 的位置），“by the door” 依赖 “kitchen”（说明 kitchen 的位置）；

传统的基于转移的依存分析（Transition-based Parsing）

这里主要介绍Nivre在2003年提出的“Greedy Deterministic Transition-based Parsing”方法，一度成为依存分析的标准方法。这里我简单地介绍一下它的工作原理。

我们构造一个三元组，分别是Stack、Buffer和一个Dependency Set。

Stack最开始只存放一个Root节点；
Buffer则装有我们需要解析的一个句子；
Set中则保存我们分析出来的依赖关系，最开始是空的。

Example：分析 I love WuHan.

从第二行，这个时候Stack中只有[Root,I] ，不构成依赖关系，所以我们需要从Buffer中加入一个单词了，因此采取的Action是Shift（把Buffer中的首个词，移动到Stack中），于是就到了第三行。

第三行，我们的Stack变成了[Root,I,love] ，其中I和Love构成了依赖关系，且是Love指向I，即“向左指”的依赖关系，因此我们将采取“Left Arc”的action，把被依赖的词（此时就是关系中的左边的词）给移除Stack，把这个关系给放入到Dependency Set中。

如何决定Action呢？在Nivre的年代，这里使用是机器学习的方法，需要做繁重的特征工程。这里的特征，往往有个二值特征，即无数个指示条件作为特征，来训练模型，可以想象这么高纬度的特征是十分稀疏的。因此，这种模型的95%左右的解析时间都花费在计算特征上。这也是传统方法的最要问题。

神经依存分析（Neural Dependency Parsing）

神经依存分析方法，是斯坦福团队2014年的研究成果，主要就是利用了神经网络的方法代替了传统机器学习方法、用低维分布式表示来代替传统方法的复杂的高维稀疏特征表示。而整个解析的过程依然是根据之前的Transition-based方法。

首先明确，我们的预测任务是根据当前的状态，即Stack、Buffer、Set的当前状态，来构建特征，然后预测出下一步的动作。

在神经依存分析中，我们的特征是怎么构建的呢？我们可以利用的信息包括词（word）、词性（postag）和依赖关系的标签（label）。我们对这三者，都进行低维分布式表示，即通过Embedding的方法，把离散的word、label、tag都转化成低维向量表示。

对于一个状态，我们可以选取stack、Buffer、set中的某些词和关系，构成一个集合，然后把他们所有的embedding向量都拼接起来，这样就构成了该状态的特征表示。

下面是具体实现一个Minibatch Dependency Parsing的代码：

class PartialParse(object):
    def __init__(self, sentence):
        """Initializes this partial parse.

        @param sentence (list of str): The sentence to be parsed as a list of words.
                                        Your code should not modify the sentence.
        """
        # The sentence being parsed is kept for bookkeeping purposes. Do NOT alter it in your code.
        self.sentence = sentence

        ### YOUR CODE HERE (3 Lines)
        ### Your code should initialize the following fields:
        ###     self.stack: The current stack represented as a list with the top of the stack as the
        ###                 last element of the list.
        ###     self.buffer: The current buffer represented as a list with the first item on the
        ###                  buffer as the first item of the list
        ###     self.dependencies: The list of dependencies produced so far. Represented as a list of
        ###             tuples where each tuple is of the form (head, dependent).
        ###             Order for this list doesn't matter.
        ###
        ### Note: The root token should be represented with the string "ROOT"
        ### Note: If you need to use the sentence object to initialize anything, make sure to not directly 
        ###       reference the sentence object.  That is, remember to NOT modify the sentence object. 

        self.stack = ["ROOT"]
        self.buffer = sentence[:]
        self.dependencies = []

        ### END YOUR CODE


    def parse_step(self, transition):
        """Performs a single parse step by applying the given transition to this partial parse

        @param transition (str): A string that equals "S", "LA", or "RA" representing the shift,
                                left-arc, and right-arc transitions. You can assume the provided
                                transition is a legal transition.
        """
        ### YOUR CODE HERE (~7-12 Lines)
        ### TODO:
        ###     Implement a single parsing step, i.e. the logic for the following as
        ###     described in the pdf handout:
        ###         1. Shift
        ###         2. Left Arc
        ###         3. Right Arc
        if transition == "S":
            self.stack.append(self.buffer.pop(0))
        elif transition == "LA":
            self.dependencies.append((self.stack[-1], self.stack[-2]))
            self.stack.pop(-2)
        elif transition == "RA":
            self.dependencies.append((self.stack[-2], self.stack[-1]))
            self.stack.pop()

        ### END YOUR CODE

    def parse(self, transitions):
        """Applies the provided transitions to this PartialParse

        @param transitions (list of str): The list of transitions in the order they should be applied

        @return dependencies (list of string tuples): The list of dependencies produced when
                                                        parsing the sentence. Represented as a list of
                                                        tuples where each tuple is of the form (head, dependent).
        """
        for transition in transitions:
            self.parse_step(transition)
        return self.dependencies


def minibatch_parse(sentences, model, batch_size):
    """Parses a list of sentences in minibatches using a model.

    @param sentences (list of list of str): A list of sentences to be parsed
                                            (each sentence is a list of words and each word is of type string)
    @param model (ParserModel): The model that makes parsing decisions. It is assumed to have a function
                                model.predict(partial_parses) that takes in a list of PartialParses as input and
                                returns a list of transitions predicted for each parse. That is, after calling
                                    transitions = model.predict(partial_parses)
                                transitions[i] will be the next transition to apply to partial_parses[i].
    @param batch_size (int): The number of PartialParses to include in each minibatch


    @return dependencies (list of dependency lists): A list where each element is the dependencies
                                                    list for a parsed sentence. Ordering should be the
                                                    same as in sentences (i.e., dependencies[i] should
                                                    contain the parse for sentences[i]).
    """
    dependencies = []

    ### YOUR CODE HERE (~8-10 Lines)
    ### TODO:
    ###     Implement the minibatch parse algorithm.  Note that the pseudocode for this algorithm is given in the pdf handout.
    ###
    ###     Note: A shallow copy (as denoted in the PDF) can be made with the "=" sign in python, e.g.
    ###                 unfinished_parses = partial_parses[:].
    ###             Here `unfinished_parses` is a shallow copy of `partial_parses`.
    ###             In Python, a shallow copied list like `unfinished_parses` does not contain new instances
    ###             of the object stored in `partial_parses`. Rather both lists refer to the same objects.
    ###             In our case, `partial_parses` contains a list of partial parses. `unfinished_parses`
    ###             contains references to the same objects. Thus, you should NOT use the `del` operator
    ###             to remove objects from the `unfinished_parses` list. This will free the underlying memory that
    ###             is being accessed by `partial_parses` and may cause your code to crash.

    partial_parses = [PartialParse(sentence) for sentence in sentences]
    unfinished_parses = partial_parses[:]
    while unfinished_parses:
        minibatch = unfinished_parses[:batch_size]
        transitions = model.predict(minibatch)
      
        for partial_parse, transition in zip(minibatch, transitions):
            partial_parse.parse_step(transition)
      
        keep = [partial_parse for partial_parse in minibatch if len(partial_parse.buffer) > 0 or len(partial_parse.stack) > 1]
        unfinished_parses = keep + unfinished_parses[batch_size:]
  
    dependencies = [partial_parse.dependencies for partial_parse in partial_parses]

    ### END YOUR CODE

    return dependencies

Word Vectors and Language Models