ELMo论文笔记+源码分析_vlambda技术博客

vlambda
2020-12-01

ELMo论文笔记+源码分析

1. 论文精读

1.1 阶段1：预训练过程
1.2 阶段2：应用到下游NLP task
1.3 ELMo优势

2. 源码分析

2.1 使用elmo能得到什么
2.2 elmo内部执行流程

3. ELMo应用到文本分类
4. 参考

1. 论文精读

论文全称：Deep contextualized word representations（https://arxiv.org/abs/1802.05365）

1.1 阶段1：预训练过程

ELMo的预训练过程就是常见的语言模型（Language Model，简称LM）的训练过程：从句子中学习预测next word，从而学习到对语言的理解的任务。语言模型的学习通常得益于海量的无需标注的文本数据。

ELMo是双向语言模型，它结合了前向LM和后向LM，其 目标 是 共同最大化前向和后向的对数似然：

其中是token representation的参数，是Softmax layer的参数，和都是前后向网络共享的参数。和分别是前向和后向LSTM网络参数。

模型的大致结构可参考下图（图片来源于http://jalammar.github.io/illustrated-bert/）:

ELMo模型示意图

输入的句子会经过3层layer，先经过基于字符的char cnn encode layer，然后依次经过2层BiLSTM layer。这里面涉及到诸多细节，下面分别详细的描述一下Char CNN Encoder层和BiLSTM层这2个过程

（1）Char CNN Encoder

inputs先假设为 [b, max_sentence_len, max_word_len]，其中b代表句子个数维度，也可以当成batch维度，max_sentence_len表示最大句子长度，max_word_len表示最大词长度，默认为50。
然后将 [b, max_sentence_len, max_word_len]reshape为 [b*max_sentence_len, max_word_len]经过一个embedding layer，得到 [b*max_sentence_len, max_word_len, embedding_dim]，其中embedding_dim是embedding layer的char嵌入的维度，例如设为16。这个embedding layer的权重可以是初始化为一个预训练的embedding，也可以是在训练过程中一起学习得到。这个过程其实就是把原先char维度的one-hot编码（长度为50）转化为更为稠密的编码（长度为16）。
然后将上面的输出进行转置，得到 [b*max_sentence_len, embedding_dim, max_word_len]，接下来将其送入到多个kernel size（卷积核大小）和out channel（通道维度）不同的卷积层中，每个filter对应的卷积层最后输出 [b*max_sentence_len, out_channel, new_h]，其中new_h是通过卷积公式计算：new_h=【h-kernelsize+2p】/stride +1，【】表示向下取整。然后 [b*max_sentence_len, out_channel, new_h]会经过一个max pool层，在new_h维度上做最大池化，得到 [b*max_sentence_len, out_channel]，然后把多个卷积层得到的输出在out_channel维度上进行concat，假设concat最后的维度是n_filters，得到 [b*max_sentence_len, n_filters]，其中n_filters=2048（官网模型中）。
然后上面的得到的结果会再经过2个highway layer，源码里的实现是：

highway layer没有改变输入的shape，所以经过这层还是输出 [b*max_sentence_len, n_filters]
最后会经过一个Linear的投影层，最后输出 [b*max_sentence_len, output_dim]，其中output_dim为512。并会reshape为 [b, max_sentence_len, output_dim]，至此，就得到char cnn encoder的embedding。

（2）Bi_LSTM

char cnn encoder的输出经过forward layer得到 [b, max_sentence_len, hidden_size]，其中hidden_size=512
char cnn encoder的输出经过backward layer得到 [b, max_sentence_len, hidden_size]，其中hidden_size=512
将前向和后向的在hidden_size为维度做concat得到 [b, max_sentence_len, 2*hidden_size]

注意： Bi_LSTM 有2层，并且他们之间有Skip connections，即两层BiLSTM之间有残差网络相连，也就是说第一层的输出不仅作为第二层的输入，同时也会和第二层的输出相加。返回的时候，也会分别返回2层最后得到的representation，即在Bi_LSTM层最后返回的是[num_layers, b, max_sentence_len, 2*hidden_size]，num_layer=2，表示2层Bi_LSTM层。

经过上面，最后句子的表示会得到3层representation：

最底下的层是token_embedding，基于char cnn得到，对上下文不敏感
第一层Bi LSTM得到的对应token的embedding，这一层表示包含了更多的语法句法信息（syntax）
第二层Bi LSTM得到的对应token的embedding，这一层表示包含了更多的词义的信息（word sense）

1.2 阶段2：应用到下游NLP task

由阶段1知道，每个token 经过ELMo，都能得到L+1=3个表示（L=2，表示Bi LSTM的层数），即对每个，都有如下的representations：

其中表示token layer，而表示第j层的BiLSTM layer。这里为了保持维度上的一致，会对复制层2份，即。

当在下游NLP task中使用上面的embedding时，ELMo将R中的所有层collapses以形成一个single向量：，例如最简单情形就是只取top layer的那一层表示即，或者直接对三层取平均。

更一般地，会给特定的任务一组task specifific的权重，根据权重对不同层的表示进行scalar mixer：

其中，表示softmax-normalized weights，标量参数允许任务模型缩放整个ELMo向量（allows the task model to scale the entire ELMo vector），对于帮助优化过程具有实际意义。通常较小的模型在大多数cases中效果会更好。

考虑到每个BiLM层的输出具有不同的分布，在某些情况下，分布差异过大时，在加权前应使用 layer normalization（层归一化）应用于每个BiLM层。

1.3 ELMo优势

双向语言模型，能更好的捕捉当前词上下文信息
通过ELMo得到的token embedding是根据词所在句子的上下文动态获得的，而不再是传统的fixed的词向量，这样多义词在不同场景中就能得到不同的表示
更深层的向量表示，不仅仅获取了top layer的表示，而是获得了3层的表示，有与上下文不敏感的token embedding，有更能捕获句法语法信息的第1个LSTM layer的表示，以及更能捕获词义信息的第2层LSTM layer的表示
task specifific的对各层的加权融合，能无缝衔接到其他下游任务中，例如直接提取ELMo的embedding替换下游model的embedding layer，也可以将ELMo的scalar mixer加入到下游模型一起训练，得到适合下有模型的各层的权重

2. 源码分析

2.1 使用elmo能得到什么

在看源码之前，我们先试试使用ELMo，看看输入句子后我们能获得什么。

安装ELMo： pip3 install allennlp

options.json：https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5
weights.hdf5：https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json

测试demo：

from allennlp.modules.elmo import Elmo, batch_to_ids

model_dir = 'E:/pretrained_model/elmo_original/'
options_file = model_dir+'elmo_2x4096_512_2048cnn_2xhighway_options.json'
weights_file = model_dir+'elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5'

num_output_representations = 2 # 2或者1

elmo = Elmo(
  options_file=options_file,
        weight_file=weights_file,
        num_output_representations=num_output_representations,
        dropout=0
  )

sentence_lists = [['I', 'have', 'a', 'dog', ',', 'it', 'is', 'so', 'cute'],
                  ['That', 'is', 'a', 'question'],
                  ['an']]
    
character_ids = batch_to_ids(sentence_lists) #    
print('character_ids:', character_ids.shape) # [3, 11, 50]    

res = elmo(character_ids)    
print(len(res['elmo_representations']))  # 2   
print(res['elmo_representations'][0].shape)  # [3, 9, 1024]    
print(res['elmo_representations'][1].shape)  # [3, 9, 1024]  
  
print(res['mask'].shape)  # [3, 9]

输出：

character_ids: torch.Size([3, 9, 50])
2
torch.Size([3, 9, 1024])
torch.Size([3, 9, 1024])
torch.Size([3, 9])

模型的输出已经对三层的表示（character-convnet output, 1st lstm output, 2nd lstm output）做了线性融合。

注意：num_output_reprensenations用于指定需要几组不同的线性加权的结果，而并不是指定输出第几层的表示，代码中num_output_reprensenations可以是任意整数，跟lstm层数没有关系。在文档https://docs.allennlp.org/v1.2.2/api/modules/elmo/处有对其的解释：

Typically num_output_representations is 1 or 2. For example, in the case of the SRL model in the above paper, num_output_representations=1 where ELMo was included at the input token representation layer. In the case of the SQuAD model, num_output_representations=2 as ELMo was also included at the GRU output layer.

即通常取值为1或2，如果像SRL（Semantic role labeling） model这种，使用ELMo表示token embeding作为model 输入的，就取值为1，如果是像SQuAD（Stanford Question Answering Dataset） model这种，ELMo还在model 的输出层也参与了计算的，就取值为2。论文中也有指出，在不同的位置使用ELMo，模型的效果也是不一样的：

ELMo加在不同model的不同位置效果对比

2.2 elmo内部执行流程

接下来我们就一起走一次完成的流程，从输入句子到最后获得的进行了线性融合后的embedding。我是将pytorch版的官方代码下载到本地IDE中进行阅读和调试以及测试的。

pytorch版官方代码：https://github.com/allenai/allennlp/tree/master/allennlp
tensorflow版官方代码：https://github.com/allenai/bilm-tf

首先，直接进入allennlp/modules/elmo.py中，这里有核心的Elmo类，首先理清一下大致的调用关系，batch_to_ids()将文本句子转化成ids，然后进入到Elmo中，ELMo的forward中调用了self._elmo_lstm()，这是一个_ElmoBiLm类的实例，_ElmoBiLm继承于nn.Module。在_ElmoBiLm的forward中先是调用了self._token_embedder()，再根据self._token_embedder()的结果调用self._elmo_lstm()。_token_embedder是_ElmoCharacterEncoder的实例，里面进行了对token的 embedding，cnn卷积，池化，concat，highway，投影 等一系列操作，对应 1.1节中的（1）Char CNN Encoder中所描述的细节；而_elmo_lstm是ElmoLstm的实例，里面主要实现了 Bi LSTM，对前向后向输出结果进行concat，其中第1层和第2层lstm layer还进行了Skip connections（第一层的输出不仅作为第二层的输入，同时也会和第二层的输出相加），这些对应1.1节中的（2）Bi_LSTM中所描述的细节。

下面将详细查看各功能模块以及对应的输入输出，我主要是自己写一个测试脚本，跟进整个执行流程。首先给出需要的基本输入，有句子以及options.json：

_options = {"lstm":
               {"use_skip_connections": 'true',
                "projection_dim": 512,
                "cell_clip": 3,
                "proj_clip": 3,
                "dim": 4096,
                "n_layers": 2},
            "char_cnn":
               {"activation": "relu",
                "filters": [[1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024]],
                "n_highway": 2,
                "embedding": {"dim": 16},
                "n_characters": 262,
                "max_characters_per_token": 50}
    }

sentence_lists = [['I', 'have', 'a', 'dog', ',', 'it', 'is', 'so', 'cute'],
                  ['That', 'is', 'a', 'question'],
                  ['an']]

为方便测试，_options 直接拷贝options .json里的内容。首先利用batch_to_ids将sentence_lists 转化为ids：

sentence_to_ids = batch_to_ids(sentence_lists)
print(sentence_to_ids.shape) # [3, 9, 50]

得到shape为[3, 9, 50]，即[batch, max_sentence_len, max_word_len]，最大句子长度max_sentence_len=9，因为第1个句子最长有9个token，最后一个维度max_word_len表示最大词长度，即每个token最多由50个char构成，这里面包含了<begin word>和<end word>。

（1）Char CNN Encoder

接下来这部分对应于1.1节中的 Char CNN Encoder中所描述的细节，进行了对token的 embedding，cnn卷积，池化，concat，highway，投影 等一系列操作，而这些全在_ElmoCharacterEncoder的forward中完成，下面再测试脚本中仿照_ElmoCharacterEncoder的forward中完成这些操作，主要是搞清楚维度变化和一些具体的layer。

1 添加句子开始和结束标记

首先为句子开头和结束假设<begin sentence>和<end sentence>标记，则此时inputs的维度就由[batch, max_sentence_len, max_word_len]=>[batch, max_sentence_len+2, max_word_len]：

inputs = sentence_to_ids
# Add BOS/EOS
mask = (inputs > 0).sum(dim=-1) > 0
print(mask.shape, mask) # [3, 9]
character_ids_with_bos_eos, mask_with_bos_eos = add_sentence_boundary_token_ids(
        inputs,
        mask,
        torch.from_numpy(
            numpy.array(ELMoCharacterMapper.beginning_of_sentence_characters) + 1  # Jane: 256+1
        ),
        torch.from_numpy(
            numpy.array(ELMoCharacterMapper.beginning_of_sentence_characters) + 1  # Jane: 256+1
        )
    )
print(character_ids_with_bos_eos.shape, mask_with_bos_eos.shape) # [3, 11=1+9+1, 50], [3, 11]
print(character_ids_with_bos_eos[-1][:5]) # BOS an EOS

此时加了句子开头和结束标记的character_ids_with_bos_eos shape为[3, 11, 50]，mask为[3,11]，dim=1上比之前多了2。关于<begin word>和<end word>以及<begin sentence>和<end sentence>标记在allennlp/data/token_indexers/elmo_indexer.py中有定义：

max_word_length = 50

# char ids 0-255 come from utf-8 encoding bytes
# assign 256-300 to special chars
beginning_of_sentence_character = 256  # <begin sentence>
end_of_sentence_character = 257  # <end sentence>
beginning_of_word_character = 258  # <begin word>
end_of_word_character = 259  # <end word>
padding_character = 260  # <padding>

2 character embedding 层

加上标记后，就进入一个embedding layer，这个embedding layer的权重可以是以一个预训练的词向量进行初始化，也可以是在学习中获得，下面随机初始化了一个权重_char_embedding_weights 进行示例说明，其对262个字符进行编码，每个字符对应大小为16的向量：

# the character id embedding
max_chars_per_token = 50
_char_embedding_weights = torch.rand(262, 16)
# (batch_size * sequence_length, max_chars_per_token, embed_dim)
character_embedding = torch.nn.functional.embedding(
 character_ids_with_bos_eos.view(-1, max_chars_per_token), # input [3, 11, 50]=>[3*11, 50]
 _char_embedding_weights # weight [max_id+1, char_embedding_dim=]
)
print(character_embedding.shape) # [3*11, 50, 16]

character_ids_with_bos_eos先reshape成[3*11, 50]，然后进入embedding层后得到shape为[3*11, 50, 16]的输出。

3 multi-size 的卷积层

然后是进入多个不同filter size的卷积层：

# 转置
character_embedding = torch.transpose(character_embedding, 1, 2)
print(character_embedding.shape) # [3*11, 50, 16]=>[3*11, 16, 50] =>(batch_size * sequence_length, embed_dim, max_chars_per_token)


cnn_options = _options["char_cnn"]
filters = cnn_options["filters"]
char_embed_dim = cnn_options["embedding"]["dim"]

_convolutions = []
for i, (width, num) in enumerate(filters):
 print('filter kernel_size:', width, 'out channels:', num)
    conv = torch.nn.Conv1d( # [3*11, 16, 50]=>[3*11, num, [50-width]+1]
            in_channels=char_embed_dim, out_channels=num, kernel_size=width, bias=True
    )
    _convolutions.append(conv)



convs = []
for i in range(len(_convolutions)):
 conv = _convolutions[i]
    convolved = conv(character_embedding)
    # print('convolved:', convolved.shape)
    # (batch_size * sequence_length, n_filters for this width)
    """
    [3*11, 16, 50]=>[3*11, num, [50-width]+1]
    "filters":     [[1, 32],       [2, 32],       [3, 64],       [4, 128],      [5, 256],      [6, 512],     [7, 1024]],
    [3*11, 16, 50]=>[3*11, 32, 50],[3*11, 32, 49],[3*11, 64, 48],[3*11, 128, 47],[3*11, 256, 46],[3*11, 512, 45],[3*11, 1024, 44],
    """
    convolved, _ = torch.max(convolved, dim=-1) # 返回values和indices [3*11,num],[3*11,num] 相当于在dim=-1上做了maxpool, num=n_filters
    # print('convolved:', convolved.shape)
    convolved = torch.nn.functional.relu(convolved) # _options['char_cnn']['activation']='relu'
    convs.append(convolved)

# (batch_size * sequence_length, n_filters)
token_embedding = torch.cat(convs, dim=-1) # =>[3*11, 2048]
# print(token_embedding.shape) # [3*11, 2048]

先将embedding层后得到输出进行转置，得到shape为[3*11, 16, 50]，配置文件中共有7个不同的卷积核配置，卷积操作后跟着是max pool操作（由torch.max(convolved, dim=-1)实现）和relu。卷积核的out channel是有意设计的，这样在最后concat时，刚好得到2048，即最后token_embedding的shape为[3*11, 2048]。

4 highway 层

接下来就是进入highway layer：

# the highway layers have same dimensionality as the number of cnn filters
cnn_options = _options["char_cnn"]
filters = cnn_options["filters"]
n_filters = sum(f[1] for f in filters)
print('n_filters:', n_filters) # 2048
n_highway = cnn_options["n_highway"] # 2

# create the layers, and load the weights
_highways = Highway(n_filters, n_highway, activation=torch.nn.functional.relu)


# apply the highway layers (batch_size * sequence_length, n_filters)
token_embedding = _highways(token_embedding)
print(token_embedding.shape) # [3*11, 2048]

highway不改变输入的shape，经过highway 层的输出的shape依然为[3*11, 2048]。

5 projection投影层

最后是投影层：

cnn_options = _options["char_cnn"]
filters = cnn_options["filters"]
n_filters = sum(f[1] for f in filters) # 2048

output_dim = _options["lstm"]["projection_dim"] # 512
_projection = torch.nn.Linear(n_filters, output_dim, bias=True)

# final projection  (batch_size * sequence_length, embedding_dim)
token_embedding = _projection(token_embedding) # [3*11, 2048]=>[3*11, 512]
print('token_embedding:', token_embedding.shape)

投影操作将[3*11, 2048]=>[3*11, 512]，最后组织成字典，作为这一整个部分的返回：

# reshape to (batch_size, sequence_length, embedding_dim)
batch_size, sequence_length, _ = character_ids_with_bos_eos.size() # [3, 11, 50]

res= {
        "mask": mask_with_bos_eos, # [3, 11]
        "token_embedding": token_embedding.view(batch_size, sequence_length, -1), # [3, 11, 512]
     }
print(res['mask'].shape, res['token_embedding'].shape)

这个返回的dict在BI LSTM中要用到，是整个_ElmoCharacterEncoder的forward的过程，他会在_ElmoBiLm中的forward中，由token_embedding = self._token_embedder(inputs)得到，token_embedding就是上面的字典的组成。

（2）Bi_LSTM

接下来这部分对应于1.1节中的 Bi_LSTM中所描述的细节，主要是 对前向后向输出结果进行concat，其中第1层和第2层lstm layer还进行了Skip connections（第一层的输出不仅作为第二层的输入，同时也会和第二层的输出相加），然后对拼接后得特征还进行了投影操作，而这些全在_ElmoBiLm的forward以及ElmoLstm（allennlp/modules/elmo_sltm.py）的forward和_lstm_forward完成的，下面依然是在测试脚本中仿照完成这些操作，主要是搞清楚维度变化和一些具体的layer。

_token_embedding = res
# 进入BiLSTM层
mask = _token_embedding["mask"]
type_representation = _token_embedding["token_embedding"]

_elmo_lstm = ElmoLstm(
        input_size=_options["lstm"]["projection_dim"], # 512
        hidden_size=_options["lstm"]["projection_dim"],
        cell_size=_options["lstm"]["dim"], # 4096 lstm cell_size
        num_layers=_options["lstm"]["n_layers"],
        memory_cell_clip_value=_options["lstm"]["cell_clip"],
        state_projection_clip_value=_options["lstm"]["proj_clip"],
        requires_grad=False,
)

lstm_outputs = _elmo_lstm(type_representation, mask)
print('lstm_outputs:', lstm_outputs.shape) # [2, 3, 11, 1024] (num_layers, batch_size, sequence_length, hidden_size)

output_tensors = [
        torch.cat([type_representation, type_representation], dim=-1) * mask.unsqueeze(-1) # [3, 11, 512],[3, 11, 512]=>[3, 11, 1024]*[3,11,1]
]
print('output_tensors:', len(output_tensors), output_tensors[0].shape) # [3,11,1024]

for layer_activations in torch.chunk(lstm_outputs, lstm_outputs.size(0), dim=0): # chunk 对tensors分块
 print('layer_activations:', layer_activations.shape) # [3,11,1024]
    output_tensors.append(layer_activations.squeeze(0))

res = {"activations": output_tensors, "mask": mask}
print(len(res['activations'])) # 3
print(res['activations'][0].shape) # [3, 11, 1024] 第0层 上下文无关token embedding layer
print(res['activations'][1].shape) # [3, 11, 1024] 第1层 bilstm layer
print(res['activations'][2].shape) # [3, 11, 1024] 第2层 bilstm layer
print(mask.shape) # [3, 11]

# run the biLM finished
_elmo_lstm.num_layers = _options["lstm"]["n_layers"] + 1 # 3

最后的输出中'activations'对应有3层表示，第0层是上下文无关token的表示，然后是第1层 bilstm layer的输出和第2层 bilstm layer的输出。

具体的正向反向实现和拼接，Skip connections，以及投影主要在ElmoLstm（allennlp/modules/elmo_sltm.py）的_lstm_forward完成，下面仅给出最核心的部分代码：

        for layer_index, state in enumerate(hidden_states):

            forward_cache = forward_output_sequence # 用于Skip connections
            backward_cache = backward_output_sequence

            forward_state: Optional[Tuple[Any, Any]] = None
            backward_state: Optional[Tuple[Any, Any]] = None
            if state is not None:
                forward_hidden_state, backward_hidden_state = state[0].split(self.hidden_size, 2)
                forward_memory_state, backward_memory_state = state[1].split(self.cell_size, 2)
                forward_state = (forward_hidden_state, forward_memory_state)
                backward_state = (backward_hidden_state, backward_memory_state)

            forward_output_sequence, forward_state = forward_layer( # Jane =>(batch_size, max_timesteps, hidden_size=512)
                # final state: (1, batch_size, hidden_size=512) and  (1, batch_size, cell_size=4096)
                forward_output_sequence, batch_lengths, forward_state
            )
            backward_output_sequence, backward_state = backward_layer( # Jane =>(batch_size, max_timesteps, hidden_size=512)
                backward_output_sequence, batch_lengths, backward_state
            )
            # Skip connections, just adding the input to the output.
            if layer_index != 0: # Jane 两层BiLSTM之间有残差网络相连，也就是说第一层的输出不仅作为第二层的输入，同时也会和第二层的输出相加
                forward_output_sequence += forward_cache
                backward_output_sequence += backward_cache

            sequence_outputs.append( # 拼接
                torch.cat([forward_output_sequence, backward_output_sequence], -1) # Jane =>[b,max_timesteps,2*hidden_size=1024]
            )

（3）scalar_mix：各层的线性融合，给出最终的representation

num_output_representations 指定需要几组不同权重的线性组合，通常是1或者2。

bilm_output = res
layer_activations = bilm_output["activations"] # [3, 11, 1024],[3, 11, 1024],[3, 11, 1024]
mask_with_bos_eos = bilm_output["mask"] # [3,11]

_scalar_mixes = []
num_output_representations = 2

for k in range(num_output_representations):  # Jane 通常是1或者2
 print('_elmo_lstm.num_layers:', _elmo_lstm.num_layers)
    scalar_mix = ScalarMix(
            _elmo_lstm.num_layers,  # 3
            do_layer_norm=False,
            initial_scalar_parameters=None,
            trainable=False,
    )
_scalar_mixes.append(scalar_mix)

最后根据上面定义的scaler mix，得到最终的表示：

# compute the elmo representations
_keep_sentence_boundaries = False # 是否保留句子开始结束标记
representations = []
for i in range(len(_scalar_mixes)):
 scalar_mix = _scalar_mixes[i]
    representation_with_bos_eos = scalar_mix(layer_activations, mask_with_bos_eos)
    print('representation_with_bos_eos:', representation_with_bos_eos.shape) # [3, 11, 1024]
    if _keep_sentence_boundaries:
        processed_representation = representation_with_bos_eos
        processed_mask = mask_with_bos_eos
    else: # False 不保留句子开始结束标记
        representation_without_bos_eos, mask_without_bos_eos = remove_sentence_boundaries(
                representation_with_bos_eos, mask_with_bos_eos
        )
        processed_representation = representation_without_bos_eos
        processed_mask = mask_without_bos_eos
        print('processed_representation:', processed_representation.shape) # [3, 11-2=9, 1024]
    representations.append(processed_representation) # dropout 省去了 不影响shape

最后我们来看看最终的elmo representation的shape，这与2.1节中的结果一致：

mask = processed_mask
elmo_representations = representations

res = {"elmo_representations": elmo_representations, "mask": mask}
print(len(res['elmo_representations'])) # 2
print(res['elmo_representations'][0].shape) # [3, 9, 1024]
print(res['elmo_representations'][1].shape) # [3, 9, 1024]
print(res['mask'].shape) # [3, 9]

3. ELMo应用到文本分类

想对textcnn（baseline），glove+textcnn， elmo+textcnn做一组文本分类的对比实现，验证一下elmo的效果。后续有时间弄完后，再在这里补上代码和对比结果。

最后：如果本文中出现任何错误，请您一定要帮忙指正，感激~

4. 参考

[1] https://allennlp.org/elmo
[2] https://docs.allennlp.org/v1.2.2/api/modules/elmo/
[3] https://github.com/allenai/allennlp/tree/master/allennlp
[4] http://jalammar.github.io/illustrated-bert/
[5]
[6] https://blog.csdn.net/Magical_Bubble/article/details/89160032
[7] https://larid.site/2020/03/01/elmo源码阅读流水账
[8] http://shomy.top/2019/01/01/elmo-1/

vlambda博客
学习文章列表