【多模态大模型面经】 BERT 专题面经-人工智能-PHP中文网

✍ 本专题假设读者已经具备一定的深度学习与 transformer 基础，目标是帮助读者系统地复习 bert 模型的核心设计思想与常见面试问法。本专题来源于本人在面试 nlp / llm / 多模态预训练相关岗位时的真实问题与个人总结，本章的重点是为什么 gpt的【mask】设计会导致数据泄露？为什么bert在取代【mask】保留原词的时候就不会导致数据泄露？等比较深入的问题

一、BERT 基本架构

BERT 全称为 Bidirectional Encoder Representations from Transformers，由 Google AI 在 2018 年提出，奠定了后续预训练语言模型（PLM）发展的基石。

?️ 论文地址：arxiv地址

BERT与Transformer不同，和GPT一样沿用了Transformer的基本架构，不同的是，BERT是基于 Transformer Encoder 堆叠的模型，舍弃了 Transformer 的 Decoder，仅保留 Encoder 部分, 是一个堆叠了 $L$ 层 Encoder 的纯 Transformer 编码器模型。作者提出了两个大小的BERT模型：

模型	层数 (L)	隐层维度 (H)	注意力头数 (A)	参数量	备注
BERT Base	12	768	12	~110M	与GPT大小相同
BERT Large	24	1024	16	~340M	更高表达力，训练更慢

面试官很喜欢问大模型架构之间的区别，例如：

? 1. Transformer、GPT、BERT 架构分别是什么，为什么要这么用？

代表模型	方向	任务类型	优势	劣势
BERT, RoBERTa	Encoder-only	理解（分类、匹配、QA）	双向上下文、语义表达强	无法生成、mask训练慢
GPT serious	Decoder-only	生成（文本生成、对话）	自回归生成自然、训练目标简单	单向建模、理解弱
Transformer, T5, BART	Encoder–Decoder	理解+生成（摘要、翻译）	二者兼顾，可用于指令模型	训练/推理复杂度高

BERT对于初学者来说，好像也是一个生成任务，但实际上，BERT是在做一个完形填空的任务。

? 举个例子：

→ BERT 能预测“机器”

输入：我爱学习

→ BERT 无法生成接下来的“因为我有热情”，因为它没有解码器。

因此，BERT 的 MLM 任务确实有“生成”行为，但它并不是自回归意义上的生成模型。它预测被mask的token，而不是像GPT一样输入一个句子，依次生成一个完整的句子。

? 2. 介绍一下BERT的训练过程

BERT 的训练分为两个阶段：

预训练（Pre-training）：在大规模无监督语料上进行语言建模任务学习。下游微调（Fine-tuning）：在下游任务（分类、问答、序列标注等）上，用任务特定的输入格式（加上 [CLS]、[SEP] 标志），再加上一个小的输出层。

二、BERT: Pre-training

在面试过程中，只要你的简历上涉及到了BERT，BERT一定会问你一个问题：

? 3. 预训练任务有哪几个任务

任务	缩写	作用 PHP高级程序设计模式框架与测试(中文高清PDF版) 享有盛誉的PHP高级教程，Zend Framework核心开发人员力作，深入设计模式、PHP标准库和JSON 。　　今天，PHP已经是无可争议的Web开发主流语言。PHP 5以后，它的面向对象特性也足以与Java和C#相抗衡。然而，讲述PHP高级特性的资料一直缺乏，大大影响了PHP语言的深入应用。　　本书填补了这一空白。它专门针对有一定经验的PHP程序员，详细讲解了对他们最为重要的主题 455 查看详情	举例
Masked Language Model	MLM	学习双向上下文	猜被 MASK 的词
Next Sentence Prediction	NSP	学习句子间关系	判断 B 是不是 A 的下一句

2.1 Masked Language Model (MLM)

传统的语言模型（例如 GPT）是条件概率建模，

$$P(x1, x_2, ..., x_n) = \prod{t=1}^{n} P(xt | x_1, ..., x{t-1})$$

也就是逐词预测下一个 token。这乍一看好像也可以实现双向，假设我们想让模型学：

$$P(xt | x_1, ..., x{t-1}, x_{t+1}, ..., x_n)$$

也就是同时看到左右上下文。但问题是：如果模型的多层 self-attention 可以访问所有位置的信息,通过其他 token 的上下文聚合，模型仍然可能在深层“绕回来”获取自己的信息, “间接地”访问到自己的真实词。

? 举个例子：

Layer 1: "love" attends to "I" and "NLP"

Layer 2: "NLP" attends back to "love"

这样“love”通过“间接路径”又拿到了自己的 embedding，

因此，BERT 引入了 Masked Language Modeling (MLM) 预训练目标，让模型能够同时看到左右上下文。在预训练时，BERT会随机遮蔽输入序列中 15% 的 token. （其中有80%被替换为MASK, 10% 替换为随机词, 10% 保留原词）任务是预测被遮蔽的词：

模型通过上下文（左右两侧）推测被 Mask 的词，因此学习到双向语义信息。

? 4. 为什么只有80%被替换为MASK，而不是全部

下游任务的输入从来没有 [MASK]。但如果预训练时几乎所有预测目标都依赖 [MASK] 特征，模型就会学会“见到 [MASK] 才认真预测”，而当 fine-tune 阶段没有 [MASK] 时，它的表现会退化。这样做是为了防止模型过度依赖 [MASK] 符号，增强鲁棒性。

学到这里，可能有些读者就有些疑惑了，为什么这里就可以保留原词了呢？保留原词不就又导致数据泄露了吗？如果你有此疑问，可以继续看下去。

BERT 的整体过程是这样的：

先随机选出 15% 的 token → 这些位置是“潜在的预测目标”。loss 只计算在这 15% 的位置上。

模型虽然“输入里看到 love”，但它并不知道 “love” 是要被预测的位置。所以它的“泄露”是表层信息流的可见，而非训练信号（loss 反传）层面的泄露。

2.2 Next Sentence Prediction（NSP ）

NSP任务是判断 B 是不是 A 的下一句:

输入A	输入B	标签
I went to the store.	I bought some milk.	IsNext
I went to the store.	Penguins live in Antarctica.	NotNext

其中50% 的样本中，B 是 A 的真实下一句；50% 是随机句子；BERT 输入由两句话拼接而成，用 [SEP] 分隔：

模型通过 [CLS] 向量预测是否为“下一句”。

? 5. NSP 的作用是什么？为什么后来的模型（如 RoBERTa）去掉了 NSP？

NSP 想让模型不仅理解句内词之间的关系, 还要理解句与句之间的语义连续性（discourse-level coherence）。

后续实验（尤其是 RoBERTa 和 ALBERT）发现 NSP 的问题主要有两点：

① 任务过于简单：随机拼句 vs. 连续句这个二分类太容易。模型可以仅靠表面统计特征（比如主题词、长度、标点）来判断，而非真正理解上下文。

② 难以泛化到真实句间关系任务：下游任务（如 QA、NLI）要求模型理解逻辑推理（entailment、contradiction、causality），但 NSP 学到的只是“句子 A 和句子 B 是否相邻”

? 6. 后续模型是如何替代 NSP 的？

模型	NSP 是否保留	替代机制
RoBERTa	❌ 去掉 NSP	用更长连续文本训练（512 tokens），依赖 MLM 自行捕获句间依存
ALBERT	✅ 改进	引入 SOP（Sentence Order Prediction）：判断两句是否调换顺序，更关注语义连贯性
ELECTRA	❌ 去掉 NSP	改为 RTD（Replaced Token Detection），更细粒度的预训练信号

三、BERT 手撕代码模块

Self-Attention & Encoder

<code class="python">class BertSelfAttention(nn.Module):    def __init__(self, hidden_dim, num_heads, dropout=0.1):        super().__init__()        assert hidden_dim % num_heads == 0        self.num_heads = num_heads        self.head_dim = hidden_dim // num_heads        self.query = nn.Linear(hidden_dim, hidden_dim)        self.key = nn.Linear(hidden_dim, hidden_dim)        self.value = nn.Linear(hidden_dim, hidden_dim)        self.dropout = nn.Dropout(dropout)    def _transpose_for_scores(self, x):        """        [B, L, H] -> [B, h, L, d]        """        B, L, H = x.size()        x = x.view(B, L, self.num_heads, self.head_dim)        return x.permute(0, 2, 1, 3)  # [B, h, L, d]    def forward(self, hidden_states, attention_mask=None):        """        hidden_states: [B, L, H]        attention_mask: [B, 1, 1, L]  (1 保留, 0 mask)        """        Q = self._transpose_for_scores(self.query(hidden_states))        K = self._transpose_for_scores(self.key(hidden_states))        V = self._transpose_for_scores(self.value(hidden_states))        # [B, h, L, L]        scores = torch.matmul(Q, K.transpose(-1, -2)) / (self.head_dim ** 0.5)        if attention_mask is not None:            # 将 mask 为 0 的位置置为 -inf，softmax 后概率为 0            scores = scores.masked_fill(attention_mask == 0, float("-inf"))        attn_probs = F.softmax(scores, dim=-1)        attn_probs = self.dropout(attn_probs)        # [B, h, L, d]        context = torch.matmul(attn_probs, V)        # -> [B, L, H]        context = context.permute(0, 2, 1, 3).contiguous()        B, L, h, d = context.size()        context = context.view(B, L, h * d)        return context, attn_probsclass BertSelfOutput(nn.Module):    def __init__(self, hidden_dim, dropout=0.1):        super().__init__()        self.dense = nn.Linear(hidden_dim, hidden_dim)        self.layer_norm = nn.LayerNorm(hidden_dim)        self.dropout = nn.Dropout(dropout)    def forward(self, hidden_states, input_tensor):        """        hidden_states: self-attention 输出 [B, L, H]        input_tensor: 残差输入 [B, L, H]        """        x = self.dense(hidden_states)        x = self.dropout(x)        return self.layer_norm(x + input_tensor)class BertIntermediate(nn.Module):    def __init__(self, hidden_dim, intermediate_dim, activation="gelu"):        super().__init__()        self.dense = nn.Linear(hidden_dim, intermediate_dim)        if activation == "relu":            self.act_fn = F.relu        else:  # 默认 GELU            self.act_fn = F.gelu    def forward(self, x):        return self.act_fn(self.dense(x))class BertOutput(nn.Module):    def __init__(self, hidden_dim, intermediate_dim, dropout=0.1):        super().__init__()        self.dense = nn.Linear(intermediate_dim, hidden_dim)        self.layer_norm = nn.LayerNorm(hidden_dim)        self.dropout = nn.Dropout(dropout)    def forward(self, hidden_states, input_tensor):        x = self.dense(hidden_states)        x = self.dropout(x)        return self.layer_norm(x + input_tensor)class BertLayer(nn.Module):    """    一个完整的 BERT Encoder 层：    Self-Attention -> Add & Norm -> FFN -> Add & Norm    """    def __init__(self, hidden_dim, num_heads, intermediate_dim, dropout=0.1):        super().__init__()        self.attention = BertSelfAttention(hidden_dim, num_heads, dropout)        self.attention_output = BertSelfOutput(hidden_dim, dropout)        self.intermediate = BertIntermediate(hidden_dim, intermediate_dim)        self.output = BertOutput(hidden_dim, intermediate_dim, dropout)    def forward(self, hidden_states, attention_mask=None):        # Self-Attention        attn_output, _ = self.attention(hidden_states, attention_mask)        hidden_states = self.attention_output(attn_output, hidden_states)        # FFN        intermediate_output = self.intermediate(hidden_states)        layer_output = self.output(intermediate_output, hidden_states)        return layer_output</code>

登录后复制

Encoder + Pooler + BertModel

<code class="python">class BertEncoder(nn.Module):    def __init__(self, num_layers, hidden_dim, num_heads, intermediate_dim, dropout=0.1):        super().__init__()        self.layers = nn.ModuleList([            BertLayer(hidden_dim, num_heads, intermediate_dim, dropout)            for _ in range(num_layers)        ])    def forward(self, hidden_states, attention_mask=None):        # 简化版：只返回最后一层的输出        for layer in self.layers:            hidden_states = layer(hidden_states, attention_mask)        return hidden_statesclass BertPooler(nn.Module):    """    Pooler：取 [CLS] 位置的向量，过一层全连接 + tanh    """    def __init__(self, hidden_dim):        super().__init__()        self.dense = nn.Linear(hidden_dim, hidden_dim)    def forward(self, hidden_states):        # 假设 input_ids 的第一个 token 是 [CLS]        cls_token = hidden_states[:, 0]  # [B, H]        pooled = torch.tanh(self.dense(cls_token))        return pooledclass BertModel(nn.Module):    """    一个最小可用的 BERT 模型：    Embedding -> Encoder(L 层) -> Pooler    """    def __init__(        self,        vocab_size,        hidden_dim=768,        num_layers=12,        num_heads=12,        intermediate_dim=3072,        max_len=512,        segment_size=2,        dropout=0.1,    ):        super().__init__()        self.embeddings = BertEmbedding(            vocab_size=vocab_size,            hidden_dim=hidden_dim,            max_len=max_len,            segment_size=segment_size,        )        self.encoder = BertEncoder(            num_layers=num_layers,            hidden_dim=hidden_dim,            num_heads=num_heads,            intermediate_dim=intermediate_dim,            dropout=dropout,        )        self.pooler = BertPooler(hidden_dim)    def forward(self, input_ids, token_type_ids, attention_mask=None):        """        input_ids: [B, L]        token_type_ids: [B, L]  (segment id / sentence A/B)        attention_mask: [B, L]  (1 表示真实 token, 0 表示 padding)        """        # [B, L, H]        hidden_states = self.embeddings(input_ids, token_type_ids)        if attention_mask is not None:            # [B, 1, 1, L]，方便广播到 [B, h, L, L]            attention_mask = attention_mask[:, None, None, :]        # Encoder        sequence_output = self.encoder(hidden_states, attention_mask)        # Pooler        pooled_output = self.pooler(sequence_output)        return sequence_output, pooled_output</code>

登录后复制

BertForPreTraining

<code class="python">class MaskedLanguageModel(nn.Module):    """    对被 mask 的位置做 token 分类：    hidden_states: [B, L, H]    masked_positions: [B, M]  (每个样本 M 个 mask 位置，不足用 -1 填充)    """    def __init__(self, vocab_size, hidden_dim):        super().__init__()        self.transform = nn.Linear(hidden_dim, hidden_dim)        self.layer_norm = nn.LayerNorm(hidden_dim)        self.decoder = nn.Linear(hidden_dim, vocab_size, bias=False)    def forward(self, hidden_states, masked_positions):        B, L, H = hidden_states.size()        # masked_positions: [B, M]，其中 -1 表示“无效”        # 构造 batch 维索引        batch_idx = torch.arange(B, device=hidden_states.device).unsqueeze(1)  # [B, 1]        # 为避免 -1 索引越界，先 clamp 到 [0, L-1]，后面通过 label mask 掩掉        pos = masked_positions.clamp(min=0)        # [B, M, H]        masked_hidden = hidden_states[batch_idx, pos]        x = F.gelu(self.transform(masked_hidden))        x = self.layer_norm(x)        logits = self.decoder(x)  # [B, M, vocab_size]        return logitsclass NextSentencePrediction(nn.Module):    def __init__(self, hidden_dim):        super().__init__()        self.linear = nn.Linear(hidden_dim, 2)    def forward(self, cls_vector):        # cls_vector: [B, H]        return self.linear(cls_vector)  # [B, 2]class BertForPreTraining(nn.Module):    """    BERT 预训练模型：BertModel + MLM + NSP    """    def __init__(self, vocab_size, hidden_dim=768, **kwargs):        super().__init__()        self.bert = BertModel(            vocab_size=vocab_size,            hidden_dim=hidden_dim,            **kwargs,        )        self.mlm_head = MaskedLanguageModel(vocab_size, hidden_dim)        self.nsp_head = NextSentencePrediction(hidden_dim)    def forward(        self,        input_ids,        token_type_ids,        attention_mask,        masked_positions,        masked_lm_labels=None,      # [B, M]，无效位置用 -100        next_sentence_labels=None,   # [B]    ):        sequence_output, pooled_output = self.bert(            input_ids=input_ids,            token_type_ids=token_type_ids,            attention_mask=attention_mask,        )        # MLM 预测        prediction_scores = self.mlm_head(sequence_output, masked_positions)        # NSP 预测        seq_relationship_score = self.nsp_head(pooled_output)        outputs = (prediction_scores, seq_relationship_score)        # 如传入 label，则计算 loss        if masked_lm_labels is not None and next_sentence_labels is not None:            # MLM loss            mlm_loss_fct = nn.CrossEntropyLoss(ignore_index=-100)            # [B * M, vocab_size] vs [B * M]            mlm_loss = mlm_loss_fct(                prediction_scores.view(-1, prediction_scores.size(-1)),                masked_lm_labels.view(-1),            )            # NSP loss            nsp_loss_fct = nn.CrossEntropyLoss()            nsp_loss = nsp_loss_fct(                seq_relationship_score.view(-1, 2),                next_sentence_labels.view(-1),            )            total_loss = mlm_loss + nsp_loss            outputs = (total_loss, mlm_loss, nsp_loss) + outputs        return outputs  # (loss, mlm_loss, nsp_loss, prediction_scores, seq_relationship_score)</code>

登录后复制

Mask 采样函数：实现 15% + 80/10/10 规则

<code class="python">def create_masked_lm_labels(    input_ids,    pad_token_id,    cls_token_id,    sep_token_id,    mask_token_id,    vocab_size,    mlm_probability=0.15,    max_masks_per_seq=20,):    """    根据 BERT 规则构造 MLM 输入和标签：    - 15% token 作为预测目标      - 80% -> [MASK]      - 10% -> 随机词      - 10% -> 保留原词    返回:        masked_input_ids: [B, L]        masked_lm_labels: [B, M]，非 mask 位置填 -100        masked_positions: [B, M]，不足部分填 -1    """    device = input_ids.device    B, L = input_ids.size()    # 1. 初始化    masked_input_ids = input_ids.clone()    masked_lm_labels = torch.full(        (B, max_masks_per_seq), -100, dtype=torch.long, device=device    )    masked_positions = torch.full(        (B, max_masks_per_seq), -1, dtype=torch.long, device=device    )    # 预先生成随机数    prob = torch.rand_like(input_ids.float())    # 构建不能被 mask 的位置（特殊符号和 padding）    special_mask = (        (input_ids == pad_token_id)        | (input_ids == cls_token_id)        | (input_ids == sep_token_id)    )    # 选出真正被选为 mask 候选的 token    mask_candidate = (prob < mlm_probability) & (~special_mask)    for b in range(B):        candidate_indices = torch.nonzero(mask_candidate[b], as_tuple=False).view(-1)        # 限制最多 mask 的个数        if len(candidate_indices) > max_masks_per_seq:            chosen = candidate_indices[torch.randperm(len(candidate_indices))[:max_masks_per_seq]]        else:            chosen = candidate_indices        if len(chosen) == 0:            continue        # 记录这些位置的 label        masked_lm_labels[b, : len(chosen)] = input_ids[b, chosen]        masked_positions[b, : len(chosen)] = chosen        # 80% -> [MASK]        num_mask = len(chosen)        num_mask_mask = int(num_mask * 0.8)        num_mask_random = int(num_mask * 0.1)        # 剩余 10% 保留原词        perm = torch.randperm(num_mask)        mask_idx = chosen[perm[:num_mask_mask]]        random_idx = chosen[perm[num_mask_mask : num_mask_mask + num_mask_random]]        # keep_idx = chosen[perm[num_mask_mask + num_mask_random:]]        # 替换为 [MASK]        masked_input_ids[b, mask_idx] = mask_token_id        # 替换为随机词        if len(random_idx) > 0:            random_words = torch.randint(                low=0, high=vocab_size, size=(len(random_idx),), device=device            )            masked_input_ids[b, random_idx] = random_words        # 剩下的 10% 保留原词，不需要改 masked_input_ids    return masked_input_ids, masked_lm_labels, masked_positions</code>

登录后复制

预训练

<code class="python">def pretrain_step(    model: BertForPreTraining,    batch,    optimizer,    pad_token_id,    cls_token_id,    sep_token_id,    mask_token_id,    vocab_size,    device="cuda",):    model.train()    input_ids = batch["input_ids"].to(device)           # [B, L]    token_type_ids = batch["token_type_ids"].to(device) # [B, L]    attention_mask = batch["attention_mask"].to(device) # [B, L]    next_sentence_labels = batch["next_sentence_labels"].to(device)  # [B]    # 构造 MLM 目标    masked_input_ids, masked_lm_labels, masked_positions = create_masked_lm_labels(        input_ids=input_ids,        pad_token_id=pad_token_id,        cls_token_id=cls_token_id,        sep_token_id=sep_token_id,        mask_token_id=mask_token_id,        vocab_size=vocab_size,    )    masked_input_ids = masked_input_ids.to(device)    masked_lm_labels = masked_lm_labels.to(device)    masked_positions = masked_positions.to(device)    outputs = model(        input_ids=masked_input_ids,        token_type_ids=token_type_ids,        attention_mask=attention_mask,        masked_positions=masked_positions,        masked_lm_labels=masked_lm_labels,        next_sentence_labels=next_sentence_labels,    )    total_loss, mlm_loss, nsp_loss = outputs[:3]    optimizer.zero_grad()    total_loss.backward()    optimizer.step()    return {        "loss": total_loss.item(),        "mlm_loss": mlm_loss.item(),        "nsp_loss": nsp_loss.item(),    }</code>

登录后复制

以上就是【多模态大模型面经】 BERT 专题面经的详细内容，更多请关注php中文网其它相关文章！