DeepSeek R1不编程就能生成GPU内核，比熟练工程师好，惊到了英伟达-人工智能-PHP中文网

DeepSeek R1不编程就能生成GPU内核，比熟练工程师好，惊到了英伟达

心靈之曲

发布： 2025-02-15 21:50:11

原创

1076人浏览过

英伟达利用deepseek-r1自动生成优化gpu内核，引发ai社区热议。这项研究利用推理时扩展技术，让deepseek-r1模型在推理过程中分配额外计算资源，自动生成数值正确且针对不同注意力变体的优化gpu注意力内核，无需任何显式编程。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

部分评论认为此举可能“自拆护城河”，也有人担忧工作岗位被AI取代。

随着AI大模型规模扩大和能力提升，测试时扩展（TTS）或推理时扩展（ITS）技术日益重要。该技术通过在推理过程中增加计算资源，评估多种结果并选择最佳方案，从而提升模型性能。这使得AI初步具备了类似人类分析复杂问题的能力，能够逐步解决问题并得出最终答案。

英伟达的实验中，DeepSeek-R1模型通过推理时扩展技术，解决了自动生成优化GPU注意力内核的难题。在某些情况下，其生成的结果甚至超越了经验丰富的工程师。

优化注意力内核的需求与挑战

注意力机制是LLM的关键，但其计算复杂度与输入序列长度的平方成正比。因此，需要优化GPU内核以提高效率并避免错误。此外，注意力机制有多种变体，工程师需要针对特定任务组合使用这些变体。多模态模型则带来了更多挑战，例如需要专门的注意力机制来处理时空信息。

即使对于经验丰富的工程师，创建优化GPU内核也需要大量时间和技能。虽然DeepSeek-R1等大模型在代码生成方面潜力巨大，但其初始尝试效果并不理想，因此需要在推理时采用其他策略。

示例Prompt如下：

<code>Please write a GPU attention kernel to support relative position encodings. Implement the relative positional encoding on the fly within the kernel. The complete code should be returned, including the necessary modifications.

Use the following function to compute the relative positional encoding:

def relative_positional(score, b, h, q_idx, kv_idx):

     return score + (q_idx - kv_idx)

When implementing the kernel, keep in mind that a constant scaling factor 1.44269504 should be applied to the relative positional encoding due to qk_scale = sm_scale * 1.44269504. The PyTorch reference does not need to scale the relative positional encoding, but in the GPU kernel, use:

qk = qk * qk_scale + rel_pos * 1.44269504

Please provide the complete updated kernel code that incorporates these changes, ensuring that the relative positional encoding is applied efficiently within the kernel operations.</code>

登录后复制

大模型有时会产生错误或低效的代码。计算最佳GPU线程映射也极具挑战性。