科大讯飞-学术论文分类挑战赛：ERNIE 准确率0.79-人工智能-PHP中文网

科大讯飞-学术论文分类挑战赛：ERNIE 准确率0.79

P粉084495128

发布： 2025-07-25 10:25:11

原创

900人浏览过

随着人工智能技术不断发展，每周都有非常多的论文公开发布。现如今对论文进行分类逐渐成为非常现实的问题，这也是研究人员和研究机构每天都面临的问题。现在希望选手能构建一个论文分类模型。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

科大讯飞-学术论文分类挑战赛：ernie 准确率0.79 - php中文网

赛事任务

本次赛题希望参赛选手利用论文信息：论文id、标题、摘要，划分论文具体类别。

赛题样例（使用\t分隔）：

paperid：9821title：Calculation of prompt diphoton production cross sections at Tevatron and LHC energies

abstract：A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy.

categories：hep-ph

登录后复制

数据说明

训练数据和测试集以csv文件给出，其中：

训练集5W篇论文。其中每篇论文都包含论文id、标题、摘要和类别四个字段。

AI大学堂
科大讯飞打造的AI学习平台

87

查看详情
测试集1W篇论文。其中每篇论文都包含论文id、标题、摘要，不包含论文类别字段。

评估指标

本次竞赛的评价标准采用准确率指标，最高分为1。

计算方法参考https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html，评估代码参考：

from sklearn.metrics import accuracy_scorey_pred = [0, 2, 1, 3]y_true = [0, 1, 2, 3]

登录后复制

In [1]

!pip install paddle-ernie > log.log

登录后复制

In [2]

import numpy as npimport paddle as P# 导入ernie模型from ernie.tokenizing_ernie import ErnieTokenizerfrom ernie.modeling_ernie import ErnieModel

model = ErnieModel.from_pretrained('ernie-1.0')    # Try to get pretrained model from server, make sure you have network connectionmodel.eval()
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')

ids, _ = tokenizer.encode('hello world')
ids = P.to_tensor(np.expand_dims(ids, 0))  # insert extra `batch` dimensionpooled, encoded = model(ids)                 # eager executionprint(pooled.numpy())

登录后复制

In [9]

import sysimport numpy as npimport pandas as pdfrom sklearn.metrics import f1_scoreimport paddle as Pfrom ernie.tokenizing_ernie import ErnieTokenizerfrom ernie.modeling_ernie import ErnieModelForSequenceClassification

登录后复制

In [10]

train_df = pd.read_csv('train.csv', sep='\t')
train_df['title'] = train_df['title'] + ' ' + train_df['abstract']

train_df = train_df.sample(frac=1.0)
train_df.head()

登录后复制

In [11]

train_df.shape

登录后复制

In [12]

train_df['categories'].nunique()

登录后复制

In [13]

train_df['categories'], lbl_list = pd.factorize(train_df['categories'])

登录后复制

In [14]

# 模型超参数BATCH=32MAX_SEQLEN=300LR=5e-5EPOCH=10# 定义ernie分类模型ernie = ErnieModelForSequenceClassification.from_pretrained('ernie-2.0-en', num_labels=39)
optimizer = P.optimizer.Adam(LR,parameters=ernie.parameters())
tokenizer = ErnieTokenizer.from_pretrained('ernie-2.0-en')

登录后复制

In [15]

train_df.iterrows()

登录后复制

In [16]

# 对数据集进行转换，主要操作为文本编码def make_data(df):
    data = []    for i, row in enumerate(df.iterrows()):
        text, label = row[1].title, row[1].categories
        text_id, _ = tokenizer.encode(text) # ErnieTokenizer 会自动添加ERNIE所需要的特殊token，如[CLS], [SEP]
        text_id = text_id[:MAX_SEQLEN]
        text_id = np.pad(text_id, [0, MAX_SEQLEN-len(text_id)], mode='constant')
        data.append((text_id, label))    return data

train_data = make_data(train_df.iloc[:-5000])
val_data = make_data(train_df.iloc[-5000:])

登录后复制

In [ ]

# 获取batch数据def get_batch_data(data, i):
    d = data[i*BATCH: (i + 1) * BATCH]
    feature, label = zip(*d)
    feature = np.stack(feature)  # 将BATCH行样本整合在一个numpy.array中
    label = np.stack(list(label))
    feature = P.to_tensor(feature) # 使用to_variable将numpy.array转换为paddle tensor
    label = P.to_tensor(label)    return feature, label

登录后复制

In [12]

EPOCH=1# 模型训练for i in range(EPOCH):
    np.random.shuffle(train_data) # 每个epoch都shuffle数据以获得最佳训练效果；
    ernie.train()    for j in range(len(train_data) // BATCH):
        feature, label = get_batch_data(train_data, j)
        loss, _ = ernie(feature, labels=label) 
        loss.backward()
        optimizer.minimize(loss)
        ernie.clear_gradients()        if j % 50 == 0:            print('Train %d: loss %.5f' % (j, loss.numpy()))        
        # 模型验证
        if j % 100 == 0:
            all_pred, all_label = [], []            with P.no_grad():
                ernie.eval()                for j in range(len(val_data) // BATCH):
                    feature, label = get_batch_data(val_data, j)
                    loss, logits = ernie(feature, labels=label)

                    all_pred.extend(logits.argmax(-1).numpy())
                    all_label.extend(label.numpy())
                ernie.train()
            acc = (np.array(all_label) == np.array(all_pred)).astype(np.float32).mean()            print('Val acc %.5f' % acc)

登录后复制

In [13]

test_df = pd.read_csv('test.csv', sep='\t')
test_df['title'] = test_df['title'] + ' ' + test_df['abstract']
test_df['categories'] = 0test_data = make_data(test_df.iloc[:])

登录后复制

In [20]

all_pred, all_label = [], []# 模型预测with P.no_grad():
    ernie.eval()    for j in range(len(test_data) // BATCH+1):
        feature, label = get_batch_data(test_data, j)
        loss, logits = ernie(feature, labels=label)

        all_pred.extend(logits.argmax(-1).numpy())
        all_label.extend(label.numpy())

登录后复制

In [21]

pd.DataFrame({    'paperid': test_df['paperid'],    'categories': lbl_list[all_pred]
}).to_csv('submit.csv', index=None)

登录后复制

以上就是科大讯飞-学术论文分类挑战赛：ERNIE 准确率0.79的详细内容，更多请关注php中文网其它相关文章！

大家都在看：

ZeroGPT能检测脚本内容吗_ZeroGPT对各类脚本AI生成的判断华为AI眼镜如何使用音乐播放控制_华为AI眼镜音乐播放与暂停功能教程百度AI官方网站直达链接_百度AI官网入口官方直达地址秘塔AI互动聊天入口在哪秘塔AI聊天入口2026 有道智云AI教育在线入口有道智云人工智能免费教育使用入口