使用 Python 进行词嵌入：docc-Python教程-PHP中文网

使用 Python 进行词嵌入：docc

聖光之護

发布： 2024-09-20 18:27:56

转载

1099人浏览过

使用 python 进行词嵌入：docc

使用 python（和 gensim）实现 doc2vec

注意：此代码是用 python 3.6.1 (+gensim 2.3.0) 编写的
doc2vec与gensim的python实现及应用

import re
import numpy as np

from gensim.models import doc2vec
from gensim.models.doc2vec import taggeddocument
from nltk.corpus import gutenberg
from multiprocessing import pool
from scipy import spatial

登录后复制

导入训练数据集
从nltk库导入莎士比亚的哈姆雷特语料库

sentences = list(gutenberg.sents('shakespeare-hamlet.txt'))   # import the corpus and convert into a list

print('type of corpus: ', type(sentences))
print('length of corpus: ', len(sentences))

登录后复制

语料库类型：类“list”
语料库长度：3106

print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

登录后复制

['[', 'the', '悲剧', 'of', '哈姆雷特', 'by', '威廉', '莎士比亚', '1599', ']']
['actus', 'primus', '.']
['弗兰', '.']

预处理数据

使用re模块预处理数据
将所有字母转换为小写
删除标点符号、数字等
对于doc2vec模型，输入数据应采用可迭代的taggeddocuments格式”
- 每个 taggeddocument 实例都包含单词和标签
- 因此，每个文档（即句子或段落）应该有一个可识别的唯一标签

for i in range(len(sentences)):
    sentences[i] = [word.lower() for word in sentences[i] if re.match('^[a-za-z]+', word)]  
print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

登录后复制

['the'、'悲剧'、'of'、'哈姆雷特'、'by'、'威廉'、'莎士比亚']
['actus', 'primus']
['弗兰']

for i in range(len(sentences)):
    sentences[i] = taggeddocument(words = sentences[i], tags = ['sent{}'.format(i)])    # converting each sentence into a taggeddocument
sentences[0]

登录后复制

taggeddocument(words=['the', 'tragedie', 'of', 'hamlet', 'by', 'william', 'shakespeare'], tags=['sent0'])

Android开发中使用SQLite数据库的教程 chm版

Android使用SQLite数据库进行开发的教程，chm格式，SQLite 是一款非常流行的嵌入式数据库，它支持 SQL 查询，并且只用很少的内存。Android 在运行时集成了 SQLite，所以每个 Android 应用程序都可以使用 SQLite 数据库。对数熟悉 SQL 的开发人员来时，使用 SQLite 相当简单。可以，由于 JDBC 不适合手机这种内存受限设备，所以 Android 开发人员需要学习新的 API 来使用 SQLite。本文主要讲解 SQLite 在 Android 环境中的基

查看详情

创建和训练模型

创建 doc2vec 模型并使用 hamlet 语料库对其进行训练
关键参数说明（https://radimrehurek.com/gensim/models/doc2vec.html）
- 句子：训练数据（必须是带有标记化句子的列表）
- size：嵌入空间的尺寸
- sg: cbow 如果为 0，skip-gram 如果为 1
- 窗口：每个上下文所占的单词数（如果窗口
- 大小为3，考虑左邻域中的3个单词和右邻域中的3个单词）
- min_count：词汇表中包含的最小单词数
- iter：训练迭代次数
- workers：要训练的工作线程数量

model = doc2vec(documents = sentences,dm = 1, size = 100, min_count = 1, iter = 10, workers = pool()._processes)

model.init_sims(replace = true)

登录后复制

保存和加载模型

doc2vec模型可以本地保存和加载
这样做可以减少再次训练模型的时间

model.save('doc2vec_model')
model = doc2vec.load('doc2vec_model')

登录后复制

相似度计算

嵌入单词（即向量）之间的相似度可以使用余弦相似度等指标来计算

model.most_similar('hamlet')

登录后复制

[('horatio', 0.9978846311569214),
('女王', 0.9971947073936462),
('莱尔特斯', 0.9971820116043091),
('国王', 0.9968599081039429),
('妈妈', 0.9966716170310974),
('哪里', 0.9966292381286621),
('迪尔', 0.9965540170669556),
('奥菲莉亚', 0.9964221715927124),
('非常', 0.9963752627372742),
('哦', 0.9963476657867432)]

v1 = model['king']
v2 = model['queen']

# define a function that computes cosine similarity between two words
def cosine_similarity(v1, v2):
    return 1 - spatial.distance.cosine(v1, v2)

cosine_similarity(v1, v2)

登录后复制

0.99437165260314941

立即学习“Python免费学习笔记（深入）”；

以上就是使用 Python 进行词嵌入：docc的详细内容，更多请关注php中文网其它相关文章！