TF-IDF项目 P3-1 计算TF-IDF值和提取topK关键词

作者：陈华 • 发布时间：2022-12-22 • 阅读 2527

上节课给大家介绍了TF-IDF的计算公式，这节课方便大家理解，先用面向过程的方式，来实现这个算法。但是项目需求里面，还有动态装载语料库、维护停用词这样一些要求，后面还是需要封装成类。

代码示例

1、定义语料库

import math

corpus = [
    'my dog sat on my bed',
    'my cat sat on my knees',
]

2、分词

对语料库中的句子做分词，并统计所有词的词表。

all_words = []
vocab = set()
corpus_words = []

for sentence in corpus:
    words = sentence.strip().split(' ')
    corpus_words.append(words)
    all_words += words
    vocab.update(words)

3、计算TF

tf_list = []
for words in corpus_words:
    tf = {}
    for word in words:
        tf[word] = words.count(word) / len(words)
    tf_list.append(tf)

4、计算IDF

idf_dict = {}
N = len(corpus_words)
for word in vocab:
    num = sum([1 if word in words else 0 for words in corpus_words])
    idf_dict[word] = math.log(N/(num+1))

5、计算TF-IDF

tfidf_list = []
for tf in tf_list:
    tfidf = {}
    for word, tf_val in tf.items():
        tfidf[word] = tf_val * idf_dict[word]
    tfidf_list.append(tfidf)

6、提取topK关键词

for tfidf in tfidf_list:
    tfidf_topk = sorted([(v,k) for k,v in tfidf.items()], reverse=True)[:3]
    print([k for v,k in tfidf_topk])

好的，计算每个词的tfidf值，和根据tfidf值从大到小排序之后，提取topK关键词的功能已经完成了。但是仔细看下最终得到的结果，好像和之前分析的关键词结果不一致。这个问题其实是在介绍公式的时候，提到的为了避免除0错误，分母加了1之后，出现的一个bug，这个问题比较复杂，放在下节课来解决。

本文链接：http://www.chenhuax.com/edu/note/552

TF-IDF项目 P3-1 计算TF-IDF值和提取topK关键词

TF-IDF项目 P3-1 计算TF-IDF值和提取topK关键词

代码示例

陈华编程

关于我们

合作平台

相关网站

联系我们