R语言文本挖掘tf-idf,主题建模，情感分析,n-gram建模研究

vlambda
2020-06-10

R语言文本挖掘tf-idf,主题建模，情感分析,n-gram建模研究

原文链接：http://tecdat.cn/?p=6864

我们将对1993年发送到20个Usenet公告板的20,000条消息进行分析。此数据集中的Usenet公告板包括新闻组用于政治，宗教，汽车，体育和密码学等主题。

预处理

我们首先阅读20news-bydate文件夹中的所有消息，这些消息组织在子文件夹中，每个消息都有一个文件。我们可以看到在这样的文件用的组合

read_lines()，map()和unnest()。raw_text## # A tibble: 511,655 x 3## newsgroup id text## <chr> <chr> <chr>## 1 alt.atheism 49960 From: mathew <[email protected]>## 2 alt.atheism 49960 Subject: Alt.Atheism FAQ: Atheist Resources## 3 alt.atheism 49960 Summary: Books, addresses, music -- anything related to atheism## 4 alt.atheism 49960 Keywords: FAQ, atheism, books, music, fiction, addres
## # … with 511,645 more rows

请注意该newsgroup列描述了每条消息来自哪20个新闻组，以及id列，用于标识该新闻组中的消息。

在新闻组中查找tf-idf

我们希望新闻组在主题和内容方面有所不同，因此，它们之间的词语频率也不同。

R语言文本挖掘tf-idf,主题建模，情感分析,n-gram建模研究

newsgroup_cors## # A tibble: 380 x 3## item1 item2 correlation## <chr> <chr> <dbl>## 1 talk.religion.misc soc.religion.christian 0.835## 2 soc.religion.christian talk.religion.misc 0.835## 3 alt.atheism talk.religion.misc 0.779## 4 talk.religion.misc alt.atheism 0.779## 5 alt.atheism soc.religion.christian 0.751## 6 soc.religion.christian alt.atheism 0.751## 7 comp.sys.mac.hardware comp.sys.ibm.pc.hardware 0.680## 8 comp.sys.ibm.pc.hardware comp.sys.mac.hardware 0.680## 9 rec.sport.baseball rec.sport.hockey 0.577## 10 rec.sport.hockey rec.sport.baseball 0.577## # … with 370 more rows