文本挖掘| 什么时候可以用到主题建模？

vlambda
2020-07-24

文本挖掘| 什么时候可以用到主题建模？

主题建模可以帮助决策者处理大量文本数据，对文档中的名词出现频率进行概率建模。用来估计两个文档和关键词之间的相似性。你是否好奇奥巴马过去几年国会演讲报告，随着时间的推移，他传达的信息有什么变化？接下来，https://github.com/datameister66/data下载奥巴马2011-2015年国会演讲报告。

数据预处理

> library(tm)> library(wordcloud)> library(RColorBrewer)> getwd()#获取工作目录路径[1] "/Users/apple/Desktop"> name <- file.path("/Users/apple/Desktop/text")#⚠️：把下载的2011-2015的txt文件单独放在text文件夹里> length(dir(name))#查看txt文件个数[1] 6> dir(name)#查看文件名称[1] "sou2010.txt" "sou2011.txt" "sou2012.txt" "sou2013.txt" "sou2014.txt" "sou2015.txt"

建立语料库DOC

> DOC <- Corpus(DirSource(name))> DOC<<SimpleCorpus>>Metadata: corpus specific: 1, document level (indexed): 0Content: documents: 6

文本转换

> DOC <- tm_map(DOC, tolower)#转换为小写> DOC <- tm_map(DOC, removeNumbers)#删除数字> DOC<- tm_map(DOC, removePunctuation)#删除标点符号> DOC <- tm_map(DOC, removeWords, stopwords("english"))#删除停用词english> DOC <- tm_map(DOC, stripWhitespace)#删除空白字符> DOC <- tm_map(DOC, removeWords, c("applause", "can", "cant","will","that", "weve", "dont", "wont", "youll", "youre"))#删除不必要的

文档-词矩阵构建

> inspect(DocumentTermMatrix(DOC,control=control))

> doc<-DocumentTermMatrix(DOC)> doc<DocumentTermMatrix (documents: 6, terms: 4373)>>Non-/sparse entries: 9470/16768Sparsity : 64%Maximal term length: 16Weighting : term frequency (tf)>dim(doc)

#查看维度，这6篇演讲稿包含4373个词，词量太大，可以把稀疏度大于0.75的删掉。也就是说，(1-0.75) * 6 = 1.5，对于任何一个名词，如果包含它的文档少于2个，它就会被删除。

> doc1 <- removeSparseTerms(doc, 0.75)> dim(doc1)[1] 6 2058

#查看维度，这6篇演讲稿包含2058个词

#命名矩阵的行名称

rownames(doc1) <- c("2010", "2011", "2012", "2013", "2014", "2015")

词频计算和分析

> freq <- colSums(as.matrix(doc1))#计算词频> ord <- order(-freq)#降序排列> freq <- colSums(as.matrix(doc1))> ord <- order(-freq)> freq[head(ord)]#检查对象的头部6个 new jobs america people now thats  177 155 153 148 142 141 > freq[tail(ord)]#检查对象的尾部6个 terror terrorism treat undo unleash zero  2 2 2 2 2 2

可以看出，出现最频繁的词是new，其次是jobs，同时，总统先生非常频繁地提起america。还应该注意到，从jobs这个词的频率可以看出国会非常注重就业问题！

#使用findFreqTerms()函数，找出那些至少出现125次的词

> findFreqTerms(dtm, 125)[1] "america" "american" "jobs" "new" "now" "people" "thats" "years"

词频计算出来以后，可以用于相关性分析，比如统计词与词之间的相关性，词云图绘制，ggplot2可视化分析,同时对某一篇文章以及两篇文章进行相关指标的对比等等，此文主要针对主题建立模型。

主题建模

使用topicmodels包建立主题模型，利用LDA()函数建立4个主题。

> library(topicmodels)> set.seed(123)> lda4 <- LDA(dtm, k = 3, method = "Gibbs")> topics(lda4)2010 2011 2012 2013 2014 2015  3 2 1 1 1 1

可以看到，随着时间的变化，主题的变化令人惊讶。奥巴马任职期间2012-2015年的演讲具有同样的主题分组。

#我们选出每个主题排名前20的词

> terms(lda4, 20) Topic 1 Topic 2 Topic 3  [1,] "america" "jobs" "thats"  [2,] "years" "just" "year"  [3,] "new" "last" "people"  [4,] "every" "energy" "american"  [5,] "lets" "tax" "businesses" [6,] "congress" "now" "now"  [7,] "country" "also" "economy"  [8,] "work" "future" "time"  [9,] "get" "tonight" "one" [10,] "make" "next" "take" [11,] "right" "come" "americans" [12,] "help" "education" "know" [13,] "world" "new" "two" [14,] "want" "people" "families" [15,] "states" "support" "security" [16,] "job" "change" "work" [17,] "need" "reform" "still" [18,] "like" "deficit" "like" [19,] "home" "need" "many" [20,] "american" "must" "health"

可以看出，2010年，美国国会演讲的主题Topic3主要是涉及经济ecnomy和商业businesses，2011年的主题Topic2能传达信息的词语是jobs、energy、deficit。2012-2015年的主题Topic1主要是work，job，特别好奇2012年-2015年这4年演讲报告的内容为啥主题会一样，是什么原因导致的？非常值得好奇人士的分析。

文本 | 数据分析 | 可视化

后台回复【R】

领取R语言学习资料

vlambda博客
学习文章列表