R语言 | 词频统计
本章内容
-
导入停用词 -
读数据,分词 -
剔除停用词
导入停用词表
library(dplyr)
stopwords <- readtext::readtext("data/stopwords.txt") %>%
as.character() %>%
stringr::str_split('\n') %>%
unlist()
#显示前50个停用词
head(stopwords, n=50)
## [1] "?" "、" "。" "“" "”" "《" "》" "!" "!" ","
## [11] "," ":" ":" ";" "?" "-" "(" ")" "(" ")"
## [21] "·" "--" "……" "/" "." "|" "——" "‘" "’" "□"
## [31] "【" "】" "A" "B" "C" "D" "啊" "阿" "哎" "哎呀"
## [41] "哎哟" "唉" "俺" "俺们" "按" "按照" "吧" "吧哒" "把" "罢了"
读取数据分词
library(jiebaR)
#告诉worker停用词表的位置
tokenizer <- worker(stop_word = 'data/stopwords.txt')
#读取三体.txt为字符串
text <- readtext::readtext("data/三体.txt") %>% as.character()
#分词
words <- segment(text, tokenizer)
#显示分词结果的前20个词
head(words, n=20)
## [1] "第" "1" "章" "科学" "边界" "1" "恋上你" "看书"
## [9] "网" "630" "bookla" "最快" "更新" "三体" "全集" "最新"
## [17] "章节" "汪淼" "觉得" "来"
剔除停用词
在Python学习使用的过程中养成了for解决问题习惯,但是R里的for真的好慢~
new_words <- c()
for (word in words) {
if (!word %in% stopwords){
new_words <- c(new_words, word)
}
}
head(new_words)
## [1] "1" "章" "科学" "边界" "1" "恋上你"
词频统计
jiebaR有一个freq函数,可以words中每个词的词频,返回的数据类型是data.frame
wordfreqs <- jiebaR::freq(new_words)
wordfreqs
词频按照降序显示
#提前小超纲,用到dplyr中的排序
wordfreqs <- dplyr::arrange(wordfreqs, -freq)
wordfreqs
保存到excel
使用writexl包的write
writexl::write_xlsx(wordfreqs, "output/三体词频统计.xlsx")
R语言相关
Python相关
后台回复关键词【R词频】获取本文代码和数据