R语言｜PDF合并、拆分、压缩、重命名、文本/表格提取

vlambda
2020-04-03

R语言｜PDF合并、拆分、压缩、重命名、文本/表格提取

PDF(Portable Document Format的简称，意为“便携式文档格式”)，是由Adobe Systems用于与应用程序、操作系统、硬件无关的方式进行文件交换所发展出的文件格式。PDF文件以PostScript语言图象模型为基础，无论在哪种打印机上都可保证精确的颜色和准确的打印效果，即PDF会忠实地再现原稿的每一个字符、颜色以及图象。

PDF 作为一种使用极为广泛的可移植文档格式(电子文件格式)。这种文件格式与操作系统平台无关，也就是说，PDF文件不管是在Windows，Unix还是在苹果公司的Mac OS操作系统中都是通用的。这一特点使它成为在Internet上进行电子文档发行和数字化信息传播的理想文档格式。越来越多的电子图书、产品说明、公司文告、网络资料、电子邮件在开始使用PDF格式文件。

在日常办公和科研中，时常需要对PDF 文档进行合并、拆分、压缩、重命名、信息(文本、表格、图片...)提取等操作，尽管网上有许多软件可以实现，但大多数需要付费、耗力。下文重点介绍R语言pdf*家族包(pdftools,pdfsearch,pdftables)，完成以上pdf文档常规操作。当然，R语言能完成的操作，python也有相应的程序可以实现以上功能。

R语言｜PDF合并、拆分、压缩、重命名、文本/表格提取

1.合并、分割、压缩PDF

library(qpdf)path <- "your pdf files path"setwd(path)#PDF合并pdf_combine(list.files())#1] "/Users/yong/R projects/pdf/papers/combined.pdf"

#PDF分割pdf_subset("img/test.pdf", pages = 1:2)#[1] "/Users/yong/R projects/pdf/img/test_output.pdf"

#PDF压缩#压缩pdf分辨率参数compression,printer(300dpi),ebook(150dpi),screen(72dpi)compressPDF(filename = "img/test.pdf", outFilename = "out_printer_300dpi.pdf", compression="gs(printer)+qpdf")# [1] "compressedPDFs/out_printer_300dpi.pdf"# attr(,"result")# img/test.pdf compressedPDFs/out_printer_300dpi.pdf # 4863604 1454553 compressPDF(filename = "img/test.pdf", outFilename = "out_ebook_150dpi.pdf", compression="gs(ebook)+qpdf")# [1] "compressedPDFs/out_ebook_150dpi.pdf"# attr(,"result")# img/test.pdf compressedPDFs/out_ebook_150dpi.pdf # 4863604 406641 compressPDF(filename = "img/test.pdf", outFilename = "out_screen_72dpi.pdf", compression="gs(screen)+qpdf")# [1] "compressedPDFs/out_screen_72dpi.pdf"# attr(,"result")# img/test.pdf compressedPDFs/out_screen_72dpi.pdf # 4863604 237474

扩展：Ghostscript压缩PDF文件

#在Linux、windows、Mac系统安装Ghostscript步骤，可上网获得#How to Compress PDF the most optimal way with Ghostscript# Get ghostscript with your package manager (aptitude, yum, pacman, yaourt);apt-get install ghostscript
# Run ghostscript.#examplesgs -dSAFER -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -dINTERPOLATE -dNOPAUSE -dQUIET -dBATCH -dNumRenderingThreads=8 -r300 -sOutputFile=output.pdf -c 30000000 setvmthreshold -f input.pdfgs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dDownsampleColorImages=true -dColorImageResolution=130 -dNOPAUSE -dBATCH -sOutputFile=output.pdf input.pdf
# -dPDFSETTINGS=# /screen - 72 dpi images# /ebook - 150 dpi images# /printer - 300 dpi images# /prepress - color preserving, 300 dpi# -dColorImageResolution Custom images dpi

2.批量重命名PDF

#在我们确定一个研究课题之后，我们总是会先下载一堆相关的paper，看看这个领域现在已经做到一个什么地步，也看看别人是怎么做的，后面自己写paper的时候可以参考参考。当我们把相关的文章都下载到一个文件夹之后，有一个问题，pdf文档名字不一。#提取pdf创建日期+论文标题并自动重命名library(dplyr)library(pdftools)library(purrr)library(lubridate)library(stringi)path <- "your pdf files path"setwd(path)line_title <- 4 #The line where the title is in the pdfx1 <- lapply(list.files(), function(x) pdftools::pdf_text(x) %>% strsplit(split = "\n"))x2 <- lapply(1:length(x1),function(x) stringi::stri_split_lines(x1[[x]][[1]]))x3 <- lapply(1:length(x2),function(x) stringi::stri_split_lines(x2[[x]][[line_title]]))x4 <- data.frame(matrix(unlist(x3), byrow=T))names(x4) <- "title"year <- purrr::transpose(lapply(list.files(), pdftools::pdf_info))$createdyear <- substring(lubridate::as_datetime(as.numeric(year)), 1, 10)newname <- paste0(year,"-",x4$title,"...",".pdf")filename <-list.files() file.rename(filename,newname)

R语言｜PDF合并、拆分、压缩、重命名、文本/表格提取

3.以关键字搜索PDF

library(pdfsearch)file <- system.file('pdf', '1501.00450.pdf', package = 'pdfsearch')#提取文本位置heading_search(file, headings = c('abstract', 'introduction'), path = TRUE)# # A tibble: 1 x 5# keyword page_num line_num line_text token_text# <chr> <int> <int> <list> <list> #   1 introduction        4      227 <chr [1]> <list [1]> 
#找到关键字在pdf文档中的位置，返回页码、行数、所在行文本keyword_search(file, keyword = c('repeated measures', 'mixed effects'),path = TRUE)# A tibble: 9 x 5# keyword page_num line_num line_text token_text# <chr> <int> <int> <list> <list> # 1 repeated measures 1 9 <chr [1]> <list [1]># 2 repeated measures 1 30 <chr [1]> <list [1]># 3 repeated measures 2 57 <chr [1]> <list [1]># 4 repeated measures 2 59 <chr [1]> <list [1]># 5 repeated measures 2 69 <chr [1]> <list [1]># 6 repeated measures 3 165 <chr [1]> <list [1]># 7 repeated measures 3 176 <chr [1]> <list [1]># 8 repeated measures 3 181 <chr [1]> <list [1]># 9 repeated measures 4 308 <chr [1]> <list [1]>
##批量找到关键字在所有pdf文档中的位置，返回页码、行数、所在行文本# find directorydirectory <- system.file('pdf', package = 'pdfsearch')
# do search over two fileskeyword_directory(directory,  keyword = c('repeated measures', 'measurement error'), surround_lines = 1, full_names = TRUE)

R语言｜PDF合并、拆分、压缩、重命名、文本/表格提取

扩展：提取PDF文档物种名称

pdfSpeciesScraper package

#作者很久没有维护，可下载到本地运行https://codeload.github.com/joe-devivo/pdfSpeciesScraper/zip/master

4.提取PDF表格数据

#使用pdftables包需从https://pdftables.com网站注册申请apilibrary(pdftables)write.csv(head(iris, 20), file = "test.csv", row.names = FALSE)get_remaining("your api_key")#The package (and API) supports converting PDFs to .csv, .xml, and .xlsx.convert_pdf("test.pdf", "test2.csv",api_key="your api_key")

R语言｜PDF合并、拆分、压缩、重命名、文本/表格提取

结语

若能熟练掌握应用pdf*家族包(pdftools,pdfsearch,pdftables)，在日常办公和科研中，对PDF文档操作往往可以达到事半功倍的效果。

参考链接

https://pdftables.com
https://baike.baidu.com/item/pdf/317608?fr=aladdin
https://www.r-bloggers.com/how-to-extract-data-from-a-pdf-file-with-r/

vlambda博客
学习文章列表