爬虫教程（六）验证码识别

vlambda
2020-04-01

爬虫教程（六）验证码识别

当爬虫遇到需要进行验证码识别时，我们一般使用由谷歌赞助的，开源的OCR工具tesseract，同时在代码中使用pytesseract来控制。

安装tesseract

各个系统和平台的安装方式都可以在下面这个网站找到：

https://tesseract-ocr.github.io/tessdoc/Home.html

安装pytesseract

pip install pytesseract

并且，需要读取图片，需要借助一个第三方库叫做 PIL。

pip install pillow

实例

使用 pytesseract将图片上的文字转换为文本文字的示例代码如下：

import pytesseract
from PIL import Image

# 指定tesseract.exe所在的路径
pytesseract.pytesseract.tesseract_cmd = r'E:\Tesseract-OCR\tesseract.exe'

# 打开图片
image = Image.open("验证图片.png")
# 调用image_to_string将图片转换为文字
text = pytesseract.image_to_string(image)
print(text)

验证图片：

输出结果：

总结

对于比较整齐的验证码，tesseract可以轻松识别，但是对于噪点比较多的验证码，则需要提前对图片进行处理才能用tesseract识别。

tesseract-ocr是通过lstm实现的，感兴趣的朋友可以在github上关注一下。

vlambda博客
学习文章列表

爬虫教程（六）验证码识别

安装tesseract

安装pytesseract

实例

总结

标签:

推荐阅读

相关文章

vlambda博客 学习文章列表

爬虫教程（六）验证码识别

安装tesseract

安装pytesseract

实例

总结

标签:

推荐阅读

相关文章

vlambda博客
学习文章列表