如何检测图像中是否包含ASCII字符？

2022-04-12 00:00:00 python image-processing ocr tesseract python-tesseract

问题描述

我有一个图像数据集，我想过滤掉所有包含文本(ASCII字符)的图像。例如，我有一个可爱的狗狗形象：

如您所见，右下角有一段文字&2003年5月18日，因此应将其过滤掉。

经过一番研究，我发现了tesseractOCR。在python中，我有以下代码：

# Attempt 1
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
print(text)

# Attempt 2
import unidecode
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
text = unidecode.unidecode(text)
print(text)

# Attempt 3
import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
text = pytesseract.image_to_string(img,lang='eng',
                        config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)

均未检测到该字符串(打印空格)。我如何才能检测到它？

解决方案

您可以使用inRange thresholding

结果为：

如果将psm模式设置为6，则输出将为：

<<
‘
' MAY 18 2003

所有数字都被正确捕获，但我们有一些不需要的字符。如果添加'only-alpha numeric'条件，则结果将为：

['M', 'A', 'Y', '1', '8', '2', '0', '0', '3']

首先，我对图像进行了上采样，然后应用了tesseract-OCR。原因是日期太小，无法读取。

编码：

import cv2
import pytesseract
from numpy import array

img = cv2.imread("result.png")  # Load the upsampled image
img = cv2.cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
msk = cv2.inRange(img, array([0, 103, 171]), array([179, 255, 255]))
krn = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 3))
dlt = cv2.dilate(msk, krn, iterations=1)
thr = 255 - cv2.bitwise_and(dlt, msk)

txt = pytesseract.image_to_string(thr, config='--psm 6')
print([t for t in txt if t.isalnum()])
cv2.imshow("", thr)
cv2.waitKey(0)

您可以设置最小和最大范围的新值：

import numpy as np

min_range = np.array([0, 103, 171])
max_range = np.array([179, 255, 255])
msk = cv2.inRange(img, min_range, max_range)

您还可以使用不同的psm参数进行测试：

txt = pytesseract.image_to_string(thr, config='--psm 6')

有关更多信息，请阅读：Improving the quality of the output

相关文章