如何查找所有基于图像的 PDF?

2022-01-24 00:00:00 python pdf ocr debian java

我的系统中有许多 PDF 文档，有时我注意到文档是基于图像的，没有编辑功能.在这种情况下，我进行 OCR 是为了在 Foxit PhantomPDF 中进行更好的搜索，您可以在多个文件中进行 OCR.我想找到我的所有基于图像的 PDF 文档.

I have many PDF documents in my system, and I notice sometimes that documents are image-based without editing capability. In this case, I do OCR for better search in Foxit PhantomPDF where you can do OCR in multiple files. I would like to find all PDF documents of mine which are image-based.

我不明白 PDF 阅读器如何识别文档的 OCR 不是文本.这些读者必须访问某些字段.这也可以在终端中访问.这个答案给出了如何在线程检查PDF文件是否是扫描文件中的公开建议:

I do not understand how the PDF reader can recognize that the document's OCR is not textual. There must be some fields which these readers access. This can be accessed in terminal too. This answer gives open proposals how to do it in the thread Check if a PDF file is a scanned one:

您最好的选择可能是检查它是否有文本，并查看是否它包含一个大的页面大小的图像或大量的平铺图像覆盖页面.如果您还检查元数据，这应该涵盖大部分选项.

Your best bet might be to check to see if it has text and also see if it contains a large pagesized image or lots of tiled images which cover the page. If you also check the metadata this should cover most options.

我想更好地了解如何有效地做到这一点，因为如果存在一些元字段，那就很容易了.但是，我还没有找到这样的元字段.我认为最可能的方法是查看页面是否包含具有用于搜索的 OCR 的分页图像，因为它已经在某些 PDF 阅读器中有效并已使用.但是，我不知道该怎么做.

I would like to understand better how you can do this effectively, since if there exists some metafield, then it would be easy. However, I have not found such a metafield. I think the most probable approach is to see if the page contains pagesized image which has OCR for search because it is effective and used in some PDF readers already. However, I do not know how to do it.

在休变换中，在参数空间的超正方形中有专门选择的参数.它的复杂性 $O(A^{m-2})$ 其中 m 是您看到的参数数量，如果参数多于那里，问题就很困难.A 是图像空间的大小.Foxit 阅读器在其实现中很可能使用了 3 个参数.边缘易于检测，可以保证效率，必须在休变换之前完成.损坏的页面会被忽略.其他两个参数仍然未知，但我认为它们必须是节点和一些交叉点.如何计算这些交点是未知的?确切问题的公式是未知的.

In Hugh transform, there are specifically chosen parameters in the hyper-square of the parameter space. Its complexity $O(A^{m-2})$ where m is the amount of parameters where you see that with more than there parameters the problem is difficult. A is the size of the image space. Foxit reader is using most probably 3 parameters in their implementation. Edges are easy to detect well which can ensure the efficiency and must be done before Hugh transform. Corrupted pages are simply ignored. Other two parameters are still unknown but I think they must be nodes and some intersections. How these intersections are computed is unknown? The formulation of the exact problem is unknown.

该命令在 Debian 8.5 中有效，但我无法使其最初在 Ubuntu 16.04 中有效

The command works in Debian 8.5 but I could not manage to get it work initially in Ubuntu 16.04

masi@masi:~$ find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi' ./Downloads/596P.pdf ./Downloads/20160406115732.pdf ^C

操作系统:Debian 8.5 64 位
Linux 内核:4.6 的反向移植
硬件:华硕 Zenbook UX303UA

OS: Debian 8.5 64 bit
Linux kernel: 4.6 of backports
Hardware: Asus Zenbook UX303UA

推荐答案

聚会迟到了，这里有一个简单的解决方案，暗示已经包含字体的 pdf 文件不是仅基于图像的:

Being late for the party, here's a simple solution implying that pdf files already containing fonts aren't image based only:

find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'

pdffonts 列出 PDF 文件中的所有嵌入字体.如果包含可搜索的文本，它也必须包含字体，所以 pdffonts 会列出它们.检查结果是否少于三行是因为 pdffonts 的标题是 2 行.所以所有低于 3 行的结果都没有嵌入字体.AFAIK，不应该有误报，尽管这更像是一个要问 pdffonts 开发人员的问题.

单线

find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'

说明:如果 pdf 包含文本，pdffonts file.pdf 将显示超过 2 行.输出所有不包含文本的 pdf 文件的文件名.

Explanation: pdffonts file.pdf will show more than 2 lines if pdf contains text. Outputs filenames of all pdf files that don't contain text.

我的具有相同功能的 OCR 项目在 Github deajan/pmOCR.

My OCR project which has the same feature is in Github deajan/pmOCR.

相关文章