以编程方式将 LaTeX 代码转换/解析为纯文本

2022-01-24 00:00:00 python latex parsing text

问题描述

我有几个 C++/Python 代码项目,其中 LaTeX 格式的描述和标签用于生成 PDF 文档或使用 LaTeX+pstricks 制作的图表.但是,我们也有一些纯文本输出,例如文档的 HTML 版本(我已经有代码可以为此编写最少的标记)和不支持 TeX 的绘图渲染器.

I have a couple of code projects in C++/Python in which LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks. However, we also have some plain text outputs, such as an HTML version of the documentation (I already have code to write minimal markup for that) and a non-TeX-enabled plot renderer.

对于这些,我想消除例如必要的 TeX 标记.表示物理单位.这包括不间断(细)空格、 ext、mathrm 等.将 frac{#1}{#2} 之类的内容解析为 #1/#2 以用于纯文本输出(以及对 HTML 使用 MathJax).由于我们目前拥有的系统,我需要能够从 Python 执行此操作,即 理想情况下我正在寻找一个 Python 包,但我正在寻找一个非 Python 可执行文件可以从 Python 调用并捕获输出字符串也可以.

For these I would like to eliminate the TeX markup that is necessary for e.g. representing physical units. This includes non-breaking (thin) spaces, ext, mathrm etc. It would also be nice to parse down things like frac{#1}{#2} into #1/#2 for the plain text output (and use MathJax for the HTML). Due to the system that we've got at the moment, I need to be able to do this from Python, i.e. ideally I'm looking for a Python package, but a non-Python executable which I can call from Python and catch the output string would also be fine.

我知道 TeX StackExchange 网站上有 类似问题,但没有任何真正的程序化解决方案:我看过 detex、plasTeX 和 pytex,它们似乎都有些死了,并没有真正做我需要的事情:TeX 的程序化转换字符串转换为有代表性的纯文本字符串.

I'm aware of the similar question on the TeX StackExchange site, but there weren't any really programmatic solutions to that: I've looked at detex, plasTeX and pytex, which they all seem a bit dead and don't really do what I need: programmatic conversion of a TeX string to a representative plain text string.

我可以尝试使用例如编写一个基本的 TeX 解析器.pyparsing,但是 a) 这可能是陷阱和帮助将不胜感激 b) 肯定有人以前尝试过,或者知道有一种方法可以连接到 TeX 本身以获得更好的结果?

I could try writing a basic TeX parser using e.g. pyparsing, but a) that might be pitfall-laden and help would be appreciated and b) surely someone has tried that before, or knows of a way to hook into TeX itself to get a better result?

更新:感谢所有答案...这确实似乎是一个有点尴尬的要求!我可以使用少于一般的 LaTeX 解析,但考虑在循环中使用解析器而不是加载正则表达式的原因是我希望能够很好地处理嵌套宏和多参数宏,并获得大括号匹配才能正常工作.然后我可以例如首先减少与 txt 无关的宏,如 ext 和 mathrm,最后处理与 frac 等 txt 相关的宏......甚至可以使用适当的括号!好吧,我可以梦想......现在正则表达式并没有做那么糟糕的工作.

Update: Thanks for all the answers... it does indeed seem to be a bit of an awkward request! I can make do with less than general parsing of LaTeX, but the reason for considering a parser rather than a load of regexes in a loop is that I want to be able to handle nested macros and multi-arg macros nicely, and get the brace matching to work properly. Then I can e.g. reduce txt-irrelevant macros like ext and mathrm first, and handle txt-relevant ones like frac last... maybe even with appropriate parentheses! Well, I can dream... for now regexes are not doing such a terrible job.


解决方案

我知道这是一篇旧帖子,但由于该帖子经常出现在 latex-python-parsing 搜索中(如 仅从 .tex 格式的 arXiv 文章中提取正文),留下这个在这里为人们提供以下信息:这是 Python 中的 LaTeX 解析器,支持搜索和修改解析树,https://github.com/alvinwan/texsoup.摘自自述文件,这里是示例文本以及如何通过 TexSoup 与之交互.

I understand this is an old post, but since this post comes up often in latex-python-parsing searches (as evident by Extract only body text from arXiv articles formatted as .tex), leaving this here for folks down the line: Here's a LaTeX parser in Python that supports search over and modification of the parse tree, https://github.com/alvinwan/texsoup. Taken from the README, here is sample text and how you can interact with it via TexSoup.

from TexSoup import TexSoup
soup = TexSoup("""
egin{document}

section{Hello 	extit{world}.}

subsection{Watermelon}

(n.) A sacred fruit. Also known as:

egin{itemize}
item red lemon
item life
end{itemize}

Here is the prevalence of each synonym.

egin{tabular}{c c}
red lemon & uncommon \
life & common
end{tabular}

end{document}
""")

以下是如何导航解析树.

Here's how to navigate the parse tree.

>>> soup.section  # grabs the first `section`
section{Hello 	extit{world}.}
>>> soup.section.name
'section'
>>> soup.section.string
'Hello \textit{world}.'
>>> soup.section.parent.name
'document'
>>> soup.tabular
egin{tabular}{c c}
red lemon & uncommon \
life & common
end{tabular}
>>> soup.tabular.args[0]
'c c'
>>> soup.item
item red lemon
>>> list(soup.find_all('item'))
[item red lemon, item life]

免责声明:我写了这个库,但出于类似的原因.关于 Little Bobby Tales 的帖子(关于 def),TexSoup 不处理定义.

Disclaimer: I wrote this lib, but it was for similar reasons. Regarding the post by Little Bobby Tales (regarding def), TexSoup doesn't handle definitions.

相关文章