Python:在没有剪贴板的情况下从 Office/Excel 文档访问嵌入式 OLE

2022-01-14 00:00:00 python excel com ms-office ole

问题描述

我想使用 Python 从 Office/Excel 文档中添加和提取文件.到目前为止添加东西很容易,但对于提取我还没有找到一个干净的解决方案.

I want to add and extract files from an Office/Excel document using Python. So far adding things is easy but for extracting I haven't found a clean solution.

为了弄清楚我有什么和没有什么,我写了下面的小例子 test.py 并进一步解释.

To make clear what I've got and what not I've written the small example test.py below and explain further.

test.py

import win32com.client as win32
import os 
from tkinter import messagebox
import win32clipboard

# (0) Setup
dir_path = os.path.dirname(os.path.realpath(__file__))
print(dir_path)
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(dir_path + "\" + "test_excel.xlsx")
ws = wb.Worksheets.Item(1)
objs = ws.OLEObjects()

# (1) Embed file
f = dir_path + "\" + "test_txt.txt"
name = "test_txt_ole.txt"
objs.Add( Filename=f, IconLabel=name )

# (2) Access embedded file
obj = objs.Item(1) # Get single OLE from OLE list
obj.Copy()
win32clipboard.OpenClipboard()
data = win32clipboard.GetClipboardData(0xC004) # Binary access
win32clipboard.EmptyClipboard()
win32clipboard.CloseClipboard()
messagebox.showinfo(title="test_txt_ole.txt", message=str(data))

# (3) Press don't save here to keep 
# wb.Close() # Will close excel document and leave excel opened.
excel.Application.Quit() # Will close excel with all opened documents

为了准备(步骤 0),它会打开一个给定的 excel 文档,其中包含一个之前通过使用 excel 中的 new document 按钮创建的工作表.

For preparation (step 0) it opens a given excel document with one worksheet that was create before by using new document button in excel.

在步骤 (1) 中,它使用 API 将给定的文本文件嵌入到 excel 文档中.该文本文件是之前使用文本编辑器创建的,内容为TEST123".

In step (1) it uses API to embed a given text file to the excel document. The text file was created before with content "TEST123" using a text editor.

随后在步骤 (2) 中,它尝试使用剪贴板从嵌入的 OLE 读回内容,并打开一个消息框,显示剪贴板中 OLE 的内容.

Afterwards in step (2) it tries to read back content from embedded OLE using clipboard and opens a message box that shows the content from OLE in clipboard.

最后(3)程序关闭打开的文档.要保持不变的设置,请在此处按否".

Finally (3) the program closes the opened document. To keep an unchanged setup press no here.

此解决方案的最大缺点是使用剪贴板会破坏剪贴板中的任何用户内容,这在生产环境中是不好的风格.此外,它使用了一个未记录的剪贴板选项.

The big disadvantage of this solution is the use of clipboard which smashes any user content in clipboard which is bad style in a productive environment. Further it uses an undocumented option for clipboard.

更好的解决方案是将 OLE 或 OLE 嵌入文件安全到 python 数据容器或我选择的文件.在我的示例中,我使用了 TXT 文件来轻松识别文件数据.最后,我将使用 ZIP 作为一体式解决方案,但对于 base64 数据,TXT 文件解决方案就足够了.

A better solution would be to safe OLE or OLE embedded file to a python data container or to a file of my choice. In my example I've used a TXT file to easily identify file data. Finally I'll use ZIP for an all-in-one solution but a TXT file solution would be sufficient for base64 data.

0xC004 = 49156 的来源:https://danny.fyi/embedding-and-accessing-a-file-in-excel-with-vba-and-ole-objects-4d4e7863cfff

Source of 0xC004 = 49156: https://danny.fyi/embedding-and-accessing-a-file-in-excel-with-vba-and-ole-objects-4d4e7863cfff

这个 VBA 示例看起来很有趣,但我对 VBA 一无所知:将嵌入的 OLE 对象(Excel 工作簿)保存到 Excel 2010 中的文件

This VBA example look interesting but I have no clue about VBA: Saving embedded OLE Object (Excel workbook) to file in Excel 2010


解决方案

嗯,我觉得 Parfait 的解决方案有点 hackish(在不好的意义上),因为

Well, I find Parfait's solution a bit hackish (in the bad sense) because

  • 它假定 Excel 会将嵌入保存为临时文件,
  • 它假定这个临时文件的路径始终是用户的默认临时路径,
  • 它假定您将有权在此处打开文件,
  • 它假定您使用命名约定来标识您的对象(例如,test_txt"总是在名称中找到,但您不能插入一个对象account_data"),
  • 它假定这个约定不受操作系统干扰(例如,它不会将其更改为 '~test_tx(1)' 以保存字符长度),
  • 它假定计算机上的所有其他程序都知道并接受此约定(没有其他人会使用包含test_txt"的名称).

所以,我写了一个替代解决方案.其本质如下:

So, I wrote an alternative solution. The essence of this is thef following:

  1. 解压缩 .xlsx 文件(或新的基于 XML 的任何其他 Office 文件)格式,不受密码保护)到一个临时路径.

  1. unzip the .xlsx file (or any other Office file in the new XML-based format, which is not password protected) to a temporary path.

遍历 '/xxx/embeddings' ('xxx' ='xl' 或 'word' 或 'ppt'),并创建一个包含 .bin 的字典文件的临时路径作为键和从返回的字典步骤 3 作为值.

iterate through all .bin files inside the '/xxx/embeddings' ('xxx' = 'xl' or 'word' or 'ppt'), and create a dictionary that contains the .bin files' temporary paths as keys and the dictionaries returned from step 3 as values.

从.bin文件中提取信息根据(不是很有据可查)Ole Packager 格式,并将信息返回为一本字典.(检索原始二进制数据作为内容",不仅从 .txt 但任何文件类型,例如.png)

extract information from the .bin file according to the (not very well documented) Ole Packager format, and return the information as a dictionary. (Retrieves the raw binary data as 'contents', not only from .txt but any file type, e.g. .png)

我还在学习 Python,所以这并不完美(没有错误检查,没有性能优化),但你可以从中得到灵感.我在几个例子上对其进行了测试.这是我的代码:

I'm still learning Python, so this is not perfect (no error checking, no performance optimization) but you can get the idea from it. I tested it on a few examples. Here is my code:

import tempfile
import os
import shutil
import zipfile
import glob
import pythoncom
import win32com.storagecon


def read_zipped_xml_bin_embeddings( path_zipped_xml ):
    temp_dir = tempfile.mkdtemp()

    zip_file = zipfile.ZipFile( path_zipped_xml )
    zip_file.extractall( temp_dir )
    zip_file.close()

    subdir = {
            '.xlsx': 'xl',
            '.xlsm': 'xl',
            '.xltx': 'xl',
            '.xltm': 'xl',
            '.docx': 'word',
            '.dotx': 'word',
            '.docm': 'word',
            '.dotm': 'word',
            '.pptx': 'ppt',
            '.pptm': 'ppt',
            '.potx': 'ppt',
            '.potm': 'ppt',
        }[ os.path.splitext( path_zipped_xml )[ 1 ] ]
    embeddings_dir = temp_dir + '\' + subdir + '\embeddings\*.bin'

    result = {}
    for bin_file in list( glob.glob( embeddings_dir ) ):
        result[ bin_file ] = bin_embedding_to_dictionary( bin_file )

    shutil.rmtree( temp_dir )

    return result


def bin_embedding_to_dictionary( bin_file ):
    storage = pythoncom.StgOpenStorage( bin_file, None, win32com.storagecon.STGM_READ | win32com.storagecon.STGM_SHARE_EXCLUSIVE )
    for stastg in storage.EnumElements():
        if stastg[ 0 ] == '1Ole10Native':
            stream = storage.OpenStream( stastg[ 0 ], None, win32com.storagecon.STGM_READ | win32com.storagecon.STGM_SHARE_EXCLUSIVE )

            result = {}
            result[ 'original_filename' ] = '' # original filename in ANSI starts at byte 7 and is null terminated
            stream.Seek( 6, 0 )
            while True:
                ch = stream.Read( 1 )
                if ch == '':
                    break
                result[ 'original_filename' ] += ch

            result[ 'original_filepath' ] = '' # original filepath in ANSI is next and is null terminated
            while True:
                ch = stream.Read( 1 )
                if ch == '':
                    break
                result[ 'original_filepath' ] += ch

            stream.Seek( 4, 1 ) # next 4 bytes is unused

            temporary_filepath_size = 0 # size of the temporary file path in ANSI in little endian
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 0
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 8
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 16
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 24

            result[ 'temporary_filepath' ] = stream.Read( temporary_filepath_size ) # temporary file path in ANSI

            result[ 'size' ] = 0 # size of the contents in little endian
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 0
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 8
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 16
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 24

            result[ 'contents' ] = stream.Read( result[ 'size' ] ) # contents

            return result

你可以这样使用它:

objects = read_zipped_xml_bin_embeddings( dir_path + '\test_excel.xlsx' )
obj = objects.values()[ 0 ] # Get first element, or iterate somehow, the keys are the temporary paths
print( 'Original filename: ' + obj[ 'original_filename' ] )
print( 'Original filepath: ' + obj[ 'original_filepath' ] )
print( 'Original filepath: ' + obj[ 'temporary_filepath' ] )
print( 'Contents: ' + obj[ 'contents' ] )

相关文章