使用python 2.7从块中的文件中进行base64编码,解码

2022-01-21 00:00:00 python python-2.7 file-io base64

问题描述

我已经阅读了 base64 python 文档并在 SO 和其他地方看到了示例,但我仍然在将 base64 解码回原始二进制表示时遇到问题.

I've read the base64 python docs and seen examples here on SO and elsewhere, but I'm still having a problem decoding base64 back to the original binary representation.

我没有遇到任何异常,所以我认为不存在填充或字符集问题.我只是得到一个比原始二进制文件小的二进制文件.

I'm not getting any exceptions, so I don't think there's a padding or character set issue. I just get a resulting binary file that's smaller than the original binary.

我同时包含 base64 编码和解码步骤,以防其中一个或两个步骤出现问题.

I'm including both the base64 encoding and decoding steps in case there's an issue with either or both steps.

代码必须使用 python 2.7 运行.

The code must run with python 2.7.

以下是重现问题的脚本.

Below are the scripts that reproduce the problem.


b64_encode.py

#!/usr/bin/env python2.7

#
# b64_encode.py - must run with python 2.7
#               - must process data in chunks to limit memory consumption
#               - base64 data must be JSON compatible, i.e.
#                 use base64 "modern" interface,
#                 not base64.encodestring() which contains linefeeds
#

import sys, base64

def write_base64_file_from_file(src_fname, b64_fname, chunk_size=8192):
    with open(src_fname, 'rb') as fin, open(b64_fname, 'w') as fout:
        while True:
            bin_data = fin.read(chunk_size)
            if not bin_data:
                break
            print 'bin %s data len: %d' % (type(bin_data), len(bin_data))
            b64_data = base64.b64encode(bin_data)
            print 'b64 %s data len: %d' % (type(b64_data), len(b64_data))
            fout.write(b64_data)

if len(sys.argv) != 2:
    print 'usage: %s <bin_fname>' % sys.argv[0]
    sys.exit()

bin_fname = sys.argv[1]
b64_fname = bin_fname + '.b64'

write_base64_file_from_file(bin_fname, b64_fname)


b64_decode.py

#!/usr/bin/env python2.7

#
# b64_decode.py - must run with python 2.7
#               - must process data in chunks to limit memory consumption
#

import os, sys, base64

def write_file_from_base64_file(b64_fname, dst_fname, chunk_size=8192):
    with open(b64_fname, 'r') as fin, open(dst_fname, 'wb') as fout:
        while True:
            b64_data = fin.read(chunk_size)
            if not b64_data:
                break
            print 'b64 %s data len: %d' % (type(b64_data), len(b64_data))
            bin_data = base64.b64decode(b64_data)
            print 'bin %s data len: %d' % (type(bin_data), len(bin_data))
            fout.write(bin_data)

if len(sys.argv) != 2:
    print 'usage: %s <b64_fname>' % sys.argv[0]
    sys.exit()

b64_fname = sys.argv[1]
bin_ext = os.path.splitext(os.path.splitext(b64_fname)[0])[1]
bin_fname = os.path.splitext(b64_fname)[0] + bin_ext

write_file_from_base64_file(b64_fname, bin_fname)


例如,我对 19k 图像文件的输出是:


For example, my output for a 19k image file is:

$ ./b64_encode.py img.jpg
bin <type 'str'> data len: 8192
b64 <type 'str'> data len: 10924
bin <type 'str'> data len: 8192
b64 <type 'str'> data len: 10924
bin <type 'str'> data len: 2842
b64 <type 'str'> data len: 3792

$ ./b64_decode.py img.jpg.b64 
b64 <type 'str'> data len: 8192
bin <type 'str'> data len: 6144
b64 <type 'str'> data len: 8192
bin <type 'str'> data len: 2048
b64 <type 'str'> data len: 8192
bin <type 'str'> data len: 4097
b64 <type 'str'> data len: 1064
bin <type 'str'> data len: 796

$ ll
19226 Feb  5 14:24 img.jpg
25640 Mar 29 12:12 img.jpg.b64
13085 Mar 29 12:14 img.jpg.jpg


解决方案

你确实遇到了填充问题:

>>> open('pianoavatar.jpg').read(8192).encode('base64')[-5:]
'IIE=
'

Base64 解码在遇到 = 填充标记时停止.您的第二次阅读在第 10924 个字符处发现了这样一个标记.

Base64 decoding stops when it encounters the = padding marker. Your second read finds such a marker at the 10924th character.

您需要将块大小调整为可被 3 整除,以避免在输出文件中间出现填充.例如,使用 8190 的块大小.

You need to adjust your chunk size to be divisible by 3 instead to avoid padding in the middle of your output file. Use a chunk size of 8190, for example.

阅读时,您需要使用 4 的倍数的缓冲区大小,以避免遇到对齐问题.8192 在那里会很好,但您必须确保在您的函数中满足此限制.您最好默认输入块的 base64 扩展块大小;对于 8190 的编码块大小为 10920(每 3 个字节编码 4 个 base64 字符).

When reading, you need to use a buffersize that's a multiple of 4 to avoid running into alignment issues as well. 8192 would do fine there, but you must ensure this restriction is met in your functions. You'd be better off defaulting to the base64 expanded chunk size for the input chunks; 10920 for an encoding chunk size of 8190 (4 base64 characters for every 3 bytes encoded).

演示:

>>> write_base64_file_from_file('pianoavatar.jpg', 'test.b64', 8190)
bin <type 'str'> data len: 8190
b64 <type 'str'> data len: 10920
bin <type 'str'> data len: 8190
b64 <type 'str'> data len: 10920
bin <type 'str'> data len: 1976
b64 <type 'str'> data len: 2636

现在阅读效果很好,即使在您原来的 8192 块大小下:

Reading now works just fine, even at your original chunk size of 8192:

>>> write_file_from_base64_file('test.b64', 'test.jpg', 8192)
b64 <type 'str'> data len: 8192
bin <type 'str'> data len: 6144
b64 <type 'str'> data len: 8192
bin <type 'str'> data len: 6144
b64 <type 'str'> data len: 8092
bin <type 'str'> data len: 6068

您可以使用简单的模数强制缓冲区大小在您的函数中对齐:

You can force the buffersize to be aligned in your functions with a simple modulus:

def write_base64_file_from_file(src_fname, b64_fname, chunk_size=8190):
    chunk_size -= chunk_size % 3  # align to multiples of 3
    # ...

def write_file_from_base64_file(b64_fname, dst_fname, chunk_size=10920):
    chunk_size -= chunk_size % 4  # align to multiples of 4
    # ...

相关文章