Base64数据的流解码
我有一些大型 base64 编码数据(存储在 hadoop 文件系统中的 snappy 文件中).此数据最初是 gzip 压缩的文本数据.我需要能够读取这些编码数据的块,对其进行解码,然后将其刷新到 GZIPOutputStream.
I have some large base64 encoded data (stored in snappy files in the hadoop filesystem). This data was originally gzipped text data. I need to be able to read chunks of this encoded data, decode it, and then flush it to a GZIPOutputStream.
关于如何做到这一点而不是将整个 base64 数据加载到数组中并调用 Base64.decodeBase64(byte[]) 的任何想法?
Any ideas on how I could do this instead of loading the whole base64 data into an array and calling Base64.decodeBase64(byte[]) ?
如果我读取字符直到 ' ' 分隔符并逐行解码,我是对的吗?例如:
Am I right if I read the characters till the ' ' delimiter and decode it line by line? e.g. :
for (int i = 0; i < byteData.length; i++) {
if (byteData[i] == CARRIAGE_RETURN || byteData[i] == NEWLINE) {
if (i < byteData.length - 1 && byteData[i + 1] == NEWLINE)
i += 2;
else
i += 1;
byteBuffer.put(Base64.decodeBase64(record));
byteCounter = 0;
record = new byte[8192];
} else {
record[byteCounter++] = byteData[i];
}
}
遗憾的是,这种方法没有提供任何人类可读的输出.理想情况下,我想对数据进行流式读取、解码和流式传输.
Sadly, this approach doesn't give any human readable output. Ideally, I would like to stream read, decode, and stream out the data.
现在,我正在尝试放入一个输入流,然后复制到一个 gzipout
Right now, I'm trying to put in an inputstream and then copy to a gzipout
byteBuffer.get(bufferBytes);
InputStream inputStream = new ByteArrayInputStream(bufferBytes);
inputStream = new GZIPInputStream(inputStream);
IOUtils.copy(inputStream , gzipOutputStream);
它给了我一个java.io.IOException:损坏的 GZIP 预告片
And it gives me a java.io.IOException: Corrupt GZIP trailer
推荐答案
我们一步一步来:
您需要一个
GZIPInputStream
来读取压缩数据(而不是GZIPOutputStream
;输出流用于压缩数据).有了这个流,您将能够读取未压缩的原始二进制数据.这需要构造函数中的InputStream
.
You need a
GZIPInputStream
to read zipped data (that and not aGZIPOutputStream
; the output stream is used to compress data). Having this stream you will be able to read the uncompressed, original binary data. This requires anInputStream
in the constructor.
您需要一个能够读取 Base64 编码数据的输入流.我建议方便的 Base64InputStream
来自 apache-commons-codec.使用构造函数,您可以设置行长、行分隔符并设置 doEncode=false
来解码数据.这又需要另一个输入流——原始的 Base64 编码数据.
You need an input stream capable of reading the Base64 encoded data. I suggest the handy Base64InputStream
from apache-commons-codec. With the constructor you can set the line length, the line separator and set doEncode=false
to decode data. This in turn requires another input stream - the raw, Base64 encoded data.
此流取决于您获取数据的方式;理想情况下,数据应该以 InputStream
的形式提供 - 问题已解决.如果没有,您可能必须使用 ByteArrayInputStream
(如果是二进制)、StringBufferInputStream
(如果是字符串)等
This stream depends on how you get your data; ideally the data should be available as InputStream
- problem solved. If not, you may have to use the ByteArrayInputStream
(if binary), StringBufferInputStream
(if string) etc.
大概这个逻辑是:
InputStream fromHadoop = ...; // 3rd paragraph
Base64InputStream b64is = // 2nd paragraph
new Base64InputStream(fromHadoop, false, 80, "
".getBytes("UTF-8"));
GZIPInputStream zis = new GZIPInputStream(b64is); // 1st paragraph
请注意Base64InputStream
的参数(行长和行尾字节数组),您可能需要调整它们.
Please pay attention to the arguments of Base64InputStream
(line length and end-of-line byte array), you may need to tweak them.
相关文章