Base64数据的流解码

2022-01-21 00:00:00 hadoop base64 java gzipinputstream

我有一些大型 base64 编码数据(存储在 hadoop 文件系统中的 snappy 文件中).此数据最初是 gzip 压缩的文本数据.我需要能够读取这些编码数据的块,对其进行解码,然后将其刷新到 GZIPOutputStream.

I have some large base64 encoded data (stored in snappy files in the hadoop filesystem). This data was originally gzipped text data. I need to be able to read chunks of this encoded data, decode it, and then flush it to a GZIPOutputStream.

关于如何做到这一点而不是将整个 base64 数据加载到数组中并调用 Base64.decodeBase64(byte[]) 的任何想法?

Any ideas on how I could do this instead of loading the whole base64 data into an array and calling Base64.decodeBase64(byte[]) ?

如果我读取字符直到 ' ' 分隔符并逐行解码,我是对的吗?例如:

Am I right if I read the characters till the ' ' delimiter and decode it line by line? e.g. :

for (int i = 0; i < byteData.length; i++) {
    if (byteData[i] == CARRIAGE_RETURN || byteData[i] == NEWLINE) {
       if (i < byteData.length - 1 && byteData[i + 1] == NEWLINE)
            i += 2;
       else 
            i += 1;

       byteBuffer.put(Base64.decodeBase64(record));

       byteCounter = 0;
       record = new byte[8192];
    } else {
        record[byteCounter++] = byteData[i];
    }
}

遗憾的是,这种方法没有提供任何人类可读的输出.理想情况下,我想对数据进行流式读取、解码和流式传输.

Sadly, this approach doesn't give any human readable output. Ideally, I would like to stream read, decode, and stream out the data.

现在,我正在尝试放入一个输入流,然后复制到一个 gzipout

Right now, I'm trying to put in an inputstream and then copy to a gzipout

byteBuffer.get(bufferBytes);

InputStream inputStream = new ByteArrayInputStream(bufferBytes);
inputStream = new GZIPInputStream(inputStream);
IOUtils.copy(inputStream , gzipOutputStream);

它给了我一个java.io.IOException:损坏的 GZIP 预告片

And it gives me a java.io.IOException: Corrupt GZIP trailer

推荐答案

我们一步一步来:

  1. 您需要一个 GZIPInputStream 来读取压缩数据(而不是 GZIPOutputStream;输出流用于压缩数据).有了这个流,您将能够读取未压缩的原始二进制数据.这需要构造函数中的 InputStream.

  1. You need a GZIPInputStream to read zipped data (that and not a GZIPOutputStream; the output stream is used to compress data). Having this stream you will be able to read the uncompressed, original binary data. This requires an InputStream in the constructor.

您需要一个能够读取 Base64 编码数据的输入流.我建议方便的 Base64InputStream 来自 apache-commons-codec.使用构造函数,您可以设置行长、行分隔符并设置 doEncode=false 来解码数据.这又需要另一个输入流——原始的 Base64 编码数据.

You need an input stream capable of reading the Base64 encoded data. I suggest the handy Base64InputStream from apache-commons-codec. With the constructor you can set the line length, the line separator and set doEncode=false to decode data. This in turn requires another input stream - the raw, Base64 encoded data.

此流取决于您获取数据的方式;理想情况下,数据应该以 InputStream 的形式提供 - 问题已解决.如果没有,您可能必须使用 ByteArrayInputStream(如果是二进制)、StringBufferInputStream(如果是字符串)等

This stream depends on how you get your data; ideally the data should be available as InputStream - problem solved. If not, you may have to use the ByteArrayInputStream (if binary), StringBufferInputStream (if string) etc.

大概这个逻辑是:

InputStream fromHadoop = ...;                                  // 3rd paragraph
Base64InputStream b64is =                                      // 2nd paragraph
    new Base64InputStream(fromHadoop, false, 80, "
".getBytes("UTF-8"));
GZIPInputStream zis = new GZIPInputStream(b64is);              // 1st paragraph

请注意Base64InputStream的参数(行长和行尾字节数组),您可能需要调整它们.

Please pay attention to the arguments of Base64InputStream (line length and end-of-line byte array), you may need to tweak them.

相关文章