存储哈夫曼树的有效方式

2021-12-08 00:00:00 performance c++ huffman-code

我正在编写一个霍夫曼编码/解码工具,并且正在寻找一种有效的方法来存储为存储在输出文件中而创建的霍夫曼树.

I am writing a Huffman encoding/decoding tool and am looking for an efficient way to store the Huffman tree that is created to store inside of the output file.

目前我正在实施两个不同的版本.

Currently there are two different versions I am implementing.

  1. 这个将整个文件逐个字符读入内存,并为整个文件建立一个频率表.这只需要输出一次树,因此效率不是那么大,除非输入文件很小.
  2. 我使用的另一种方法是读取一块大小约为 64 KB 的数据,然后对其运行频率分析,创建一棵树并对其进行编码.但是,在这种情况下,在每个块之前,我需要输出我的频率树,以便解码器能够重新构建其树并正确解码编码文件.这是效率确实发挥作用的地方,因为我想节省尽可能多的空间.

到目前为止,在我的搜索中,我还没有找到在尽可能小的空间内存储树的好方法,我希望 StackOverflow 社区可以帮助我找到一个好的解决方案!

In my searches so far I have not found a good way of storing the tree in as little space as possible, I am hoping the StackOverflow community can help me find a good solution!

推荐答案

由于您已经需要实现代码来处理字节组织流/文件之上的逐位层,因此这是我的建议.

Since you already have to implement code to handle a bit-wise layer on top of your byte-organized stream/file, here's my proposal.

不要存储实际频率,解码不需要它们.但是,您确实需要实际的树.

Do not store the actual frequencies, they're not needed for decoding. You do, however, need the actual tree.

所以对于每个节点,从根开始:

So for each node, starting at root:

  1. 如果叶子节点:输出 1 位 + N 位字符/字节
  2. 如果不是叶子节点,则输出 0 位.然后以相同的方式编码两个子节点(先左后右)

要阅读,请执行以下操作:

To read, do this:

  1. 读取位.如果为 1,则读取 N 位字符/字节,返回其周围没有子节点的新节点
  2. 如果 bit 为 0,以同样的方式解码左右子节点,并返回带有这些子节点的新节点,但没有值

叶节点基本上是任何没有子节点的节点.

A leaf-node is basically any node that doesn't have children.

使用这种方法,您可以在写入之前计算输出的确切大小,以确定收益是否足以证明付出的努力.这假设您有一个包含每个字符出现频率的键/值对字典,其中频率是实际出现的次数.

With this approach, you can calculate the exact size of your output before writing it, to figure out if the gains are enough to justify the effort. This assumes you have a dictionary of key/value pairs that contains the frequency of each character, where frequency is the actual number of occurrences.

计算伪代码:

Tree-size = 10 * NUMBER_OF_CHARACTERS - 1
Encoded-size = Sum(for each char,freq in table: freq * len(PATH(char)))

树大小计算考虑了叶子节点和非叶子节点,内联节点比字符少一个.

The tree-size calculation takes the leaf and non-leaf nodes into account, and there's one less inline node than there are characters.

SIZE_OF_ONE_CHARACTER 将是位数,这两个将为您提供我对树的方法 + 编码数据将占用的总位数.

SIZE_OF_ONE_CHARACTER would be number of bits, and those two would give you the number of bits total that my approach for the tree + the encoded data will occupy.

PATH(c) 是一个函数/表,它会产生从根到树中那个字符的位路径.

PATH(c) is a function/table that would yield the bit-path from root down to that character in the tree.

这是一个看起来像 C# 的伪代码,它假设一个字符只是一个简单的字节.

Here's a C#-looking pseudo-code to do it, which assumes one character is just a simple byte.

void EncodeNode(Node node, BitWriter writer)
{
    if (node.IsLeafNode)
    {
        writer.WriteBit(1);
        writer.WriteByte(node.Value);
    }
    else
    {
        writer.WriteBit(0);
        EncodeNode(node.LeftChild, writer);
        EncodeNode(node.Right, writer);
    }
}

读回:

Node ReadNode(BitReader reader)
{
    if (reader.ReadBit() == 1)
    {
        return new Node(reader.ReadByte(), null, null);
    }
    else
    {
        Node leftChild = ReadNode(reader);
        Node rightChild = ReadNode(reader);
        return new Node(0, leftChild, rightChild);
    }
}

示例(简化、使用属性等)节点实现:

An example (simplified, use properties, etc.) Node implementation:

public class Node
{
    public Byte Value;
    public Node LeftChild;
    public Node RightChild;

    public Node(Byte value, Node leftChild, Node rightChild)
    {
        Value = value;
        LeftChild = leftChild;
        RightChild = rightChild;
    }

    public Boolean IsLeafNode
    {
        get
        {
            return LeftChild == null;
        }
    }
}

<小时>

这是一个特定示例的示例输出.


Here's a sample output from a specific example.

输入:AAAAAABCCCCCCDDEEEEE

Input: AAAAAABCCCCCCDDEEEEE

频率:

  • 答:6
  • 乙:1
  • C:6
  • D:2
  • E:5

每个字符只有 8 位,因此树的大小将为 10 * 5 - 1 = 49 位.

Each character is just 8 bits, so the size of the tree will be 10 * 5 - 1 = 49 bits.

树看起来像这样:

      20
  ----------
  |        8
  |     -------
 12     |     3
-----   |   -----
A   C   E   B   D
6   6   5   1   2

所以每个字符的路径如下(0左,1右):

So the paths to each character is as follows (0 is left, 1 is right):

  • 答:00
  • 乙:110
  • C:01
  • D:111
  • E:10

所以要计算输出大小:

  • A:6 次出现 * 2 位 = 12 位
  • B:1 次出现 * 3 位 = 3 位
  • C:出现 6 次 * 2 位 = 12 位
  • D:出现 2 次 * 3 位 = 6 位
  • E:出现 5 次 * 2 位 = 10 位

编码字节的总和为 12+3+12+6+10 = 43 位

Sum of encoded bytes is 12+3+12+6+10 = 43 bits

将其添加到树中的 49 位,输出将为 92 位或 12 字节.与存储未编码的原始 20 个字符所需的 20 * 8 个字节相比,您将节省 8 个字节.

Add that to the 49 bits from the tree, and the output will be 92 bits, or 12 bytes. Compare that to the 20 * 8 bytes necessary to store the original 20 characters unencoded, you'll save 8 bytes.

最终输出,包括开始的树,如下所示.流 (A-E) 中的每个字符都被编码为 8 位,而 0 和 1 只是一个位.流中的空间只是将树与编码数据分开,不占用最终输出中的任何空间.

The final output, including the tree to begin with, is as follows. Each character in the stream (A-E) is encoded as 8 bits, whereas 0 and 1 is just a single bit. The space in the stream is just to separate the tree from the encoded data and does not take up any space in the final output.

001A1C01E01B1D 0000000000001100101010101011111111010101010

<小时>

对于你在评论中的具体例子,AABCDEF,你会得到这个:


For the concrete example you have in the comments, AABCDEF, you will get this:

输入:AABCDEF

频率:

  • 答:2
  • 乙:1
  • C:1
  • D:1
  • E:1
  • F:1

树:

        7
  -------------
  |           4
  |       ---------
  3       2       2
-----   -----   -----
A   B   C   D   E   F
2   1   1   1   1   1

路径:

  • 答:00
  • 乙:01
  • C:100
  • D:101
  • E:110
  • F:111

树:001A1B001C1D01E1F = 59 位
数据:000001100101110111 = 18位
总和:59 + 18 = 77 位 = 10 字节

Tree: 001A1B001C1D01E1F = 59 bits
Data: 000001100101110111 = 18 bits
Sum: 59 + 18 = 77 bits = 10 bytes

由于原来是 7 个字符的 8 位 = 56,你会因为这么小的数据块而有太多的开销.

Since the original was 7 characters of 8 bits = 56, you will have too much overhead of such small pieces of data.

相关文章