当你只关心速度时如何存储二进制数据?

2022-01-09 00:00:00 data-structures binary performance c++ stl

我在 D 维度上有 N 个点,假设 N 是 100 万,D 是 100.我所有的点都有二进制坐标,即{0, 1}^D,我只对速度感兴趣.

I have N points in D dimensions, where let's say N is 1 million and D 1 hundred. All my points have binary coordinates, i.e. {0, 1}^D, and I am only interested in speed.

目前我的实现使用 std::vector<int>.我想知道是否可以通过更改我的 数据结构.我只做插入和搜索(我不改变位).

Currently my implementation uses std::vector<int>. I am wondering if I could benefit in terms of faster execution by changing my data-structure. I am only doing insertions and searches (I don't change the bits).

我发现的所有相关问题都提到了 std::vector<char>std::vector<bool>std::bitset,但都提到了使用这种结构应该获得的空间优势.

All related questions I found mention std::vector<char>, std::vector<bool> and std::bitset, but all mention the space benefits one should get by using such structures.

当速度是主要关注点时,对于 C++ 中的二进制数据,什么是合适的数据结构?

What's the appropriate data structure, when speed is of main concern, for binary data in C++?

我打算用二进制数据填充我的数据结构,然后进行大量连续搜索(我的意思是我并不真正关心点的第 i 个坐标,如果我正在访问一个点,我会连续访问其所有坐标).我将计算彼此之间的汉明距离.

I intend to populate my data structure with the binary data and then do a lot of contiguous searches (I mean that I don't really care for the i-th coordinate of a point, if I am accessing a point I will access all of its coordinates continuously). I will compute the Hamming distance between each other.

推荐答案

参考位置可能是驱动力.所以很明显,您将单个点的 D 坐标表示为一个连续的位向量.std::bitset<D> 将是一个合乎逻辑的选择.

Locality of reference will likely be the driving force. So it's fairly obvious that you represent the D coordinates of a single point as a contiguous bitvector. std::bitset<D> would be a logical choice.

不过,接下来要意识到的重要一点是,您可以轻松看到高达 4KB 的局部性优势.这意味着您不应选择一个点并将其与所有其他 N-1 个点进行比较.取而代之的是,以 4KB 为一组对点进行分组,然后对这些组进行比较.两种方式都是O(N*N),但是第二种会快很多.

However, the next important thing to realize is that you see locality benefits easily up to 4KB. This means that you should not pick a single point and compare it against all other N-1 points. Instead, group points in sets of 4KB each, and compare those groups. Both ways are O(N*N), but the second will be much faster.

你可以通过使用三角不等式击败 O(N*N) - Hamming(a,b)+Hamming(b,c) >= Hamming (a,c).我只是想知道如何.这可能取决于您希望输出的方式.天真的输出将是一组 N*N 距离,这不可避免地是 O(N*N).

You may be able to beat O(N*N) by use of the triangle inequality - Hamming(a,b)+Hamming(b,c) >= Hamming (a,c). I'm just wondering how. It probably depends on how you want your output. The naive output would be a N*N set of distances, and that's unavoidably O(N*N).

相关文章