是否可以比使用 hashmap 更快地将字符串映射到 int?

2022-01-08 00:00:00 algorithm performance hashmap c++

我知道我不应该优化我的程序的每一个地方,所以请把这个问题看作是学术性的"

每个字符串最多有 100 个字符串和整数,如下所示:

MSFT 1戴尔 2惠普 4……美国广播公司 58

这个集合是预初始化的,这意味着一旦创建它就永远不会改变.在 set 初始化后,我使用它非常密集,因此可以快速查找.字符串很短,最多 30 个字符.映射的 int 也是有限的,在 1 到 100 之间.

至少知道字符串是预初始化的并且永远不会改变,应该可以找到"导致一个篮子一个项目"映射的哈希函数,但可能还有其他黑客攻击.

我可以想象的一个优化 - 我只能读取第一个符号.例如,如果DELL"是唯一以D"开头的字符串,而我收到了D***"之类的内容,那么我什至不需要读取该字符串!它显然是戴尔".这种查找必须比hashmap 查找"快得多.(这里我假设我们只收到哈希中的符号,但并非总是如此)

对于我的问题,是否有任何现成或易于实施的解决方案?我正在使用 c++ 和 boost.

upd 我检查并发现我的股票交易限制是 12 个符号,而不是上面提到的 30 个.然而,其他交易所可能允许稍长的符号,因此让算法继续处理长达 20 个字符的长代码是很有趣的.

解决方案

哈希表[1]原则上是最快的方法.

你可以编译一个 完美哈希函数您提前了解完整的域.

有了完美的哈希,就不会发生冲突,所以你可以将哈希表存储在一个线性数组中!

通过适当的调整,您就可以

  • 在有限的空间内放置所有哈希元素,使直接寻址成为一种潜在的选择
  • 在 O(1) 中进行反向查找

生成完美哈希函数的老派"工具是 gperf(1).维基百科列出了有关该主题的更多资源.

<块引用>

因为所有的争论我跑了一个演示:

下载 纳斯达克股票代码 并从该集合中获取 100 个随机样本,应用 gperf 如下:

gperf -e '15' -L C++ -7 -C -E -k '*,1,$' -m 100 选择 >perfha??sh.cpp

产生 157 的哈希值 MAX_HASH_VALUE 和包含尽可能多的项目的 direct 字符串查找表.这是只是用于演示目的的哈希函数:

inline unsigned int Perfect_Hash::hash (register const char *str, register unsigned int len) {静态 const unsigned char asso_values[] = {156, 156, 156, 156, 156, 156, 156, 156, 156, 156,156, 156, 156, 156, 156, 156, 156, 156, 156, 156,156, 156, 156, 156, 156, 156, 156, 156, 156, 156,156, 156, 156, 156, 156, 156, 156, 156, 156, 156,156, 156, 156, 156, 156, 156, 156, 156, 156, 156,156, 156, 156, 156, 156, 156, 156, 156, 156, 156,156、156、156、156、156、64、40、1、62、1、41, 18, 47, 0, 1, 11, 10, 57, 21, 7,14, 13, 24, 3, 33, 89, 11, 0, 19, 5,12、0、156、156、156、156、156、156、156、156、156, 156, 156, 156, 156, 156, 156, 156, 156, 156,156, 156, 156, 156, 156, 156, 156, 156, 156, 156,156、156、156、156、156、156、156、156、156};注册 int hval = len;开关(hval){默认值:hval += asso_values[(unsigned char)str[4]];/*跌倒*/案例 4:hval += asso_values[(unsigned char)str[3]];/*跌倒*/案例 3:hval += asso_values[(unsigned char)str[2]+1];/*跌倒*/案例 2:hval += asso_values[(unsigned char)str[1]];/*跌倒*/案例 1:hval += asso_values[(unsigned char)str[0]];休息;}返回hval;}

它确实没有变得更有效率.看看 github 上的完整源代码:https://gist.github.com/sehe/5433535

请注意,这也是一个完美的哈希,所以不会有没有冲突

<小时><块引用>

Q. [...] 显然是DELL".这种查找必须比hashmap 查找"快得多.

答:如果您使用简单的std::map,最终效果是前缀搜索(因为第一个字符不匹配的字典字符串比较快捷方式).排序容器中的二分查找也是如此.

<小时>

[1] 附言.对于 100 个字符串,具有 std::searchstd::lower_bound 的字符串排序数组可能会因为改进的 参考地点.请查阅您的个人资料结果,看看这是否适用.

I understand that I should not optimize every single spot of my program so please consider this question to be "academic"

I have maximum 100 strings and integer number for each of them, something like that:

MSFT 1
DELL 2
HP   4
....
ABC  58

This set is preinitialized that means that once created it never changes. After set is initialized I use it pretty intensive so it nice to have fast lookup. Strings are pretty short, maximum 30 characters. Mapped int is also limited and between 1 and 100.

At least knowing that strings are preinitialized and never change it should be possible to "find" hash-function that results in "one basket-one item" mapping, but probably there are other hacks.

One optimization I can imagine - i can read first symbol only. For example if "DELL" is the only string starting with "D" and I have received something like "D***" than I do not need even to read the string! it's obviosly "DELL". Such lookup must be significantly faster than "hashmap lookup". (well here I assumed that we receive only symbols that in hash, but it is not always the case)

Are there any ready to use or easy to implement solutions for my problem? i'm using c++ and boost.

upd I've check and found that for my exchange limit for ticker is 12 symbols, not 30 as mentioned above. However other exchanges may allow slighty longer symbols so it's interesting to have algorithm that will continue working on up to 20-charachters long tickers.

解决方案

A hashtable[1] is in principle the fastest way.

You could however compile a Perfect Hash Function given the fact that you know the full domain ahead of time.

With a perfect hash, there need not be a collision, so you can store the hash table in a linear array!

With proper tweaking you can then

  • fit all of the hash elements in a limited space, making direct addressing a potential option
  • have a reverse lookup in O(1)

The 'old-school' tool for generating Perfect Hash functions would be gperf(1). The wikipedia lists more resources on the subject.

Because of all the debate I ran a demo:

Downloading NASDAQ ticker symbols and getting 100 random samples from that set, applying gperf as follows:

gperf -e ' 15' -L C++ -7 -C -E -k '*,1,$' -m 100 selection > perfhash.cpp

Results in a hash-value MAX_HASH_VALUE of 157 and a direct string lookup table of as many items. Here's just the hash function for demonstration purposes:

inline unsigned int Perfect_Hash::hash (register const char *str, register unsigned int len) {
  static const unsigned char asso_values[] = {
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156,  64,  40,   1,  62,   1,
       41,  18,  47,   0,   1,  11,  10,  57,  21,   7,
       14,  13,  24,   3,  33,  89,  11,   0,  19,   5,
       12,   0, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156
    };
  register int hval = len;

  switch (hval) {
      default: hval += asso_values[(unsigned char)str[4]];   /*FALLTHROUGH*/
      case 4:  hval += asso_values[(unsigned char)str[3]];   /*FALLTHROUGH*/
      case 3:  hval += asso_values[(unsigned char)str[2]+1]; /*FALLTHROUGH*/
      case 2:  hval += asso_values[(unsigned char)str[1]];   /*FALLTHROUGH*/
      case 1:  hval += asso_values[(unsigned char)str[0]];   break;
  }
  return hval;
}

It really doesn't get much more efficient. Do have a look at the full source at github: https://gist.github.com/sehe/5433535

Mind you, this is a perfect hash, too, so there will be no collisions


Q. [...] it's obviosly "DELL". Such lookup must be significantly faster than "hashmap lookup".

A: If you use a simple std::map the net effect is prefix search (because lexicographical string comparison shortcuts on the first character mismatch). The same thing goes for binary search in a sorted container.


[1] PS. For 100 strings, a sorted array of string with std::search or std::lower_bound would potentially be as fast/faster due to the improved Locality of Reference. Consult your profile results to see whether this applies.

相关文章