C++ 字符串内存管理

2022-01-12 00:00:00 memory-management c++ mfc

上周,我用 C# 编写了几行代码,将一个大型文本文件(300,000 行)写入字典.写了十分钟,不到一秒就执行完毕.

Last week I wrote a few lines of code in C# to fire up a large text file (300,000 lines) into a Dictionary. It took ten minutes to write and it executed in less than a second.

现在我正在将该段代码转换为 C++(因为我需要在旧的 C++ COM 对象中使用它).到目前为止,我已经花了两天时间.:-( 虽然生产力差异本身就令人震惊,但我需要一些建议.

Now I'm converting that piece of code into C++ (because I need it in an old C++ COM object). I've spent two days on it this far. :-( Although the productivity difference is shocking on its own, it's the performance that I would need some advice on.

加载需要 7 秒,更糟糕的是:之后释放所有 CStringW 需要的时间正好是这么长的时间.这是不可接受的,我必须想办法提高性能.

It takes seven seconds to load, and even worse: it takes just exactly that much time to free all the CStringWs afterwards. This is not acceptable, and I must find a way to increase the performance.

有没有机会我可以分配这么多的字符串而不会看到这种可怕的性能下降?

Are there any chance that I can allocate this many strings without seeing this horrible performace degradation?

我现在的猜测是,我必须将所有文本填充到一个大数组中,然后让我的哈希表指向该数组中每个字符串的开头并删除 CStringW 内容.

My guess right now is that I'll have to stuff all the text into a large array and then let my hash table point to the beginning of each string within this array and drop the CStringW stuff.

但在此之前,各位 C++ 专家有什么建议吗?

But before that, any advice from you C++ experts out there?

编辑:我对自己的回答如下.我意识到这对我来说是最快的路线,并且也朝着我认为正确的方向迈进――朝着更多的托管代码前进.

EDIT: My answer to myself is given below. I realized that that is the fastest route for me, and also step in what I consider the right direction - towards more managed code.

推荐答案

你正在步雷蒙德陈的位置.他做了同样的事情,用非托管 C++ 编写了一本中文字典.Rico Mariani 也这样做了,它是用 C# 编写的.Mariani 先生制作了一个版本.陈先生写了6个版本,试图匹配Mariani版本的性能.为了实现这一目标,他几乎重写了 C/C++ 运行时库的重要部分.

You are stepping into the shoes of Raymond Chen. He did the exact same thing, writing a Chinese dictionary in unmanaged C++. Rico Mariani did too, writing it in C#. Mr. Mariani made one version. Mr. Chen wrote 6 versions, trying to match the perf of Mariani's version. He pretty much rewrote significant chunks of the C/C++ runtime library to get there.

在那之后,托管代码得到了更多的尊重.GC 分配器是不可能被击败的.查看此博客 发布链接.您可能也会对这篇博文感兴趣,了解 STL 如何值语义是问题的一部分.

Managed code got a lot more respect after that. The GC allocator is impossible to beat. Check this blog post for the links. This blog post might interest you too, instructive to see how the STL value semantics are part of the problem.

相关文章