Win32下堆损坏;如何定位?

2021-12-14 00:00:00 windows multithreading debugging memory c++

我正在开发一个破坏堆的多线程 C++ 应用程序.定位此损坏的常用工具似乎不适用.源代码的旧版本(18 个月大)表现出与最新版本相同的行为,所以这已经存在很长时间了,只是没有被注意到;不利的一面是,无法使用源增量来识别错误何时引入 - 存储库中有很多代码更改.

崩溃行为的提示是在这个系统中产生吞吐量 - 数据的套接字传输被修改为内部表示.我有一组测试数据,这些数据会定期导致应用程序异常(各种地方,各种原因 - 包括堆分配失败,因此:堆损坏).

该行为似乎与 CPU 功率或内存带宽有关;每台机器的数量越多,就越容易崩溃.禁用超线程内核或双核内核会降低(但不会消除)损坏率.这表明存在与时间相关的问题.

现在问题来了:
当它在轻量级调试环境(比如 Visual Studio 98/AKA MSVC6)下运行时,堆损坏很容易重现 - 十到十五分钟后就会发生可怕的失败和异常,例如 alloc; 在复杂的调试环境(Rational Purify、VS2008/MSVC9 甚至 Microsoft Application Verifier)下运行时,系统会受内存速度限制并且不会崩溃(内存受限:CPU 没有超过 50%,磁盘灯不亮,程序运行得尽可能快,盒子消耗 1.3G 的 2G RAM).因此,我可以在能够重现问题(但不能确定原因)或能够确定原因或无法重现的问题之间做出选择.

我目前对下一步的最佳猜测是:

  1. 获得一个非常笨拙的盒子(替换当前的开发盒子:E6550 Core2 Duo 中的 2Gb RAM);这将使在强大的调试环境下运行时重现导致错误行为的崩溃成为可能;或
  2. 重写运算符 newdelete 以使用 VirtualAllocVirtualProtect 尽快将内存标记为只读正如它所做的那样.在 MSVC6 下运行,让操作系统捕获正在写入释放内存的坏人.是的,这是绝望的迹象:谁他妈的重写了newdelete?!我想知道这是否会使它像 Purify 等人那样慢.

而且,不:使用内置 Purify 仪器运输不是一种选择.

一位同事刚走过来问堆栈溢出?我们现在有堆栈溢出吗?!?"

现在,问题是:如何定位堆损坏者?

<小时>

更新:平衡 new[]delete[] 似乎在解决问题方面已经走了很长一段路.该应用程序现在在崩溃前大约需要两个小时,而不是 15 分钟.还没有.有什么进一步的建议吗?堆损坏持续存在.

更新:Visual Studio 2008 下的发布版本似乎要好得多;当前的怀疑取决于 VS98 附带的 STL 实现.

<小时><块引用>

  1. 重现问题.Dr Watson 将生成可能有助于进一步分析的转储.

我会记录下来,但我担心 Watson 博士只会在事后被绊倒,而不是在堆被踩踏时.

<块引用>

另一个尝试可能是使用 WinDebug 作为调试工具,它非常强大,同时也是轻量级的.

现在又开始了:在出现问题之前没有太大帮助.我想在现场抓到破坏者.

<块引用>

也许这些工具至少可以让您将问题缩小到某些组件.

我不抱太大希望,但绝望的时刻需要......

<块引用>

您确定项目的所有组件都具有正确的运行时库设置(C/C++ 选项卡,VS 6.0 项目设置中的代码生成类别)?

不,我不是,明天我将花几个小时浏览工作区(其中有 58 个项目)并检查它们是否都在编译并与适当的标志链接.<小时>更新:这花了 30 秒.在 Settings 对话框中选择所有项目,取消选择,直到找到没有正确设置的项目(它们都有正确的设置).

解决方案

我的首选是专用的堆工具,例如 pageheap.exe.

重写 new 和 delete 可能有用,但这并不能捕获低级代码提交的分配.如果这是您想要的,最好使用 Microsoft Detours 绕开 low-level alloc API.

还有健全性检查,例如:验证您的运行时库是否匹配(发布与调试、多线程与单线程、dll 与静态库)、查找错误删除(例如,delete where delete []应该已经使用过),请确保您没有混合和匹配您的分配.

还可以尝试有选择地关闭线程,看看问题何时/是否消失.

在第一个异常发生时调用堆栈等是什么样的?

I'm working on a multithreaded C++ application that is corrupting the heap. The usual tools to locate this corruption seem to be inapplicable. Old builds (18 months old) of the source code exhibit the same behaviour as the most recent release, so this has been around for a long time and just wasn't noticed; on the downside, source deltas can't be used to identify when the bug was introduced - there are a lot of code changes in the repository.

The prompt for crashing behaviuor is to generate throughput in this system - socket transfer of data which is munged into an internal representation. I have a set of test data that will periodically cause the app to exception (various places, various causes - including heap alloc failing, thus: heap corruption).

The behaviour seems related to CPU power or memory bandwidth; the more of each the machine has, the easier it is to crash. Disabling a hyper-threading core or a dual-core core reduces the rate of (but does not eliminate) corruption. This suggests a timing related issue.

Now here's the rub:
When it's run under a lightweight debug environment (say Visual Studio 98 / AKA MSVC6) the heap corruption is reasonably easy to reproduce - ten or fifteen minutes pass before something fails horrendously and exceptions, like an alloc; when running under a sophisticated debug environment (Rational Purify, VS2008/MSVC9 or even Microsoft Application Verifier) the system becomes memory-speed bound and doesn't crash (Memory-bound: CPU is not getting above 50%, disk light is not on, the program's going as fast it can, box consuming 1.3G of 2G of RAM). So, I've got a choice between being able to reproduce the problem (but not identify the cause) or being able to idenify the cause or a problem I can't reproduce.

My current best guesses as to where to next is:

  1. Get an insanely grunty box (to replace the current dev box: 2Gb RAM in an E6550 Core2 Duo); this will make it possible to repro the crash causing mis-behaviour when running under a powerful debug environment; or
  2. Rewrite operators new and delete to use VirtualAlloc and VirtualProtect to mark memory as read-only as soon as it's done with. Run under MSVC6 and have the OS catch the bad-guy who's writing to freed memory. Yes, this is a sign of desperation: who the hell rewrites new and delete?! I wonder if this is going to make it as slow as under Purify et al.

And, no: Shipping with Purify instrumentation built in is not an option.

A colleague just walked past and asked "Stack Overflow? Are we getting stack overflows now?!?"

And now, the question: How do I locate the heap corruptor?


Update: balancing new[] and delete[] seems to have gotten a long way towards solving the problem. Instead of 15mins, the app now goes about two hours before crashing. Not there yet. Any further suggestions? The heap corruption persists.

Update: a release build under Visual Studio 2008 seems dramatically better; current suspicion rests on the STL implementation that ships with VS98.


  1. Reproduce the problem. Dr Watson will produce a dump that might be helpful in further analysis.

I'll take a note of that, but I'm concerned that Dr Watson will only be tripped up after the fact, not when the heap is getting stomped on.

Another try might be using WinDebug as a debugging tool which is quite powerful being at the same time also lightweight.

Got that going at the moment, again: not much help until something goes wrong. I want to catch the vandal in the act.

Maybe these tools will allow you at least to narrow the problem to certain component.

I don't hold much hope, but desperate times call for...

And are you sure that all the components of the project have correct runtime library settings (C/C++ tab, Code Generation category in VS 6.0 project settings)?

No I'm not, and I'll spend a couple of hours tomorrow going through the workspace (58 projects in it) and checking they're all compiling and linking with the appropriate flags.


Update: This took 30 seconds. Select all projects in the Settings dialog, unselect until you find the project(s) that don't have the right settings (they all had the right settings).

解决方案

My first choice would be a dedicated heap tool such as pageheap.exe.

Rewriting new and delete might be useful, but that doesn't catch the allocs committed by lower-level code. If this is what you want, better to Detour the low-level alloc APIs using Microsoft Detours.

Also sanity checks such as: verify your run-time libraries match (release vs. debug, multi-threaded vs. single-threaded, dll vs. static lib), look for bad deletes (eg, delete where delete [] should have been used), make sure you're not mixing and matching your allocs.

Also try selectively turning off threads and see when/if the problem goes away.

What does the call stack etc look like at the time of the first exception?

相关文章