超越堆栈采样:C++ 分析器

2021-12-09 00:00:00 qt optimization profiling c++ profiler

日期是 12/02/10.圣诞节前的日子一去不复返了,作为一个 Windows 程序员,我几乎遇到了一个主要的障碍.我一直在使用 AQTime,我尝试过昏昏欲睡、闪亮和非常昏昏欲睡,正如我们所说,VTune 正在安装.我曾尝试使用 VS2008 分析器,但它一直是积极的惩罚,而且常常是不明智的.我使用了随机暂停技术.我检查了调用树.我已经关闭了函数跟踪.但令人悲伤的事实是,我正在使用的应用程序有超过一百万行代码,其中可能还有价值一百万行的第三方应用程序.

The date is 12/02/10. The days before Christmas are dripping away and I've pretty much hit a major road block as a windows programmer. I've been using AQTime, I've tried sleepy, shiny, and very sleepy, and as we speak, VTune is installing. I've tried to use the VS2008 profiler, and it's been positively punishing as well as often insensible. I've used the random pause technique. I've examined call-trees. I've fired off function traces. But the sad painful fact of the matter is that the app I'm working with is over a million lines of code, with probably another million lines worth of third-party apps.

我需要更好的工具.我已经阅读了其他主题.我已经尝试了每个主题中列出的每个分析器.必须有比这些垃圾和昂贵的选择更好的东西,或者几乎没有收益的荒谬的工作量.更复杂的是,我们的代码是大量线程的,并运行了许多 Qt 事件循环,其中一些非常脆弱,由于时间延迟,它们在大量检测下崩溃.不要问我为什么要运行多个事件循环.没有人能告诉我.

I need better tools. I've read the other topics. I've tried out each profiler listed in each topic. There simply has to be something better than these junky and expensive options, or ludicrous amounts of work for almost no gain. To further complicate matters, our code is heavily threaded, and runs a number of Qt Event loops, some of which are so fragile that they crash under heavy instrumentation due to timing delays. Don't ask me why we're running multiple event loops. No one can tell me.

在 Windows 环境中是否有更多类似 Valgrind 的选项?
有什么比我已经尝试过的大量破损工具更好的了吗?
是否有任何旨在与 Qt 集成的东西,也许可以有用地显示队列中的事件?

Are there any options more along the lines of Valgrind in a windows environment?
Is there anything better than the long swath of broken tools I've already tried?
Is there anything designed to integrate with Qt, perhaps with a useful display of events in queue?

我尝试过的工具的完整列表,其中真正有用的用斜体表示:

A full list of the tools I tried, with the ones that were really useful in italics:

  • AQTime:相当不错!深度递归有一些问题,但调用图在这些情况下是正确的,可用于消除您可能遇到的任何混淆.不是一个完美的工具,但值得一试.它可能适合您的需求,而且在大多数情况下对我来说肯定已经足够了.
  • 调试模式下的随机暂停攻击:时间不够信息.
    一个很好的工具,但不是一个完整的解决方案.
  • Parallel Studios: 核选项.突兀,怪异,而且疯狂的强大.我认为您应该进行 30 天评估,并确定它是否合适.这也太酷了.
  • AMD Codeanalyst: 很棒,易于使用,非常容易崩溃,但我认为这是环境问题.我建议您尝试一下,因为它是免费的.
  • Luke Stackwalker: 在小型项目上运行良好,但在我们的项目上运行有点困难.不过也有一些不错的结果,它绝对可以代替 Sleepy 来处理我的个人任务.
  • PurifyPlus:不支持 Win-x64 环境,最突出的是 Windows 7.其他方面都很棒.我在其他部门的许多同事都对它发誓.
  • VS2008 Profiler:在功能跟踪模式下以所需分辨率生成 100+gigs 范围内的输出.从好的方面来说,会产生可靠的结果.
  • GProf:要求 GCC 甚至适度有效.
  • VTune:VTune 的 W7 支持近乎犯罪.否则很棒
  • PIN:我需要破解我自己的工具,所以这是最后的手段.
  • SleepyVerySleepy:对于较小的应用程序很有用,但在这里让我失望.
  • EasyProfiler:如果您不介意手动注入一些代码来指示要检测的位置,那还不错.
  • Valgrind:仅适用于 *nix,但在那种环境下非常好.
  • OProfile:仅限 Linux.
  • 普罗菲:他们射野马.
  • AQTime: Rather good! Has some trouble with deep recursion, but the call graph is correct in these cases, and can be used to clear up any confusion you might have. Not a perfect tool, but worth trying out. It might suit your needs, and it certainly was good enough for me most of the time.
  • Random Pause attack in debug mode: Not enough information enough of the time.
    A good tool but not a complete solution.
  • Parallel Studios: The nuclear option. Obtrusive, weird, and crazily powerful. I think you should hit up the 30 day evaluation, and figure out if it's a good fit. It's just darn cool, too.
  • AMD Codeanalyst: Wonderful, easy to use, very crash-prone, but I think that's an environment thing. I'd recommend trying it, as it is free.
  • Luke Stackwalker: Works fine on small projects, it's a bit trying to get it working on ours. Some good results though, and it definitely replaces Sleepy for my personal tasks.
  • PurifyPlus: No support for Win-x64 environments, most prominently Windows 7. Otherwise excellent. A number of my colleagues in other departments swear by it.
  • VS2008 Profiler: Produces output in the 100+gigs range in function trace mode at the required resolution. On the plus side, produces solid results.
  • GProf: Requires GCC to be even moderately effective.
  • VTune: VTune's W7 support borders on criminal. Otherwise excellent
  • PIN: I'd need to hack up my own tool, so this is sort of a last resort.
  • SleepyVerySleepy: Useful for smaller apps, but failing me here.
  • EasyProfiler: Not bad if you don't mind a bit of manually injected code to indicate where to instrument.
  • Valgrind: *nix only, but very good when you're in that environment.
  • OProfile: Linux only.
  • Proffy: They shoot wild horses.

我没有尝试过的推荐工具:

Suggested tools that I haven't tried:

  • XPerf:
  • 发光代码:
  • 开发伙伴:

注意事项:目前的英特尔环境.VS2008,增强库.Qt 4+.以及他们所有的可悲的 humdinger:通过 trolltech 的 Qt/MFC 集成.


现在:差不多两周后,我的问题似乎已经解决了.多亏了各种各样的工具,包括列表中的几乎所有东西和我的一些个人技巧,我们找到了主要的瓶颈.但是,我将继续测试、探索和尝试新的分析器和新技术.为什么?因为我欠你们,因为你们摇滚.它确实稍微减慢了时间线,但我仍然很高兴继续尝试新工具.

Notes: Intel environment at the moment. VS2008, boost libraries. Qt 4+. And the wretched humdinger of them all: Qt/MFC integration via trolltech.


Now: Almost two weeks later, it looks like my issue is resolved. Thanks to a variety of tools, including almost everything on the list and a couple of my personal tricks, we found the primary bottlenecks. However, I'm going to keep testing, exploring, and trying out new profilers as well as new tech. Why? Because I owe it to you guys, because you guys rock. It does slow the timeline down a little, but I'm still very excited to keep trying out new tools.

概要
在许多其他问题中,一些组件最近被切换到不正确的线程模型,由于我们下面的代码突然不再是多线程的,导致严重的挂断.我不能说更多,因为它违反了我的 NDA,但我可以告诉你,通过随意检查甚至正常的代码审查都不会发现这种情况.如果没有分析器、调用图和随机暂停,我们仍然会对天空中美丽的蓝色弧线大喊大叫.谢天谢地,我与一些我见过的最优秀的黑客一起工作,我可以接触到一首令人惊叹的诗篇",里面充满了伟大的工具和伟大的人.

Synopsis
Among many other problems, a number of components had recently been switched to the incorrect threading model, causing serious hang-ups due to the fact that the code underneath us was suddenly no longer multithreaded. I can't say more because it violates my NDA, but I can tell you that this would never have been found by casual inspection or even by normal code review. Without profilers, callgraphs, and random pausing in conjunction, we'd still be screaming our fury at the beautiful blue arc of the sky. Thankfully, I work with some of the best hackers I've ever met, and I have access to an amazing 'verse full of great tools and great people.

绅士们,我非常感谢这一点,唯一遗憾的是我没有足够的代表来奖励你们每个人.我仍然认为这是一个重要的问题,比我们迄今为止在 SO 上得到的答案更好.

Gentlefolk, I appreciate this tremendously, and only regret that I don't have enough rep to reward each of you with a bounty. I still think this is an important question to get a better answer to than the ones we've got so far on SO.

因此,在接下来的三周内,我每周都会提供我能负担得起的最大赏金,并使用我认为不是常识的最好工具将其奖励给答案.三周后,如果你能原谅我的双关语,我们有望积累一份明确的剖析师简介.

As a result, each week for the next three weeks, I'll be putting up the biggest bounty I can afford, and awarding it to the answer with the nicest tool that I think isn't common knowledge. After three weeks, we'll hopefully have accumulated a definitive profile of the profilers, if you'll pardon my punning.

外卖
使用分析器.它们对 Ritchie、Kernighan、Bentley 和 Knuth 来说已经足够好了.我不在乎你认为你是谁.使用分析器.如果你得到的一个不起作用,再找一个.如果找不到,请编码一.如果你不能编码,或者是一个小挂断,或者你只是卡住了,使用随机暂停.如果一切都失败了,请聘请一些研究生来制作分析器.

Take-away
Use a profiler. They're good enough for Ritchie, Kernighan, Bentley, and Knuth. I don't care who you think you are. Use a profiler. If the one you've got doesn't work, find another. If you can't find one, code one. If you can't code one, or it's a small hang up, or you're just stuck, use random pausing. If all else fails, hire some grad students to bang out a profiler.


远景
所以,我认为写一点回顾可能会很好.我选择与 Parallel Studios 广泛合作,部分原因是它实际上是建立在 PIN 工具之上的.与一些参与的研究人员进行过学术交流后,我觉得这可能是某种品质的标志.谢天谢地,我是对的.虽然 GUI 有点可怕,但我发现 IPS 非常有用,尽管我不能轻松地向所有人推荐它.至关重要的是,没有明显的方法来获得行级命中计数,这是 AQT 和许多其他分析器提供的,我发现除其他外,对于检查分支选择率非常有用.在网络中,我也很喜欢使用 AQTime,而且我发现他们的支持非常敏感.同样,我必须证明我的建议:他们的许多功能都不能很好地工作,其中一些在 Win7x64 上非常容易崩溃.XPerf 的表现也令人钦佩,但在某些类型的应用程序上获得良好读取所需的采样细节方面却慢得令人痛苦.


A Longer View
So, I thought it might be nice to write up a bit of a retrospective. I opted to work extensively with Parallel Studios, in part because it is actually built on top of the PIN Tool. Having had academic dealings with some of the researchers involved, I felt that this was probably a mark of some quality. Thankfully, I was right. While the GUI is a bit dreadful, I found IPS to be incredibly useful, though I can't comfortably recommend it for everyone. Critically, there's no obvious way to get line-level hit counts, something that AQT and a number of other profilers provide, and I've found very useful for examining rate of branch-selection among other things. In net, I've enjoyed using AQTime as well, and I've found their support to be really responsive. Again, I have to qualify my recommendation: A lot of their features don't work that well, and some of them are downright crash-prone on Win7x64. XPerf also performed admirably, but is agonizingly slow for the sampling detail required to get good reads on certain kinds of applications.

现在,我不得不说,我认为在 W7x64 环境中分析 C++ 代码没有明确的选项,但肯定有一些选项根本无法执行任何有用的服务.

Right now, I'd have to say that I don't think there's a definitive option for profiling C++ code in a W7x64 environment, but there are certainly options that simply fail to perform any useful service.

推荐答案

第一:

时间采样分析器比 CPU 采样分析器更强大.我对 Windows 开发工具不是很熟悉,所以我不能说哪些是哪些.大多数分析器都是 CPU 采样.

First:

Time sampling profilers are more robust than CPU sampling profilers. I'm not extremely familiar with Windows development tools so I can't say which ones are which. Most profilers are CPU sampling.

CPU 采样分析器每 N 条指令抓取一个堆栈跟踪.
此技术将揭示您的代码中受 CPU 限制的部分.如果这是您应用程序中的瓶颈,那就太棒了.如果您的应用程序线程大部分时间都在为互斥锁而战,那就不太好.

A CPU sampling profiler grabs a stack trace every N instructions.
This technique will reveal portions of your code that are CPU bound. Which is awesome if that is the bottle neck in your application. Not so great if your application threads spend most of their time fighting over a mutex.

时间采样分析器每 N 微秒抓取一次堆栈跟踪.
这种技术将在慢" 代码中归零.原因是否是 CPU 绑定、阻塞 IO 绑定、互斥绑定或代码的缓存抖动部分.简而言之,让您的应用程序变慢的任何代码段都将脱颖而出.

A time sampling profiler grabs a stack trace every N microseconds.
This technique will zero in on "slow" code. Whether the cause is CPU bound, blocking IO bound, mutex bound, or cache thrashing sections of code. In short what ever piece of code is slowing your application will standout.

因此,如果可能,请使用时间采样分析器,尤其是在分析线程代码时.

So use a time sampling profiler if at all possible especially when profiling threaded code.

采样分析器生成大量数据.数据非常有用,但往往太多而无法轻松使用.个人资料数据可视化工具在这里非常有用.我发现的用于个人资料数据可视化的最佳工具是 gprof2dot.不要被这个名字骗了,它处理各种采样分析器输出(AQtime、Sleepy、XPerf 等).一旦可视化指出了有问题的函数,请跳回原始配置文件数据,以获得有关真正原因的更好提示.

Sampling profilers generate gobs of data. The data is extremely useful, but there is often too much to be easily useful. A profile data visualizer helps tremendously here. The best tool I've found for profile data visualization is gprof2dot. Don't let the name fool you, it handles all kinds of sampling profiler output (AQtime, Sleepy, XPerf, etc). Once the visualization has pointed out the offending function(s), jump back to the raw profile data to get better hints on what the real cause is.

gprof2dot 工具会生成一个点图描述,然后您将其输入到graphviz 工具.输出基本上是一个调用图,其中的函数根据它们对应用程序的影响进行颜色编码.

The gprof2dot tool generates a dot graph description that you then feed into a graphviz tool. The output is basically a callgraph with functions color coded by their impact on the application.

让 gprof2dot 生成良好输出的一些提示.

A few hints to get gprof2dot to generate nice output.

  • 我在图表上使用了 0.001 的 --skew,这样我就可以很容易地看到热代码路径.否则,int main() 将主导图形.
  • 如果您对 C++ 模板做任何疯狂的事情,您可能想要添加 --strip.Boost 尤其如此.
  • 我使用 OProfile 生成我的采样数据.为了获得良好的输出,我需要将其配置为从我的 3rd 方和系统库加载调试符号.一定要这样做,否则你会看到 CRT 占用了 20% 的应用程序时间,而真正发生的事情是 malloc 正在破坏堆并占用 15% 的时间.
  • I use a --skew of 0.001 on my graphs so I can easily see the hot code paths. Otherwise the int main() dominates the graph.
  • If you're doing anything crazy with C++ templates you'll probably want to add --strip. This is especially true with Boost.
  • I use OProfile to generate my sampling data. To get good output I need configure it to load the debug symbols from my 3rd party and system libraries. Be sure to do the same, otherwise you'll see that CRT is taking 20% of your application's time when what's really going on is malloc is trashing the heap and eating up 15%.

相关文章