使用 __gnu_mcount_nc 捕获函数退出时间
我正在尝试在支持不佳的原型嵌入式平台上进行一些性能分析.
I'm trying to do some performance profiling on a poorly supported prototype embedded platform.
我注意到 GCC 的 -pg 标志会导致在每个函数的入口处插入对 __gnu_mcount_nc
的 thunk.__gnu_mcount_nc
没有可用的实现(并且供应商没有兴趣提供帮助),但是由于编写一个简单地记录堆栈帧和当前循环计数的代码很简单,我已经这样做了;这工作正常,并且在调用方/被调用方图和最常调用的函数方面产生了有用的结果.
I note that GCC's -pg flag causes thunks to __gnu_mcount_nc
to be inserted on entry to every function. No implementation of __gnu_mcount_nc
is available (and the vendor is not interested in assisting), however as it is trivial to write one that simply records the stack frame and current cycle count, I have done so; this works fine and is yielding useful results in terms of caller/callee graphs and most frequently called functions.
我真的很想获得有关在函数体中花费的时间的信息,但是我很难理解如何仅通过入口而不是出口来解决这个问题,每个函数都被钩住了:你可以准确地说出当进入每个函数时,但不挂钩退出点,您无法知道在收到下一条信息以归因于被调用者以及调用者的信息之前有多少时间.
I would really like to obtain information about the time spent in function bodies as well, however I am having difficulty understanding how to approach this with only the entry, but not the exit, to each function getting hooked: you can tell exactly when each function is entered, but without hooking the exit points you cannot know how much of the time until you receive the next piece of information to attribute to callee and how much to the callers.
尽管如此,GNU 分析工具实际上可以证明能够为许多平台上的函数收集运行时信息,因此开发人员大概有一些实现这一目标的方案.
Nevertheless, the GNU profiling tools are in fact demonstrably able to gather runtime information for functions on many platforms, so presumably the developers have some scheme in mind for achieving this.
我已经看到一些现有的实现,它们会维护影子调用堆栈并在进入 __gnu_mcount_nc 时旋转返回地址,以便在被调用者返回时再次调用 __gnu_mcount_nc;然后它可以将调用者/被调用者/sp 三元组与影子调用堆栈的顶部进行匹配,从而将这种情况与进入时的调用区分开来,记录退出时间并正确返回到调用者.
I have seen some existing implementations that do things like maintain a shadow callstack and twiddle the return address on entry to __gnu_mcount_nc so that __gnu_mcount_nc will get invoked again when the callee returns; it can then match the caller/callee/sp triplet against the top of the shadow callstack and so distinguish this case from the call on entry, record the exit time and correctly return to the caller.
这种方法还有很多不足之处:
This approach leaves much to be desired:
- 在没有 -pg 标志的情况下存在递归和编译库的情况下,它似乎可能很脆弱
- 在缺乏工具链 TLS 支持且当前线程 ID 可能昂贵/复杂获取的嵌入式多线程/多核环境中,似乎很难以低开销或根本无法实现
是否有一些明显更好的方法来实现 __gnu_mcount_nc 以便 -pg 构建能够捕获我遗漏的函数退出和进入时间?
Is there some obvious better way to implement a __gnu_mcount_nc so that a -pg build is able to capture function exit as well as entry time that I am missing?
推荐答案
gprof 不将该函数用于计时、进入或退出,而是用于呼叫计数函数 A 调用任何函数 B.相反,它使用通过计算每个例程中的 PC 样本收集的自时间,然后使用函数到函数的调用计数来估计应该向调用者收取多少自时间.
gprof does not use that function for timing, of entry or exit, but for call-counting of function A calling any function B. Rather, it uses the self-time gathered by counting PC samples in each routine, and then uses the function-to-function call counts to estimate how much of that self-time should be charged back to callers.
例如,如果 A 调用 C 10 次,B 调用 C 20 次,并且 C 有 1000 毫秒的自检时间(即 100 个 PC 样本),则 gprof 知道 C 被调用了 30 次,其中 33 个样品可以充电到 A,而另外 67 个可以充电到 B.同样,样本计数沿调用层次向上传播.
For example, if A calls C 10 times, and B calls C 20 times, and C has 1000ms of self time (i.e 100 PC samples), then gprof knows C has been called 30 times, and 33 of the samples can be charged to A, while the other 67 can be charged to B. Similarly, sample counts propagate up the call hierarchy.
所以你看,它不计时函数进入和退出.它得到的测量结果非常粗略,因为它没有区分短调用和长调用.此外,如果 PC 样本发生在 I/O 期间或未使用 -pg 编译的库例程中,则根本不计算在内.而且,正如您所指出的,它在递归的情况下非常脆弱,并且会在短函数上引入显着的开销.
So you see, it doesn't time function entry and exit. The measurements it does get are very coarse, because it makes no distinction between short calls and long calls. Also, if a PC sample happens during I/O or in a library routine that is not compiled with -pg, it is not counted at all. And, as you noted, it is very brittle in the presence of recursion, and can introduce notable overhead on short functions.
另一种方法是堆栈采样,而不是 PC 采样.诚然,捕获堆栈样本比 PC 样本更昂贵,但需要的样本更少.例如,如果某个函数、代码行或您想要进行的任何描述在 N 个样本总数中的分数 F 上很明显,那么您知道它花费的时间分数是 F,具有标准偏差sqrt(NF(1-F)).因此,例如,如果您取 100 个样本,并且其中 50 个样本上出现一行代码,那么您可以估计该行花费 50% 的时间,不确定性为 sqrt(100*.5*.5) =+/- 5 个样本或介于 45% 和 55% 之间.如果您采集 100 倍的样本,则可以将不确定性降低 10 倍.(递归无关紧要.如果函数或代码行在单个样本中出现 3 次,则计为 1 个样本,而不是 3 个.函数调用是否很短也无关紧要 - 如果它们被调用的次数足够多以花费很大一部分,它们就会被捕获.)
Another approach is stack-sampling, rather than PC-sampling. Granted, it is more expensive to capture a stack sample than a PC-sample, but fewer samples are needed. If, for example, a function, line of code, or any description you want to make, is evident on fraction F out of the total of N samples, then you know that the fraction of time it costs is F, with a standard deviation of sqrt(NF(1-F)). So, for example, if you take 100 samples, and a line of code appears on 50 of them, then you can estimate the line costs 50% of the time, with an uncertainty of sqrt(100*.5*.5) = +/- 5 samples or between 45% and 55%. If you take 100 times as many samples, you can reduce the uncertainty by a factor of 10. (Recursion doesn't matter. If a function or line of code appears 3 times in a single sample, that counts as 1 sample, not 3. Nor does it matter if function calls are short - if they are called enough times to cost a significant fraction, they will be caught.)
请注意,当您正在寻找可以修复以提高速度的问题时,确切的百分比并不重要.重要的是找到它.(实际上,您只需要两次就知道问题大到可以解决.)
Notice, when you're looking for things you can fix to get speedup, the exact percent doesn't matter. The important thing is to find it. (In fact, you only need see a problem twice to know it is big enough to fix.)
这就是这种技术.
附言不要沉迷于调用图、热点路径或热点.这是一个典型的调用图鼠巢.黄色是热点,红色是热点.
P.S. Don't get suckered into call-graphs, hot-paths, or hot-spots. Here's a typical call-graph rat's nest. Yellow is the hot-path, and red is the hot-spot.
这表明在这些地方都没有一个多汁的加速机会是多么容易:
And this shows how easy it is for a juicy speedup opportunity to be in none of those places:
最有价值的是十几个随机的原始堆栈样本,并将它们与源代码相关联.(这意味着绕过分析器的后端.)
The most valuable thing to look at is a dozen or so random raw stack samples, and relating them to the source code. (That means bypassing the back-end of the profiler.)
添加:只是为了说明我的意思,我从上面的调用图中模拟了十个堆栈样本,这是我发现的
ADDED: Just to show what I mean, I simulated ten stack samples from the call graph above, and here's what I found
- 3/10 示例调用
class_exists
,一个用于获取类名,另外两个用于设置本地配置.class_exists
调用autoload
,后者调用requireFile
,其中两个调用adminpanel
.如果能更直接地做到这一点,大约可以节省 30%. - 2/10 样本调用
determineId
,后者调用fetch_the_id
,后者调用getPageAndRootlineWithDomain
,后者调用另外三个级别,以结尾sql_fetch_assoc
.获取 ID 似乎很麻烦,而且花费了大约 20% 的时间,这还不包括 I/O.
- 3/10 samples are calling
class_exists
, one for the purpose of getting the class name, and two for the purpose of setting up a local configuration.class_exists
callsautoload
which callsrequireFile
, and two of those calladminpanel
. If this can be done more directly, it could save about 30%. - 2/10 samples are calling
determineId
, which callsfetch_the_id
which callsgetPageAndRootlineWithDomain
, which calls three more levels, terminating insql_fetch_assoc
. That seems like a lot of trouble to go through to get an ID, and it's costing about 20% of time, and that's not counting I/O.
因此,堆栈示例不仅会告诉您一个函数或一行代码花费了多少包含时间,还会告诉您为什么要完成它,以及完成它可能需要多少愚蠢.我经常看到这种 - 疾驰的普遍性 - 用锤子拍苍蝇,不是故意的,而是遵循良好的模块化设计.
So the stack samples don't just tell you how much inclusive time a function or line of code costs, they tell you why it's being done, and what possible silliness it takes to accomplish it. I often see this - galloping generality - swatting flies with hammers, not intentionally, but just following good modular design.
添加:另一件不要被吸入的东西是火焰图.例如,这里是来自上面调用图的十个模拟堆栈样本的火焰图(向右旋转 90 度).例程都有编号,而不是命名,但每个例程都有自己的颜色.
请注意我们上面发现的问题,class_exists(例程 219)在 30% 的样本上,通过查看火焰图根本不明显.更多的样本和不同的颜色会使图表看起来更像火焰",但不会暴露从不同地方多次调用而花费大量时间的例程.
ADDED: Another thing not to get sucked into is flame graphs.
For example, here is a flame graph (rotated right 90 degrees) of the ten simulated stack samples from the call graph above. The routines are all numbered, rather than named, but each routine has its own color.
Notice the problem we identified above, with class_exists (routine 219) being on 30% of the samples, is not at all obvious by looking at the flame graph.
More samples and different colors would make the graph look more "flame-like", but does not expose routines which take a lot of time by being called many times from different places.
这是按功能而不是按时间排序的相同数据.这有点帮助,但不会汇总从不同地方调用的相似之处:
再一次,目标是找到隐藏在你面前的问题.任何人都可以找到简单的东西,但隐藏的问题才是关键所在.
Here's the same data sorted by function rather than by time.
That helps a little, but doesn't aggregate similarities called from different places:
Once again, the goal is to find the problems that are hiding from you.
Anyone can find the easy stuff, but the problems that are hiding are the ones that make all the difference.
添加:另一种吸引眼球的是这个:
黑色轮廓的例程可能都是相同的,只是从不同的地方调用.该图表不会为您汇总它们.如果一个例程通过从不同的地方被大量调用而具有很高的包含百分比,它就不会被暴露.
ADDED: Another kind of eye-candy is this one:
where the black-outlined routines could all be the same, just called from different places.
The diagram doesn't aggregate them for you.
If a routine has high inclusive percent by being called a large number of times from different places, it will not be exposed.
相关文章