优化C++中的成员变量顺序

2021-12-20 00:00:00 performance optimization embedded c++

我正在阅读博文由 Introversion 的游戏编码员编写,他正忙于挤占每个 CPU 勾选他可以出的代码.他直接提到的一个技巧是

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to

"重新排序a的成员变量分为最常用和最不常用."

"re-order the member variables of a class into most used and least used."

我不熟悉 C++,也不熟悉它的编译方式,但我想知道

I'm not familiar with C++, nor with how it compiles, but I was wondering if

  1. 这个说法准确吗?
  2. 如何/为什么?
  3. 它是否适用于其他(编译/脚本)语言?

我知道这个技巧节省的 (CPU) 时间是最少的,它不是一个交易破坏者.但另一方面,在大多数函数中,很容易确定哪些变量将是最常用的,并且默认情况下只需以这种方式开始编码.

I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.

推荐答案

这里有两个问题:

  • 是否以及何时将某些字段放在一起是一种优化.
  • 如何去做.

它可能有帮助的原因是内存以称为缓存行"的块加载到 CPU 缓存中.这需要时间,一般来说,为您的对象加载的缓存行越多,所需的时间就越长.此外,为了腾出空间而从缓存中抛出的其他内容越多,这就会以不可预测的方式减慢其他代码的速度.

The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.

缓存线的大小取决于处理器.如果它与对象的大小相比很大,那么很少有对象会跨越缓存线边界,因此整个优化就变得无关紧要.否则,有时您可能只会将部分对象放在缓存中,而将其余部分放在主内存(或 L2 缓存,也许)中.如果您最常用的操作(访问常用字段的操作)为对象使用尽可能少的缓存,这是一件好事,因此将这些字段组合在一起可以让您更有可能发生这种情况.

The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.

一般原则称为引用位置".程序访问的不同内存地址越接近,获得良好缓存行为的机会就越大.通常很难提前预测性能:相同架构的不同处理器模型可能表现不同,多线程意味着您通常不知道缓存中将有什么内容,等等.但可以谈论什么是 大多数情况下可能会发生.如果您想了解任何事情,通常必须对其进行衡量.

The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.

请注意这里有一些问题.如果您正在使用基于 CPU 的原子操作(C++0x 中的原子类型通常会这样做),那么您可能会发现 CPU 锁定整个缓存行以锁定字段.然后,如果你有几个原子域靠近在一起,不同的线程在不同的内核上运行并同时在不同的域上运行,你会发现所有这些原子操作都是序列化的,因为它们都锁定了相同的内存位置,即使它们在不同的领域重新运作.如果它们在不同的缓存行上运行,那么它们就会并行工作,并且运行得更快.事实上,正如 Glen(通过 Herb Sutter)在他的回答中指出的那样,在一致缓存架构上,即使没有原子操作也会发生这种情况,并且会彻底毁掉你的一天.因此,在涉及多个内核的情况下,即使它们共享缓存,引用的局部性也不是必然的好事.您可以期望它是,因为缓存未命中通常是速度损失的根源,但在您的特定情况下是非常错误的.

Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.

现在,除了区分常用字段和不常用字段之外,对象越小,它占用的内存(因此缓存越少)就越少.这几乎是一个好消息,至少在您没有激烈争用的情况下.对象的大小取决于其中的字段,以及必须在字段之间插入的任何填充,以确保它们与架构正确对齐.C++(有时)根据字段的声明顺序对字段必须出现在对象中的顺序进行了限制.这是为了使低级编程更容易.因此,如果您的对象包含:

Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:

  • 一个整数(4 字节,4 对齐)
  • 后跟一个字符(1 个字节,任意对齐)
  • 后跟一个 int(4 字节,4 对齐)
  • 后跟一个字符(1 个字节,任意对齐)

那么这可能会在内存中占用 16 个字节.顺便提一下,int 的大小和对齐方式在每个平台上都不相同,但是 4 很常见,这只是一个示例.

then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.

在这种情况下,编译器将在第二个 int 之前插入 3 个字节的填充,以正确对齐它,并在末尾插入 3 个字节的填充.对象的大小必须是其对齐方式的倍数,以便相同类型的对象可以在内存中相邻放置.这就是数组在 C/C++ 中的所有内容,内存中的相邻对象.如果结构体是 int、int、char、char,那么同一个对象可能是 12 个字节,因为 char 没有对齐要求.

In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.

我说过 int 是否为 4 对齐取决于平台:在 ARM 上它绝对必须如此,因为未对齐的访问会引发硬件异常.在 x86 上,您可以访问未对齐的整数,但它通常较慢且 IIRC 非原子.所以编译器通常(总是?)x86 上的 4 对齐整数.

I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.

编写代码时的经验法则,如果您关心打包,则是查看结构中每个成员的对齐要求.然后首先对具有最大对齐类型的字段进行排序,然后是下一个最小的字段,依此类推到没有对齐要求的成员.例如,如果我正在尝试编写可移植的代码,我可能会想出这个:

The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:

struct some_stuff {
    double d;   // I expect double is 64bit IEEE, it might not be
    uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know
    uint32_t i; // 4 bytes, usually 4-aligned
    int32_t j;  // same
    short s;    // usually 2 bytes, could be 2-aligned or unaligned, I don't know
    char c[4];  // array 4 chars, 4 bytes big but "never" needs 4-alignment
    char d;     // 1 byte, any alignment
};

如果您不知道字段的对齐方式,或者您正在编写可移植的代码但希望在没有大的技巧的情况下尽力而为,那么您可以假设对齐要求是任何基本类型的最大要求结构,并且基本类型的对齐要求是它们的大小.因此,如果您的结构包含 uint64_t 或 long long,那么最好的猜测是它是 8 对齐的.有时你会错,但很多时候你会是对的.

If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.

请注意,像您的博主这样的游戏程序员通常对他们的处理器和硬件了如指掌,因此他们不必猜测.他们知道缓存行的大小,他们知道每种类型的大小和对齐方式,并且他们知道他们的编译器使用的结构布局规则(对于 POD 和非 POD 类型).如果他们支持多个平台,那么他们可以在必要时对每个平台进行特殊处理.他们还花大量时间思考游戏中的哪些对象将受益于性能改进,并使用分析器找出真正的瓶颈所在.但即便如此,无论对象是否需要,都可以应用一些经验法则,这并不是一个坏主意.只要不让代码不清楚,将常用字段放在对象的开头"和按对齐要求排序"是两条很好的规则.

Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.

相关文章