重叠数组的总和、自动矢量化和限制

2021-12-20 00:00:00 optimization c c++ sse auto-vectorization

Arstechnia 最近有一篇文章为什么有些编程语言比其他语言快.它比较了 Fortran 和 C 并提到了求和数组.在 Fortran 中，假设数组不重叠，以便进一步优化.在 C/C++ 中，指向同一类型的指针可能会重叠，因此通常不能使用这种优化.但是，在 C/C++ 中，可以使用 restrict 或 __restrict 关键字告诉编译器不要假设指针重叠.所以我开始研究关于自动矢量化的问题.

Arstechnia recently had an article Why are some programming languages faster than others. It compares Fortran and C and mentions summing arrays. In Fortran it's assumed that arrays don't overlap so that allows further optimization. In C/C++ pointers to the same type may overlap so this optimization can't be used in general. However, in C/C++ one can use the restrict or __restrict keyword to tell the compiler not to assume the pointers overlap. So I started looking into this in regards to auto-vectorization.

以下代码在 GCC 和 MSVC 中向量化

The following code vectorizes in GCC and MSVC

void dot_int(int *a, int *b, int *c, int n) { for(int i=0; i<n; i++) { c[i] = a[i] + b[i]; } }

我在使用和不使用重叠数组的情况下对此进行了测试，并得到了正确的结果.但是，我使用 SSE 手动矢量化此循环的方式不能处理重叠数组.

I tested this with and without overlapping arrays and it gets the correct result. However, the way I would vectorize this loop manually with SSE does not handle overlapping arrays.

int i=0; for(; i<n-3; i+=4) { __m128i a4 = _mm_loadu_si128((__m128i*)&a[i]); __m128i b4 = _mm_loadu_si128((__m128i*)&b[i]); __m128i c4 = _mm_add_epi32(a4,b4); _mm_storeu_si128((__m128i*)c, c4); } for(; i<n; i++) { c[i] = a[i] + b[i]; }

接下来我尝试使用 __restrict.我认为，由于编译器可以假设数组不重叠，因此它不会处理重叠数组，但即使使用 __restrict，GCC 和 MSVC 仍然可以获得重叠数组的正确结果.

Next I tried using __restrict. I assumed that since the compiler could assume the arrays don't overlap it would not handle overlapping arrays but both GCC and MSVC still get the correct result for overlapping arrays even with __restrict.

void dot_int_restrict(int * __restrict a, int * __restrict b, int * __restrict c, int n) { for(int i=0; i<n; i++) { c[i] = a[i] + b[i]; } }

为什么带有和不带有 __restrict 的自动矢量化代码在重叠数组时得到正确的结果?.

Why does the auto-vectorized code with and without __restrict get the correct result for overlapping arrays?.

这是我用来测试的完整代码:

Here is the full code I used to test this:

#include <stdio.h> #include <immintrin.h> void dot_int(int *a, int *b, int *c, int n) { for(int i=0; i<n; i++) { c[i] = a[i] + b[i]; } for(int i=0; i<8; i++) printf("%d ", c[i]); printf(" "); } void dot_int_restrict(int * __restrict a, int * __restrict b, int * __restrict c, int n) { for(int i=0; i<n; i++) { c[i] = a[i] + b[i]; } for(int i=0; i<8; i++) printf("%d ", c[i]); printf(" "); } void dot_int_SSE(int *a, int *b, int *c, int n) { int i=0; for(; i<n-3; i+=4) { __m128i a4 = _mm_loadu_si128((__m128i*)&a[i]); __m128i b4 = _mm_loadu_si128((__m128i*)&b[i]); __m128i c4 = _mm_add_epi32(a4,b4); _mm_storeu_si128((__m128i*)c, c4); } for(; i<n; i++) { c[i] = a[i] + b[i]; } for(int i=0; i<8; i++) printf("%d ", c[i]); printf(" "); } int main() { const int n = 100; int a[] = {1,1,1,1,1,1,1,1}; int b1[] = {1,1,1,1,1,1,1,1,1}; int b2[] = {1,1,1,1,1,1,1,1,1}; int b3[] = {1,1,1,1,1,1,1,1,1}; int c[8]; int *c1 = &b1[1]; int *c2 = &b2[1]; int *c3 = &b3[1]; dot_int(a,b1,c, 8); dot_int_SSE(a,b1,c,8); dot_int(a,b1,c1, 8); dot_int_restrict(a,b2,c2,8); dot_int_SSE(a,b3,c3,8); }

输出(来自 MSVC)

The output (from MSVC)

2 2 2 2 2 2 2 2 //no overlap default 2 2 2 2 2 2 2 2 //no overlap with manual SSE vector code 2 3 4 5 6 7 8 9 //overlap default 2 3 4 5 6 7 8 9 //overlap with restrict 3 2 2 2 1 1 1 1 //manual SSE vector code

这是另一个产生更简单代码的插入版本

Here is another inserting version which produces much simpler code

void dot_int(int * __restrict a, int * __restrict b, int * __restrict c, int n) { a = (int*)__builtin_assume_aligned (a, 16); b = (int*)__builtin_assume_aligned (b, 16); c = (int*)__builtin_assume_aligned (c, 16); for(int i=0; i<n; i++) { c[i] = a[i] + b[i]; } }

推荐答案

我不明白问题出在哪里.在 Linux/64 位、GCC 4.6、-O3、-mtune=native、-msse4.1(即非常旧的编译器/系统)上测试，这段代码

I don't get what the problem is. Testing on Linux/64 bit, GCC 4.6, -O3, -mtune=native, -msse4.1 (i.e. a very old compiler/system), this code

void dot_int(int *a, int *b, int *c, int n) { for(int i=0; i<n; ++i) { c[i] = a[i] + b[i]; } }

编译到这个内部循环:

.L4: movdqu (%rdi,%rax), %xmm1 addl $1, %r8d movdqu (%rsi,%rax), %xmm0 paddd %xmm1, %xmm0 movdqu %xmm0, (%rdx,%rax) addq $16, %rax cmpl %r8d, %r10d ja .L4 cmpl %r9d, %ecx je .L1

虽然这段代码

void dot_int_restrict(int * __restrict a, int * __restrict b, int * __restrict c, int n) { for(int i=0; i<n; ++i) { c[i] = a[i] + b[i]; } }

编译为:

.L15: movdqu (%rbx,%rax), %xmm0 addl $1, %r8d paddd 0(%rbp,%rax), %xmm0 movdqu %xmm0, (%r11,%rax) addq $16, %rax cmpl %r10d, %r8d jb .L15 addl %r12d, %r9d cmpl %r12d, %r13d je .L10

如您所见，负载减少了一个.我猜正确估计在执行求和之前不需要显式加载内存，因为结果不会覆盖任何内容.

As you can clearly see there's one less load. I guess it correclty estimated that there's no need to explicitely load memory before performing the sum, as the result won't overwrite anythng.

还有更多优化的空间――GCC 不知道参数是 f.i.128 位对齐，因此它必须生成一个巨大的前导码来检查是否有对齐问题 (YMMV)，以及一个可发布的以处理额外的未对齐部分(或小于 128 位的宽度).这实际上发生在上面的两个版本中.这是为 dot_int 生成的完整代码:

There's also room for way more optimizations -- GCC doesn't know that the parameters are f.i. 128 bit aligned, hence it must generate a huge preamble to check that there are no alignment issues (YMMV), and a postable to deal with extra unaligned parts (or less wide than 128 bits). This actually happens with both versions above. This is the complete code generated for dot_int:

dot_int: .LFB626: .cfi_startproc testl %ecx, %ecx pushq %rbx .cfi_def_cfa_offset 16 .cfi_offset 3, -16 jle .L1 leaq 16(%rdx), %r11 movl %ecx, %r10d shrl $2, %r10d leal 0(,%r10,4), %r9d testl %r9d, %r9d je .L6 leaq 16(%rdi), %rax cmpl $6, %ecx seta %r8b cmpq %rax, %rdx seta %al cmpq %r11, %rdi seta %bl orl %ebx, %eax andl %eax, %r8d leaq 16(%rsi), %rax cmpq %rax, %rdx seta %al cmpq %r11, %rsi seta %r11b orl %r11d, %eax testb %al, %r8b je .L6 xorl %eax, %eax xorl %r8d, %r8d .p2align 4,,10 .p2align 3 .L4: movdqu (%rdi,%rax), %xmm1 addl $1, %r8d movdqu (%rsi,%rax), %xmm0 paddd %xmm1, %xmm0 movdqu %xmm0, (%rdx,%rax) addq $16, %rax cmpl %r8d, %r10d ja .L4 cmpl %r9d, %ecx je .L1 .L3: movslq %r9d, %r8 xorl %eax, %eax salq $2, %r8 addq %r8, %rdx addq %r8, %rdi addq %r8, %rsi .p2align 4,,10 .p2align 3 .L5: movl (%rdi,%rax,4), %r8d addl (%rsi,%rax,4), %r8d movl %r8d, (%rdx,%rax,4) addq $1, %rax leal (%r9,%rax), %r8d cmpl %r8d, %ecx jg .L5 .L1: popq %rbx .cfi_remember_state .cfi_def_cfa_offset 8 ret .L6: .cfi_restore_state xorl %r9d, %r9d jmp .L3 .cfi_endproc

现在在您的情况下，int 有效地未对齐(因为它们在堆栈上)，但是如果您可以使它们对齐并告诉 GCC，那么您可以改进代码生成:

Now in your case the ints effectively not aligned (as they're on the stack), but if you can make them aligned and tell GCC so, then you can improve code generation:

typedef int intvec __attribute__((vector_size(16))); void dot_int_restrict_alig(intvec * restrict a, intvec * restrict b, intvec * restrict c, unsigned int n) { for(unsigned int i=0; i<n; ++i) { c[i] = a[i] + b[i]; } }

这将生成此代码，没有前导码:

This generates this code, with no preamble:

dot_int_restrict_alig: .LFB628: .cfi_startproc testl %ecx, %ecx je .L23 subl $1, %ecx xorl %eax, %eax addq $1, %rcx salq $4, %rcx .p2align 4,,10 .p2align 3 .L25: movdqa (%rdi,%rax), %xmm0 paddd (%rsi,%rax), %xmm0 movdqa %xmm0, (%rdx,%rax) addq $16, %rax cmpq %rcx, %rax jne .L25 .L23: rep ret .cfi_endproc

注意对齐的 128 位加载指令的用法(movdqa，a 对齐，vs movdqu，未对齐).

Note the usage of the aligned 128 bit load instructions (movdqa, a as aligned, vs movdqu, unaligned).

相关文章