如何使用 SIMD 实现 atoi?

2022-01-06 00:00:00 simd c++ sse x86 atoi

我想尝试使用 SIMD 指令编写 atoi 实现，以包含在 RapidJSON(C++ JSON读/写库).它目前在其他地方有一些 SSE2 和 SSE4.2 优化.

I'd like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON (a C++ JSON reader/writer library). It currently has some SSE2 and SSE4.2 optimizations in other places.

如果是速度增益，多个atoi结果可以并行完成.字符串最初来自 JSON 数据缓冲区，因此多 atoi 函数必须执行任何所需的 swizzling.

If it's a speed gain, multiple atoi results can be done in parallel. The strings are originally coming from a buffer of JSON data, so a multi-atoi function will have to do any required swizzling.

我想出的算法如下:

我可以用以下方式初始化一个长度为 N 的向量:[10^N..10^1]
我将缓冲区中的每个字符都转换为一个整数，然后将它们放入另一个向量中.
我将有效数字向量中的每个数字乘以数字向量中的匹配数字，然后对结果求和.

我的目标是 x86 和 x86-64 架构.

I'm targeting x86 and x86-64 architectures.

我知道 AVX2 支持三个操作数融合乘加，所以我将能够执行 Sum = Number * Effective Digit + Sum.
这就是我到目前为止的目标.
我的算法正确吗?有没有更好的方法?
是否有使用任何 SIMD 指令集的 atoi 参考实现?

I know that AVX2 supports three operand Fused Multiply-Add so I'll be able to perform Sum = Number * Significant Digit + Sum.
That's where I got so far.
Is my algorithm correct? Is there a better way?
Is there a reference implementation for atoi using any SIMD instructions set?

推荐答案

算法及其实现到此结束.它是完整的并且(适度)测试(更新以减少恒定内存使用量和容忍加号).

The algorithm and its implementation is finished now. It's complete and (moderately) tested (Updated for less constant memory usage and tolerating plus-char).

这段代码的属性如下:

适用于 int 和 uint，从 MIN_INT=-2147483648 到 MAX_INT=2147483647 和从 MIN_UINT=0 到 MAX_UINT=4294967295
前导 '-' 字符表示负数(合理)，前导 '+' 字符被忽略
忽略前导零(带或不带符号字符)
溢出被忽略 - 更大的数字只是环绕
零长度字符串导致值 0 = -0
识别无效字符，转换在第一个无效字符处结束
最后一个前导零之后的至少 16 个字节必须是可访问的，并且在 EOS 之后读取可能的安全隐患留给调用者
只需要 SSE4.2

Works for int and uint, from MIN_INT=-2147483648 to MAX_INT=2147483647 and from MIN_UINT=0 to MAX_UINT=4294967295

A leading '-' char indicates a negative number (as sensible), a leading '+' char is ignored

Leading zeros (with or without sign char) are ignored

Overflow is ignored - bigger numbers just wraparound

Zero length strings result in value 0 = -0

Invalid characters are recognized and the conversion ends at the first invalid char

At least 16 bytes after the last leading zero must be accessible and possible security implications of reading after EOS are left to the caller

Only SSE4.2 is needed

关于这个实现:

此代码示例可以使用 GNU Assembler(as) 在开头使用 .intel_syntax noprefix 运行
常量的数据占用空间为 64 字节(4*128 位 XMM)，相当于一个缓存行.
代码占用空间为 46 条指令，具有 51 个微操作和 64 个周期延迟
一个循环删除前导零，否则除了错误处理外没有跳转，所以...
时间复杂度为 O(1)

This code sample can be run with GNU Assembler(as) using .intel_syntax noprefix at the beginning

Data footprint for constants is 64 bytes (4*128 bit XMM) equalling one cache line.

Code footprint is 46 instructions with 51 micro-Ops and 64 cycles latency

One loop for removal of leading zeros, otherwise no jumps except for error handling, so...

Time complexity is O(1)

算法的方法:

- Pointer to number string is expected in ESI - Check if first char is '-', then indicate if negative number in EDX (**A**) - Check for leading zeros and EOS (**B**) - Check string for valid digits and get strlen() of valid chars (**C**) - Reverse string so that power of 10^0 is always at BYTE 15 10^1 is always at BYTE 14 10^2 is always at BYTE 13 10^3 is always at BYTE 12 10^4 is always at BYTE 11 ... and mask out all following chars (**D**) - Subtract saturated '0' from each of the 16 possible chars (**1**) - Take 16 consecutive byte-values and and split them to WORDs in two XMM-registers (**2**) P O N M L K J I | H G F E D C B A -> H G F E | D C B A (XMM0) P O N M | L K J I (XMM1) - Multiply each WORD by its place-value modulo 10000 (1,10,100,1000) (factors smaller then MAX_WORD, 4 factors per QWORD/halfXMM) (**2**) so we can horizontally combine twice before another multiply. The PMADDWD instruction can do this and the next step: - Horizontally add adjacent WORDs to DWORDs (**3**) H*1000+G*100 F*10+E*1 | D*1000+C*100 B*10+A*1 (XMM0) P*1000+O*100 N*10+M*1 | L*1000+K*100 J*10+I*1 (XMM1) - Horizontally add adjacent DWORDs from XMM0 and XMM1 to XMM0 (**4**) xmmDst[31-0] = xmm0[63-32] + xmm0[31-0] xmmDst[63-32] = xmm0[127-96] + xmm0[95-64] xmmDst[95-64] = xmm1[63-32] + xmm1[31-0] xmmDst[127-96] = xmm1[127-96] + xmm1[95-64] - Values in XMM0 are multiplied with the factors (**5**) P*1000+O*100+N*10+M*1 (DWORD factor 1000000000000 = too big for DWORD, but possibly useful for QWORD number strings) L*1000+K*100+J*10+I*1 (DWORD factor 100000000) H*1000+G*100+F*10+E*1 (DWORD factor 10000) D*1000+C*100+B*10+A*1 (DWORD factor 1) - The last step is adding these four DWORDs together with 2*PHADDD emulated by 2*(PSHUFD+PADDD) - xmm0[31-0] = xmm0[63-32] + xmm0[31-0] (**6**) xmm0[63-32] = xmm0[127-96] + xmm0[95-64] (the upper QWORD contains the same and is ignored) - xmm0[31-0] = xmm0[63-32] + xmm0[31-0] (**7**) - If the number is negative (indicated in EDX by 000...0=pos or 111...1=neg), negate it(**8**)

以及使用英特尔语法在 GNU Assembler 中的示例实现:

.intel_syntax noprefix .data .align 64 ddqDigitRange: .byte '0','9',0,0,0,0,0,0,0,0,0,0,0,0,0,0 ddqShuffleMask:.byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 ddqFactor1: .word 1,10,100,1000, 1,10,100,1000 ddqFactor2: .long 1,10000,100000000,0 .text _start: mov esi, lpInputNumberString /* (**A**) indicate negative number in EDX */ mov eax, -1 xor ecx, ecx xor edx, edx mov bl, byte ptr [esi] cmp bl, '-' cmove edx, eax cmp bl, '+' cmove ecx, eax sub esi, edx sub esi, ecx /* (**B**)remove leading zeros */ xor eax,eax /* return value ZERO */ remove_leading_zeros: inc esi cmp byte ptr [esi-1], '0' /* skip leading zeros */ je remove_leading_zeros cmp byte ptr [esi-1], 0 /* catch empty string/number */ je FINISH dec esi /* check for valid digit-chars and invert from front to back */ pxor xmm2, xmm2 movdqa xmm0, xmmword ptr [ddqDigitRange] movdqu xmm1, xmmword ptr [esi] pcmpistri xmm0, xmm1, 0b00010100 /* (**C**) iim8=Unsigned bytes, Ranges, Negative Polarity(-), returns strlen() in ECX */ jo FINISH /* if first char is invalid return 0 - prevent processing empty string - 0 is still in EAX */ mov al , '0' /* value to subtract from chars */ sub ecx, 16 /* len-16=negative to zero for shuffle mask */ movd xmm0, ecx pshufb xmm0, xmm2 /* broadcast CL to all 16 BYTEs */ paddb xmm0, xmmword ptr [ddqShuffleMask] /* Generate permute mask for PSHUFB - all bytes < 0 have highest bit set means place gets zeroed */ pshufb xmm1, xmm0 /* (**D**) permute - now from highest to lowest BYTE are factors 10^0, 10^1, 10^2, ... */ movd xmm0, eax /* AL='0' from above */ pshufb xmm0, xmm2 /* broadcast AL to XMM0 */ psubusb xmm1, xmm0 /* (**1**) */ movdqa xmm0, xmm1 punpcklbw xmm0, xmm2 /* (**2**) */ punpckhbw xmm1, xmm2 pmaddwd xmm0, xmmword ptr [ddqFactor1] /* (**3**) */ pmaddwd xmm1, xmmword ptr [ddqFactor1] phaddd xmm0, xmm1 /* (**4**) */ pmulld xmm0, xmmword ptr [ddqFactor2] /* (**5**) */ pshufd xmm1, xmm0, 0b11101110 /* (**6**) */ paddd xmm0, xmm1 pshufd xmm1, xmm0, 0b01010101 /* (**7**) */ paddd xmm0, xmm1 movd eax, xmm0 /* negate if negative number */ add eax, edx /* (**8**) */ xor eax, edx FINISH: /* EAX is return (u)int value */

Haswell 32 位的 Intel-IACA 吞吐量分析结果:

Throughput Analysis Report -------------------------- Block Throughput: 16.10 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: --------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | --------------------------------------------------------------------------------------- | Cycles | 9.5 0.0 | 10.0 | 4.5 4.5 | 4.5 4.5 | 0.0 | 11.1 | 11.4 | 0.0 | --------------------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 0* | | | | | | | | | | xor eax, eax | 0* | | | | | | | | | | xor ecx, ecx | 0* | | | | | | | | | | xor edx, edx | 1 | | 0.1 | | | | | 0.9 | | | dec eax | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | CP | mov bl, byte ptr [esi] | 1 | | | | | | | 1.0 | | CP | cmp bl, 0x2d | 2 | 0.1 | 0.2 | | | | | 1.8 | | CP | cmovz edx, eax | 1 | 0.1 | 0.5 | | | | | 0.4 | | CP | cmp bl, 0x2b | 2 | 0.5 | 0.2 | | | | | 1.2 | | CP | cmovz ecx, eax | 1 | 0.2 | 0.5 | | | | | 0.2 | | CP | sub esi, edx | 1 | 0.2 | 0.5 | | | | | 0.3 | | CP | sub esi, ecx | 0* | | | | | | | | | | xor eax, eax | 1 | 0.3 | 0.1 | | | | | 0.6 | | CP | inc esi | 2^ | 0.3 | | 0.5 0.5 | 0.5 0.5 | | | 0.6 | | | cmp byte ptr [esi-0x1], 0x30 | 0F | | | | | | | | | | jz 0xfffffffb | 2^ | 0.6 | | 0.5 0.5 | 0.5 0.5 | | | 0.4 | | | cmp byte ptr [esi-0x1], 0x0 | 0F | | | | | | | | | | jz 0x8b | 1 | 0.1 | 0.9 | | | | | | | CP | dec esi | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | movdqa xmm0, xmmword ptr [0x80492f0] | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | CP | movdqu xmm1, xmmword ptr [esi] | 0* | | | | | | | | | | pxor xmm2, xmm2 | 3 | 2.0 | 1.0 | | | | | | | CP | pcmpistri xmm0, xmm1, 0x14 | 1 | | | | | | | 1.0 | | | jo 0x6e | 1 | | 0.4 | | | | 0.1 | 0.5 | | | mov al, 0x30 | 1 | 0.1 | 0.5 | | | | 0.1 | 0.3 | | CP | sub ecx, 0x10 | 1 | | | | | | 1.0 | | | CP | movd xmm0, ecx | 1 | | | | | | 1.0 | | | CP | pshufb xmm0, xmm2 | 2^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | CP | paddb xmm0, xmmword ptr [0x80492c0] | 1 | | | | | | 1.0 | | | CP | pshufb xmm1, xmm0 | 1 | | | | | | 1.0 | | | | movd xmm0, eax | 1 | | | | | | 1.0 | | | | pshufb xmm0, xmm2 | 1 | | 1.0 | | | | | | | CP | psubusb xmm1, xmm0 | 0* | | | | | | | | | CP | movdqa xmm0, xmm1 | 1 | | | | | | 1.0 | | | CP | punpcklbw xmm0, xmm2 | 1 | | | | | | 1.0 | | | | punpckhbw xmm1, xmm2 | 2^ | 1.0 | | 0.5 0.5 | 0.5 0.5 | | | | | CP | pmaddwd xmm0, xmmword ptr [0x80492d0] | 2^ | 1.0 | | 0.5 0.5 | 0.5 0.5 | | | | | | pmaddwd xmm1, xmmword ptr [0x80492d0] | 3 | | 1.0 | | | | 2.0 | | | CP | phaddd xmm0, xmm1 | 3^ | 2.0 | | 0.5 0.5 | 0.5 0.5 | | | | | CP | pmulld xmm0, xmmword ptr [0x80492e0] | 1 | | | | | | 1.0 | | | CP | pshufd xmm1, xmm0, 0xee | 1 | | 1.0 | | | | | | | CP | paddd xmm0, xmm1 | 1 | | | | | | 1.0 | | | CP | pshufd xmm1, xmm0, 0x55 | 1 | | 1.0 | | | | | | | CP | paddd xmm0, xmm1 | 1 | 1.0 | | | | | | | | CP | movd eax, xmm0 | 1 | | | | | | | 1.0 | | CP | add eax, edx | 1 | | | | | | | 1.0 | | CP | xor eax, edx Total Num Of Uops: 51

针对 Haswell 32 位的 Intel-IACA 延迟分析结果:

Latency Analysis Report --------------------------- Latency: 64 Cycles N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis The Resource delay is counted since all the sources of the instructions are ready and until the needed resource becomes available | Inst | Resource Delay In Cycles | | | Num | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | FE | | ------------------------------------------------------------------------- | 0 | | | | | | | | | | | xor eax, eax | 1 | | | | | | | | | | | xor ecx, ecx | 2 | | | | | | | | | | | xor edx, edx | 3 | | | | | | | | | | | dec eax | 4 | | | | | | | | | 1 | CP | mov bl, byte ptr [esi] | 5 | | | | | | | | | | CP | cmp bl, 0x2d | 6 | | | | | | | | | | CP | cmovz edx, eax | 7 | | | | | | | | | | CP | cmp bl, 0x2b | 8 | | | | | | | 1 | | | CP | cmovz ecx, eax | 9 | | | | | | | | | | CP | sub esi, edx | 10 | | | | | | | | | | CP | sub esi, ecx | 11 | | | | | | | | | 3 | | xor eax, eax | 12 | | | | | | | | | | CP | inc esi | 13 | | | | | | | | | | | cmp byte ptr [esi-0x1], 0x30 | 14 | | | | | | | | | | | jz 0xfffffffb | 15 | | | | | | | | | | | cmp byte ptr [esi-0x1], 0x0 | 16 | | | | | | | | | | | jz 0x8b | 17 | | | | | | | | | | CP | dec esi | 18 | | | | | | | | | 4 | | movdqa xmm0, xmmword ptr [0x80492f0] | 19 | | | | | | | | | | CP | movdqu xmm1, xmmword ptr [esi] | 20 | | | | | | | | | 5 | | pxor xmm2, xmm2 | 21 | | | | | | | | | | CP | pcmpistri xmm0, xmm1, 0x14 | 22 | | | | | | | | | | | jo 0x6e | 23 | | | | | | | | | 6 | | mov al, 0x30 | 24 | | | | | | | | | | CP | sub ecx, 0x10 | 25 | | | | | | | | | | CP | movd xmm0, ecx | 26 | | | | | | | | | | CP | pshufb xmm0, xmm2 | 27 | | | | | | | | | 7 | CP | paddb xmm0, xmmword ptr [0x80492c0] | 28 | | | | | | | | | | CP | pshufb xmm1, xmm0 | 29 | | | | | | 1 | | | | | movd xmm0, eax | 30 | | | | | | 1 | | | | | pshufb xmm0, xmm2 | 31 | | | | | | | | | | CP | psubusb xmm1, xmm0 | 32 | | | | | | | | | | CP | movdqa xmm0, xmm1 | 33 | | | | | | | | | | CP | punpcklbw xmm0, xmm2 | 34 | | | | | | | | | | | punpckhbw xmm1, xmm2 | 35 | | | | | | | | | 9 | CP | pmaddwd xmm0, xmmword ptr [0x80492d0] | 36 | | | | | | | | | 9 | | pmaddwd xmm1, xmmword ptr [0x80492d0] | 37 | | | | | | | | | | CP | phaddd xmm0, xmm1 | 38 | | | | | | | | | 10 | CP | pmulld xmm0, xmmword ptr [0x80492e0] | 39 | | | | | | | | | | CP | pshufd xmm1, xmm0, 0xee | 40 | | | | | | | | | | CP | paddd xmm0, xmm1 | 41 | | | | | | | | | | CP | pshufd xmm1, xmm0, 0x55 | 42 | | | | | | | | | | CP | paddd xmm0, xmm1 | 43 | | | | | | | | | | CP | movd eax, xmm0 | 44 | | | | | | | | | | CP | add eax, edx | 45 | | | | | | | | | | CP | xor eax, edx Resource Conflict on Critical Paths: ----------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ----------------------------------------------------------------- | Cycles | 0 0 | 0 | 0 0 | 0 0 | 0 | 0 | 1 | 0 | ----------------------------------------------------------------- List Of Delays On Critical Paths ------------------------------- 6 --> 8 1 Cycles Delay On Port6

在 Peter Cordes 的评论中建议的替代处理是用 imul 替换最后两个 add+xor 指令.这种操作码的集中可能更好.不幸的是，IACA 不支持该指令并抛出一个！- 指令不受支持，未在 Analysis 注释中说明.尽管如此，虽然我喜欢减少操作码并将其从(2uop，2c 延迟)减少到(1 uop，3c 延迟-延迟更糟，但仍然是 AMD 上的一个 m-op")，但我更愿意将它留给实现者方式来选择.我还没有检查以下代码是否足以解析任何数字.只是为了完整性才提到的，其他部分的代码可能需要修改(尤其是处理正数).

An alternative handling suggested in comments by Peter Cordes is replacing the last two add+xor instructions by an imul. This concentration of OpCodes is likely to be superior. Unfortunately IACA doesn't support that instruction and throws a ! - instruction not supported, was not accounted in Analysis comment. Nevertheless, although I like the reduction of OpCodes and reduction from (2uops, 2c latency) to (1 uop, 3c latency - "worse latency, but still one m-op on AMD"), I prefer to leave it to the implementer which way to choose. I haven't checked if the following code is sufficient for parsing any number. It is just mentioned for completeness and code modifications in other parts may be necessary (especially handling positive numbers).

替代方法可能是将最后两行替换为:

The alternative may be replacing the last two lines with:

... /* negate if negative number */ imul eax, edx FINISH: /* EAX is return (u)int value */

相关文章