C++ 链接在实践中是如何工作的?
C++ 链接在实践中是如何工作的?我正在寻找的是关于如何链接发生的详细解释,而不是什么命令进行链接.
How does C++ linking work in practice? What I am looking for is a detailed explanation about how the linking happens, and not what commands do the linking.
已经有一个关于编译的类似问题没有详细说明:编译/链接过程是如何工作的?
There's already a similar question about compilation which doesn't go into too much detail: How does the compilation/linking process work?
推荐答案
编辑:我已将此答案移至副本:https://stackoverflow.com/a/33690144/895245
EDIT: I have moved this answer to the duplicate: https://stackoverflow.com/a/33690144/895245
此答案侧重于地址重定位,这是链接的关键功能之一.
This answer focuses on address relocation, which is one of the crucial functions of linking.
将使用一个最小的例子来阐明这个概念.
A minimal example will be used to clarify the concept.
总结:重定位编辑目标文件的.text
部分进行翻译:
Summary: relocation edits the .text
section of object files to translate:
- 目标文件地址
- 进入可执行文件的最终地址
这必须由链接器完成,因为编译器一次只能看到一个输入文件,但我们必须同时了解所有目标文件才能决定如何:
This must be done by the linker because the compiler only sees one input file at a time, but we must know about all object files at once to decide how to:
- 解析未定义的符号,例如声明的未定义函数
- 不冲突多个目标文件的多个
.text
和.data
部分
- resolve undefined symbols like declared undefined functions
- not clash multiple
.text
and.data
sections of multiple object files
先决条件:基本了解:
- x86-64 或 IA-32 程序集
- ELF 文件的全局结构.我已经制作了 一个教程
链接与 C 或 C++ 无关:编译器只是生成目标文件.然后链接器将它们作为输入,而不知道是什么语言编译了它们.也可以是 Fortran.
Linking has nothing to do with C or C++ specifically: compilers just generate the object files. The linker then takes them as input without ever knowing what language compiled them. It might as well be Fortran.
所以为了减少外壳,让我们研究一个 NASM x86-64 ELF Linux hello world:
So to reduce the crust, let's study a NASM x86-64 ELF Linux hello world:
section .data
hello_world db "Hello world!", 10
section .text
global _start
_start:
; sys_write
mov rax, 1
mov rdi, 1
mov rsi, hello_world
mov rdx, 13
syscall
; sys_exit
mov rax, 60
mov rdi, 0
syscall
编译和组装:
nasm -felf64 hello_world.asm # creates hello_world.o
ld -o hello_world.out hello_world.o # static ELF executable with no libraries
使用 NASM 2.10.09.
with NASM 2.10.09.
首先我们反编译目标文件的.text
部分:
First we decompile the .text
section of the object file:
objdump -d hello_world.o
给出:
0000000000000000 <_start>:
0: b8 01 00 00 00 mov $0x1,%eax
5: bf 01 00 00 00 mov $0x1,%edi
a: 48 be 00 00 00 00 00 movabs $0x0,%rsi
11: 00 00 00
14: ba 0d 00 00 00 mov $0xd,%edx
19: 0f 05 syscall
1b: b8 3c 00 00 00 mov $0x3c,%eax
20: bf 00 00 00 00 mov $0x0,%edi
25: 0f 05 syscall
关键的几行是:
a: 48 be 00 00 00 00 00 movabs $0x0,%rsi
11: 00 00 00
它应该将hello world字符串的地址移动到rsi
寄存器中,该寄存器被传递给write系统调用.
which should move the address of the hello world string into the rsi
register, which is passed to the write system call.
但是等等!当程序加载时,编译器怎么可能知道 Hello world!"
将在内存中结束的位置?
But wait! How can the compiler possibly know where "Hello world!"
will end up in memory when the program is loaded?
嗯,它不能,特别是在我们将一堆 .o
文件与多个 .data
部分链接在一起之后.
Well, it can't, specially after we link a bunch of .o
files together with multiple .data
sections.
只有链接器才能做到这一点,因为只有他才能拥有所有这些目标文件.
Only the linker can do that since only he will have all those object files.
所以编译器只是:
- 在编译输出上放置一个占位符值
0x0
- 为链接器提供了一些额外信息,说明如何使用正确的地址修改已编译的代码
这个额外信息"包含在目标文件的 .rela.text
部分中
This "extra information" is contained in the .rela.text
section of the object file
.rela.text
代表.text 部分的重定位".
.rela.text
stands for "relocation of the .text section".
使用重定位这个词是因为链接器必须将地址从对象重定位到可执行文件中.
The word relocation is used because the linker will have to relocate the address from the object into the executable.
我们可以反汇编 .rela.text
部分:
We can disassemble the .rela.text
section with:
readelf -r hello_world.o
其中包含;
Relocation section '.rela.text' at offset 0x340 contains 1 entries:
Offset Info Type Sym. Value Sym. Name + Addend
00000000000c 000200000001 R_X86_64_64 0000000000000000 .data + 0
本节的格式固定记录在:http://www.sco.com/developers/gabi/2003-12-17/ch4.reloc.html
The format of this section is fixed documented at: http://www.sco.com/developers/gabi/2003-12-17/ch4.reloc.html
每个条目告诉链接器一个需要重定位的地址,这里我们只有一个用于字符串.
Each entry tells the linker about one address which needs to be relocated, here we have only one for the string.
简化一下,对于这个特定的行,我们有以下信息:
Simplifying a bit, for this particular line we have the following information:
Offset = C
:.text
的第一个字节是这个条目改变的.
Offset = C
: what is the first byte of the.text
that this entry changes.
如果我们回头看反编译的文本,它恰好在关键的 movabs $0x0,%rsi
内,知道 x86-64 指令编码的人会注意到,它编码的是 64 位地址部分的指令.
If we look back at the decompiled text, it is exactly inside the critical movabs $0x0,%rsi
, and those that know x86-64 instruction encoding will notice that this encodes the 64-bit address part of the instruction.
Name = .data
:地址指向.data
部分
Type = R_X86_64_64
,它指定了确切的计算来转换地址.
Type = R_X86_64_64
, which specifies what exactly what calculation has to be done to translate the address.
此字段实际上取决于处理器,因此记录在 AMD64 System V ABI 扩展 第 4.4 节重定位".
This field is actually processor dependent, and thus documented on the AMD64 System V ABI extension section 4.4 "Relocation".
该文档说 R_X86_64_64
确实:
Field = word64
:8 个字节,因此00 00 00 00 00 00 00 00
在地址0xC
Field = word64
: 8 bytes, thus the00 00 00 00 00 00 00 00
at address0xC
计算 = S + A
S
是被重定位地址处的value,因此00 00 00 00 00 00 00 00
A
是加数,这里是0
.这是重定位条目的字段.
S
is value at the address being relocated, thus00 00 00 00 00 00 00 00
A
is the addend which is0
here. This is a field of the relocation entry.
所以 S + A == 0
我们将被重新定位到 .data
部分的第一个地址.
So S + A == 0
and we will get relocated to the very first address of the .data
section.
现在让我们看看为我们生成的可执行ld
的文本区域:
Now lets look at the text area of the executable ld
generated for us:
objdump -d hello_world.out
给予:
00000000004000b0 <_start>:
4000b0: b8 01 00 00 00 mov $0x1,%eax
4000b5: bf 01 00 00 00 mov $0x1,%edi
4000ba: 48 be d8 00 60 00 00 movabs $0x6000d8,%rsi
4000c1: 00 00 00
4000c4: ba 0d 00 00 00 mov $0xd,%edx
4000c9: 0f 05 syscall
4000cb: b8 3c 00 00 00 mov $0x3c,%eax
4000d0: bf 00 00 00 00 mov $0x0,%edi
4000d5: 0f 05 syscall
所以从目标文件中唯一改变的是关键行:
So the only thing that changed from the object file are the critical lines:
4000ba: 48 be d8 00 60 00 00 movabs $0x6000d8,%rsi
4000c1: 00 00 00
现在指向地址 0x6000d8
(d8 00 60 00 00 00 00 00
in little-endian)而不是 0x0
.
which now point to the address 0x6000d8
(d8 00 60 00 00 00 00 00
in little-endian) instead of 0x0
.
这是 hello_world
字符串的正确位置吗?
Is this the right location for the hello_world
string?
为了决定我们必须检查程序头,它告诉 Linux 加载每个部分的位置.
To decide we have to check the program headers, which tell Linux where to load each section.
我们将它们分解为:
readelf -l hello_world.out
给出:
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000000d7 0x00000000000000d7 R E 200000
LOAD 0x00000000000000d8 0x00000000006000d8 0x00000000006000d8
0x000000000000000d 0x000000000000000d RW 200000
Section to Segment mapping:
Segment Sections...
00 .text
01 .data
这告诉我们 .data
部分,即第二个部分,从 VirtAddr
= 0x06000d8
开始.
This tells us that the .data
section, which is the second one, starts at VirtAddr
= 0x06000d8
.
数据部分唯一的内容是我们的 hello world 字符串.
And the only thing on the data section is our hello world string.
相关文章