为什么分段错误不可恢复?

2022-01-12 00:00:00 c exception segmentation-fault c++

在我的上一个问题之后，大多数评论说别这样，你处于困境中，你必须杀死一切并重新开始".还有一个安全的"解决方法.

Following a previous question of mine, most comments say "just don't, you are in a limbo state, you have to kill everything and start over". There is also a "safeish" workaround.

我不明白的是为什么分段错误本质上是不可恢复的.

What I fail to understand is why a segmentation fault is inherently nonrecoverable.

捕获写入受保护内存的时刻 - 否则，将不会发送 SIGSEGV.

The moment in which writing to protected memory is caught - otherwise, the SIGSEGV would not be sent.

如果可以捕获写入受保护内存的时刻，我不明白为什么 - 从理论上讲 - 它不能在某个低级别恢复，并且无法将 SIGSEGV 转换为标准软件异常.

If the moment of writing to protected memory can be caught, I don't see why - in theory - it can't be reverted, at some low level, and have the SIGSEGV converted to a standard software exception.

请解释为什么在分段错误之后程序处于未确定状态，很明显，在内存实际更改之前抛出了错误(我可能错了，不明白为什么).如果它被抛出，人们可以创建一个程序来更改受保护的内存，一次一个字节，出现分段错误，并最终重新编程内核 - 一种不存在的安全风险，因为我们可以看到世界仍然存在.

Please explain why after a segmentation fault the program is in an undetermined state, as very obviously, the fault is thrown before memory was actually changed (I am probably wrong and don't see why). Had it been thrown after, one could create a program that changes protected memory, one byte at a time, getting segmentation faults, and eventually reprogramming the kernel - a security risk that is not present, as we can see the world still stands.

分段错误究竟何时发生(= SIGSEGV 何时发送)?
为什么在那之后进程处于未定义的行为状态?
为什么无法恢复?
为什么此解决方案会避免这种不可恢复的状态?有吗?

When exactly does a segmentation fault happen (= when is SIGSEGV sent)?

Why is the process in an undefined behavior state after that point?

Why is it not recoverable?

Why does this solution avoid that unrecoverable state? Does it even?

推荐答案

分段错误究竟何时发生(=何时发送 SIGSEGV)?

When exactly does segmentation fault happen (=when is SIGSEGV sent)?

当您尝试访问您无权访问的内存时，例如越界访问数组或取消引用无效指针.SIGSEGV 信号是标准化的，但不同的操作系统可能会以不同的方式实现它.分段错误"主要是在 *nix 系统中使用的一个术语，Windows 称之为访问冲突".

When you attempt to access memory you don’t have access to, such as accessing an array out of bounds or dereferencing an invalid pointer. The signal SIGSEGV is standardized but different OS might implement it differently. "Segmentation fault" is mainly a term used in *nix systems, Windows calls it "access violation".

为什么在那之后进程处于未定义的行为状态?

Why is the process in undefined behavior state after that point?

因为程序中的一个或几个变量没有按预期运行.假设您有一个数组应该存储许多值，但是您没有为所有这些值分配足够的空间.因此，只有您分配空间的那些才能正确写入，而超出数组范围的其余部分可以保存任何值.操作系统究竟如何知道这些越界值对您的应用程序运行的重要性?它不知道他们的目的.

Because one or several of the variables in the program didn’t behave as expected. Let’s say you have some array that is supposed to store a number of values, but you didn’t allocate enough room for all them. So only those you allocated room for get written correctly, and the rest written out of bounds of the array can hold any values. How exactly is the OS to know how critical those out of bounds values are for your application to function? It knows nothing of their purpose.

此外，在允许的内存之外写入通常会破坏其他不相关的变量，这显然是危险的并且可能导致任何随机行为.此类错误通常很难追踪.例如，堆栈溢出就是这样的分段错误，很容易覆盖相邻的变量，除非错误被保护机制捕获.

Furthermore, writing outside allowed memory can often corrupt other unrelated variables, which is obviously dangerous and can cause any random behavior. Such bugs are often hard to track down. Stack overflows for example are such segmentation faults prone to overwrite adjacent variables, unless the error was caught by protection mechanisms.

如果我们看一下裸机"的行为，没有任何操作系统且没有虚拟内存功能的微控制器系统，只有原始物理内存――它们只会默默地按照指示做――例如，覆盖不相关的变量并继续运行.如果应用程序是关键任务，这反过来可能会导致灾难性行为.

If we look at the behavior of "bare metal" microcontroller systems without any OS and no virtual memory features, just raw physical memory - they will just silently do exactly as told - for example, overwriting unrelated variables and keep on going. Which in turn could cause disastrous behavior in case the application is mission-critical.

为什么无法恢复?

因为操作系统不知道你的程序应该做什么.

Because the OS doesn’t know what your program is supposed to be doing.

虽然在裸机"中在上述情况下，系统可能足够聪明，可以将自己置于安全模式并继续运行.不允许汽车和医疗技术等关键应用停止或重置，因为这本身可能很危险.他们宁愿尝试跛行回家".功能有限.

Though in the "bare metal" scenario above, the system might be smart enough to place itself in a safe mode and keep going. Critical applications such as automotive and med-tech aren’t allowed to just stop or reset, as that in itself might be dangerous. They will rather try to "limp home" with limited functionality.

为什么这个解决方案可以避免这种不可恢复的状态?有吗?

Why does this solution avoid that unrecoverable state? Does it even?

该解决方案只是忽略错误并继续进行.它不能解决导致它的问题.这是一个非常脏的补丁，setjmp/longjmp 通常是非常危险的函数，出于任何目的都应该避免使用.

That solution is just ignoring the error and keeps on going. It doesn’t fix the problem that caused it. It’s a very dirty patch and setjmp/longjmp in general are very dangerous functions that should be avoided for any purpose.

我们必须认识到分段错误是错误的症状，而不是原因.

We have to realize that a segmentation fault is a symptom of a bug, not the cause.

相关文章