应用程序崩溃没有解释
我想提前道歉,因为这不是一个很好的问题.
I'd like to apologize in advance, because this is not a very good question.
我有一个在专用 Windows 服务器上作为服务运行的服务器应用程序.非常随机地,此应用程序崩溃并且没有提示导致崩溃的原因.
I have a server application that runs as a service on a dedicated Windows server. Very randomly, this application crashes and leaves no hint as to what caused the crash.
当它崩溃时,事件日志中有一个条目说明应用程序失败,但没有提供原因的线索.它还提供了有关故障模块的一些信息,但似乎不太可靠,因为故障模块通常在每次崩溃时都不同.比如最新的说是ntdll,之前的说是libmysql,之前的说是netsomething,以此类推.
When it crashes, the event logs have an entry stating that the application failed, but gives no clue as to why. It also gives some information on the faulting module, but it doesn't seem very reliable, as the faulting module is usually different on each crash. For example, the latest said it was ntdll, the one before that said it was libmysql, the one before that said it was netsomething, and so on.
应用程序中的每个线程都包装在 try/catch (...)
中(任何从异常处理程序中抛出的/未专门捕获的),__try/__except
(结构化异常)和 try/catch
(特定 C++ 异常).应用程序是使用/EHa 编译的,因此 catch all 也会捕获结构化异常.
Every single thread in the application is wrapped in a try/catch (...)
(anything thrown from an exception handler/not specifically caught), __try/__except
(structured exceptions), and try/catch
(specific C++ exceptions). The application is compiled with /EHa, so the catch all will also catch structured exceptions.
所有这些异常处理程序都做同样的事情.首先,创建故障转储.其次,将条目记录到磁盘上的新文件中.第三,在应用程序日志中记录一个条目.在这些崩溃的情况下,这一切都没有发生.最底层的异常处理程序(try/catch (...)
)什么也不做,它只是终止线程.主应用线程处于休眠状态,没有机会抛出异常.
All of these exception handlers do the same thing. First, a crash dump is created. Second, an entry is logged to a new file on disk. Third, an entry is logged in the application logs. In the case of these crashes, none of this is happening. The bottom most exception handler (the try/catch (...)
) does nothing, it just terminates the thread. The main application thread is asleep and has no chance of throwing an exception.
应用程序日志文件只是停止记录.不久之后,监视服务器的进程注意到它不再响应,发送警报,然后再次启动它.如果服务器监视器注意到服务器仍在运行,但只是没有响应,它会转储进程并报告此情况,但这并没有发生.
The application log files just stop logging. Shortly after, the process that monitors the server notices that it's no longer responding, sends an alert, and starts it again. If the server monitor notices that the server is still running, but just not responding, it takes a dump of the process and reports this, but this isn't happening.
除了未捕获的异常,我能想到的唯一其他原因是调用 exit
或类似的.搜索代码不会调用任何可能终止进程的函数.我还确保程序没有正常终止(即来自服务管理器的停止请求).
The only other reason for this behavior that I can come up with, aside from uncaught exceptions, is a call to exit
or similar. Searching the code brings up no calls to any functions that could terminate the process. I've also made sure that the program isn't terminating normally (i.e. a stop request from the service manager).
我们已经尝试在附加windbg的情况下运行它(没有机会使用Visual Studio,开销太高),但是发生崩溃时它没有报告任何内容.
We have tried running it with windbg attached (no chance to use Visual Studio, the overhead is too high), but it didn't report anything when the crash occurred.
什么会导致应用程序像这样崩溃?我们开始用尽选项,并认为这可能是硬件故障,但这对我来说似乎不太可能.
What can cause an application to crash like this? We're beginning to run out of options and consider that it might be a hardware failure, but that seems a bit unlikely to me.
推荐答案
如果您的应用正在蒸发且未生成转储文件,则很可能正在生成您的应用无法(或无法)处理的异常.这可能在两种情况下发生:
If your app is evaporating an not generating a dump file, then it is likely that an exception is being generated which your app doesnt (or cant) handle. This could happen in two instances:
1) 生成了一个顶级异常,并且该异常类型没有匹配的 catch
块.
1) A top-level exception is generated and there is no matching catch
block for that exception type.
2) 您有一个匹配的 catch
块(例如 catch(...)
),但您正在该处理程序中生成异常.发生这种情况时,Windows 将从您的程序中剔除骨骼.您的应用程序将不复存在.不会生成转储,并且几乎不会记录任何内容,这是 Windows 为防止流氓程序破坏整个系统而做出的最后努力.
2) You have a matching catch
block (such as catch(...)
), but you are generating an exception within that handler. When this happens, Windows will rip the bones from your program. Your app will simply cease to exist. No dump will be generated, and virtually nothing will be logged, This is Windows' last-ditch effort to keep a rogue program from taking down the entire system.
关于 catch(...)
的说明.这显然是邪恶的.生产代码中应该(几乎)永远不会有 catch(...)
.编写 catch(...)
的人通常会争论以下两件事之一:
A note about catch(...)
. This is patently Evil. There should (almost) never be a catch(...)
in production code. People who write catch(...)
generally argue one of two things:
我的程序不应该崩溃.如果发生任何事情,我想从异常中恢复并继续运行.这是一个服务器应用程序!ZOMG!"
"My program should never crash. If anything happens, I want to recover from the exception and continue running. This is a server application! ZOMG!"
-或-
我的程序可能会崩溃,但如果真的崩溃了,我想在崩溃的过程中创建一个转储文件."
"My program might crash, but if it does I want to create a dump file on the way down."
前者是一种幼稚且危险的态度,因为如果您确实尝试处理并从每一个异常中恢复,那么您将对您的运营足迹做一些不好的事情.也许你会咀嚼堆,保持应该关闭的资源打开,创建死锁或竞争条件,谁知道呢.您的程序最终将遭受致命的崩溃.但到那时,调用堆栈将与导致实际问题的原因不再相似,并且任何转储文件都无法帮助您.
The former is a naive and dangerous attitude because if you do try to handle and recover from every single exception, you are going to do something bad to your operating footprint. Maybe you'll munch the heap, keep resources open that should be closed, create deadlocks or race conditions, who knows. Your program will suffer from a fatal crash eventually. But by that time the call stack will bear no resemblance to what caused the actual problem, and no dump file will ever help you.
后者是高贵的&强大的方法,但它的实施比看起来要困难得多,而且充满危险.问题是您必须避免在异常处理程序中生成任何进一步的异常,并且您的机器已经处于非常不稳定的状态.通常完全安全的操作突然变成了手榴弹.new
、delete
、任何 CRT 函数、字符串格式化,甚至像 char buf[256]
这样简单的基于堆栈的分配都可以让您的应用程序运行起来>噗噗<走开.您必须假设堆栈和堆都处于废墟之中.没有分配是安全的.
The latter is a noble & robust approach, but the implementation of it is much more difficult that it might seem, and it fraught with peril. The problem is you have to avoid generating any further exceptions in your exception handler, and your machine is already in a very wobbly state. Operations which are normally perfectly safe are suddenly hand grenades. new
, delete
, any CRT functions, string formatting, even stack-based allocations as simple as char buf[256]
could make your application go >POOF< and be gone. You have to assume the stack and the heap both lie in ruins. No allocation is safe.
此外,可能会发生 catch
块根本无法捕获的异常,例如 SEH 异常.出于这个原因,我总是编写一个未处理的异常处理程序,并通过 SetUnhandledExceptionFilter.在我的异常处理程序中,我通过静态分配分配了我需要的每个字节,甚至在程序启动之前.在此处理程序中要做的最好(最强大)的事情是触发一个单独的应用程序启动,这将从您的应用程序外部生成一个 MiniDump 文件.但是,如果您非常小心不要直接或间接调用任何 CRT 函数,则可以从处理程序本身中生成 MiniDump.基本上,如果它不是您调用的 API 函数,它可能是不安全的.
Moreover, there are exceptions that can occur that a catch
block simply can't catch, such as SEH exceptions. For that reason, I always write an unhandled-exception handler, and register it with Windows, via SetUnhandledExceptionFilter. Within my exception handler, I allocate every single byte I need via static allocation, before the program even starts up. The best (most robust) thing to do within this handler is to trigger a seperate application to start up, which will generate a MiniDump file from outside of your application. However, you can generate the MiniDump from within the handler itself if you are extremely careful no not call any CRT function directly or indirectly. Basically, if it isn't an API function you're calling, it probably isn't safe.
相关文章