Java VM:1.6.0_17 和 1.6.0_18 上的可重现 SIGSEGV，如何报告?

2022-01-12 00:00:00 jvm segmentation-fault crash java crash-reports

编辑:这个可重现的 SIGSEGV 发生在具有多个 proc 和超过 2GB 内存的 Linux 机器上，因此 Java 默认为 -server 模式.有趣的是，如果我强制-client"，就不会再出现崩溃了......(我仍然不太确定如何处理我的可重现 SIGSEGV，但它仍然很有趣).

EDIT: This reproducible SIGSEGV happens on a Linux machine with more than one proc and more than 2GB of mem, so Java is defaulting to the -server mode. Interestingly enough if I force "-client" there's no crash anymore... (I'm still not too sure what to do with my reproducible SIGSEGV but it's interesting nonetheless).

首先请注意，这与以下内容有点相关但并不完全相同，因为在我们的例子中，它只是发生了一个 SIGSEGV，我们可以可靠地触发它:

First note that this is a bit related but not identical to the following because in our case it's only a SIGSEGV that happens, and we can reliably trigger it:

JVM OutOfMemory 错误死亡螺旋"(不是内存泄漏)

这是相关的，因为当我们向应用程序提供大量数据"时就会发生这种情况:数据来自文本文件，然后经过数字处理(是的，Java 中的财务数字处理).

It's related because it happens when we feed our app with a "deluge of data": data are coming from text files and then number-crunched (yes, financial number crunching in Java).

我可以仅使用有效的 Java 代码可靠地触发 JVM 到 SIGSEGV.

I can reliably trigger a JVM to SIGSEGV using only valid Java code.

注意:我总是会导致 JVM 1.6.0_17 和 JVM 1.6.0_18 崩溃，这个问题不是关于如何解决这个问题(例如使用 VM 参数可能解决问题，但我不是在那之后，我想知道如何处理这个始终可重现的 SIGSEGV).

NOTE: I can invariably crash both JVM 1.6.0_17 adn JVM 1.6.0_18 and this question is not about how to workaround this issue (for example playing with VM parameters may fix the issue but I'm not after that, I want to know what to do with this always-reproducable SIGSEGV).

我有一个解决方法，它只是在启动我们的应用程序时使用 Java 1.5(同时仍然使用 Java 1.6 在同一台机器上同时运行 IntelliJ IDEA 等)，但我的问题是是否应该报告这与否，如果应该，如何在知道日志本身包含专有信息(完整的 hs_err_..._log)的情况下报告它.

I've got a workaround which simply consists in using Java 1.5 when launching our app (while still using Java 1.6 to run IntelliJ IDEA, etc. on the same machine, simultaneously), but my question is if this should be reported or not and, if it should, how to report it knowing that the log itself contains proprietary information (the full hs_err_..._log).

可以排除硬件错误:

这发生在正常运行时间长达数月的工作站上(我只在发布影响我精简和强化的 Debian Linux 的关键安全补丁时才重新启动它，这确实不经常发生)以及哪些应用程序永不崩溃(使其不太可能是该机器上的硬件问题 [更多内容])

this is happening on a workstation that regularly reaches months of uptime (I only reboot it when critical security patches affecting my trimmed down and hardened Debian Linux are issued, which really doesn't happen often) and on which applications never crash (making it very unlikely that it's an hardware issue on that machine [more below])

相同的应用程序在相同负载下的 JVM 1.5 下的同一台机器上完美运行(这就是我测试应用程序的方式:我只是在 1.5 的 VM 下启动它)

same application works perfectly on that same machine under a JVM 1.5 under the same load (this is how I'm testing the app: I simply launch it under a 1.5 VM)

同一应用程序在相同(巨大)负载下的数百台客户端机器上运行良好(在 Windows + JVM 1.5 或 1.6 上从未崩溃过一次，在 OS X + JVM 1.5 或 1.6 上从未崩溃过一次 [a崩溃意味着来自客户端的即时电话])

same application works perfectly fine on more than one hundreds clients machine under the same (gigantic) load (never crashed once on Windows + JVM 1.5 or 1.6 and never crashed once on OS X + JVM 1.5 or 1.6 [a crash would mean an instant phone call from the client])

同一台机器上的其他应用程序和相同的 1.6.0_17 或 1.6.0_18 JVM 永远不会崩溃(例如，我有两个 IntelliJ IDEA 实例在同一台机器上作为两个不同的用户运行，但他们没有崩溃)

other application on that same machine and same 1.6.0_17 or 1.6.0_18 JVM never crash (for example I've got two instances of IntelliJ IDEA running as two different users on that same machine and they don't crash)

机器定期"使用 memtest 进行测试(在安装新操作系统之前，最近一次发生在我安装 Debian Lenny 时，不久前)

machine is tested with memtest "regularly" (before installing a new OS, which last happened when I installed Debian Lenny, not that long ago)

这是可按需复制的 SIGSEGV:

Here's the reproducible-on-demand SIGSEGV:

... $uname -a Linux saturn 2.6.26-2-686 #1 SMP Wed Nov 4 20:45:37 UTC 2009 i686 GNU/Linux ... $ export /home/wizard/jdk1.6.0_17/bin:$PATH ... $ java -version java version "1.6.0_17" Java(TM) SE Runtime Environment (build 1.6.0_17-b04) Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)

启动应用程序，输入大量数据"，等待几秒钟...

Launch the app, feed it a "deluge of data", wait a few seconds...

那么，对于 1.6.0_17:

Then, invariably, for 1.6.0_17:

# # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0xb76d0080, pid=30793, tid=2514328464 # # JRE version: 6.0_17-b04 # Java VM: Java HotSpot(TM) Server VM (14.3-b01 mixed mode linux-x86 ) # Problematic frame: # V [libjvm.so+0x4bc080] # # An error report file with more information is saved as: # /home/wizard/hs_err_pid30793.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp

(请注意，'[libjvm.so+0x4bc080]' 行在每个 SIGSEGV 上与 1.6.0_17 一致)

(note that the line '[libjvm.so+0x4bc080]' is consistent for 1.6.0_17 at every SIGSEGV)

或对于 1.6.0_18:

or for 1.6.0_18:

# # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0xb77468f0, pid=722, tid=2514516880 # # JRE version: 6.0_18-b07 # Java VM: Java HotSpot(TM) Server VM (16.0-b13 mixed mode linux-x86 ) # Problematic frame: # V [libjvm.so+0x4d88f0] # # An error report file with more information is saved as: # /home/wizard/hs_err_pid722.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # Aborted

(请注意，[libjvm.so+0x4d88f0]"这一行在每个 SIGSEGV 上对于 1.6.0_18 都是一致的)

(note that the line "[libjvm.so+0x4d88f0]" is consistent for 1.6.0_18 at every SIGSEGV)

问题是日志文件包含专有信息无法共享.

The problem is that the log file contains proprietary information that cannot be shared.

重现一个重现问题的小测试用例"也不现实:它类似于上面链接的问题，只有当大量数据"被馈送到应用程序时才会发生这种情况.

Reproducing a "tiny test case" that reproduce the issue ain't realistic either: it's similar to the issue linked above, this only happens when a "deluge of data" is feeded to the app.

请注意，完全相同的应用程序，在完全相同的硬件上，具有完全相同的 JVM，但另一个版本的 Linux(我之前有 Debian Etch)没有触发该 SIGSEGV 一次.

Note that the exact same application, on exactly the same hardware, with exactly the same JVM but another version of Linux (I had Debian Etch previously) did NOT trigger that SIGSEGV once.

但这并不意味着 JVM 没有问题:它仍然可能是 JVM 问题.

But this doesn't mean the JVM isn't at fault: it could still be a JVM issue.

我应该举报吗?如何举报?(请记住，编写可重现的小型测试用例"是妄想，并且日志包含不应泄露的专有信息).我应该只编辑日志并发送吗?

Should I report this and how? (keeping in mind that writing a "reproducible tiny test case" is delusional and that the log contains proprietary information that shouldn't be leaked). Should I just edit the log and send it?

当您的日志包含专有信息并且重现问题的测试用例实际上不可行时，报告此类可重现 SIGSEGV 的程序是什么?

What's the procedure to report such reproducible SIGSEGV when your log contains proprietary information and when a test case reproducing the issue ain't realistically doable?

你们有没有人成功打开过这样的错误，然后在随后的 Java 版本中看到它得到解决?

Did any of you have success opening such a bug and then see it solved in a subsequent Java release?

您认为报告此类问题对Java 社区"有益还是我不应该打扰，因为它不重要?

Do you think it's good "for the Java community" to report such an issue or I just shouldn't bother because it's not important?

推荐答案

我在升级到 JDK 1.6_18 时遇到了类似的问题，使用以下选项似乎解决了:

I got similar problem upgrading to JDK 1.6_18 and it seems solved using the following options:

-server -Xms256m -Xmx748m -XX:MaxPermSize=128m -verbose:gc -XX:+PrintGCTimeStamps -Xloggc:/tmp/gc.log -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath="/tmp" -XX:+UseParallelGC -XX:-UseGCOverheadLimit # Following options just to remote monitoring with jconsole, useful to see JVM behaviour at runtime -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=12345 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=MyHost

我还是没有仔细检查(是生产环境)，但我认为错误是由于两个原因:

I still didn't double check (it is a production environment), but I think the error was due to two reasons:

1) 关于堆和/或永久空间的错误设置(我认为 JDK 1.6 需要比以前的 JVM 版本更多的堆和永久空间)导致 OutOfMemoryError，但是

1) Wrong setting about heap and/or Permanent space (I think JDK 1.6 needs more space in heap and permanent than previous JVM versions) caused an OutOfMemoryError, but

2) 在错误的原始设置中有人写了

2) in the wrong original setting somebody wrote

-XX:+HeapDumpOnOutOfMemoryError="/tmp"

而不是

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath="/tmp"

所以可能 JVM 无法编写 heapdump，我们只得到了 SIGSEGV(以前的版本在工作目录中编写了 heapdump).

so probably JVM was not able to write the heapdump and we got SIGSEGV only (previous versions wrote heap dump in the working directory).

检查 -server -XX:+UseParallelGC -XX:-UseGCOverheadLimit 选项.我认为使用 VM 参数不是一种解决方法，但正确的方法也是因为垃圾收集器(不仅是)在 1.5 和 1.6 之间发生了变化.

Check -server -XX:+UseParallelGC -XX:-UseGCOverheadLimit options too. I think playing with VM parameters is not a workaround, but the right approach also because garbage collector (and not only) changed between 1.5 and 1.6.

相关文章