Bug 2162365

Summary:

JVM crash in PhaseIdealLoop::spinup with OpenJDK 17

Product:

Red Hat Enterprise Linux 9

Reporter:

Simeon Andreev <simeon.andreev>

Component:

java-17-openjdk

Assignee:

Roland Westrelin <rwestrel>

Status:

CLOSED CURRENTRELEASE

QA Contact:

OpenJDK QA <java-qa>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

unspecified

CC:

asaji, loskutov, rwestrel

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2023-03-02 13:18:38 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Crash log from OpenJDK.	none
Replay log from the crash.	none
crash logs from spotbugs execution	none
reproducer application	none

Description Simeon Andreev 2023-01-19 12:44:13 UTC

Created attachment 1939136 [details]
Crash log from OpenJDK.

Description of problem:

We have an OpenJDK crash while compiling with our source code with ECJ (the Eclipse Java Compiler).

Version-Release number of selected component (if applicable):

openjdk version "17.0.4" 2022-07-19
OpenJDK Runtime Environment Temurin-17.0.4+8 (build 17.0.4+8)
OpenJDK 64-Bit Server VM Temurin-17.0.4+8 (build 17.0.4+8, mixed mode, sharing)

How reproducible:

We don't have steps to reproduce, so far we have seen the crash twice during builds.

Additional info:

We are still on RHEL 7.9, using the Eclipse Temurin JDK 17 builds for Linux. I'm unable to open an OpenJDK 17 bug for RHEL 7 though, so I'm opening one for RHEL 9. Fixing the crash on RHEL 9 is enough for us, since the fix will be in OpenJDK 17 which we can roll out.

Comment 1 Simeon Andreev 2023-01-19 12:45:13 UTC

Created attachment 1939137 [details]
Replay log from the crash.

Comment 2 Simeon Andreev 2023-01-19 12:46:59 UTC

Stack trace:

Stack: [0x00007fffc4427000,0x00007fffc4528000],  sp=0x00007fffc4522c40,  free space=1007k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xd8ed42]  PhaseIdealLoop::spinup(Node*, Node*, Node*, Node*, Node*, small_cache*) [clone .part.0]+0x52
V  [libjvm.so+0xd8f3b2]  PhaseIdealLoop::handle_use(Node*, Node*, small_cache*, Node*, Node*, Node*, Node*, Node*)+0x72
V  [libjvm.so+0xd9046e]  PhaseIdealLoop::do_split_if(Node*)+0xf8e
V  [libjvm.so+0xad6b4b]  PhaseIdealLoop::split_if_with_blocks(VectorSet&, Node_Stack&)+0x17b
V  [libjvm.so+0xacdc3a]  PhaseIdealLoop::build_and_optimize()+0x101a
V  [libjvm.so+0x5d061f]  PhaseIdealLoop::optimize(PhaseIterGVN&, LoopOptsMode)+0x16f
V  [libjvm.so+0x5ce282]  Compile::Optimize()+0xb92
V  [libjvm.so+0x5cfe75]  Compile::Compile(ciEnv*, ciMethod*, int, bool, bool, bool, bool, bool, DirectiveSet*)+0xe65
V  [libjvm.so+0x50ffb9]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0xe9
V  [libjvm.so+0x5d8ea8]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xf68
V  [libjvm.so+0x5d9b38]  CompileBroker::compiler_thread_loop()+0x508
V  [libjvm.so+0xe59204]  JavaThread::thread_main_inner()+0x184
V  [libjvm.so+0xe5c98e]  Thread::call_run()+0xde
V  [libjvm.so+0xc162c1]  thread_native_entry(Thread*)+0xe1

Comment 4 Andrew Dinn 2023-01-25 10:59:04 UTC

Hi Simeon,
Thanks for reporting this. Something is wrong with a key compile graph structure here (the dominator tree ). It's not going to be possible to identify the cause for that without a reproducer I am afraid. In case we do come up with one I am noting down what I have been able to deduce from the info in the crash log. Don't worry if it does not make any sense to you. If it does, enjoy!

A disassembly of the code leading up to the faulting address indicates this error is happening in the dominator tree search. Here's the relevant source and code disassembly:

Node *PhaseIdealLoop::spinup( Node *iff_dom, Node *new_false, Node *new_true, Node *use_blk, Node *def, small_cache *cache ) {
  if (use_blk->is_top())        // Handle dead uses
    return use_blk;
  Node *prior_n = (Node*)((intptr_t)0xdeadbeef);
  Node *n = use_blk;            // Get path input
  assert( use_blk != iff_dom, "" );
  // Here's the "spinup" the dominator tree loop.  Do a cache-check
  // along the way, in case we've come this way before.
  while( n != iff_dom ) {       // Found post-dominating point?
    prior_n = n;
    n = idom(n);                // Search higher <=== errro here
    . . .

   0x7ffff6e29cf0:	push   %rbp
   0x7ffff6e29cf1:	mov    $0xdeadbeef,%r10d   <== r10 == prior_n = 0xdeadbeef
   0x7ffff6e29cf7:	mov    %rsp,%rbp
   0x7ffff6e29cfa:	push   %r15
   0x7ffff6e29cfc:	push   %r14
   0x7ffff6e29cfe:	push   %r13
   0x7ffff6e29d00:	mov    %r8,%r13            <== move use_blk to r13 == n
   0x7ffff6e29d03:	push   %r12
   0x7ffff6e29d05:	mov    %rsi,%r12           <== move iff_dom to r12
   0x7ffff6e29d08:	push   %rbx                : various other input arg saves
   0x7ffff6e29d09:	mov    %rdi,%rbx
   0x7ffff6e29d0c:	sub    $0x38,%rsp
   0x7ffff6e29d10:	mov    %rdx,-0x40(%rbp)
   0x7ffff6e29d14:	mov    0x10(%rbp),%r14
   0x7ffff6e29d18:	mov    %rcx,-0x48(%rbp)
   0x7ffff6e29d1c:	mov    %r9,-0x38(%rbp)
   0x7ffff6e29d20:	mov    %r8,-0x50(%rbp)
   0x7ffff6e29d24:	cmp    %r12,%r13            <== compare n != iff_dom
   0x7ffff6e29d27:	je     0x7ffff6e29d94
   0x7ffff6e29d29:	nopl   0x0(%rax)            : inlined code from PhaseIdealLoop.idom()
   0x7ffff6e29d30:	mov    0x9f8(%rbx),%rax     <== load _idom array from PhaseIdealLoop (offset 0x9f8 is right)
   0x7ffff6e29d37:	mov    0x28(%r13),%edx      <== load node _idx
   0x7ffff6e29d3b:	lea    (%rax,%rdx,8),%rdi   <== index node in _idom array
   0x7ffff6e29d3f:	mov    (%rdi),%r15          <== load _in array field of node
   0x7ffff6e29d42:	mov    0x8(%r15),%rax       <== index entry _in[0]   !!!!! CRASH !!!!!
   0x7ffff6e29d46:	cmpq   $0x0,(%rax)
   0x7ffff6e29d4a:	jne    0x7ffff6e29d72
   0x7ffff6e29d4c:	mov    0x28(%rbx),%ecx
   0x7ffff6e29d4f:	nop
   0x7ffff6e29d50:	mov    0x28(%r15),%eax
   0x7ffff6e29d54:	cmp    %ecx,%eax
   0x7ffff6e29d56:	jae    0x7ffff6e2d740

The _idom entry associated with the index for use_blk is null. That indicates that the dominator tree has not been correctly derived. Given any block node there ought to be a dominating node for that block.

n.b.there is no code for the check for use_blk->is_top(). However, when spinup is called from handle_use use_blk is known to be non-null. the only other call is a recursive one internal to spinup so it seems the top level check has been pushed down to the point of recursion.

Comment 5 Andrew Dinn 2023-01-25 11:00:35 UTC

Please provide a reproducer for this bug if you can.

Comment 6 Andrey Loskutov 2023-01-25 12:01:16 UTC

Andrew, thanks for the analysis. Unfortunately we can't reproduce. The crashes here are reported while compilation tasks during our automated build, and very seldom. So we neither can't pinpoint if that is related to compilation of specific module or narrow down otherwise.

One point may be related: I don't see from the crash dump, how long it took to the crash. Typically during compilation of our product we don't have a single JVM build deamon like gradle, but start *a lot* of short living JVM processes with ant compile tasks (I would guess over 300 per product build).

So just a wild guess: could it be, the JVM runs into the crash right before or during shutdown, so the code here is running in the "unexpected" JVM state? And we observe the crash only during build time simply because the probability to get a crash on JVM shutdown is much higer there, with so many short living processes?

Comment 7 Simeon Andreev 2023-01-25 12:07:26 UTC

Also, while we cant reproduce (in a reasonable amount of time), we can add any code you wish or enable diagnostics and run our compile with that. Maybe this can help narrow down the problem...

Comment 8 Andrew Dinn 2023-01-25 12:24:39 UTC

    (In reply to Andrey Loskutov from comment #6)
    > Andrew, thanks for the analysis. Unfortunately we can't reproduce. The
    > crashes here are reported while compilation tasks during our automated
    > build, and very seldom. So we neither can't pinpoint if that is related to
    > compilation of specific module or narrow down otherwise.

    In that case I am not sure there is anything we can do here.

    > One point may be related: I don't see from the crash dump, how long it took
    > to the crash. Typically during compilation of our product we don't have a
    > single JVM build deamon like gradle, but start *a lot* of short living JVM
    > processes with ant compile tasks (I would guess over 300 per product build).
    > 
    > So just a wild guess: could it be, the JVM runs into the crash right before
    > or during shutdown, so the code here is running in the "unexpected" JVM
    > state? And we observe the crash only during build time simply because the
    > probability to get a crash on JVM shutdown is much higer there, with so many
    > short living processes?

    I think that is unlikely. The compiler thread is a VM thread operating on VM data. When the JVM shuts down it should get stopped cleanly as part of JVM shutdown. There should be no danger of that overwriting or freeing/unmapping data on which the compiler operates. Yet that is what we are seeing.

    What I think is more likely is that the dominator tree is being incorrectly computed, possibly because the underlying graph is not in the expected format. That may well be to do with the use of ecj to produce the bytecode. There is a great deal of room for bytecode compilers to transform the same Java source into different bytecode representations (e.g. ecj and OpenJDK javac 'model' loops as do {...} while (...) vs while (...) do {...}). The compiler may be making an unwarranted assumption about the shape of the bytecode which then manifests in an unexpected graph shape. The lack of consistent reproducibility could easily be because it depends on decisions that are timing or execution profile dependent (e.g. what code to inline, argument type profile optimizations etc).

Comment 9 Andrey Loskutov 2023-01-25 12:32:31 UTC

(In reply to Andrew Dinn from comment #8)
>     What I think is more likely is that the dominator tree is being
> incorrectly computed, possibly because the underlying graph is not in the
> expected format. That may well be to do with the use of ecj to produce the
> bytecode. 

Sure, we use ecj and that definitely produces different class files compared to javac :-)

> The lack
> of consistent reproducibility could easily be because it depends on
> decisions that are timing or execution profile dependent (e.g. what code to
> inline, argument type profile optimizations etc).

Any idea how we could "force" wrong decisions? We fully control our environment, so we can run JVM with any settings you want.
Extra bytecode validation or whatever needed to better diagnose the issue.

Comment 10 Andrew Dinn 2023-01-25 13:48:01 UTC

> Any idea how we could "force" wrong decisions? We fully control our
> environment, so we can run JVM with any settings you want.
> Extra bytecode validation or whatever needed to better diagnose the issue.

I am not really sure what would help with diagnosis. The only obviously related flags that you could play with are

1 SplitIfBlocks           (product : true)
2 PrintDominators         (develop : false)
3 VerifyLoopOptimizations (notproduct : false)

Setting that first option to false will bypass the problem by avoiding the calls to split_if that are blowing up. That might be useful for getting the compile to finish (albeit with lower quality compiled code) where it currently crashes. It would only help to clarify the problem if we still see errors when it is false. That would indicate that the problem is not just in the dominator computation

Option 2 requires running with a debug build. It will produce a *lot* of output in the scenario you describe where the error manifests. It really needs to be used with a reliable reproducer and, preferably, enabled from the debugger when you are about to compile a method that is know to cause the problem with only one active compile thread.

Option 3 also requires running with a debug build. It might possibly catch some problems but no guarantees. You can combine it with PrintOpto to get detailed info about the loop transforms, including the SplitIf transformation. However, that will lead to the same information overload outcome as option 2. So, again best used with a reproducer under debug.

You are probably getting all the bytecode verification you need (flag BytecodeVerificationRemote defaults to true).

Comment 11 Simeon Andreev 2023-02-08 08:09:31 UTC

I've discussed in my team, we don't have resources to try to reproduce. We are undergoing a RHEL 9 update, with which we'll update to latest available OpenJDK 17. This will likely take at least a few months. After the move, if we still see the issue when compiling our product (and its not fixed by the OpenJDK 17 update) we'll look into reproducing the problem.

Due to problems in our IT infrastructure we are not sure how often the issue occurs. If that problem is fixed and we learn that the crash is "too frequent" we might change priority and look into reproducing the crash sooner.

Comment 12 Andrey Loskutov 2023-02-21 09:35:38 UTC

(In reply to Andrew Dinn from comment #8)
>     What I think is more likely is that the dominator tree is being
> incorrectly computed, possibly because the underlying graph is not in the
> expected format. That may well be to do with the use of ecj to produce the
> bytecode.

We've got same crash with JVM running spotbugs code *compiled by javac*.

SIGSEGV (0xb) at pc=0x00007ffff6e29d42, pid=9275, tid=9302

Problematic frame:
V  [libjvm.so+0xd8ed42]  PhaseIdealLoop::spinup(Node*, Node*, Node*, Node*, Node*, small_cache*) [clone .part.0]+0x52

Current thread (0x00007ffff01e5770):  JavaThread "C2 CompilerThread0" daemon [_thread_in_native, id=9302, stack(0x00007fffa47fc000,0x00007fffa48fd000)]

Current CompileTask:
C2:  17367 10126   !   4       edu.umd.cs.findbugs.detect.FindPuzzlers::sawOpcode (4108 bytes)

Stack: [0x00007fffa47fc000,0x00007fffa48fd000],  sp=0x00007fffa48f7c40,  free space=1007k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xd8ed42]  PhaseIdealLoop::spinup(Node*, Node*, Node*, Node*, Node*, small_cache*) [clone .part.0]+0x52
V  [libjvm.so+0xd8f3b2]  PhaseIdealLoop::handle_use(Node*, Node*, small_cache*, Node*, Node*, Node*, Node*, Node*)+0x72
V  [libjvm.so+0xd9046e]  PhaseIdealLoop::do_split_if(Node*)+0xf8e
V  [libjvm.so+0xad6b4b]  PhaseIdealLoop::split_if_with_blocks(VectorSet&, Node_Stack&)+0x17b
V  [libjvm.so+0xacdc3a]  PhaseIdealLoop::build_and_optimize()+0x101a
V  [libjvm.so+0x5d061f]  PhaseIdealLoop::optimize(PhaseIterGVN&, LoopOptsMode)+0x16f
V  [libjvm.so+0x5ce282]  Compile::Optimize()+0xb92
V  [libjvm.so+0x5cfe75]  Compile::Compile(ciEnv*, ciMethod*, int, bool, bool, bool, bool, bool, DirectiveSet*)+0xe65
V  [libjvm.so+0x50ffb9]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0xe9
V  [libjvm.so+0x5d8ea8]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xf68
V  [libjvm.so+0x5d9b38]  CompileBroker::compiler_thread_loop()+0x508
V  [libjvm.so+0xe59204]  JavaThread::thread_main_inner()+0x184
V  [libjvm.so+0xe5c98e]  Thread::call_run()+0xde
V  [libjvm.so+0xc162c1]  thread_native_entry(Thread*)+0xe1

Beside the identical crash stack, the other common thing here is that the code that is being optimized is *huge*.

Both org.eclipse.jdt.internal.compiler.lookup.BinaryTypeBinding::createMethod (we saw in the reported crash before) and edu.umd.cs.findbugs.detect.FindPuzzlers::sawOpcode (we see now) are pretty huge "spaghetti-style" methods. 

I wonder if that complex method code is contributing to the dominator tree being incorrectly computed.

See 
- https://github.com/spotbugs/spotbugs/blob/30884910e0b72e114a85d98a5b7b17b40d2d684a/spotbugs/src/main/java/edu/umd/cs/findbugs/detect/FindPuzzlers.java#L174
- https://github.com/eclipse-jdt/eclipse.jdt.core/blob/7f8a17fea31dbd5361e54f274d09866c5e7982a0/org.eclipse.jdt.core.compiler.batch/src/org/eclipse/jdt/internal/compiler/lookup/BinaryTypeBinding.java#L928

With the new crash reported in spotbugs code we are working on creating a reproducer.
We've managed so far to reproduce it in 15 out of 3000 executions (takes ~8 hours), which is kind of "stable" reproducer (I will attach crash logs in a moment).

The crash happens while we start spotbugs analysis via ant on the application code in one of our projects. There is no Eclipse/ecj involved, just a pure standalone ant task that is supposed to do a static code analysis via spotbugs and usually completes in ~20 seconds.

Interesting point we observed so far: the crash happens only on a "small" workstation with 64 GB RAM / 12 cores (and original one was reported on 32 GB / 4 core VM). It is not reproducible so far on 128+ GB / 16 core workstations.

Note, that in the crash cases JVM used max heap below magic 32 GB border, so it was using "compressed pointers". Not sure if that could be one of important factors contributing to the crash, but so far we haven't received bug reports from "big" workstations.

We plan to continue improving reproducer code / changing test environment so we can provide something we can share or have insight what exactly contributes to the crash.

If we can instrument something that would help with analysis of this issue, please give us a pointer.

Comment 13 Andrey Loskutov 2023-02-21 09:36:48 UTC

Created attachment 1945451 [details]
crash logs from spotbugs execution

Comment 14 Andrew Dinn 2023-02-22 09:49:01 UTC

(In reply to Andrey Loskutov from comment #12)
  . . .
> We've got same crash with JVM running spotbugs code *compiled by javac*.

Thanks for pursuing this. Good to know that it is not ecj and even better that you now have a (relatively) reliable reproducer.

Are you able to reliably reproduce the failure any more or less consistently using the replay file to rerun the compilation up to FindPuzzlers.sawOpcode?

> Beside the identical crash stack, the other common thing here is that the
> code that is being optimized is *huge*.
> 
> Both
> org.eclipse.jdt.internal.compiler.lookup.BinaryTypeBinding::createMethod (we
> saw in the reported crash before) and
> edu.umd.cs.findbugs.detect.FindPuzzlers::sawOpcode (we see now) are pretty
> huge "spaghetti-style" methods. 
> 
> I wonder if that complex method code is contributing to the dominator tree
> being incorrectly computed.

Either of method size or complexity could be the immediate or indirect cause. However, it could also be many other things. The best way to find out would be to reproduce the problem in a debugger.

> Interesting point we observed so far: the crash happens only on a "small"
> workstation with 64 GB RAM / 12 cores (and original one was reported on 32
> GB / 4 core VM). It is not reproducible so far on 128+ GB / 16 core
> workstations.
> 
> Note, that in the crash cases JVM used max heap below magic 32 GB border, so
> it was using "compressed pointers". Not sure if that could be one of
> important factors contributing to the crash, but so far we haven't received
> bug reports from "big" workstations.

It may relate to the use of compressed oops. One thing you could maybe usefully try to check that hypothesis is to explicitly disable compressed oops while running with a heap below 32GB and see if the problem goes still happens. Of course, the test is asymmetrical, given the relatively low failure rate you are currently seeing -- if you don't see a failure that doesn't guarantee compressed oops is the culprit.

> We plan to continue improving reproducer code / changing test environment so
> we can provide something we can share or have insight what exactly
> contributes to the crash.

Ok, thanks for pursuing it.

> If we can instrument something that would help with analysis of this issue,
> please give us a pointer.

The best thing would be for us to be able to reproduce the bug reliably, preferably in a debug build of OpenJDK but even the ability to do so in a product release would be a big help. It doesn't really matter whether we achieve that by running your job or by rerunning compiles using the replay file.

Comment 17 Andrey Loskutov 2023-02-22 13:28:59 UTC

(In reply to Andrew Dinn from comment #14)
> Are you able to reliably reproduce the failure any more or less consistently
> using the replay file to rerun the compilation up to FindPuzzlers.sawOpcode?

Please provide instructions how to do that, I will try. I was remembered by stack overflow doing that few years ago (see my own answer https://stackoverflow.com/questions/33759206/java-replay-log-diagnosing-out-of-memory-error) but the links I've put there are gone.

> The best way to find out
> would be to reproduce the problem in a debugger.

:) I'm working towards a better reproducer...
 
> It may relate to the use of compressed oops. One thing you could maybe
> usefully try to check that hypothesis 

Looks like that is not heap size dependent. I've run test with smaller heaps with no crashes.
However, using some VM flags that were set by JVM on a smaller workstation I was able to reproduce crash on 16 core, with oops disabled (I believe so, since heap size was 32 GB).

The flags I've set to 16 core were taken by looking on a diff between "java -XX:+PrintFlagsFinal -version" execution on 12 vs 16 core machines:

-XX:CICompilerCount=4 -XX:NonNMethodCodeHeapSize=5839372 -XX:NonProfiledCodeHeapSize=122909434 -XX:ProfiledCodeHeapSize=122909434 -XX:G1ConcRefinementThreads=10 -XX:ParallelGCThreads=10 -XX:AllocatePrefetchInstr=0

> > If we can instrument something that would help with analysis of this issue,
> > please give us a pointer.
> 
> The best thing would be for us to be able to reproduce the bug reliably,
> preferably in a debug build of OpenJDK

I will see if I can get a debug build somehow. Simeon is on vacation, he usually managed these JVM builds.

BTW, I've got also a core file - do you want/need it? That is about 200 MB packed, so I guess I can put it somewhere on the web if you need.

Comment 18 Roland Westrelin 2023-02-22 15:50:35 UTC

> BTW, I've got also a core file - do you want/need it? That is about 200 MB
> packed, so I guess I can put it somewhere on the web if you need.

Yes, please.
If that doesn't help, we can look at the replay thing next unless you find a reproducer in the meantime.

Comment 19 Andrey Loskutov 2023-02-22 17:01:46 UTC

(In reply to Roland Westrelin from comment #18)
> > BTW, I've got also a core file - do you want/need it? That is about 200 MB
> > packed, so I guess I can put it somewhere on the web if you need.
> 
> Yes, please.
> If that doesn't help, we can look at the replay thing next unless you find a
> reproducer in the meantime.

Here is it: https://drive.google.com/file/d/1NrWkj0aOztD8basGtFP-vUuWv-pft0EF/view?usp=sharing

Meanwhile I have ~5+ core files (growing), if that one is not good enough, I can give you more :-)

Comment 20 Roland Westrelin 2023-02-23 09:01:33 UTC

> Here is it:
> https://drive.google.com/file/d/1NrWkj0aOztD8basGtFP-vUuWv-pft0EF/
> view?usp=sharing

Thanks.

> Meanwhile I have ~5+ core files (growing), if that one is not good enough, I
> can give you more :-)

Could you share one of them from a eclipse crash?

Comment 21 Andrey Loskutov 2023-02-23 09:21:04 UTC

(In reply to Roland Westrelin from comment #20)
> > Meanwhile I have ~5+ core files (growing), if that one is not good enough, I
> > can give you more :-)
> 
> Could you share one of them from a eclipse crash?

By "eclipse" you probably mean ecj compiler task crashes?
No, the crashes I can reproduce (and from which I have core files) are all from spotbugs task execution, not from Eclipse compiler.

I'm close to the state where I can share reproducer because I've managed to crash it without using our internal code yesterday.
Give me a day more to polish that.

The crash rate is still 1/1000 executions, but with the script & -XX:OnError command one can attach debugger or do something else.

Comment 22 Roland Westrelin 2023-02-23 09:46:19 UTC

> I'm close to the state where I can share reproducer because I've managed to
> crash it without using our internal code yesterday.
> Give me a day more to polish that.
> 
> The crash rate is still 1/1000 executions, but with the script & -XX:OnError
> command one can attach debugger or do something else.

Excellent! Thanks for taking the time to put a reproducer together.

Comment 23 Andrey Loskutov 2023-02-23 16:26:45 UTC

Created attachment 1945931 [details]
reproducer application

I've attached bug_2162365_reproducer.zip that contains all required binaries except JVM itself and a script that will run the loop.

Extract it somewhere with enough space for future core files and see README for more details how to run reproducer.

The script just runs ant "findbugs" task in a loop, and the task just runs spotbugs over spotbugs & bcel libraries using a single FindPuzzlers bug detector (see https://github.com/spotbugs/spotbugs/blob/30884910e0b72e114a85d98a5b7b17b40d2d684a/spotbugs/src/main/java/edu/umd/cs/findbugs/detect/FindPuzzlers.java#L174).

It crashes 1 to 3 times from 1000 executions in our environment.

I've got it crashed on a 4 core virtual machine on RH 9 / Java 17.0.2 from RH9, and on 12 & 16 core bare metal machines running RH 7.9 & Java 17.0.4.

Comment 24 Roland Westrelin 2023-02-23 16:55:30 UTC

> I've attached bug_2162365_reproducer.zip that contains all required binaries
> except JVM itself and a script that will run the loop.

Thanks. I will try it.

Comment 25 Roland Westrelin 2023-02-28 07:49:39 UTC

I could reproduce it and analyze it. Thanks again for the reproducer, that was very helpful.

I believe it's a known issue: https://bugs.openjdk.org/browse/JDK-8280696 that was backported to 17.0.5. With the patch for that bug fix, I can't reproduce the crash anymore. Can you confirm you haven't seen it with 17.0.5 or newer?

Comment 26 Andrey Loskutov 2023-02-28 13:12:01 UTC

(In reply to Roland Westrelin from comment #25)
> I could reproduce it and analyze it. Thanks again for the reproducer, that
> was very helpful.
> 
> I believe it's a known issue: https://bugs.openjdk.org/browse/JDK-8280696
> that was backported to 17.0.5. With the patch for that bug fix, I can't
> reproduce the crash anymore. 

Since you've analyzed that, and bug JDK-8280696 provides zero information about crash preconditions, could you please elaborate which preconditions need to be met to encounter it and if there are any workarounds possible to avoid the crash? 

This would be helpful to decide if we can "live" with the crash / 17.0.4 till we get a fix for https://bugzilla.redhat.com/show_bug.cgi?id=2138897 or if we have to plan & evaluate update to the 17.0.6 to get the crash fix ASAP.

> Can you confirm you haven't seen it with 17.0.5
> or newer?

Tests are running on 3 workstations using the reproducer / Java 17.0.6+10, I will give an update after a crash free day.

Comment 27 Roland Westrelin 2023-02-28 14:00:25 UTC

> Since you've analyzed that, and bug JDK-8280696 provides zero information
> about crash preconditions, could you please elaborate which preconditions
> need to be met to encounter it and if there are any workarounds possible to
> avoid the crash? 

The only reliable way to avoid the crash is to exclude the method from JIT compilation. -XX:CompileCommand=exclude,edu.umd.cs.findbugs.detect.FindPuzzlers::sawOpcode 
Unless that method is critical for performance, that could be good enough.
I can't answer the question about preconditions. The transformation that triggers the crash is used routinely. Disabling it entirely would likely have a significant performance impact. Some code pattern triggers this. Figuring what it is would take a while (in your crash, it shows up after the compiler has extensively transformed the code and tracing back what happens is complicated). It's also unlikely to help: the sawOpcode method would then need to be modified somehow so the code pattern is removed from there. The JIT compiler trims the method it compiles, inlines some other, duplicates part of the code. So it's quite possible it wouldn't even be possible to locate where that code pattern is in the sawOpcode. 

> Tests are running on 3 workstations using the reproducer / Java 17.0.6+10, I
> will give an update after a crash free day.

Thanks.

Comment 28 Andrey Loskutov 2023-03-01 06:35:02 UTC

I haven't seen crashes so far, after ~4000x5 executions of reproducer on 3 different workstations using Java 17.0.6+10 OpenJDK build from Adoptium, so it looks like this bug can be closed. Thank you for analysis.

Comment 29 Roland Westrelin 2023-03-01 09:02:20 UTC

(In reply to Andrey Loskutov from comment #28)
> I haven't seen crashes so far, after ~4000x5 executions of reproducer on 3
> different workstations using Java 17.0.6+10 OpenJDK build from Adoptium, so
> it looks like this bug can be closed. Thank you for analysis.

Thanks for doing the runs.