Bug 1862163

Summary: Copying of Byte[] with System.arraycopy and Arrays.copyOf is much slower with JDK 11.0.8 compared to JDK 1.8.0_181
Product: Red Hat Enterprise Linux 7 Reporter: Simeon Andreev <simeon.andreev>
Component: java-11-openjdkAssignee: Andrew John Hughes <ahughes>
Status: CLOSED WONTFIX QA Contact: OpenJDK QA <java-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.4CC: ahughes, aogburn, jvanek, loskutov, mmillson, saahamad
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-17 20:39:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Mike environment
none
JDK options in our environment.
none
JKD 8 flags.
none
JDK8 -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal
none
JDK11 -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal
none
JDK 11 flags.
none
Test sources
none
Test JAR
none
Output with -XX:+PrintFlagsFinal on RHEL 7.4, JDK 1.8 and JDK 11.
none
Output with -XX+PrintAdaptiveSizePolicy on RHEL 7.4, JDK 1.8. On JDK 11 used -Xlog:gc+ergo*=debug as replacement.
none
GC log from Java 8 181 with the reproduction snippet.
none
GC log from Java 11 10 with the reproduction snippet.
none
GC logs from original case of performance degradation.
none
JDK8 profiling
none
JDK11 profiling none

Description Simeon Andreev 2020-07-30 15:34:21 UTC
Description of problem:

We are moving our product to run on Java 11 and are noticing performance problems in customer code. We have boiled *one* of the problems down to use of Arrays.copyOf(Byte[], int).

Copying arrays of Byte[] is times slower when done with JDK 11.0.8 (even slower on versions before this) when compared to JDK 1.8.0_181.

Version-Release number of selected component (if applicable):

9.0.4
11.0.7
11.0.8
14+36 (early access for some JDK 14 build)

How reproducible:

Run the following snippet with Java 11 (11.0.8/11.0.7) vs. Java 8 (1.8.0_181)

import java.io.IOException;
import java.util.Arrays;

public class TestByteArrayCopy {

	public static void main(String[] args) throws IOException {
		System.out.println("java.version=" + System.getProperty("java.version"));
		long s = System.currentTimeMillis();
		
		long totalElapsed = 0;
		int iterations = 1000;
		int n = 2_000_000;
		
		Byte[] array = new Byte[n];
		for (int i = 0; i < array.length; i++) {
			array[i] = Byte.valueOf((byte) (i % Byte.MAX_VALUE));
		}
		
		for (int iteration = 0; iteration < iterations; iteration++) {
			long start = System.currentTimeMillis();
			
			array = Arrays.copyOf(array, array.length);
			
			// Same regression here:
//			Byte [] array2 = new Byte[array.length];
//			System.arraycopy(array, 0, array2, 0, array.length);
//			array = array2;
			
			long end = System.currentTimeMillis();
			long elapsed = end - start;
			totalElapsed += elapsed;
		}
		
		for (int i = 0; i < array.length; i++) {
			if(!array[i].equals(Byte.valueOf((byte) (i % Byte.MAX_VALUE)))) {
				throw new IllegalStateException();
			}
		}

		long e = System.currentTimeMillis();
		System.out.println("Arrays.copyOf(Byte[], int) repeated " + iterations + " times took: " + totalElapsed + "ms");
		System.out.println("entire snippet took: " + (e - s) + "ms");
	}
}

Actual results:

When ran with JDK 1.8.0_181:

java.version=1.8.0_181
Arrays.copyOf(Byte[], int) repeated 1000 times took: 1759ms
entire snippet took: 1782ms

When ran with JDK 11.0.8 (better than with JDK 11.0.7 but still very bad):

java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6981ms
entire snippet took: 7022ms

Expected results:

We run with RHEL 7.4, here are CPU details (HPZ640 workstation):

Additional info:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8                                                                                                                                                                                                                                                       
Socket(s):             1                                                                                                                                                                                                                                                       
NUMA node(s):          1                                                                                                                                                                                                                                                       
Vendor ID:             GenuineIntel                                                                                                                                                                                                                                            
CPU family:            6                                                                                                                                                                                                                                                       
Model:                 85                                                                                                                                                                                                                                                      
Model name:            Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz                                                                                                                                                                                                                   
Stepping:              4                                                                                                                                                                                                                                                       
CPU MHz:               3699.774                                                                                                                                                                                                                                                
CPU max MHz:           3700.0000                                                                                                                                                                                                                                               
CPU min MHz:           1200.0000                                                                                                                                                                                                                                               
BogoMIPS:              7400.00                                                                                                                                                                                                                                                 
Virtualization:        VT-x                                                                                                                                                                                                                                                    
L1d cache:             32K                                                                                                                                                                                                                                                     
L1i cache:             32K                                                                                                                                                                                                                                                     
L2 cache:              1024K                                                                                                                                                                                                                                                   
L3 cache:              11264K                                                                                                                                                                                                                                                  
NUMA node0 CPU(s):     0-15                                                                                                                                                                                                                                                    
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local ibpb ibrs stibp dtherm arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req spec_ctrl intel_stibp ssbd

Comment 2 Andrey Loskutov 2020-07-30 15:57:53 UTC
Note, that the regression for Byte array type is the biggest one with 4x performance decrease, for other number types like Long, Double and Integer it is "only" 1.5 times slower, same as for plain Object array types. Copying primitive arrays seem to be unaffected, we don't observe regression with byte [] etc.

Comment 3 Andrey Loskutov 2020-07-30 21:06:40 UTC
OK, tried with Java 14.0.2 and openjdk-15-ea+21, there were at least two fixes, and I've found few interesting facts.

First of all, it looks like regression is caused by using really *big* arrays and affects G1 GC only.

Example with 

int iterations = 1000;
int n = 2_000_000;

can't be as fast as on Java 8 if running on Java 9 - 14 (with default G1 GC), independently how many heap size is given to JVM.

If I change the example to do basically same work copying same amount of bytes, but in smaller chunks:

int iterations = 2000;
int n = 1_000_000;

than we can make it "work" on Java 11 (same performance as Java 8 on Java 11.0.8) by giving -Xmx31g limit.

The fun starts if one tries to give max heap > 32 GB: performance drops back to 5x from Java 8.
The -XX:+UseCompressedOops is the key: it works only below 32 GB limit.
So we have some really inefficient G1 GC algorithm in Java that can't deal with large arrays as fast as Parallel GC in Java 8, and requires a lot more of memory to do same work. Starting 

I can "fix" Java 11 and run it with -XX:+UseParallelGC with same speed as Java 8 on same heap sizes, even on 8 GB.

Now back to G1 GC. With this GC,

Java 14 has "only" 2.5 x regression (down from 4x).
Java 15-ea+21 performs identically fast, regression is fixed for Byte[].

The question here is now: can we identify & backport G1 GC fixes from Java 14 & Java 15 to Java 11?

Comment 5 Mike Millson 2020-08-07 20:02:38 UTC
I'm not able to reproduce this. I'm seeing JDK11 much faster.

java -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC TestByteArrayCopy

java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 62509ms
entire snippet took: 62646ms

java -Xlog:gc*,safepoint=info:file=gc11.log:uptime:filecount=4,filesize=50M -XX:+UseG1GC TestByteArrayCopy 

java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 25873ms
entire snippet took: 25942ms

Am I missing something?

Comment 6 Andrey Loskutov 2020-08-07 21:56:01 UTC
Mike, which system are you running (RAM / CPU)?
We run on 16 core Xeons (see specs above) with minimum 64 GB RAM (and typically 256), where we also measured the slowdown. 
So in our environment JVM uses probably much higher heap sizes / other flags as on your.
Could be also specific CPU flag that causes some low level hotspot optimization to fail or not to be enabled at all on our CPU's.

Comment 7 Simeon Andreev 2020-08-08 07:16:24 UTC
java.version=1.8.0_181
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6154ms
entire snippet took: 6186ms

java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 7529ms
entire snippet took: 7564ms

java.version=11.0.7
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8163ms
entire snippet took: 8208ms

java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 9144ms
entire snippet took: 9189ms


The JRE 1.8 ones were done with argument "-XX:+UseG1GC".

Comment 8 Simeon Andreev 2020-08-08 07:22:45 UTC
Please also see my comment on the RHEL support case: https://access.redhat.com/support/cases/#/case/02716392?commentId=a0a2K00000W94m7QAB

Comment 9 Mike Millson 2020-08-10 14:19:41 UTC
Hi Andrey,

I'm only testing on my laptop, but it's decent hardware.

os: RHEL 8.2
cpu: 4 with 2 cores = 8 cores
ram: 16gb

Full environment information attached as environment.txt.

Comment 10 Mike Millson 2020-08-10 14:20:37 UTC
Created attachment 1710962 [details]
Mike environment

Comment 11 Mike Millson 2020-08-10 14:21:57 UTC
I will test on RHEL7.

Comment 12 Andrey Loskutov 2020-08-10 14:34:28 UTC
I assume the bug can't be reproduced on machines with low RAM amount (below 64), because JVM uses different flags then and much lower max heap.
I'm on vacation for a week, but I guess Someone can record which default VM settings are used (please check our internal 'tuning jvm' wiki page how print flags auto-configured by JVM to the current hardware.)

My guess is that copying of object arrays is affected by chosen memory or gc 'optimization' strategies. May by just wrong use of data structures or 32/64 pointers etc, that especially affect large heaps.

I bet with enough knowledge about where this is implemented, one could find commit(s) where it was fixed in Java 15.

Comment 13 Andrey Loskutov 2020-08-10 14:36:25 UTC
(In reply to Andrey Loskutov from comment #12)

> I'm on vacation for a week, but I guess Someone can record
Stupid auto- correction. I meant *Simeon* of course.

Comment 14 Simeon Andreev 2020-08-10 14:40:32 UTC
Created attachment 1710967 [details]
JDK options in our environment.

I attached output of "java -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal -version" for our installed JDK 8 and JDK 11.

Comment 15 Mike Millson 2020-08-10 14:48:18 UTC
I wasn't able to reproduce it on a RHEL 7.7 lab box.

java.version=1.8.0_242
Arrays.copyOf(Byte[], int) repeated 1000 times took: 15333ms
entire snippet took: 15382ms

java.version=11.0.6
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6618ms
entire snippet took: 6666ms

I updated to the latest Java 8/11 packages and saw the same.

java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 17618ms
entire snippet took: 17667ms

java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6360ms
entire snippet took: 6404ms

RHEL 7.7
cpu cores = 32
ram 126gb

Comment 17 Mike Millson 2020-08-10 15:33:35 UTC
1) Please test using the same options I used:

JDK8:
java -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC TestByteArrayCopy

JDK11:
java -Xlog:gc*,safepoint=info:file=gc11.log:uptime:filecount=4,filesize=50M -XX:+UseG1GC TestByteArrayCopy 

2) And attach FlagsFinal output using the same options:

JDK8:
java -XX:+PrintFlagsFinal -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -version > flags.8.txt

JDK11:
java -XX:+PrintFlagsFinal -Xlog:gc*,safepoint=info:file=gc11.log:uptime:filecount=4,filesize=50M -XX:+UseG1GC -version > flags11.txt

That will allow a good baseline comparison.

Comment 18 Andrey Loskutov 2020-08-10 15:35:08 UTC
Simeon, can you please try on RHEL 8 test machine we have? May be some C++ libraries were fixed since 7.4?
Mike, could you also attach output from "java -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal -version"?

Also it would be not the first time if CPU family specific optimization in JVM is not working for Xeon's we have or RHEL need some extra patches.

Comment 19 Simeon Andreev 2020-08-10 15:44:08 UTC
Created attachment 1710981 [details]
JKD 8 flags.

I can't run the same for JDK 11:

Unrecognized VM option 'PrintGCTimeStamps'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

> Simeon, can you please try on RHEL 8 test machine we have? May be some C++ libraries were fixed since 7.4?

Tomorrow maybe.

> 1) Please test using the same options I used:

:/tmp$ java -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC TestByteArrayCopy
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6055ms
entire snippet took: 6089ms

:/tmp$ /usr/lib/jvm/java-11/bin/java -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+UseG1GC TestByteArrayCopy
[0.000s][warning][gc] -Xloggc is deprecated. Will use -Xlog:gc:gc8.log instead.                                                                                                                                                              
[0.001s][warning][gc] -XX:+PrintGC is deprecated. Will use -Xlog:gc instead.                                                                                                                                                                 
[0.001s][warning][gc] -XX:+PrintGCDetails is deprecated. Will use -Xlog:gc* instead.                                                                                                                                                         
java.version=11.0.8                                                                                                                                                                                                                          
Arrays.copyOf(Byte[], int) repeated 1000 times took: 10296ms                                                                                                                                                                                 
entire snippet took: 10334ms

(compiled with JDK 8)

:/tmp$ /usr/lib/jvm/java-11/bin/java -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+UseG1GC TestByteArrayCopy
[0.000s][warning][gc] -Xloggc is deprecated. Will use -Xlog:gc:gc8.log instead.                                                                                                                                                              
[0.001s][warning][gc] -XX:+PrintGC is deprecated. Will use -Xlog:gc instead.                                                                                                                                                                 
[0.001s][warning][gc] -XX:+PrintGCDetails is deprecated. Will use -Xlog:gc* instead.                                                                                                                                                         
java.version=11.0.8                                                                                                                                                                                                                          
Arrays.copyOf(Byte[], int) repeated 1000 times took: 9405ms                                                                                                                                                                                  
entire snippet took: 9442ms

(compiled with JDK 11)

:/tmp$ cat TestByteArrayCopy.java
import java.io.IOException;                                                                                                                                                                                                                  
import java.util.Arrays;                                                                                                                                                                                                                     
                                                                                                                                                                                                                                             
public class TestByteArrayCopy {

        public static void main(String[] args) throws IOException {
                System.out.println("java.version=" + System.getProperty("java.version"));
                long s = System.currentTimeMillis();

                long totalElapsed = 0;
                int iterations = 1000;
                int n = 2_000_000;

                Byte[] array = new Byte[n];
                for (int i = 0; i < array.length; i++) {
                        array[i] = Byte.valueOf((byte) (i % Byte.MAX_VALUE));
                }

                for (int iteration = 0; iteration < iterations; iteration++) {
                        long start = System.currentTimeMillis();

                        array = Arrays.copyOf(array, array.length);

                        // Same regression here:
//                      Byte [] array2 = new Byte[array.length];
//                      System.arraycopy(array, 0, array2, 0, array.length);
//                      array = array2;

                        long end = System.currentTimeMillis();
                        long elapsed = end - start;
                        totalElapsed += elapsed;
                }

                for (int i = 0; i < array.length; i++) {
                        if(!array[i].equals(Byte.valueOf((byte) (i % Byte.MAX_VALUE)))) {
                                throw new IllegalStateException();
                        }
                }

                long e = System.currentTimeMillis();
                System.out.println("Arrays.copyOf(Byte[], int) repeated " + iterations + " times took: " + totalElapsed + "ms");
                System.out.println("entire snippet took: " + (e - s) + "ms");
        }
}

Comment 20 Mike Millson 2020-08-10 15:47:41 UTC
Created attachment 1710982 [details]
JDK8 -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal

Comment 21 Mike Millson 2020-08-10 15:48:01 UTC
Created attachment 1710983 [details]
JDK11 -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal

Comment 22 Mike Millson 2020-08-10 15:50:53 UTC
>I can't run the same for JDK 11:

The JDK8 and 11 options cannot be exactly the same. I used the nearest equivalent for JDK11:
java -XX:+PrintFlagsFinal -Xlog:gc*,safepoint=info:file=gc11.log:uptime:filecount=4,filesize=50M -XX:+UseG1GC -version > flags11.txt

Comment 23 Simeon Andreev 2020-08-10 15:54:27 UTC
Created attachment 1710985 [details]
JDK 11 flags.

JDK 11 flags retrieved by executing:

java -XX:+PrintFlagsFinal -Xlog:gc*,safepoint=info:file=gc11.log:uptime:filecount=4,filesize=50M -XX:+UseG1GC -version

Comment 24 Mike Millson 2020-08-10 16:15:29 UTC
The RHEL 7.7 lab box I have been testing on:
model name	: Intel Core Processor (Haswell, no TSX)

I will try to locate a lab box w/ a Xeon process to test.

Comment 25 Roman Kennke 2020-08-10 20:00:16 UTC
Hello there,
Heap sizing might be relevant. Which -Xms/-Xmx settings do you use, if any? (If not, try to run the same program with same heap size on JDK8 vs JDK11, because default settings may differ).

In general, it is a very bad idea to use Byte[] over byte[]. The latter is just a chunk of memory treated as N bytes (clearly). The former is an array of objects which comes with severe disadvantages:
- It takes *much* more memory. A byte is just 8 bits, and you can fit 8 of them in a 64 bit word. a Byte is an object which takes (I guess) 3 64-bit words *plus* the reference in the array itself, which could be 32bit (when using compressed oops, e.g. with Xmx < ~32GB), or 64bit (when not using compressed oops).
- It will be much slower to access, because every access has to resolve the reference first. Also, the actual Byte objects may be scattered over the heap, which will have detrimental effects on caching.
- Perhaps most relevant to this bug, GC has to do a lot of extra work in order to deal with those Byte[]: when copying such arrays it needs to employ barriers before and after the the copy operation to fix up stuff, which will slow down the whole operation. I suspect here lies your problem. Further, a single byte[] is much easier for the GC to deal with than a Byte[] plus a ton of scattered Byte objects.

Please let me know your exact command lines (jdk8 and jdk11) with which you invoke the test program, and I will try to reproduce it. Or even better, consider using byte[] instead, if possible.

Best regards,
Roman

Comment 26 Mike Millson 2020-08-10 20:29:51 UTC
Simeon,

Tests on Xeon processor with 4 cpu (1 core/cpu), 8GB RAM, virtual environment.

RHEL 7.4

[root@node-0 ~]# java -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC TestByteArrayCopy
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 15783ms
entire snippet took: 15837ms

[root@node-0 ~]# java -Xlog:gc*,safepoint=info:file=gc11.log:uptime:filecount=4,filesize=50M -XX:+UseG1GC TestByteArrayCopy
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 15978ms
entire snippet took: 16026ms

RHEL 7.8

[root@node-0 ~]# java -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC TestByteArrayCopy
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 20281ms
entire snippet took: 20347ms

[root@node-0 ~]# java -Xlog:gc*,safepoint=info:file=gc11.log:uptime:filecount=4,filesize=50M -XX:+UseG1GC TestByteArrayCopy
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 16045ms
entire snippet took: 16092ms

Perhaps it's RHEL 7.4? Can you test on RHEL 7.8?

Comment 27 Andrey Loskutov 2020-08-10 22:01:44 UTC
(In reply to Roman Kennke from comment #25)
> Hello there,
> Heap sizing might be relevant. Which -Xms/-Xmx settings do you use, if any?

We set -Xmx12g (our minimal configuration has 64 GB RAM). 

> In general, it is a very bad idea to use Byte[] over byte[].

It is not our idea, our customers use that to manage multi-dimensional collections of Bytes. We would never do that, unfortunately we are not able to change code we do not own.

> Or even better,
> consider using byte[] instead, if possible.

Please note, regression we observe affects *all* reference type arrays, Byte arrays are just affected most. So copying any reference data from one to other array is at least 2x slower.

Comment 28 Roman Kennke 2020-08-10 22:04:56 UTC
(In reply to Andrey Loskutov from comment #27)
> (In reply to Roman Kennke from comment #25)
> > Hello there,
> > Heap sizing might be relevant. Which -Xms/-Xmx settings do you use, if any?
> 
> We set -Xmx12g (our minimal configuration has 64 GB RAM). 
>

Ok thank you. I will try to reproduce the regression tomorrow.

> > In general, it is a very bad idea to use Byte[] over byte[].
> 
> It is not our idea, our customers use that to manage multi-dimensional
> collections of Bytes. We would never do that, unfortunately we are not able
> to change code we do not own.

I almost assumed that. Too bad.
 
> > Or even better,
> > consider using byte[] instead, if possible.
> 
> Please note, regression we observe affects *all* reference type arrays, Byte
> arrays are just affected most. So copying any reference data from one to
> other array is at least 2x slower.

Ok, I guessed so. I suspect it has to do with the GC barriers around object-arraycopy, I will try to reproduce it, that should give us more insights.

Thanks,
Roman

Comment 29 Andrey Loskutov 2020-08-10 22:16:24 UTC
Roman, just in case you're overlooked - regression is partly fixed in JDK 14 and fully in 15. So may be you can spot some recent changes in related code.

Comment 30 Simeon Andreev 2020-08-11 12:10:24 UTC
:/tmp/jep2240/java8$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)

:/tmp/jep2240/java8$ java -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC TestByteArrayCopy
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 4169ms
entire snippet took: 4215ms

(Java 8 compiled)

:/tmp/jep2240/java11$ /usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+UseG1GC TestByteArrayCopy
[0.000s][warning][gc] -Xloggc is deprecated. Will use -Xlog:gc:gc8.log instead.
[0.001s][warning][gc] -XX:+PrintGC is deprecated. Will use -Xlog:gc instead.
[0.001s][warning][gc] -XX:+PrintGCDetails is deprecated. Will use -Xlog:gc* instead.
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8698ms
entire snippet took: 8755ms

(Java 11 compiled)

Comment 31 Roman Kennke 2020-08-11 16:22:20 UTC
I used the command line that you posted (In reply to Simeon Andreev from comment #30)
> :/tmp/jep2240/java8$ cat /etc/redhat-release
> Red Hat Enterprise Linux release 8.2 (Ootpa)
> 
> :/tmp/jep2240/java8$ java -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC
> TestByteArrayCopy
> java.version=1.8.0_262
> Arrays.copyOf(Byte[], int) repeated 1000 times took: 4169ms
> entire snippet took: 4215ms
> 
> (Java 8 compiled)
> 
> :/tmp/jep2240/java11$
> /usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java -XX:+PrintGC
> -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+UseG1GC TestByteArrayCopy
> [0.000s][warning][gc] -Xloggc is deprecated. Will use -Xlog:gc:gc8.log
> instead.
> [0.001s][warning][gc] -XX:+PrintGC is deprecated. Will use -Xlog:gc instead.
> [0.001s][warning][gc] -XX:+PrintGCDetails is deprecated. Will use -Xlog:gc*
> instead.
> java.version=11.0.8
> Arrays.copyOf(Byte[], int) repeated 1000 times took: 8698ms
> entire snippet took: 8755ms
> 
> (Java 11 compiled)

The exact same experiment yielded the following regression here:

$ /usr/lib/jvm/java-1.8.0/bin/java -XX:+PrintGC -Xloggc:gc8.log
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC TestByteArrayCopy
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 7422ms
entire snippet took: 7491ms

$ /usr/lib/jvm/java-11/bin/java -XX:+PrintGC -Xloggc:gc11.log
-XX:+PrintGCDetails -XX:+UseG1GC TestByteArrayCopy
[0.001s][warning][gc] -Xloggc is deprecated. Will use -Xlog:gc:gc11.log
instead.
[0.001s][warning][gc] -XX:+PrintGC is deprecated. Will use -Xlog:gc
instead.
[0.001s][warning][gc] -XX:+PrintGCDetails is deprecated. Will use
-Xlog:gc* instead.
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8385ms
entire snippet took: 8445ms

It doesn't look as bad as what you got.

The same experiment with fixed heap-size of 12GB looks like this:

$ /usr/lib/jvm/java-1.8.0/bin/java -XX:+PrintGC -Xloggc:gc8.log
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC -Xmx12g -Xms12g
TestByteArrayCopy
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 24292ms
entire snippet took: 24355ms

$ /usr/lib/jvm/java-11/bin/java -XX:+PrintGC -Xloggc:gc11.log
-XX:+PrintGCDetails -XX:+UseG1GC -Xmx12g -Xms12g TestByteArrayCopy
[0.001s][warning][gc] -Xloggc is deprecated. Will use -Xlog:gc:gc11.log
instead.
[0.001s][warning][gc] -XX:+PrintGC is deprecated. Will use -Xlog:gc
instead.
[0.001s][warning][gc] -XX:+PrintGCDetails is deprecated. Will use
-Xlog:gc* instead.
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6718ms
entire snippet took: 6785ms

In other words, suddenly JDK8 performans *much* worse, and JDK11 actually looks better than before.

I believe heap-sizing heuristics changed in JDK11, so for fair comparison I advice to run with same heap size settings in both experiments. Can you try that?

Thanks,
Roman

Comment 32 Andrey Loskutov 2020-08-11 19:40:17 UTC
Could be that some new feature / fix after Java 8 depends on some CPU flag/family, and exactly that is not working on our Xeons/RHEL doesn't properly support our Xeon model?

Comment 33 Simeon Andreev 2020-08-12 06:13:16 UTC
We observe the problem in our product, where the only difference in flags we specify is adding the following when we run with JDK 9+:

      args.add(toOu("-Dcom.sun.xml.bind.v2.bytecode.ClassTailor.noOptimize=true"));
      args.add(toOu("--add-modules=ALL-SYSTEM"));

Sure, I can compare with more equal flags and even likely come up with a snippet that is more close to the original. But the problem in our product will remain regardless.

Anyway I'll try again with the same heap sizes...

Comment 34 Simeon Andreev 2020-08-12 06:19:57 UTC
:/tmp/jep2240/java11$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)

:/tmp/jep2240/java8$ java -Xmx12g -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC TestByteArrayCopy
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 3962ms
entire snippet took: 4012ms

:/tmp/jep2240/java8$ java -Xmx12g -Xms12g -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC TestByteArrayCopy
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 9856ms
entire snippet took: 9905ms

:/tmp/jep2240/java11$ /usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java -Xmx12g -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+UseG1GC TestByteArrayCopy
[0.000s][warning][gc] -Xloggc is deprecated. Will use -Xlog:gc:gc8.log instead.
[0.001s][warning][gc] -XX:+PrintGC is deprecated. Will use -Xlog:gc instead.
[0.001s][warning][gc] -XX:+PrintGCDetails is deprecated. Will use -Xlog:gc* instead.
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 7597ms
entire snippet took: 7647ms

:/tmp/jep2240/java11$ /usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java -Xmx12g -Xms12g -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+UseG1GC TestByteArrayCopy
[0.000s][warning][gc] -Xloggc is deprecated. Will use -Xlog:gc:gc8.log instead.
[0.001s][warning][gc] -XX:+PrintGC is deprecated. Will use -Xlog:gc instead.
[0.001s][warning][gc] -XX:+PrintGCDetails is deprecated. Will use -Xlog:gc* instead.
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 7326ms
entire snippet took: 7381ms

Note that we *do not* specify a -Xms option for the JVM in which we observe the performance degradation.

Comment 35 Simeon Andreev 2020-08-12 06:21:58 UTC
/proc/cpuinfo contents on the RHEL 8 workstation I'm using:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
stepping	: 1
microcode	: 0xb00001b
cpu MHz		: 3576.873
cache size	: 15360 KB
physical id	: 0
siblings	: 12
core id		: 0
cpu cores	: 6
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 7183.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
stepping	: 1
microcode	: 0xb00001b
cpu MHz		: 3574.093
cache size	: 15360 KB
physical id	: 0
siblings	: 12
core id		: 1
cpu cores	: 6
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 7183.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
stepping	: 1
microcode	: 0xb00001b
cpu MHz		: 3583.223
cache size	: 15360 KB
physical id	: 0
siblings	: 12
core id		: 2
cpu cores	: 6
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 7183.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
stepping	: 1
microcode	: 0xb00001b
cpu MHz		: 3572.860
cache size	: 15360 KB
physical id	: 0
siblings	: 12
core id		: 3
cpu cores	: 6
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 7183.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
stepping	: 1
microcode	: 0xb00001b
cpu MHz		: 3576.973
cache size	: 15360 KB
physical id	: 0
siblings	: 12
core id		: 4
cpu cores	: 6
apicid		: 8
initial apicid	: 8
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 7183.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
stepping	: 1
microcode	: 0xb00001b
cpu MHz		: 3571.478
cache size	: 15360 KB
physical id	: 0
siblings	: 12
core id		: 5
cpu cores	: 6
apicid		: 10
initial apicid	: 10
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 7183.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
stepping	: 1
microcode	: 0xb00001b
cpu MHz		: 3581.761
cache size	: 15360 KB
physical id	: 0
siblings	: 12
core id		: 0
cpu cores	: 6
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 7183.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
stepping	: 1
microcode	: 0xb00001b
cpu MHz		: 3579.130
cache size	: 15360 KB
physical id	: 0
siblings	: 12
core id		: 1
cpu cores	: 6
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 7183.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 8
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
stepping	: 1
microcode	: 0xb00001b
cpu MHz		: 3575.271
cache size	: 15360 KB
physical id	: 0
siblings	: 12
core id		: 2
cpu cores	: 6
apicid		: 5
initial apicid	: 5
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 7183.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 9
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
stepping	: 1
microcode	: 0xb00001b
cpu MHz		: 3566.321
cache size	: 15360 KB
physical id	: 0
siblings	: 12
core id		: 3
cpu cores	: 6
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 7183.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 10
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
stepping	: 1
microcode	: 0xb00001b
cpu MHz		: 3544.008
cache size	: 15360 KB
physical id	: 0
siblings	: 12
core id		: 4
cpu cores	: 6
apicid		: 9
initial apicid	: 9
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 7183.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 11
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
stepping	: 1
microcode	: 0xb00001b
cpu MHz		: 3578.681
cache size	: 15360 KB
physical id	: 0
siblings	: 12
core id		: 5
cpu cores	: 6
apicid		: 11
initial apicid	: 11
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 7183.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

Comment 38 Chris Williams 2020-11-11 21:39:58 UTC
Red Hat Enterprise Linux 7 shipped it's final minor release on September 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7.
From intial triage it does not appear the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 and will now be closed. 

From the RHEL life cycle page:
https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase
"During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available."

If this BZ was closed in error and meets the above criteria please re-open it flag for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelerated Fixes:
https://source.redhat.com/groups/public/pnt-cxno/pnt_customer_experience_and_operations_wiki/support_delivery_accelerated_fix_release_handbook  

Feature Requests can re-opened and moved to RHEL 8 if the desired functionality is not already present in the product. 

Please reach out to the applicable Product Experience Engineer[0] if you have any questions or concerns.  

[0] https://bugzilla.redhat.com/page.cgi?id=agile_component_mapping.html&product=Red+Hat+Enterprise+Linux+7

Comment 39 Simeon Andreev 2020-11-12 07:17:42 UTC
This is a bug in OpenJDK, what RHEL version it was found on is irrelevant.

Do you need me to reproduce this on RHEL 8 for a re-open or a clone, or whatever is required to continue working on the bug?

Comment 40 Sarfaraz Ahamad 2020-11-19 07:00:52 UTC
Hi Team,

The customer came back with the given concerns.

----------
This issue we would like fixed is performance degradation in OpenJDK 11.0.7+ *and has nothing to do with the RHEL release version*. Please re-open the RHEL Bugzilla ticket so that an OpenJDK maintainer can continue working on the OpenJDK problem.

If you wish to see the problem reproduced on RHEL 8 with OpenJDK 1.8 (for whatever reasons), the reproduction steps are described here: https://bugzilla.redhat.com/show_bug.cgi?id=1862163#c0
----------

@Team - Could you please look into this?

Comment 41 jiri vanek 2020-11-20 11:06:44 UTC
Hi!
The issue will be fixed for rhel 7.9.z and 8.3.z (and up) JDK had never supported older rhels.

Comment 42 Andrey Loskutov 2020-11-20 12:00:30 UTC
(In reply to jiri vanek from comment #41)
> Hi!
> The issue will be fixed for rhel 7.9.z and 8.3.z (and up) JDK had never
> supported older rhels.

=> can someone with enough rights please *reopen* this bug?

Comment 44 jiri vanek 2020-11-20 16:14:04 UTC
Hi Roman! Can you please state phase  of fix? Based on it this bug can be  handled.

Andrey, there is not much need in reopening this unless for tracking. This is handled as upstream issue, and thus it will flow into rhel with new upstream sources, which will go into rhel7 for some (long) time.
The bug would be helpful if there will be need of speed up the distribution of patch, but as el7 is now security fixes I doubt we will handle the exception.
I had reopened it, but it may happen, that PM will cssoe it again and it is unlikely to affect lifecycle of the patch. As for the Byte[] x byte[], it may happen that jdk11 will be unsitable for any patching at all.

Comment 45 Andrey Loskutov 2020-11-20 16:30:09 UTC
(In reply to jiri vanek from comment #44)
> Andrey, there is not much need in reopening this unless for tracking. This
> is handled as upstream issue, and thus it will flow into rhel with new
> upstream sources, which will go into rhel7 for some (long) time.
> The bug would be helpful if there will be need of speed up the distribution
> of patch, but as el7 is now security fixes I doubt we will handle the
> exception.
> I had reopened it, but it may happen, that PM will cssoe it again and it is
> unlikely to affect lifecycle of the patch. As for the Byte[] x byte[], it
> may happen that jdk11 will be unsitable for any patching at all.

Jiri, we do not really care about RH 7 release stream - this is OpenJDK bug, so if you think RH8 would be a better target, just change it to RH 8.
The main point is to get OpenJDK fixed, not in which particular RH release that will be merged.

Comment 46 Roman Kennke 2020-11-25 16:52:44 UTC
(In reply to Simeon Andreev from comment #34)
> :/tmp/jep2240/java11$ cat /etc/redhat-release
> Red Hat Enterprise Linux release 8.2 (Ootpa)
> 
> :/tmp/jep2240/java8$ java -Xmx12g -XX:+PrintGC -Xloggc:gc8.log
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> -XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC TestByteArrayCopy
> java.version=1.8.0_262
> Arrays.copyOf(Byte[], int) repeated 1000 times took: 3962ms
> entire snippet took: 4012ms
> 
> :/tmp/jep2240/java8$ java -Xmx12g -Xms12g -XX:+PrintGC -Xloggc:gc8.log
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> -XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC TestByteArrayCopy
> java.version=1.8.0_262
> Arrays.copyOf(Byte[], int) repeated 1000 times took: 9856ms
> entire snippet took: 9905ms
> 
> :/tmp/jep2240/java11$
> /usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java -Xmx12g
> -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+UseG1GC
> TestByteArrayCopy
> [0.000s][warning][gc] -Xloggc is deprecated. Will use -Xlog:gc:gc8.log
> instead.
> [0.001s][warning][gc] -XX:+PrintGC is deprecated. Will use -Xlog:gc instead.
> [0.001s][warning][gc] -XX:+PrintGCDetails is deprecated. Will use -Xlog:gc*
> instead.
> java.version=11.0.8
> Arrays.copyOf(Byte[], int) repeated 1000 times took: 7597ms
> entire snippet took: 7647ms
> 
> :/tmp/jep2240/java11$
> /usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java -Xmx12g
> -Xms12g -XX:+PrintGC -Xloggc:gc8.log -XX:+PrintGCDetails -XX:+UseG1GC
> TestByteArrayCopy
> [0.000s][warning][gc] -Xloggc is deprecated. Will use -Xlog:gc:gc8.log
> instead.
> [0.001s][warning][gc] -XX:+PrintGC is deprecated. Will use -Xlog:gc instead.
> [0.001s][warning][gc] -XX:+PrintGCDetails is deprecated. Will use -Xlog:gc*
> instead.
> java.version=11.0.8
> Arrays.copyOf(Byte[], int) repeated 1000 times took: 7326ms
> entire snippet took: 7381ms
> 
> Note that we *do not* specify a -Xms option for the JVM in which we observe
> the performance degradation.

Can you run the experiments that omit -Xms with -XX:+PrintFlagsFinal and look for the value InitialHeapSize? This should show us which -Xms the heuristics come up with.

Comment 47 Andrew John Hughes 2020-11-25 17:09:35 UTC
(In reply to Andrey Loskutov from comment #45)
> (In reply to jiri vanek from comment #44)
> > Andrey, there is not much need in reopening this unless for tracking. This
> > is handled as upstream issue, and thus it will flow into rhel with new
> > upstream sources, which will go into rhel7 for some (long) time.
> > The bug would be helpful if there will be need of speed up the distribution
> > of patch, but as el7 is now security fixes I doubt we will handle the
> > exception.
> > I had reopened it, but it may happen, that PM will cssoe it again and it is
> > unlikely to affect lifecycle of the patch. As for the Byte[] x byte[], it
> > may happen that jdk11 will be unsitable for any patching at all.
> 
> Jiri, we do not really care about RH 7 release stream - this is OpenJDK bug,
> so if you think RH8 would be a better target, just change it to RH 8.
> The main point is to get OpenJDK fixed, not in which particular RH release
> that will be merged.

Ok, we can move this to RHEL 8 if/when we have a patch. For the time being, it's better left as it is, until we know which release to target.

Comment 48 Andrey Loskutov 2020-11-25 17:21:42 UTC
(In reply to Roman Kennke from comment #46)
> Can you run the experiments that omit -Xms with -XX:+PrintFlagsFinal and
> look for the value InitialHeapSize? This should show us which -Xms the
> heuristics come up with.

Looks like same value is used:

java -XX:+PrintFlagsFinal -version | grep InitialHeapSize
   size_t InitialHeapSize                          = 2147483648                                {product} {ergonomic}
openjdk version "11.0.8" 2020-07-14 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.8+10-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.8+10-LTS, mixed mode, sharing)


java -XX:+PrintFlagsFinal -version | grep InitialHeapSize
    uintx InitialHeapSize                          := 2147483648                          {product}
openjdk version "1.8.0_262"
OpenJDK Runtime Environment (build 1.8.0_262-b10)
OpenJDK 64-Bit Server VM (build 25.262-b10, mixed mode)


java -XX:+PrintFlagsFinal -version | grep InitialHeapSize
   size_t InitialHeapSize                          = 2147483648                                {product} {ergonomic}
java version "15" 2020-09-15
Java(TM) SE Runtime Environment (build 15+36-1562)
Java HotSpot(TM) 64-Bit Server VM (build 15+36-1562, mixed mode, sharing)

Comment 49 Roman Kennke 2020-11-25 18:25:17 UTC
(In reply to Andrey Loskutov from comment #48)
> (In reply to Roman Kennke from comment #46)
> > Can you run the experiments that omit -Xms with -XX:+PrintFlagsFinal and
> > look for the value InitialHeapSize? This should show us which -Xms the
> > heuristics come up with.
> 
> Looks like same value is used:
> 
> java -XX:+PrintFlagsFinal -version | grep InitialHeapSize
>    size_t InitialHeapSize                          = 2147483648             
> {product} {ergonomic}
> openjdk version "11.0.8" 2020-07-14 LTS
> OpenJDK Runtime Environment 18.9 (build 11.0.8+10-LTS)
> OpenJDK 64-Bit Server VM 18.9 (build 11.0.8+10-LTS, mixed mode, sharing)
> 
> 
> java -XX:+PrintFlagsFinal -version | grep InitialHeapSize
>     uintx InitialHeapSize                          := 2147483648            
> {product}
> openjdk version "1.8.0_262"
> OpenJDK Runtime Environment (build 1.8.0_262-b10)
> OpenJDK 64-Bit Server VM (build 25.262-b10, mixed mode)
> 
> 
> java -XX:+PrintFlagsFinal -version | grep InitialHeapSize
>    size_t InitialHeapSize                          = 2147483648             
> {product} {ergonomic}
> java version "15" 2020-09-15
> Java(TM) SE Runtime Environment (build 15+36-1562)
> Java HotSpot(TM) 64-Bit Server VM (build 15+36-1562, mixed mode, sharing)

I did a few experiments. The -Xms is indeed the same.

However, the results that I am getting vary so much, it doesn't really allow for any conclusion, except that the test is not sufficient. Also, it also doesn't seem to do any GC. It is hard to tell what causes the difference for the actual program.

Comment 50 Simeon Andreev 2020-11-26 08:40:27 UTC
The times I see on my workstation are more or less stable:

socbm775:/home/sandreev/tmp/jep2240$ for run in {1..10}; do rm TestByteArrayCopy.class ; /usr/lib/jvm/java-1.8.0/bin/javac TestByteArrayCopy.java && /usr/lib/jvm/java-1.8.0/bin/java -XX:+UseG1GC TestByteArrayCopy; done
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6556ms
entire snippet took: 6592ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6324ms
entire snippet took: 6358ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6593ms
entire snippet took: 6628ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6511ms
entire snippet took: 6548ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6602ms
entire snippet took: 6639ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6642ms
entire snippet took: 6677ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6833ms
entire snippet took: 6867ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6557ms
entire snippet took: 6592ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6522ms
entire snippet took: 6557ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6342ms
entire snippet took: 6376ms
socbm775:/home/sandreev/tmp/jep2240$ for run in {1..10}; do rm TestByteArrayCopy.class ; /usr/lib/jvm/java-11/bin/javac TestByteArrayCopy.java && /usr/lib/jvm/java-11/bin/java -XX:+UseG1GC TestByteArrayCopy; done
java.version=11.0.8                                                                                                                                                                                                                          
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8698ms                                                                                                                                                                                  
entire snippet took: 8742ms                                                                                                                                                                                                                  
java.version=11.0.8                                                                                                                                                                                                                          
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8620ms                                                                                                                                                                                  
entire snippet took: 8659ms                                                                                                                                                                                                                  
java.version=11.0.8                                                                                                                                                                                                                          
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8785ms                                                                                                                                                                                  
entire snippet took: 8827ms                                                                                                                                                                                                                  
java.version=11.0.8                                                                                                                                                                                                                          
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8450ms                                                                                                                                                                                  
entire snippet took: 8489ms                                                                                                                                                                                                                  
java.version=11.0.8                                                                                                                                                                                                                          
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8572ms                                                                                                                                                                                  
entire snippet took: 8610ms                                                                                                                                                                                                                  
java.version=11.0.8                                                                                                                                                                                                                          
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8829ms
entire snippet took: 8872ms
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8685ms
entire snippet took: 8727ms
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8442ms
entire snippet took: 8484ms
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8645ms
entire snippet took: 8688ms
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8763ms
entire snippet took: 8802ms
socbm775:/home/sandreev/tmp/jep2240$ cat TestByteArrayCopy.java 
import java.io.IOException;                                                                                                                                                                                                                  
import java.util.Arrays;                                                                                                                                                                                                                     
                                                                                                                                                                                                                                             
public class TestByteArrayCopy {

        public static void main(String[] args) throws IOException {
                System.out.println("java.version=" + System.getProperty("java.version"));
                long s = System.currentTimeMillis();

                long totalElapsed = 0;
                int iterations = 1000;
                int n = 2_000_000;

                Byte[] array = new Byte[n];
                for (int i = 0; i < array.length; i++) {
                        array[i] = Byte.valueOf((byte) (i % Byte.MAX_VALUE));
                }

                for (int iteration = 0; iteration < iterations; iteration++) {
                        long start = System.currentTimeMillis();

                        array = Arrays.copyOf(array, array.length);

                        // Same regression here:
//                      Byte [] array2 = new Byte[array.length];
//                      System.arraycopy(array, 0, array2, 0, array.length);
//                      array = array2;

                        long end = System.currentTimeMillis();
                        long elapsed = end - start;
                        totalElapsed += elapsed;
                }

                for (int i = 0; i < array.length; i++) {
                        if(!array[i].equals(Byte.valueOf((byte) (i % Byte.MAX_VALUE)))) {
                                throw new IllegalStateException();
                        }
                }

                long e = System.currentTimeMillis();
                System.out.println("Arrays.copyOf(Byte[], int) repeated " + iterations + " times took: " + totalElapsed + "ms");
                System.out.println("entire snippet took: " + (e - s) + "ms");
        }
}

Comment 51 Simeon Andreev 2020-11-26 09:00:05 UTC
Times with the same snippet on RHEL 8:

socbm370:/home/sandreev/tmp/jep2240$ for run in {1..10}; do /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el8_2.x86_64/jre/bin/java -XX:+UseG1GC TestByteArrayCopy; done
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 4994ms
entire snippet took: 5043ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 4973ms
entire snippet took: 5028ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 5425ms
entire snippet took: 5473ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 5126ms
entire snippet took: 5179ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 5108ms
entire snippet took: 5158ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 4836ms
entire snippet took: 4901ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 4741ms
entire snippet took: 4792ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 5270ms
entire snippet took: 5341ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 4939ms
entire snippet took: 4980ms
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 5057ms
entire snippet took: 5105ms
socbm370:/home/sandreev/tmp/jep2240$ for run in {1..10}; do /usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java -XX:+UseG1GC TestByteArrayCopy; done
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 9560ms
entire snippet took: 9638ms
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 9072ms
entire snippet took: 9124ms
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 9621ms
entire snippet took: 9681ms
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 9711ms
entire snippet took: 9766ms
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 9699ms
entire snippet took: 9753ms
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8572ms
entire snippet took: 8630ms
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 9834ms
entire snippet took: 9880ms
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 9618ms
entire snippet took: 9677ms
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8838ms
entire snippet took: 8913ms
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8671ms
entire snippet took: 8740ms
socbm370:/home/sandreev/tmp/jep2240$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)

Note that we don't have JDKs installed on RHEL 8 (only JREs), so the snippet is compiled with the respective JDKs we have on RHEL 7.4.

Comment 52 Roman Kennke 2020-11-26 14:03:39 UTC
I turned the test into a proper JMH benchmark, this gives us much more precise results. I'm going to attach the sources and resulting JAR.

Running on JDK8 (from RHEL8):

$ /usr/lib/jvm/java-1.8.0-openjdk/bin/java -jar target/benchmarks.jar TestByteArrayCopy -t 1 -f 1 -jvmArgs "-XX:+UseG1GC -Xmx12g"
# JMH version: 1.26
# VM version: JDK 1.8.0_272, OpenJDK 64-Bit Server VM, 25.272-b10

...

Result "com.redhat.test.TestByteArrayCopy.test":
  448.704 ±(99.9%) 8.022 ops/s [Average]
  (min, avg, max) = (445.784, 448.704, 451.662), stdev = 2.083
  CI (99.9%): [440.683, 456.726] (assumes normal distribution)


Running on JDK11 (same machine):

$ /usr/lib/jvm/java-11-openjdk/bin/java -jar target/benchmarks.jar TestByteArrayCopy -t 1 -f 1 -jvmArgs "-XX:+UseG1GC -Xmx12g"
# JMH version: 1.26
# VM version: JDK 11.0.9, OpenJDK 64-Bit Server VM, 11.0.9+11-LTS

...

Result "com.redhat.test.TestByteArrayCopy.test":
  541.646 ±(99.9%) 26.902 ops/s [Average]
  (min, avg, max) = (535.044, 541.646, 551.901), stdev = 6.986
  CI (99.9%): [514.744, 568.547] (assumes normal distribution)


IOW, this shows a definite improvement from JDK8 to JDK11 on my machine. (It's ops/s, more is better.)

Please test this on your machine and report your results.

Comment 53 Roman Kennke 2020-11-26 14:04:57 UTC
Created attachment 1733753 [details]
Test sources

Comment 54 Roman Kennke 2020-11-26 14:05:37 UTC
Created attachment 1733754 [details]
Test JAR

Comment 55 Simeon Andreev 2020-11-26 14:41:39 UTC
RHEL 7.4:

socbm775:/home/sandreev/tmp/jep2240$ /usr/lib/jvm/java-1.8.0/bin/java -jar bytearycpy/target/benchmarks.jar TestByteArrayCopy -t 1 -f 1 -jvmArgs "-XX:+UseG1GC -Xmx12g"
# JMH version: 1.26
# VM version: JDK 1.8.0_262, OpenJDK 64-Bit Server VM, 25.262-b10
# VM invoker: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/jre/bin/java
# VM options: -XX:+UseG1GC -Xmx12g
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.redhat.test.TestByteArrayCopy.test
# Parameters: (size = 2000000)

# Run progress: 0.00% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration   1: 68.090 ops/s
# Warmup Iteration   2: 16.172 ops/s
# Warmup Iteration   3: 238.032 ops/s
# Warmup Iteration   4: 208.194 ops/s
# Warmup Iteration   5: 250.475 ops/s
Iteration   1: 217.955 ops/s
Iteration   2: 247.756 ops/s
Iteration   3: 233.531 ops/s
Iteration   4: 148.529 ops/s
Iteration   5: 46.804 ops/s


Result "com.redhat.test.TestByteArrayCopy.test":
  178.915 ?(99.9%) 319.989 ops/s [Average]
  (min, avg, max) = (46.804, 178.915, 247.756), stdev = 83.100
  CI (99.9%): [? 0, 498.904] (assumes normal distribution)


# Run complete. Total time: 00:02:27

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                (size)   Mode  Cnt    Score     Error  Units
TestByteArrayCopy.test  2000000  thrpt    5  178.915 ? 319.989  ops/s
socbm775:/home/sandreev/tmp/jep2240$ /usr/lib/jvm/java-11/bin/java -jar bytearycpy/target/benchmarks.jar TestByteArrayCopy -t 1 -f 1 -jvmArgs "-XX:+UseG1GC -Xmx12g"
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.openjdk.jmh.util.Utils (file:/home/sandreev/tmp/jep2240/bytearycpy/target/benchmarks.jar) to field java.io.Console.cs
WARNING: Please consider reporting this to the maintainers of org.openjdk.jmh.util.Utils
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
# JMH version: 1.26
# VM version: JDK 11.0.8, OpenJDK 64-Bit Server VM, 11.0.8+10-LTS
# VM invoker: /usr/lib/jvm/java-11-openjdk-11.0.8.10-1.el7.x86_64/bin/java
# VM options: -XX:+UseG1GC -Xmx12g
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.redhat.test.TestByteArrayCopy.test
# Parameters: (size = 2000000)

# Run progress: 0.00% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration   1: 116.956 ops/s
# Warmup Iteration   2: 121.580 ops/s
# Warmup Iteration   3: 121.083 ops/s
# Warmup Iteration   4: 140.914 ops/s
# Warmup Iteration   5: 374.684 ops/s
Iteration   1: 507.539 ops/s
Iteration   2: 548.375 ops/s
Iteration   3: 582.907 ops/s
Iteration   4: 592.642 ops/s
Iteration   5: 603.223 ops/s


Result "com.redhat.test.TestByteArrayCopy.test":
  566.937 ?(99.9%) 150.436 ops/s [Average]
  (min, avg, max) = (507.539, 566.937, 603.223), stdev = 39.068
  CI (99.9%): [416.501, 717.373] (assumes normal distribution)


# Run complete. Total time: 00:01:42

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                (size)   Mode  Cnt    Score     Error  Units
TestByteArrayCopy.test  2000000  thrpt    5  566.937 ? 150.436  ops/s

RHEL 8:

socbm370:/home/sandreev/tmp/jep2240$ /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el8_2.x86_64/jre/bin/java -jar bytearycpy/target/benchmarks.jar TestByteArrayCopy -t 1 -f 1 -jvmArgs "-XX:+UseG1GC -Xmx12g"
# JMH version: 1.26
# VM version: JDK 1.8.0_262, OpenJDK 64-Bit Server VM, 25.262-b10
# VM invoker: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el8_2.x86_64/jre/bin/java
# VM options: -XX:+UseG1GC -Xmx12g
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.redhat.test.TestByteArrayCopy.test
# Parameters: (size = 2000000)

# Run progress: 0.00% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration   1: 60.361 ops/s
# Warmup Iteration   2: 244.739 ops/s
# Warmup Iteration   3: 238.716 ops/s
# Warmup Iteration   4: 244.022 ops/s
# Warmup Iteration   5: 244.379 ops/s
Iteration   1: 247.630 ops/s
Iteration   2: 238.719 ops/s
Iteration   3: 239.159 ops/s
Iteration   4: 240.748 ops/s
Iteration   5: 241.115 ops/s


Result "com.redhat.test.TestByteArrayCopy.test":
  241.474 ±(99.9%) 13.817 ops/s [Average]
  (min, avg, max) = (238.719, 241.474, 247.630), stdev = 3.588
  CI (99.9%): [227.657, 255.291] (assumes normal distribution)


# Run complete. Total time: 00:01:57

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                (size)   Mode  Cnt    Score    Error  Units
TestByteArrayCopy.test  2000000  thrpt    5  241.474 ± 13.817  ops/s
socbm370:/home/sandreev/tmp/jep2240$ /usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java -jar bytearycpy/target/benchmarks.jar TestByteArrayCopy -t 1 -f 1 -jvmArgs "-XX:+UseG1GC -Xmx12g"
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.openjdk.jmh.util.Utils (file:/home/sandreev/tmp/jep2240/bytearycpy/target/benchmarks.jar) to field java.io.Console.cs
WARNING: Please consider reporting this to the maintainers of org.openjdk.jmh.util.Utils
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
# JMH version: 1.26
# VM version: JDK 11.0.8, OpenJDK 64-Bit Server VM, 11.0.8+10-LTS
# VM invoker: /usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java
# VM options: -XX:+UseG1GC -Xmx12g
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.redhat.test.TestByteArrayCopy.test
# Parameters: (size = 2000000)

# Run progress: 0.00% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration   1: 100.788 ops/s
# Warmup Iteration   2: 106.175 ops/s
# Warmup Iteration   3: 107.597 ops/s
# Warmup Iteration   4: 216.793 ops/s
# Warmup Iteration   5: 359.409 ops/s
Iteration   1: 395.729 ops/s
Iteration   2: 443.773 ops/s
Iteration   3: 446.189 ops/s
Iteration   4: 451.510 ops/s
Iteration   5: 451.829 ops/s


Result "com.redhat.test.TestByteArrayCopy.test":
  437.806 ±(99.9%) 91.545 ops/s [Average]
  (min, avg, max) = (395.729, 437.806, 451.829), stdev = 23.774
  CI (99.9%): [346.261, 529.351] (assumes normal distribution)


# Run complete. Total time: 00:01:43

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                (size)   Mode  Cnt    Score    Error  Units
TestByteArrayCopy.test  2000000  thrpt    5  437.806 ± 91.545  ops/s


Giving 12 GB JVM heap to the original snippet doesn't change the numbers I'm seeing.

Comment 56 Simeon Andreev 2020-11-26 14:48:29 UTC
We can try to write a snippet with the exact data we are copying in the user code our product runs? That should be closest to the degradation case we want to have fixed.

Comment 57 Andrey Loskutov 2020-11-26 14:52:58 UTC
I also see similar benchmark result and it is better for Java 11.
This means, benchmark is doing something different, because in our application & snippet that we've identified as the performance problem we see clear performance degradation in Java 11. 
The question is: what the benchmark does different to the real application / simple snippet?

Comment 58 Roman Kennke 2020-11-26 15:11:49 UTC
Yeah. What I did was narrow it down to the actual arraycopy on Byte[]. It seems to lead to the conclusion that this is not the actual problem.

One issue may be that in the examples that you provided, it's barely ever doing any GC. The performance may look totally different when GC kicks in. Which means, we ought to allocate something and keep the GC busy while doing those arraycopies. I will look into that. If you could extract a testcase that is closer to your product problem, that might be useful.

Comment 59 Simeon Andreev 2020-11-26 15:21:40 UTC
RHEL 7.4:

socbm775:/home/sandreev/tmp/jep2240$ for run in {1..1}; do rm TestByteArrayCopy.class ; /usr/lib/jvm/java-1.8.0/bin/javac TestByteArrayCopy.java && /usr/lib/jvm/java-1.8.0/bin/java -XX:+UseG1GC -Xmx12g -verbose:gc TestByteArrayCopy; done
java.version=1.8.0_262
[GC pause (G1 Humongous Allocation) (young) (initial-mark) 856M->855M(2008M), 3.4473936 secs]
[GC concurrent-root-region-scan-start]
[GC concurrent-root-region-scan-end, 0.0031757 secs]
[GC concurrent-mark-start]
[GC concurrent-mark-end, 0.0225507 secs]
[GC remark, 0.0010758 secs]
[GC cleanup 916M->70944K(2008M), 0.0007968 secs]
[GC concurrent-cleanup-start]
[GC concurrent-cleanup-end, 0.0005786 secs]
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6353ms
entire snippet took: 6387ms

socbm775:/home/sandreev/tmp/jep2240$ for run in {1..1}; do rm TestByteArrayCopy.class ; /usr/lib/jvm/java-11/bin/javac TestByteArrayCopy.java && /usr/lib/jvm/java-11/bin/java -XX:+UseG1GC -Xmx12g -verbose:gc TestByteArrayCopy; done
[0.010s][info][gc] Using G1
java.version=11.0.8
[1.174s][info][gc] GC(0) Pause Young (Concurrent Start) (G1 Humongous Allocation) 898M->897M(2012M) 761.009ms
[1.174s][info][gc] GC(1) Concurrent Cycle
[1.194s][info][gc] GC(1) Pause Remark 937M->49M(2012M) 0.721ms
[1.266s][info][gc] GC(1) Pause Cleanup 369M->369M(2012M) 0.221ms
[1.270s][info][gc] GC(1) Concurrent Cycle 95.876ms
[2.115s][info][gc] GC(2) Pause Young (Concurrent Start) (G1 Humongous Allocation) 897M->897M(2012M) 732.343ms
[2.115s][info][gc] GC(3) Concurrent Cycle
[2.128s][info][gc] GC(3) Pause Remark 921M->33M(2012M) 0.659ms
[2.179s][info][gc] GC(3) Pause Cleanup 257M->257M(2012M) 0.210ms
[2.192s][info][gc] GC(3) Concurrent Cycle 76.547ms
[3.017s][info][gc] GC(4) Pause Young (Concurrent Start) (G1 Humongous Allocation) 897M->897M(2012M) 700.434ms
[3.017s][info][gc] GC(5) Concurrent Cycle
[3.037s][info][gc] GC(5) Pause Remark 945M->57M(2012M) 1.239ms
[3.113s][info][gc] GC(5) Pause Cleanup 361M->361M(2012M) 0.184ms
[3.115s][info][gc] GC(5) Concurrent Cycle 97.582ms
[3.965s][info][gc] GC(6) Pause Young (Concurrent Start) (G1 Humongous Allocation) 897M->897M(2012M) 733.181ms
[3.965s][info][gc] GC(7) Concurrent Cycle
[3.981s][info][gc] GC(7) Pause Remark 929M->41M(2012M) 1.188ms
[4.036s][info][gc] GC(7) Pause Cleanup 297M->297M(2012M) 0.202ms
[4.038s][info][gc] GC(7) Concurrent Cycle 72.938ms
[4.868s][info][gc] GC(8) Pause Young (Concurrent Start) (G1 Humongous Allocation) 897M->897M(6036M) 698.219ms
[4.868s][info][gc] GC(9) Concurrent Cycle
[4.884s][info][gc] GC(9) Pause Remark 929M->41M(6036M) 0.680ms
[4.942s][info][gc] GC(9) Pause Cleanup 305M->305M(6036M) 0.444ms
[4.948s][info][gc] GC(9) Concurrent Cycle 80.178ms
[7.709s][info][gc] GC(10) Pause Young (Concurrent Start) (G1 Humongous Allocation) 2713M->2713M(6036M) 2114.295ms
[7.709s][info][gc] GC(11) Concurrent Cycle
[7.723s][info][gc] GC(11) Pause Remark 2745M->41M(6036M) 1.299ms
[7.787s][info][gc] GC(11) Pause Cleanup 329M->329M(6036M) 0.408ms
[7.793s][info][gc] GC(11) Concurrent Cycle 83.762ms
Arrays.copyOf(Byte[], int) repeated 1000 times took: 7824ms
entire snippet took: 7862ms

RHEL 8.2:

socbm370:/home/sandreev/tmp/jep2240$ /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el8_2.x86_64/jre/bin/java -XX:+UseG1GC -Xmx12g -verbose:gc TestByteArrayCopy
java.version=1.8.0_262
[GC pause (G1 Humongous Allocation) (young) (initial-mark) 429M->427M(1002M), 0.8016226 secs]
[GC concurrent-root-region-scan-start]
[GC concurrent-root-region-scan-end, 0.0014161 secs]
[GC concurrent-mark-start]
[GC concurrent-mark-end, 0.0588239 secs]
[GC remark, 0.0009294 secs]
[GC cleanup 542M->122M(1002M), 0.0006460 secs]
[GC concurrent-cleanup-start]
[GC concurrent-cleanup-end, 0.0002427 secs]
Arrays.copyOf(Byte[], int) repeated 1000 times took: 4721ms
entire snippet took: 4778ms

socbm370:/home/sandreev/tmp/jep2240$ /usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java -XX:+UseG1GC -Xmx12g -verbose:gc TestByteArrayCopy
[0.118s][info][gc] Using G1
java.version=11.0.8
[0.903s][info][gc] GC(0) Pause Young (Concurrent Start) (G1 Humongous Allocation) 442M->441M(1006M) 395.497ms
[0.903s][info][gc] GC(1) Concurrent Cycle
[0.980s][info][gc] GC(1) Pause Remark 577M->145M(1006M) 1.576ms
[1.126s][info][gc] GC(1) Pause Cleanup 393M->393M(1006M) 0.206ms
[1.128s][info][gc] GC(1) Concurrent Cycle 224.652ms
[1.626s][info][gc] GC(2) Pause Young (Concurrent Start) (G1 Humongous Allocation) 441M->441M(1006M) 465.625ms
[1.626s][info][gc] GC(3) Concurrent Cycle
[1.670s][info][gc] GC(3) Pause Remark 545M->113M(1006M) 0.581ms
[1.826s][info][gc] GC(3) Pause Cleanup 689M->689M(1006M) 0.142ms
[1.828s][info][gc] GC(3) Concurrent Cycle 201.687ms
[2.469s][info][gc] GC(4) Pause Young (Concurrent Start) (G1 Humongous Allocation) 697M->697M(1006M) 639.730ms
[2.469s][info][gc] GC(5) Concurrent Cycle
[2.512s][info][gc] GC(5) Pause Remark 761M->73M(1006M) 11.285ms
[2.615s][info][gc] GC(5) Pause Cleanup 425M->425M(1006M) 0.167ms
[2.616s][info][gc] GC(5) Concurrent Cycle 147.054ms
[3.065s][info][gc] GC(6) Pause Young (Concurrent Start) (G1 Humongous Allocation) 441M->441M(1006M) 444.077ms
[3.065s][info][gc] GC(7) Concurrent Cycle
[3.082s][info][gc] GC(7) Pause Remark 465M->33M(1006M) 0.552ms
[3.146s][info][gc] GC(7) Pause Cleanup 161M->161M(1006M) 0.163ms
[3.147s][info][gc] GC(7) Concurrent Cycle 82.059ms
[3.668s][info][gc] GC(8) Pause Young (Concurrent Start) (G1 Humongous Allocation) 441M->441M(3018M) 399.169ms
[3.668s][info][gc] GC(9) Concurrent Cycle
[3.692s][info][gc] GC(9) Pause Remark 497M->65M(3018M) 0.734ms
[3.757s][info][gc] GC(9) Pause Cleanup 321M->321M(3018M) 0.284ms
[3.762s][info][gc] GC(9) Concurrent Cycle 93.373ms
[5.327s][info][gc] GC(10) Pause Young (Concurrent Start) (G1 Humongous Allocation) 1353M->1353M(3018M) 1194.457ms
[5.327s][info][gc] GC(11) Concurrent Cycle
[5.342s][info][gc] GC(11) Pause Remark 1385M->41M(3018M) 0.839ms
[5.380s][info][gc] GC(11) Pause Cleanup 169M->169M(3018M) 0.297ms
[5.386s][info][gc] GC(11) Concurrent Cycle 58.916ms
[6.890s][info][gc] GC(12) Pause Young (Concurrent Start) (G1 Humongous Allocation) 1353M->1353M(3018M) 1203.245ms
[6.890s][info][gc] GC(13) Concurrent Cycle
[6.912s][info][gc] GC(13) Pause Remark 1401M->57M(3018M) 0.887ms
[6.962s][info][gc] GC(13) Pause Cleanup 209M->209M(3018M) 0.262ms
[6.964s][info][gc] GC(13) Concurrent Cycle 73.295ms
[8.461s][info][gc] GC(14) Pause Young (Concurrent Start) (G1 Humongous Allocation) 1353M->1353M(3018M) 1145.752ms
[8.461s][info][gc] GC(15) Concurrent Cycle
[8.475s][info][gc] GC(15) Pause Remark 1385M->41M(3018M) 0.853ms
[8.529s][info][gc] GC(15) Pause Cleanup 257M->257M(3018M) 0.346ms
[8.531s][info][gc] GC(15) Concurrent Cycle 69.997ms
[9.959s][info][gc] GC(16) Pause Young (Concurrent Start) (G1 Humongous Allocation) 1353M->1353M(6726M) 1164.921ms
[9.959s][info][gc] GC(17) Concurrent Cycle
[10.014s][info][gc] GC(17) Pause Remark 1417M->73M(6726M) 6.023ms
[10.079s][info][gc] GC(17) Pause Cleanup 218M->218M(6726M) 0.499ms
[10.089s][info][gc] GC(17) Concurrent Cycle 129.518ms
Arrays.copyOf(Byte[], int) repeated 1000 times took: 9726ms
entire snippet took: 9773ms

(again on RHEL 8.2 we have only a JRE installed, so compilation is done with respective JDK version on RHEL 7.4)

Comment 60 Roman Kennke 2020-11-26 15:32:10 UTC
It is notable that jdk8 does only 1 cycle because it failed on humongous allocation, while JDK11 does several regular GC cycles. Can you run both experiments with +PrintFlagsFinal and check which values you get from G1HeapRegionSize ?

Comment 61 Simeon Andreev 2020-11-26 16:02:33 UTC
Created attachment 1733828 [details]
Output with -XX:+PrintFlagsFinal on RHEL 7.4, JDK 1.8 and JDK 11.

Attached the full output of running the snippet with -XX:+PrintFlagsFinal.

Comment 62 Roman Kennke 2020-11-26 17:38:00 UTC
Thank you.

It seems interesting that with JDK8 we only ever see one single GC cycle, while with JDK11 we see 11 cycles. This would explain the decreased performance: during GC, G1 enables barriers which affect performance of arraycopy of reference-arrays.

The question is, why does that happen? I compared effective flags, and all which are relevant look the same. I suspect that in JDK8, the heap size is expanded more aggressively right after the first cycle, so that future allocations can be served without GCing. The fact that it performs better with -Xms12g supports that theory. But let's see what it is actually doing: can you run with -XX:+PrintAdaptiveSizePolicy, this should tell us when heap gets expanded and by how much.
 

        java.version=1.8.0_262
        [GC pause (G1 Humongous Allocation) (young) (initial-mark) 856M->855M(2008M), 3.0465101 secs]
        [GC concurrent-root-region-scan-start]
        [GC concurrent-root-region-scan-end, 0.0004176 secs]
        [GC concurrent-mark-start]
        [GC concurrent-mark-end, 0.0419778 secs]
        [GC remark, 0.0015284 secs]
        [GC cleanup 992M->145M(2008M), 0.0009359 secs]
        [GC concurrent-cleanup-start]
        [GC concurrent-cleanup-end, 0.0007592 secs]
        Arrays.copyOf(Byte[], int) repeated 1000 times took: 5803ms
         
        java.version=11.0.8
        [1.376s][info][gc] GC(0) Pause Young (Concurrent Start) (G1 Humongous Allocation) 898M->897M(2012M) 959.827ms
        [1.376s][info][gc] GC(1) Concurrent Cycle
        [1.397s][info][gc] GC(1) Pause Remark 937M->49M(2012M) 0.676ms
        [1.469s][info][gc] GC(1) Pause Cleanup 369M->369M(2012M) 0.224ms
        [1.480s][info][gc] GC(1) Concurrent Cycle 103.824ms
        [2.567s][info][gc] GC(2) Pause Young (Concurrent Start) (G1 Humongous Allocation) 897M->897M(2012M) 977.013ms
        [2.567s][info][gc] GC(3) Concurrent Cycle
        [2.579s][info][gc] GC(3) Pause Remark 921M->33M(2012M) 0.684ms
        [2.626s][info][gc] GC(3) Pause Cleanup 241M->241M(2012M) 0.183ms
        [2.636s][info][gc] GC(3) Concurrent Cycle 69.234ms
        [3.690s][info][gc] GC(4) Pause Young (Concurrent Start) (G1 Humongous Allocation) 897M->897M(2012M) 914.754ms
        [3.690s][info][gc] GC(5) Concurrent Cycle
        [3.711s][info][gc] GC(5) Pause Remark 937M->49M(2012M) 0.636ms
        [3.779s][info][gc] GC(5) Pause Cleanup 345M->345M(2012M) 0.196ms
        [3.781s][info][gc] GC(5) Concurrent Cycle 90.267ms
        [4.827s][info][gc] GC(6) Pause Young (Concurrent Start) (G1 Humongous Allocation) 897M->897M(2012M) 922.191ms
        [4.827s][info][gc] GC(7) Concurrent Cycle
        [4.840s][info][gc] GC(7) Pause Remark 921M->33M(2012M) 1.176ms
        [4.883s][info][gc] GC(7) Pause Cleanup 225M->225M(2012M) 0.198ms
        [4.885s][info][gc] GC(7) Concurrent Cycle 58.246ms
        [5.963s][info][gc] GC(8) Pause Young (Concurrent Start) (G1 Humongous Allocation) 897M->897M(6036M) 927.560ms
        [5.963s][info][gc] GC(9) Concurrent Cycle
        [5.976s][info][gc] GC(9) Pause Remark 921M->33M(6036M) 0.719ms
        [6.026s][info][gc] GC(9) Pause Cleanup 241M->241M(6036M) 0.441ms
        [6.033s][info][gc] GC(9) Concurrent Cycle 70.097ms
        [9.656s][info][gc] GC(10) Pause Young (Concurrent Start) (G1 Humongous Allocation) 2713M->2713M(6036M) 3002.418ms
        [9.656s][info][gc] GC(11) Concurrent Cycle
        [9.671s][info][gc] GC(11) Pause Remark 2745M->41M(6036M) 1.278ms
        [9.742s][info][gc] GC(11) Pause Cleanup 337M->337M(6036M) 0.453ms
        [9.752s][info][gc] GC(11) Concurrent Cycle 95.469ms
        Arrays.copyOf(Byte[], int) repeated 1000 times took: 9772ms
        entire snippet took: 9809ms

Comment 63 Simeon Andreev 2020-11-27 07:53:11 UTC
Created attachment 1734031 [details]
Output with -XX+PrintAdaptiveSizePolicy on RHEL 7.4, JDK 1.8. On JDK 11 used -Xlog:gc+ergo*=debug as replacement.

"-XX+PrintAdaptiveSizePolicy" seems to be invalid for JDK 11, so I used "-Xlog:gc+ergo*=debug" instead. Let me know if I should use something else.

Comment 64 Roman Kennke 2020-11-30 17:57:05 UTC
(In reply to Simeon Andreev from comment #63)
> Created attachment 1734031 [details]
> Output with -XX+PrintAdaptiveSizePolicy on RHEL 7.4, JDK 1.8. On JDK 11 used
> -Xlog:gc+ergo*=debug as replacement.
> 
> "-XX+PrintAdaptiveSizePolicy" seems to be invalid for JDK 11, so I used
> "-Xlog:gc+ergo*=debug" instead. Let me know if I should use something else.

Ok. If I read this correctly, G1 in JDK8 repeatedly and aggressively expands the heap, instead of starting a GC cycle. It gets away with no GC cycles at all, except for the very first one.

JDK11 does the opposite: it first tries to do a few GC cycles to keep heap size small, and only starts expanding after a few cycles. The additional barrier overhead seems to be what makes the JDK11 test go slower for you.

The solution is to give a more reasonable -Xms, e.g. -Xms6g or even just the same as -Xmx (afaik, G1 in JDK11 wouldn't shrink anyway once it expanded the heap to -Xmx, so you can just commit all that memory to begin with and avoid the overhead later).

I don't see that we can easily fix this in the JDK without impacting other workloads.

Comment 65 Andrey Loskutov 2020-11-30 18:06:51 UTC
(In reply to Roman Kennke from comment #64)
> I don't see that we can easily fix this in the JDK without impacting other
> workloads.

See comment 3 - JDK 15 seem to "fixed" this again. Was the strategy changed back?

Comment 66 Roman Kennke 2020-11-30 19:25:49 UTC
I know. Maybe they changed the strategy back, maybe they improved it overall somehow. I don't know this. I do know that there have been many significant changes in G1 in this area, for example, G1 would now also uncommit heap regions that it doesn't need anymore. Even if we could find the specific change that improved the situation again (I will have a look at the list of changes in G1 between 11 and 15), it seems fairly risky to backport such a change blindly. If possible at all, this would require very careful investigation. It seems a much less risky and rather simple approach to run the application with reasonable -Xms setting. In-fact, it is *always* a good plan to pass good -Xms and -Xmx to the JVM. Is there a problem with passing a reasonable higher -Xms ?

Comment 67 Andrey Loskutov 2020-11-30 19:41:12 UTC
(In reply to Roman Kennke from comment #66)
> Is there a problem with passing
> a reasonable higher -Xms ?

Who knows which other regressions we will see with that, *if* we will find a reasonable value that fixes the performance issue.
JVM is not only process running in our application, so increasing Xms may "steal" memory from other processes, resulting again in some performance issue just in other place.

We will try to play with Xms and report back what we've found with the snippet & application itself. Thanks for looking into this!

Comment 68 Roman Kennke 2020-11-30 20:07:51 UTC
(In reply to Andrey Loskutov from comment #67)
> (In reply to Roman Kennke from comment #66)
> > Is there a problem with passing
> > a reasonable higher -Xms ?
> 
> Who knows which other regressions we will see with that, *if* we will find a
> reasonable value that fixes the performance issue.
> JVM is not only process running in our application, so increasing Xms may
> "steal" memory from other processes, resulting again in some performance
> issue just in other place.
> 
> We will try to play with Xms and report back what we've found with the
> snippet & application itself. Thanks for looking into this!

G1 is the default GC in JDK11. A change there affects almost *every* user/customer, and it is not clear whether or not this would be an improvement in general case. And if anything goes wrong with that, we get everybody in trouble. I am not going to do a risky change there if it there is a chance that it can be fixed with a reasonable -Xms. Remember: memory usage with G1 in JDK11 only ever grows, if you are worried about JVM taking memory away from other processes, then you may want to reconsider your -Xmx setting. -Xms is only the initial heap size. It is likely to arrive at -Xmx at some point anyway.

Comment 69 Simeon Andreev 2020-12-01 16:01:33 UTC
When specifying different -Xms values, I'm only seeing worse performance in OpenJDK 1.8. I'm not seeing better performance in OpenJDK 11:

socbm775:/home/sandreev/tmp/jep2240$ rm TestByteArrayCopy.class ; /usr/lib/jvm/java-1.8.0/bin/javac TestByteArrayCopy.java && /usr/lib/jvm/java-1.8.0/bin/java -XX:+UseG1GC -Xmx12g -Xms12g TestByteArrayCopy
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 22308ms
entire snippet took: 22353ms
socbm775:/home/sandreev/tmp/jep2240$ rm TestByteArrayCopy.class ; /usr/lib/jvm/java-11/bin/javac TestByteArrayCopy.java && /usr/lib/jvm/java-11/bin/java -XX:+UseG1GC -Xmx12g -Xms12g TestByteArrayCopy
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8709ms
entire snippet took: 8746ms

socbm775:/home/sandreev/tmp/jep2240$ rm TestByteArrayCopy.class ; /usr/lib/jvm/java-1.8.0/bin/javac TestByteArrayCopy.java && /usr/lib/jvm/java-1.8.0/bin/java -XX:+UseG1GC -Xmx12g TestByteArrayCopy
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 6342ms
entire snippet took: 6375ms
socbm775:/home/sandreev/tmp/jep2240$ rm TestByteArrayCopy.class ; /usr/lib/jvm/java-11/bin/javac TestByteArrayCopy.java && /usr/lib/jvm/java-11/bin/java -XX:+UseG1GC -Xmx12g TestByteArrayCopy
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 9301ms
entire snippet took: 9340ms

socbm775:/home/sandreev/tmp/jep2240$ rm TestByteArrayCopy.class ; /usr/lib/jvm/java-1.8.0/bin/javac TestByteArrayCopy.java && /usr/lib/jvm/java-1.8.0/bin/java -XX:+UseG1GC -Xmx12g -Xms6g TestByteArrayCopy
java.version=1.8.0_262
Arrays.copyOf(Byte[], int) repeated 1000 times took: 11792ms
entire snippet took: 11827ms
socbm775:/home/sandreev/tmp/jep2240$ rm TestByteArrayCopy.class ; /usr/lib/jvm/java-11/bin/javac TestByteArrayCopy.java && /usr/lib/jvm/java-11/bin/java -XX:+UseG1GC -Xmx12g -Xms6g TestByteArrayCopy
java.version=11.0.8
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8175ms
entire snippet took: 8214ms

Comment 70 Roman Kennke 2020-12-01 19:55:28 UTC
Ok, interesting.

We should not rely on the snippet, though. The way it is written, it is grossly unreliable. It clearly measures something, but it is absolutely not clear what it is. It could be warmup time, JIT compilation overhead or whatever else. I strongly suggest that we change/extend the JMH test that I've provided so that it shows more clearly the problem, and then we can debug this. JMH takes care of most benchmarking pitfalls quite nicely, and once we see the problem there, JMH lets us drill in and see which code (down to assembly level) is the hot code there.

Comment 74 Andrew John Hughes 2021-08-16 12:38:49 UTC
This has been idle all this year so far. What is the current status?

Comment 75 Simeon Andreev 2021-08-16 12:53:56 UTC
We still see the degradation in our application. We don't know how to "better" reduce the code which shows the performance degradation, to comply with comment 70.

Comment 77 Simeon Andreev 2021-10-25 14:49:49 UTC
(In reply to Roman Kennke from comment #70)

As before, no idea how to reduce our application to a few-liners benchmark. At most I can check what exact data is being copied with Arrays.copyOf(), and then use the same data for the snippet/benchmark; performance timers in our code led to Arrays.copyOf(Byte[]) calls. I'll check with Andrey whether we will invest time in this.

With the provided benchmark (no changes), I see:


...:/home/sandreev/tmp/jep2240/rhel/bytearycpy$ /usr/lib/jvm/java-11/bin/java -XX:+UseG1GC -Xms512m -Xmx8g -jar target/benchmarks.jar TestByteArrayCopyWARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.openjdk.jmh.util.Utils (file:/home/sandreev/tmp/jep2240/rhel/bytearycpy/target/benchmarks.jar) to field java.io.Console.cs
WARNING: Please consider reporting this to the maintainers of org.openjdk.jmh.util.Utils
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
# JMH version: 1.26
# VM version: JDK 11.0.10, OpenJDK 64-Bit Server VM, 11.0.10+9-LTS
# VM invoker: /usr/lib/jvm/java-11-openjdk-11.0.10.0.9-1.el7_9.x86_64/bin/java
# VM options: -XX:+UseG1GC -Xms512m -Xmx8g
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.redhat.test.TestByteArrayCopy.test
# Parameters: (size = 2000000)

# Run progress: 0.00% complete, ETA 00:08:20
# Fork: 1 of 5
# Warmup Iteration   1: 127.782 ops/s
# Warmup Iteration   2: 119.977 ops/s
# Warmup Iteration   3: 371.962 ops/s
# Warmup Iteration   4: 510.211 ops/s
# Warmup Iteration   5: 533.200 ops/s
Iteration   1: 569.155 ops/s
Iteration   2: 569.925 ops/s
Iteration   3: 577.482 ops/s
Iteration   4: 590.540 ops/s
Iteration   5: 589.867 ops/s

# Run progress: 20.00% complete, ETA 00:06:42
# Fork: 2 of 5
# Warmup Iteration   1: 115.482 ops/s
# Warmup Iteration   2: 108.611 ops/s
# Warmup Iteration   3: 481.968 ops/s
# Warmup Iteration   4: 525.948 ops/s
# Warmup Iteration   5: 560.495 ops/s
Iteration   1: 579.424 ops/s
Iteration   2: 580.063 ops/s
Iteration   3: 586.730 ops/s
Iteration   4: 582.806 ops/s
Iteration   5: 571.221 ops/s

# Run progress: 40.00% complete, ETA 00:05:04
# Fork: 3 of 5
# Warmup Iteration   1: 120.873 ops/s
# Warmup Iteration   2: 107.551 ops/s
# Warmup Iteration   3: 474.751 ops/s
# Warmup Iteration   4: 517.245 ops/s
# Warmup Iteration   5: 555.823 ops/s
Iteration   1: 564.218 ops/s
Iteration   2: 578.410 ops/s
Iteration   3: 586.837 ops/s
Iteration   4: 586.264 ops/s
Iteration   5: 574.152 ops/s

# Run progress: 60.00% complete, ETA 00:03:23
# Fork: 4 of 5
# Warmup Iteration   1: 121.829 ops/s
# Warmup Iteration   2: 124.185 ops/s
# Warmup Iteration   3: 472.564 ops/s
# Warmup Iteration   4: 517.524 ops/s
# Warmup Iteration   5: 548.365 ops/s
Iteration   1: 568.103 ops/s
Iteration   2: 592.352 ops/s
Iteration   3: 586.285 ops/s
Iteration   4: 592.231 ops/s
Iteration   5: 588.189 ops/s

# Run progress: 80.00% complete, ETA 00:01:41
# Fork: 5 of 5
# Warmup Iteration   1: 116.719 ops/s
# Warmup Iteration   2: 111.155 ops/s
# Warmup Iteration   3: 478.959 ops/s
# Warmup Iteration   4: 529.781 ops/s
# Warmup Iteration   5: 538.293 ops/s
Iteration   1: 563.849 ops/s
Iteration   2: 574.354 ops/s
Iteration   3: 569.874 ops/s
Iteration   4: 576.075 ops/s
Iteration   5: 579.891 ops/s


Result "com.redhat.test.TestByteArrayCopy.test":
  579.132 ?(99.9%) 6.627 ops/s [Average]
  (min, avg, max) = (563.849, 579.132, 592.352), stdev = 8.847
  CI (99.9%): [572.505, 585.759] (assumes normal distribution)


# Run complete. Total time: 00:08:28

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                (size)   Mode  Cnt    Score   Error  Units
TestByteArrayCopy.test  2000000  thrpt   25  579.132 ? 6.627  ops/s


...:/home/sandreev/tmp/jep2240/rhel/bytearycpy$ /data/tmp/jre-8u181-linux-x64/jre1.8.0_181/bin/java -XX:+UseG1GC -Xms512m -Xmx8g -jar target/benchmarks.jar TestByteArrayCopy# JMH version: 1.26
# VM version: JDK 1.8.0_181, Java HotSpot(TM) 64-Bit Server VM, 25.181-b13
# VM invoker: /data/tmp/jre-8u181-linux-x64/jre1.8.0_181/bin/java
# VM options: -XX:+UseG1GC -Xms512m -Xmx8g
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.redhat.test.TestByteArrayCopy.test
# Parameters: (size = 2000000)

# Run progress: 0.00% complete, ETA 00:08:20
# Fork: 1 of 5
# Warmup Iteration   1: 38.466 ops/s
# Warmup Iteration   2: 221.146 ops/s
# Warmup Iteration   3: 206.954 ops/s
# Warmup Iteration   4: 221.695 ops/s
# Warmup Iteration   5: 229.257 ops/s
Iteration   1: 226.557 ops/s
Iteration   2: 232.463 ops/s
Iteration   3: 218.153 ops/s
Iteration   4: 233.829 ops/s
Iteration   5: 230.345 ops/s

# Run progress: 20.00% complete, ETA 00:07:59
# Fork: 2 of 5
# Warmup Iteration   1: 38.469 ops/s
# Warmup Iteration   2: 203.775 ops/s
# Warmup Iteration   3: 208.940 ops/s
# Warmup Iteration   4: 236.138 ops/s
# Warmup Iteration   5: 224.775 ops/s
Iteration   1: 230.788 ops/s
Iteration   2: 232.497 ops/s
Iteration   3: 218.175 ops/s
Iteration   4: 227.657 ops/s
Iteration   5: 232.267 ops/s

# Run progress: 40.00% complete, ETA 00:05:59
# Fork: 3 of 5
# Warmup Iteration   1: 41.499 ops/s
# Warmup Iteration   2: 223.229 ops/s
# Warmup Iteration   3: 216.217 ops/s
# Warmup Iteration   4: 232.287 ops/s
# Warmup Iteration   5: 239.693 ops/s
Iteration   1: 236.312 ops/s
Iteration   2: 200.450 ops/s
Iteration   3: 232.421 ops/s
Iteration   4: 224.571 ops/s
Iteration   5: 233.315 ops/s

# Run progress: 60.00% complete, ETA 00:03:58
# Fork: 4 of 5
# Warmup Iteration   1: 36.709 ops/s
# Warmup Iteration   2: 216.162 ops/s
# Warmup Iteration   3: 220.014 ops/s
# Warmup Iteration   4: 233.063 ops/s
# Warmup Iteration   5: 239.657 ops/s
Iteration   1: 216.951 ops/s
Iteration   2: 230.073 ops/s
Iteration   3: 212.168 ops/s
Iteration   4: 223.988 ops/s
Iteration   5: 235.122 ops/s

# Run progress: 80.00% complete, ETA 00:01:59
# Fork: 5 of 5
# Warmup Iteration   1: 34.313 ops/s
# Warmup Iteration   2: 226.345 ops/s
# Warmup Iteration   3: 213.087 ops/s
# Warmup Iteration   4: 231.221 ops/s
# Warmup Iteration   5: 232.686 ops/s
Iteration   1: 227.066 ops/s
Iteration   2: 228.689 ops/s
Iteration   3: 216.182 ops/s
Iteration   4: 233.526 ops/s
Iteration   5: 223.083 ops/s


Result "com.redhat.test.TestByteArrayCopy.test":
  226.266 ?(99.9%) 6.403 ops/s [Average]
  (min, avg, max) = (200.450, 226.266, 236.312), stdev = 8.547
  CI (99.9%): [219.863, 232.669] (assumes normal distribution)


# Run complete. Total time: 00:10:00

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                (size)   Mode  Cnt    Score   Error  Units
TestByteArrayCopy.test  2000000  thrpt   25  226.266 ? 6.403  ops/s


...:/home/sandreev/tmp/jep2240/rhel/bytearycpy$ /usr/lib/jvm/java-1.8.0/bin/java -XX:+UseG1GC -Xms512m -Xmx8g -jar target/benchmarks.jar TestByteArrayCopy
# JMH version: 1.26
# VM version: JDK 1.8.0_282, OpenJDK 64-Bit Server VM, 25.282-b08
# VM invoker: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre/bin/java
# VM options: -XX:+UseG1GC -Xms512m -Xmx8g
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.redhat.test.TestByteArrayCopy.test
# Parameters: (size = 2000000)

# Run progress: 0.00% complete, ETA 00:08:20
# Fork: 1 of 5
# Warmup Iteration   1: 37.233 ops/s
# Warmup Iteration   2: 95.486 ops/s
# Warmup Iteration   3: 210.030 ops/s
# Warmup Iteration   4: 229.411 ops/s
# Warmup Iteration   5: 236.842 ops/s
Iteration   1: 234.087 ops/s
Iteration   2: 241.966 ops/s
Iteration   3: 224.214 ops/s
Iteration   4: 229.057 ops/s
Iteration   5: 243.408 ops/s

# Run progress: 20.00% complete, ETA 00:08:02
# Fork: 2 of 5
# Warmup Iteration   1: 41.422 ops/s
# Warmup Iteration   2: 229.687 ops/s
# Warmup Iteration   3: 227.888 ops/s
# Warmup Iteration   4: 235.010 ops/s
# Warmup Iteration   5: 237.072 ops/s
Iteration   1: 220.131 ops/s
Iteration   2: 232.929 ops/s
Iteration   3: 240.381 ops/s
Iteration   4: 238.468 ops/s
Iteration   5: 219.600 ops/s

# Run progress: 40.00% complete, ETA 00:05:57
# Fork: 3 of 5
# Warmup Iteration   1: 36.178 ops/s
# Warmup Iteration   2: 222.853 ops/s
# Warmup Iteration   3: 224.844 ops/s
# Warmup Iteration   4: 237.254 ops/s
# Warmup Iteration   5: 229.619 ops/s
Iteration   1: 233.495 ops/s
Iteration   2: 214.449 ops/s
Iteration   3: 245.675 ops/s
Iteration   4: 232.525 ops/s
Iteration   5: 242.771 ops/s

# Run progress: 60.00% complete, ETA 00:03:59
# Fork: 4 of 5
# Warmup Iteration   1: 35.241 ops/s
# Warmup Iteration   2: 87.385 ops/s
# Warmup Iteration   3: 222.375 ops/s
# Warmup Iteration   4: 208.025 ops/s
# Warmup Iteration   5: 244.371 ops/s
Iteration   1: 246.017 ops/s
Iteration   2: 242.849 ops/s
Iteration   3: 214.375 ops/s
Iteration   4: 249.397 ops/s
Iteration   5: 237.778 ops/s

# Run progress: 80.00% complete, ETA 00:02:00
# Fork: 5 of 5
# Warmup Iteration   1: 40.901 ops/s
# Warmup Iteration   2: 95.482 ops/s
# Warmup Iteration   3: 207.553 ops/s
# Warmup Iteration   4: 226.225 ops/s
# Warmup Iteration   5: 233.268 ops/s
Iteration   1: 236.959 ops/s
Iteration   2: 234.856 ops/s
Iteration   3: 216.235 ops/s
Iteration   4: 240.222 ops/s
Iteration   5: 240.158 ops/s


Result "com.redhat.test.TestByteArrayCopy.test":
  234.080 ?(99.9%) 7.785 ops/s [Average]
  (min, avg, max) = (214.375, 234.080, 249.397), stdev = 10.393
  CI (99.9%): [226.295, 241.865] (assumes normal distribution)


# Run complete. Total time: 00:09:59

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                (size)   Mode  Cnt    Score   Error  Units
TestByteArrayCopy.test  2000000  thrpt   25  234.080 ? 7.785  ops/s

(installed Java 11 is 11.0.10, installed Java 8 is 1.8.0_282)


Here are summaries for more JVM versions with slightly different heap arguments: -Xmx4g and -Xms512m

JVM             GC             (size)      Mode      Cnt    Score       Error       Units

Java 1.8.0_282  ParallelGC     2000000     thrpt     25     668.486     ? 4.254     ops/s
Java 1.8.0_282  G1GC           2000000     thrpt     25     234.794     ? 3.866     ops/s

Java 11.0.10    ParallelGC     2000000     thrpt     25     667.958     ? 3.005     ops/s
Java 11.0.10    G1GC           2000000     thrpt     25     554.807     ? 6.625     ops/s

Java 15.0.1     ParallelGC     2000000     thrpt     25     662.809     ? 4.104     ops/s
Java 15.0.1     G1GC           2000000     thrpt     25     402.874     ? 2.563     ops/s

Java 16.0.1     ParallelGC     2000000     thrpt     25     666.737     ? 4.046     ops/s
Java 16.0.1     G1GC           2000000     thrpt     25     365.033     ? 5.768     ops/s

Java 17 ea 29   ParallelGC     2000000     thrpt     25     665.201     ? 3.865     ops/s
Java 17 ea 29   G1GC           2000000     thrpt     25     340.814     ? 7.928     ops/s

I'm listing those as there seems to degradations comparing Java 11 with Java 15+. We'll probably open a ticket for this, will have to discuss with my colleagues (we are evaluating Java 17 support, to replace our Java 8 support in an year or so).

Comment 78 Simeon Andreev 2021-10-26 13:13:28 UTC
If I add print for time spent in GC, it seems G1GC does next to nothing for the snippet with Java 8, and a lot with Java 11, for "low" iteration count. For higher iteration time, G1GC seems to work better in Java 11.

public class TestByteArrayCopy {

	public static void main(String[] args) throws java.io.IOException {
		System.out.println("java.version=" + System.getProperty("java.version"));
		long s = System.currentTimeMillis();

		long totalElapsed = 0;
		int iterations = Integer.valueOf(args[0]);
		int n = 2_000_000;

		Byte[] array = new Byte[n];
		for (int i = 0; i < array.length; i++) {
			array[i] = Byte.valueOf((byte) (i % Byte.MAX_VALUE));
		}

		long start = System.currentTimeMillis();
		for (int iteration = 0; iteration < iterations; iteration++) {
			array = java.util.Arrays.copyOf(array, array.length);
		}
		long end = System.currentTimeMillis();
		long elapsed = end - start;
		totalElapsed += elapsed;

		for (int i = 0; i < array.length; i++) {
			if (!array[i].equals(Byte.valueOf((byte) (i % Byte.MAX_VALUE)))) {
				throw new IllegalStateException();
			}
		}

		long e = System.currentTimeMillis();
		System.out.println("Arrays.copyOf(Byte[], int) repeated " + iterations + " times took: " + totalElapsed + "ms");
		System.out.println("entire snippet took: " + (e - s) + "ms");
		System.out.println("GC took = " + getGarbageCollectionTime() + " ms");
	}

	private static long getGarbageCollectionTime() {
		long collectionTime = 0;
		for (java.lang.management.GarbageCollectorMXBean garbageCollectorMXBean : java.lang.management.ManagementFactory.getGarbageCollectorMXBeans()) {
			collectionTime += garbageCollectorMXBean.getCollectionTime();
		}
		return collectionTime;
	}
}

...:/home/sandreev/tmp/jep2240$ /data/tmp/jre-8u181-linux-x64/jre1.8.0_181/bin/java -Xmx8g -Xms512m -XX:+UseG1GC TestByteArrayCopy 1000
java.version=1.8.0_181
Arrays.copyOf(Byte[], int) repeated 1000 times took: 4336ms
entire snippet took: 4368ms
GC took = 607 ms

...:/home/sandreev/tmp/jep2240$ /usr/lib/jvm/java-11/bin/java -Xmx8g -Xms512m -XX:+UseG1GC TestByteArrayCopy 1000
java.version=11.0.10
Arrays.copyOf(Byte[], int) repeated 1000 times took: 8513ms
entire snippet took: 8550ms
GC took = 6094 ms

...:/home/sandreev/tmp/jep2240$ /data/tmp/jre-8u181-linux-x64/jre1.8.0_181/bin/java -Xmx8g -Xms512m -XX:+UseG1GC TestByteArrayCopy 10000
java.version=1.8.0_181
Arrays.copyOf(Byte[], int) repeated 10000 times took: 64374ms
entire snippet took: 64405ms
GC took = 27248 ms

...:/home/sandreev/tmp/jep2240$ /usr/lib/jvm/java-11/bin/java -Xmx8g -Xms512m -XX:+UseG1GC TestByteArrayCopy 10000
java.version=11.0.10
Arrays.copyOf(Byte[], int) repeated 10000 times took: 35083ms
entire snippet took: 35113ms
GC took = 15051 ms


I tried adjusting whatever shared G1GC settings there are, to match their values to the ones Java 11 uses, I see no difference:

...:/home/sandreev/tmp/jep2240$ /data/tmp/jre-8u181-linux-x64/jre1.8.0_181/bin/java -XX:G1HeapRegionSize=4194304 -XX:G1RSetScanBlockSize=64 -XX:G1ConcRefinementThreads=13 -XX:G1RSetSparseRegionEntries=12 -XX:G1ConcRefinementThresholdStep=2 -Xmx8g -Xms512m -XX:+UseG1GC TestByteArrayCopy 1000
java.version=1.8.0_181
Arrays.copyOf(Byte[], int) repeated 1000 times took: 4477ms
entire snippet took: 4507ms
GC took = 617 ms


Is this enough info on our end, or do you need more?

Comment 79 Simeon Andreev 2021-10-26 13:14:55 UTC
Regarding the Java 15+ degradation (see comment 77), should we open another RHEL bugzilla ticket? The degradation is seen with the JMH benchmark, so it shouldn't be a "warmup problem" (like apparently the one here is).

Comment 80 Roman Kennke 2021-10-26 13:36:18 UTC
(In reply to Simeon Andreev from comment #78)
> If I add print for time spent in GC, it seems G1GC does next to nothing for
> the snippet with Java 8, and a lot with Java 11, for "low" iteration count.
> For higher iteration time, G1GC seems to work better in Java 11.
> 
> public class TestByteArrayCopy {
> 
> 	public static void main(String[] args) throws java.io.IOException {
> 		System.out.println("java.version=" + System.getProperty("java.version"));
> 		long s = System.currentTimeMillis();
> 
> 		long totalElapsed = 0;
> 		int iterations = Integer.valueOf(args[0]);
> 		int n = 2_000_000;
> 
> 		Byte[] array = new Byte[n];
> 		for (int i = 0; i < array.length; i++) {
> 			array[i] = Byte.valueOf((byte) (i % Byte.MAX_VALUE));
> 		}
> 
> 		long start = System.currentTimeMillis();
> 		for (int iteration = 0; iteration < iterations; iteration++) {
> 			array = java.util.Arrays.copyOf(array, array.length);
> 		}
> 		long end = System.currentTimeMillis();
> 		long elapsed = end - start;
> 		totalElapsed += elapsed;
> 
> 		for (int i = 0; i < array.length; i++) {
> 			if (!array[i].equals(Byte.valueOf((byte) (i % Byte.MAX_VALUE)))) {
> 				throw new IllegalStateException();
> 			}
> 		}
> 
> 		long e = System.currentTimeMillis();
> 		System.out.println("Arrays.copyOf(Byte[], int) repeated " + iterations + "
> times took: " + totalElapsed + "ms");
> 		System.out.println("entire snippet took: " + (e - s) + "ms");
> 		System.out.println("GC took = " + getGarbageCollectionTime() + " ms");
> 	}
> 
> 	private static long getGarbageCollectionTime() {
> 		long collectionTime = 0;
> 		for (java.lang.management.GarbageCollectorMXBean garbageCollectorMXBean :
> java.lang.management.ManagementFactory.getGarbageCollectorMXBeans()) {
> 			collectionTime += garbageCollectorMXBean.getCollectionTime();
> 		}
> 		return collectionTime;
> 	}
> }
> 
> ...:/home/sandreev/tmp/jep2240$
> /data/tmp/jre-8u181-linux-x64/jre1.8.0_181/bin/java -Xmx8g -Xms512m
> -XX:+UseG1GC TestByteArrayCopy 1000
> java.version=1.8.0_181
> Arrays.copyOf(Byte[], int) repeated 1000 times took: 4336ms
> entire snippet took: 4368ms
> GC took = 607 ms
> 
> ...:/home/sandreev/tmp/jep2240$ /usr/lib/jvm/java-11/bin/java -Xmx8g
> -Xms512m -XX:+UseG1GC TestByteArrayCopy 1000
> java.version=11.0.10
> Arrays.copyOf(Byte[], int) repeated 1000 times took: 8513ms
> entire snippet took: 8550ms
> GC took = 6094 ms
> 
> ...:/home/sandreev/tmp/jep2240$
> /data/tmp/jre-8u181-linux-x64/jre1.8.0_181/bin/java -Xmx8g -Xms512m
> -XX:+UseG1GC TestByteArrayCopy 10000
> java.version=1.8.0_181
> Arrays.copyOf(Byte[], int) repeated 10000 times took: 64374ms
> entire snippet took: 64405ms
> GC took = 27248 ms
> 
> ...:/home/sandreev/tmp/jep2240$ /usr/lib/jvm/java-11/bin/java -Xmx8g
> -Xms512m -XX:+UseG1GC TestByteArrayCopy 10000
> java.version=11.0.10
> Arrays.copyOf(Byte[], int) repeated 10000 times took: 35083ms
> entire snippet took: 35113ms
> GC took = 15051 ms
> 
> 
> I tried adjusting whatever shared G1GC settings there are, to match their
> values to the ones Java 11 uses, I see no difference:
> 
> ...:/home/sandreev/tmp/jep2240$
> /data/tmp/jre-8u181-linux-x64/jre1.8.0_181/bin/java
> -XX:G1HeapRegionSize=4194304 -XX:G1RSetScanBlockSize=64
> -XX:G1ConcRefinementThreads=13 -XX:G1RSetSparseRegionEntries=12
> -XX:G1ConcRefinementThresholdStep=2 -Xmx8g -Xms512m -XX:+UseG1GC
> TestByteArrayCopy 1000
> java.version=1.8.0_181
> Arrays.copyOf(Byte[], int) repeated 1000 times took: 4477ms
> entire snippet took: 4507ms
> GC took = 617 ms
> 
> 
> Is this enough info on our end, or do you need more?

It might be useful to see GC logs for each run, to see what the GC is actually doing there.

Roman

Comment 81 Simeon Andreev 2021-10-26 13:49:52 UTC
Created attachment 1837207 [details]
GC log from Java 8 181 with the reproduction snippet.

Comment 82 Simeon Andreev 2021-10-26 13:50:51 UTC
Created attachment 1837208 [details]
GC log from Java 11 10 with the reproduction snippet.

...:/home/sandreev/tmp/jep2240$ rm /tmp/java8_181_gclog.txt ; /data/tmp/jre-8u181-linux-x64/jre1.8.0_181/bin/java -XX:+PrintGCDetails -Xloggc:/tmp/java8_181_gclog.txt -Xmx8g -Xms512m -XX:+UseG1GC TestByteArrayCopy 1000
java.version=1.8.0_181
Arrays.copyOf(Byte[], int) repeated 1000 times took: 4587ms
entire snippet took: 4621ms
GC took = 642 ms

...:/home/sandreev/tmp/jep2240$ rm /tmp/java11_10_gclog.txt ; /usr/lib/jvm/java-11/bin/java -XX:+PrintGCDetails -Xloggc:/tmp/java11_10_gclog.txt -Xmx8g -Xms512m -XX:+UseG1GC TestByteArrayCopy 1000
[0.000s][warning][gc] -Xloggc is deprecated. Will use -Xlog:gc:/tmp/java11_10_gclog.txt instead.
[0.001s][warning][gc] -XX:+PrintGCDetails is deprecated. Will use -Xlog:gc* instead.
java.version=11.0.10
Arrays.copyOf(Byte[], int) repeated 1000 times took: 10758ms
entire snippet took: 10811ms
GC took = 8322 ms

Let me know if I should increase the log verbosity (and how to do so).

Comment 83 Roman Kennke 2021-10-28 18:39:05 UTC
It looks to me like in JDK8, the GC starts with small young-gen (512K) but after the first and only GC cycle, this grows to 4MB. JDK11 doesn't seem to grow as aggressively, and instead starts a GC cycle whenever copyOf() allocates a new array/humongous object. You could tune the young generation to be larger, see here:

https://www.oracle.com/technical-resources/articles/java/g1gc.html

But in any case, this seems to be a warmup problem. As soon as it reaches the final configuration, it should be ok.

Also note that hitting many humongous alloc failures is not a good sign. Trouble is that humongous objects are only reclaimed in full collections, IIRC. If this is happening in production, you may want to increase the heap region size to be more than 2x large as your large objects. For example, the snippets keeps allocating 2MB byte arrays, but heap region size is only 2MB. You need 4MB heap regions to avoid humongous allocations.

Comment 84 Andrey Loskutov 2021-10-28 19:35:06 UTC
(In reply to Roman Kennke from comment #83)
> But in any case, this seems to be a warmup problem. As soon as it reaches
> the final configuration, it should be ok.

Do you have explanation for slowdown with JDK 17, see comment 79 / comment 77? 
Should we open a new bug for that?

> Also note that hitting many humongous alloc failures is not a good sign.
> Trouble is that humongous objects are only reclaimed in full collections,
> IIRC. If this is happening in production, you may want to increase the heap
> region size to be more than 2x large as your large objects.

Thanks for pointers. We can't know upfront what kind of objects will be created, because we offer a platform for customers where they are free to run their own code. Almost sure thing is that they will create *a lot* of objects (the test data received from hardware), but no assumptions regarding the single object size. Measulement results can be gazillions of bits, bit sets, doubles or byte arrays, and later is where we see the slowdown.

Typically we run with 12 GB max heap, sometimes we only need 4GB, sometimes even 18 GB is just enough.
So if we are going to optimize GC with some flags, we should be pretty generic to make everyone happy.

Comment 85 Roman Kennke 2021-10-28 20:05:50 UTC
(In reply to Andrey Loskutov from comment #84)
> (In reply to Roman Kennke from comment #83)
> > But in any case, this seems to be a warmup problem. As soon as it reaches
> > the final configuration, it should be ok.
> 
> Do you have explanation for slowdown with JDK 17, see comment 79 / comment
> 77? 

No, I don't currently have one.

> Should we open a new bug for that?

Not sure. We should better find out what it is.

> > Also note that hitting many humongous alloc failures is not a good sign.
> > Trouble is that humongous objects are only reclaimed in full collections,
> > IIRC. If this is happening in production, you may want to increase the heap
> > region size to be more than 2x large as your large objects.
> 
> Thanks for pointers. We can't know upfront what kind of objects will be
> created, because we offer a platform for customers where they are free to
> run their own code. Almost sure thing is that they will create *a lot* of
> objects (the test data received from hardware), but no assumptions regarding
> the single object size. Measulement results can be gazillions of bits, bit
> sets, doubles or byte arrays, and later is where we see the slowdown.

Notice that it is *not* byte[] but Byte[]. This is a *huge* difference. And I would very strongly suggest to *not* use Byte[] but byte[] instead. Byte are objects, and copying objects causes all sorts of interference with GC, and I am not surprised that we see wildly different results. Please please if you can use byte[] instead, do so. This will show much more stable results.

Comment 86 Andrey Loskutov 2021-10-28 20:12:17 UTC
(In reply to Roman Kennke from comment #85)
> Notice that it is *not* byte[] but Byte[]. This is a *huge* difference. And
> I would very strongly suggest to *not* use Byte[] but byte[] instead. Byte
> are objects, and copying objects causes all sorts of interference with GC,
> and I am not surprised that we see wildly different results. Please please
> if you can use byte[] instead, do so. This will show much more stable
> results.

Sure, but it's not our code and not only Byte arrays, see comment 27. Arrays for all reference types are affected, Byte[] is just the one affected most.

Comment 87 Roman Kennke 2021-10-29 06:40:55 UTC
(In reply to Andrey Loskutov from comment #86)
> (In reply to Roman Kennke from comment #85)
> > Notice that it is *not* byte[] but Byte[]. This is a *huge* difference. And
> > I would very strongly suggest to *not* use Byte[] but byte[] instead. Byte
> > are objects, and copying objects causes all sorts of interference with GC,
> > and I am not surprised that we see wildly different results. Please please
> > if you can use byte[] instead, do so. This will show much more stable
> > results.
> 
> Sure, but it's not our code and not only Byte arrays, see comment 27. Arrays
> for all reference types are affected, Byte[] is just the one affected most.

The trouble is, that it could be anything. We have already established that it's most likely not the arraycopy per se that is causing the slowdown. It looks more like an ergonomics problem. The JDK 17 slowdown could be caused by the GC releasing memory back to the OS, from where it would have to be re-allocated later when another copy of array is made. It would be better to run with -Xms == -Xmx to avoid such ergonomics troubles.

Could we get GC logs from the actual problem from the customer? Because I don't think that the snippets measure what happens there.

Comment 88 Simeon Andreev 2021-10-29 06:55:27 UTC
(In reply to Roman Kennke from comment #87)
> Could we get GC logs from the actual problem from the customer? Because I
> don't think that the snippets measure what happens there.

I'll re-run the code in question on Monday. I'll see if I can still use 1.8_181; it might not be trivial, as we have updated to 1.8_282 meanwhile.

Comment 89 Roman Kennke 2021-10-29 07:07:23 UTC
(In reply to Simeon Andreev from comment #88)
> (In reply to Roman Kennke from comment #87)
> > Could we get GC logs from the actual problem from the customer? Because I
> > don't think that the snippets measure what happens there.
> 
> I'll re-run the code in question on Monday. I'll see if I can still use
> 1.8_181; it might not be trivial, as we have updated to 1.8_282 meanwhile.

Why would you want to run it with 181 if you can run it with 282 ? The problem might well have been fixed already in the later version.

Comment 90 Simeon Andreev 2021-10-29 07:54:41 UTC
(In reply to Roman Kennke from comment #89)
> 
> Why would you want to run it with 181 if you can run it with 282 ? The
> problem might well have been fixed already in the later version.

The problem we have is with Java 11, not Java 8. At most it was introduced also in 1.8_282. If I see the same degradation (we did make performance measurements prior to updating to 1.8_282, but I don't remember if we observed the problem here also in 1.8_282) when comparing 11 and 1.8_282, I'll attach logs from 1.8_282.

Comment 91 Simeon Andreev 2021-10-29 16:08:27 UTC
Created attachment 1838349 [details]
GC logs from original case of performance degradation.

I still have to check whether the degradation is actually seen, the Java 8 run is with 1.8_282; the performance numbers I get with the GC logging enabled are useless.

Comment 92 Simeon Andreev 2021-10-29 16:49:44 UTC
OK, I checked also performance, the degradation is seen also when comparing 1.8_282 to 11.0.10.

I've not measured time spent in GC in this case though.

Comment 97 Mike Millson 2021-12-21 22:05:14 UTC
Created attachment 1847239 [details]
JDK8 profiling

java -agentpath:/home/mmillson/opt/async-profiler-2.5.1-linux-x64/build/libasyncProfiler.so=start,event=wall,file=profile-8.html TestByteArrayCopy 1000

Comment 98 Mike Millson 2021-12-21 22:06:32 UTC
Created attachment 1847240 [details]
JDK11 profiling

java -agentpath:/home/mmillson/opt/async-profiler-2.5.1-linux-x64/build/libasyncProfiler.so=start,event=wall,file=profile-11.html TestByteArrayCopy 1000

Comment 102 Mike Millson 2021-12-23 22:12:41 UTC
Comment on attachment 1847239 [details]
JDK8 profiling

><!DOCTYPE html>
><html lang='en'>
><head>
><meta charset='utf-8'>
><style>
>	body {margin: 0; padding: 10px; background-color: #ffffff}
>	h1 {margin: 5px 0 0 0; font-size: 18px; font-weight: normal; text-align: center}
>	header {margin: -24px 0 5px 0; line-height: 24px}
>	button {font: 12px sans-serif; cursor: pointer}
>	p {margin: 5px 0 5px 0}
>	a {color: #0366d6}
>	#hl {position: absolute; display: none; overflow: hidden; white-space: nowrap; pointer-events: none; background-color: #ffffe0; outline: 1px solid #ffc000; height: 15px}
>	#hl span {padding: 0 3px 0 3px}
>	#status {overflow: hidden; white-space: nowrap}
>	#match {overflow: hidden; white-space: nowrap; display: none; float: right; text-align: right}
>	#reset {cursor: pointer}
></style>
></head>
><body style='font: 12px Verdana, sans-serif'>
><h1>Wall clock profile</h1>
><header style='text-align: left'><button id='reverse' title='Reverse'>&#x1f53b;</button>&nbsp;&nbsp;<button id='search' title='Search'>&#x1f50d;</button></header>
><header style='text-align: right'>Produced by <a href='https://github.com/jvm-profiling-tools/async-profiler'>async-profiler</a></header>
><canvas id='canvas' style='width: 100%; height: 288px'></canvas>
><div id='hl'><span></span></div>
><p id='match'>Matched: <span id='matchval'></span> <span id='reset' title='Clear'>&#x274c;</span></p>
><p id='status'>&nbsp;</p>
><script>
>	// Copyright 2020 Andrei Pangin
>	// Licensed under the Apache License, Version 2.0.
>	'use strict';
>	var root, rootLevel, px, pattern;
>	var reverse = false;
>	const levels = Array(18);
>	for (let h = 0; h < levels.length; h++) {
>		levels[h] = [];
>	}
>
>	const canvas = document.getElementById('canvas');
>	const c = canvas.getContext('2d');
>	const hl = document.getElementById('hl');
>	const status = document.getElementById('status');
>
>	const canvasWidth = canvas.offsetWidth;
>	const canvasHeight = canvas.offsetHeight;
>	canvas.style.width = canvasWidth + 'px';
>	canvas.width = canvasWidth * (devicePixelRatio || 1);
>	canvas.height = canvasHeight * (devicePixelRatio || 1);
>	if (devicePixelRatio) c.scale(devicePixelRatio, devicePixelRatio);
>	c.font = document.body.style.font;
>
>	const palette = [
>		[0x50e150, 30, 30, 30],
>		[0x50bebe, 30, 30, 30],
>		[0xe17d00, 30, 30,  0],
>		[0xc8c83c, 30, 30, 10],
>		[0xe15a5a, 30, 40, 40],
>	];
>
>	function getColor(p) {
>		const v = Math.random();
>		return '#' + (p[0] + ((p[1] * v) << 16 | (p[2] * v) << 8 | (p[3] * v))).toString(16);
>	}
>
>	function f(level, left, width, type, title) {
>		levels[level].push({left: left, width: width, color: getColor(palette[type]), title: title});
>	}
>
>	function samples(n) {
>		return n === 1 ? '1 sample' : n.toString().replace(/\B(?=(\d{3})+(?!\d))/g, ',') + ' samples';
>	}
>
>	function pct(a, b) {
>		return a >= b ? '100' : (100 * a / b).toFixed(2);
>	}
>
>	function findFrame(frames, x) {
>		let left = 0;
>		let right = frames.length - 1;
>
>		while (left <= right) {
>			const mid = (left + right) >>> 1;
>			const f = frames[mid];
>
>			if (f.left > x) {
>				right = mid - 1;
>			} else if (f.left + f.width <= x) {
>				left = mid + 1;
>			} else {
>				return f;
>			}
>		}
>
>		if (frames[left] && (frames[left].left - x) * px < 0.5) return frames[left];
>		if (frames[right] && (x - (frames[right].left + frames[right].width)) * px < 0.5) return frames[right];
>
>		return null;
>	}
>
>	function search(r) {
>		if (r && (r = prompt('Enter regexp to search:', '')) === null) {
>			return;
>		}
>
>		pattern = r ? RegExp(r) : undefined;
>		const matched = render(root, rootLevel);
>		document.getElementById('matchval').textContent = pct(matched, root.width) + '%';
>		document.getElementById('match').style.display = r ? 'inherit' : 'none';
>	}
>
>	function render(newRoot, newLevel) {
>		if (root) {
>			c.fillStyle = '#ffffff';
>			c.fillRect(0, 0, canvasWidth, canvasHeight);
>		}
>
>		root = newRoot || levels[0][0];
>		rootLevel = newLevel || 0;
>		px = canvasWidth / root.width;
>
>		const x0 = root.left;
>		const x1 = x0 + root.width;
>		const marked = [];
>
>		function mark(f) {
>			return marked[f.left] >= f.width || (marked[f.left] = f.width);
>		}
>
>		function totalMarked() {
>			let total = 0;
>			let left = 0;
>			Object.keys(marked).sort(function(a, b) { return a - b; }).forEach(function(x) {
>				if (+x >= left) {
>					total += marked[x];
>					left = +x + marked[x];
>				}
>			});
>			return total;
>		}
>
>		function drawFrame(f, y, alpha) {
>			if (f.left < x1 && f.left + f.width > x0) {
>				c.fillStyle = pattern && f.title.match(pattern) && mark(f) ? '#ee00ee' : f.color;
>				c.fillRect((f.left - x0) * px, y, f.width * px, 15);
>
>				if (f.width * px >= 21) {
>					const chars = Math.floor(f.width * px / 7);
>					const title = f.title.length <= chars ? f.title : f.title.substring(0, chars - 2) + '..';
>					c.fillStyle = '#000000';
>					c.fillText(title, Math.max(f.left - x0, 0) * px + 3, y + 12, f.width * px - 6);
>				}
>
>				if (alpha) {
>					c.fillStyle = 'rgba(255, 255, 255, 0.5)';
>					c.fillRect((f.left - x0) * px, y, f.width * px, 15);
>				}
>			}
>		}
>
>		for (let h = 0; h < levels.length; h++) {
>			const y = reverse ? h * 16 : canvasHeight - (h + 1) * 16;
>			const frames = levels[h];
>			for (let i = 0; i < frames.length; i++) {
>				drawFrame(frames[i], y, h < rootLevel);
>			}
>		}
>
>		return totalMarked();
>	}
>
>	canvas.onmousemove = function() {
>		const h = Math.floor((reverse ? event.offsetY : (canvasHeight - event.offsetY)) / 16);
>		if (h >= 0 && h < levels.length) {
>			const f = findFrame(levels[h], event.offsetX / px + root.left);
>			if (f) {
>				hl.style.left = (Math.max(f.left - root.left, 0) * px + canvas.offsetLeft) + 'px';
>				hl.style.width = (Math.min(f.width, root.width) * px) + 'px';
>				hl.style.top = ((reverse ? h * 16 : canvasHeight - (h + 1) * 16) + canvas.offsetTop) + 'px';
>				hl.firstChild.textContent = f.title;
>				hl.style.display = 'block';
>				canvas.title = f.title + '\n(' + samples(f.width) + ', ' + pct(f.width, levels[0][0].width) + '%)';
>				canvas.style.cursor = 'pointer';
>				canvas.onclick = function() {
>					if (f != root) {
>						render(f, h);
>						canvas.onmousemove();
>					}
>				};
>				status.textContent = 'Function: ' + canvas.title;
>				return;
>			}
>		}
>		canvas.onmouseout();
>	}
>
>	canvas.onmouseout = function() {
>		hl.style.display = 'none';
>		status.textContent = '\xa0';
>		canvas.title = '';
>		canvas.style.cursor = '';
>		canvas.onclick = '';
>	}
>
>	document.getElementById('reverse').onclick = function() {
>		reverse = !reverse;
>		render();
>	}
>
>	document.getElementById('search').onclick = function() {
>		search(true);
>	}
>
>	document.getElementById('reset').onclick = function() {
>		search(false);
>	}
>
>	window.onkeydown = function() {
>		if (event.ctrlKey && event.keyCode === 70) {
>			event.preventDefault();
>			search(true);
>		} else if (event.keyCode === 27) {
>			search(false);
>		}
>	}
>f(0,0,2151,4,'all')
>f(1,0,409,4,'.annobin___pthread_cond_timedwait.start')
>f(1,409,41,0,'TestByteArrayCopy.main')
>f(2,409,1,0,'java/lang/System.currentTimeMillis')
>f(3,409,1,4,'__libc_read')
>f(2,410,40,0,'java/util/Arrays.copyOf')
>f(3,410,40,0,'java/util/Arrays.copyOf')
>f(4,410,8,0,'java/lang/System.arraycopy')
>f(5,410,8,4,'JVM_ArrayCopy')
>f(6,410,3,4,'.annobin__ZN13ObjArrayKlass10copy_arrayEP12arrayOopDesciS1_iiP6Thread.start')
>f(7,410,3,4,'.annobin__ZN10BarrierSet15write_ref_arrayEP8HeapWordm.start')
>f(8,410,3,3,'G1SATBCardTableLoggingModRefBS::write_ref_array_work(MemRegion)')
>f(9,410,3,3,'PtrQueue::enqueue_known_active(void*)')
>f(10,410,3,3,'PtrQueueSet::process_or_enqueue_complete_buffer(void**)')
>f(11,410,3,4,'.annobin__ZN17DirtyCardQueueSet18mut_process_bufferEPPv.start')
>f(12,410,3,3,'RefineCardTableEntryClosure::do_card_ptr(signed char*, unsigned int)')
>f(13,410,3,3,'G1RemSet::refine_card(signed char*, unsigned int, bool) [clone .part.78]')
>f(14,410,3,3,'HeapRegion::oops_on_card_seq_iterate_careful(MemRegion, FilterOutOfRegionClosure*, signed char*)')
>f(15,410,3,3,'ObjArrayKlass::oop_oop_iterate_nv_m(oopDesc*, FilterOutOfRegionClosure*, MemRegion)')
>f(16,410,1,4,'.annobin__ZN17OtherRegionsTable13add_referenceEPvi.start')
>f(16,411,2,4,'.annobin__ZN29G1UpdateRSOrPushRefOopClosure6do_oopEPj.start')
>f(17,412,1,4,'.annobin__ZN17OtherRegionsTable13add_referenceEPvi.start')
>f(6,413,5,4,'aci_CopyRight')
>f(4,418,18,0,'java/lang/reflect/Array.newInstance')
>f(5,418,18,0,'java/lang/reflect/Array.newArray')
>f(6,418,11,4,'JVM_NewArray')
>f(7,418,11,3,'InstanceKlass::allocate_objArray(int, int, Thread*)')
>f(8,418,1,3,'CollectedHeap::common_mem_allocate_init(KlassHandle, unsigned long, Thread*)')
>f(9,418,1,3,'G1CollectedHeap::mem_allocate(unsigned long, bool*)')
>f(10,418,1,3,'G1CollectedHeap::attempt_allocation_humongous(unsigned long, unsigned int*, unsigned int*)')
>f(11,418,1,3,'G1CollectedHeap::humongous_obj_allocate(unsigned long, unsigned char)')
>f(12,418,1,3,'HeapRegionManager::find_contiguous(unsigned long, bool)')
>f(8,419,10,4,'__memset_avx2_erms')
>f(6,429,6,4,'__pthread_cond_wait')
>f(6,435,1,4,'sysmalloc')
>f(4,436,14,4,'oop_disjoint_arraycopy')
>f(1,450,9,4,'[not_walkable_Java]')
>f(2,450,9,3,'BarrierSet::static_write_ref_array_post(HeapWord*, unsigned long)')
>f(3,450,9,3,'G1SATBCardTableLoggingModRefBS::write_ref_array_work(MemRegion)')
>f(4,451,8,3,'PtrQueue::enqueue_known_active(void*)')
>f(5,451,8,3,'PtrQueueSet::process_or_enqueue_complete_buffer(void**)')
>f(6,451,8,4,'.annobin__ZN17DirtyCardQueueSet18mut_process_bufferEPPv.start')
>f(7,451,8,3,'RefineCardTableEntryClosure::do_card_ptr(signed char*, unsigned int)')
>f(8,451,8,3,'G1RemSet::refine_card(signed char*, unsigned int, bool) [clone .part.78]')
>f(9,451,8,3,'HeapRegion::oops_on_card_seq_iterate_careful(MemRegion, FilterOutOfRegionClosure*, signed char*)')
>f(10,451,8,3,'ObjArrayKlass::oop_oop_iterate_nv_m(oopDesc*, FilterOutOfRegionClosure*, MemRegion)')
>f(11,451,3,4,'.annobin__ZN17OtherRegionsTable13add_referenceEPvi.start')
>f(11,454,5,4,'.annobin__ZN29G1UpdateRSOrPushRefOopClosure6do_oopEPj.start')
>f(1,459,5,4,'[unknown_Java]')
>f(2,459,5,3,'BarrierSet::static_write_ref_array_post(HeapWord*, unsigned long)')
>f(3,459,5,3,'G1SATBCardTableLoggingModRefBS::write_ref_array_work(MemRegion)')
>f(4,459,5,3,'PtrQueue::enqueue_known_active(void*)')
>f(5,461,3,3,'PtrQueueSet::process_or_enqueue_complete_buffer(void**)')
>f(6,461,3,4,'.annobin__ZN17DirtyCardQueueSet18mut_process_bufferEPPv.start')
>f(7,461,3,3,'RefineCardTableEntryClosure::do_card_ptr(signed char*, unsigned int)')
>f(8,461,3,3,'G1RemSet::refine_card(signed char*, unsigned int, bool) [clone .part.78]')
>f(9,461,1,3,'G1CardCounts::add_card_count(signed char*)')
>f(9,462,2,3,'HeapRegion::oops_on_card_seq_iterate_careful(MemRegion, FilterOutOfRegionClosure*, signed char*)')
>f(10,462,2,3,'ObjArrayKlass::oop_oop_iterate_nv_m(oopDesc*, FilterOutOfRegionClosure*, MemRegion)')
>f(1,464,57,4,'__GI___pthread_timedjoin_ex')
>f(1,521,921,4,'__pthread_cond_wait')
>f(1,1442,56,4,'do_futex_wait.constprop.1')
>f(1,1498,56,0,'java/lang/ref/Finalizer$FinalizerThread.run')
>f(2,1498,56,0,'java/lang/ref/ReferenceQueue.remove')
>f(3,1498,56,0,'java/lang/ref/ReferenceQueue.remove')
>f(4,1498,56,0,'java/lang/Object.wait')
>f(5,1498,56,4,'__pthread_cond_wait')
>f(1,1554,56,0,'java/lang/ref/Reference$ReferenceHandler.run')
>f(2,1554,56,0,'java/lang/ref/Reference.tryHandlePending')
>f(3,1554,56,0,'java/lang/Object.wait')
>f(4,1554,56,0,'java/lang/Object.wait')
>f(5,1554,56,4,'__pthread_cond_wait')
>f(1,1610,9,4,'mprotect')
>f(1,1619,525,4,'start_thread')
>f(2,1619,1,4,'JavaMain')
>f(3,1619,1,4,'JNI_CreateJavaVM')
>f(4,1619,1,3,'Threads::create_vm(JavaVMInitArgs*, bool*)')
>f(5,1619,1,3,'Compiler::Compiler()')
>f(2,1620,524,4,'java_start(Thread*)')
>f(3,1620,480,4,'.annobin__ZN24ConcurrentG1RefineThread3runEv.start')
>f(4,1620,480,3,'DirtyCardQueueSet::apply_closure_to_completed_buffer(CardTableEntryClosure*, unsigned int, int, bool)')
>f(5,1621,1,3,'G1RemSet::refine_card(signed char*, unsigned int, bool)')
>f(5,1622,478,3,'RefineCardTableEntryClosure::do_card_ptr(signed char*, unsigned int)')
>f(6,1622,1,3,'FilterOutOfRegionClosure::FilterOutOfRegionClosure(HeapRegion*, OopClosure*)')
>f(6,1623,475,3,'G1RemSet::refine_card(signed char*, unsigned int, bool) [clone .part.78]')
>f(7,1643,3,3,'G1CardCounts::add_card_count(signed char*)')
>f(7,1646,451,3,'HeapRegion::oops_on_card_seq_iterate_careful(MemRegion, FilterOutOfRegionClosure*, signed char*)')
>f(8,1652,2,4,'.annobin__ZN15objArrayOopDesc11object_sizeEv.start')
>f(8,1654,35,4,'.annobin__ZN29G1UpdateRSOrPushRefOopClosure6do_oopEPj.start')
>f(8,1689,408,3,'ObjArrayKlass::oop_oop_iterate_nv_m(oopDesc*, FilterOutOfRegionClosure*, MemRegion)')
>f(9,1761,138,4,'.annobin__ZN17OtherRegionsTable13add_referenceEPvi.start')
>f(9,1899,198,4,'.annobin__ZN29G1UpdateRSOrPushRefOopClosure6do_oopEPj.start')
>f(10,2070,20,4,'.annobin__ZN17OtherRegionsTable13add_referenceEPvi.start')
>f(10,2090,7,3,'BitMap::par_at_put(unsigned long, bool)')
>f(7,2097,1,3,'MemRegion::intersection(MemRegion) const')
>f(6,2098,2,3,'HeapRegion::oops_on_card_seq_iterate_careful(MemRegion, FilterOutOfRegionClosure*, signed char*)')
>f(3,2100,44,3,'GangWorker::loop()')
>f(4,2100,44,3,'G1ParTask::work(unsigned int)')
>f(5,2100,40,3,'G1ParEvacuateFollowersClosure::do_void()')
>f(6,2101,34,4,'.annobin__ZN20G1ParScanThreadState10trim_queueEv.start')
>f(7,2134,1,4,'.annobin__ZN6Atomic7cmpxchgEaPVaa.start')
>f(6,2135,1,3,'G1SATBCardTableModRefBS::mark_card_deferred(unsigned long)')
>f(6,2136,3,3,'GenericTaskQueue<StarTask, (MemoryType)5, 131072u>::pop_global(StarTask volatile&)')
>f(6,2139,1,3,'GenericTaskQueueSet<OverflowTaskQueue<StarTask, (MemoryType)5, 131072u>, (MemoryType)5>::steal_best_of_2(unsigned int, int*, StarTask&)')
>f(5,2140,4,3,'G1RootProcessor::scan_remembered_sets(G1ParPushHeapRSClosure*, OopClosure*, unsigned int)')
>f(6,2140,4,3,'G1RemSet::oops_into_collection_set_do(G1ParPushHeapRSClosure*, CodeBlobClosure*, unsigned int)')
>f(7,2140,4,3,'G1RemSet::scanRS(G1ParPushHeapRSClosure*, CodeBlobClosure*, unsigned int)')
>f(8,2140,4,3,'G1CollectedHeap::collection_set_iterate_from(HeapRegion*, HeapRegionClosure*)')
>f(9,2140,4,4,'.annobin__ZN13ScanRSClosure12doHeapRegionEP10HeapRegion.start')
>f(10,2140,4,3,'DirtyCardToOopClosure::do_MemRegion(MemRegion)')
>f(11,2140,2,3,'G1BlockOffsetArrayContigSpace::block_start_unsafe(void const*)')
>f(11,2142,2,3,'HeapRegionDCTOC::walk_mem_region(MemRegion, HeapWord*, HeapWord*)')
>f(12,2142,2,3,'ObjArrayKlass::oop_oop_iterate_nv_m(oopDesc*, G1ParPushHeapRSClosure*, MemRegion)')
>f(1,2144,7,4,'sysmalloc')
>render();
></script></body></html>

Comment 103 Mike Millson 2021-12-23 22:13:13 UTC
Comment on attachment 1847240 [details]
JDK11 profiling

><!DOCTYPE html>
><html lang='en'>
><head>
><meta charset='utf-8'>
><style>
>	body {margin: 0; padding: 10px; background-color: #ffffff}
>	h1 {margin: 5px 0 0 0; font-size: 18px; font-weight: normal; text-align: center}
>	header {margin: -24px 0 5px 0; line-height: 24px}
>	button {font: 12px sans-serif; cursor: pointer}
>	p {margin: 5px 0 5px 0}
>	a {color: #0366d6}
>	#hl {position: absolute; display: none; overflow: hidden; white-space: nowrap; pointer-events: none; background-color: #ffffe0; outline: 1px solid #ffc000; height: 15px}
>	#hl span {padding: 0 3px 0 3px}
>	#status {overflow: hidden; white-space: nowrap}
>	#match {overflow: hidden; white-space: nowrap; display: none; float: right; text-align: right}
>	#reset {cursor: pointer}
></style>
></head>
><body style='font: 12px Verdana, sans-serif'>
><h1>Wall clock profile</h1>
><header style='text-align: left'><button id='reverse' title='Reverse'>&#x1f53b;</button>&nbsp;&nbsp;<button id='search' title='Search'>&#x1f50d;</button></header>
><header style='text-align: right'>Produced by <a href='https://github.com/jvm-profiling-tools/async-profiler'>async-profiler</a></header>
><canvas id='canvas' style='width: 100%; height: 224px'></canvas>
><div id='hl'><span></span></div>
><p id='match'>Matched: <span id='matchval'></span> <span id='reset' title='Clear'>&#x274c;</span></p>
><p id='status'>&nbsp;</p>
><script>
>	// Copyright 2020 Andrei Pangin
>	// Licensed under the Apache License, Version 2.0.
>	'use strict';
>	var root, rootLevel, px, pattern;
>	var reverse = false;
>	const levels = Array(14);
>	for (let h = 0; h < levels.length; h++) {
>		levels[h] = [];
>	}
>
>	const canvas = document.getElementById('canvas');
>	const c = canvas.getContext('2d');
>	const hl = document.getElementById('hl');
>	const status = document.getElementById('status');
>
>	const canvasWidth = canvas.offsetWidth;
>	const canvasHeight = canvas.offsetHeight;
>	canvas.style.width = canvasWidth + 'px';
>	canvas.width = canvasWidth * (devicePixelRatio || 1);
>	canvas.height = canvasHeight * (devicePixelRatio || 1);
>	if (devicePixelRatio) c.scale(devicePixelRatio, devicePixelRatio);
>	c.font = document.body.style.font;
>
>	const palette = [
>		[0x50e150, 30, 30, 30],
>		[0x50bebe, 30, 30, 30],
>		[0xe17d00, 30, 30,  0],
>		[0xc8c83c, 30, 30, 10],
>		[0xe15a5a, 30, 40, 40],
>	];
>
>	function getColor(p) {
>		const v = Math.random();
>		return '#' + (p[0] + ((p[1] * v) << 16 | (p[2] * v) << 8 | (p[3] * v))).toString(16);
>	}
>
>	function f(level, left, width, type, title) {
>		levels[level].push({left: left, width: width, color: getColor(palette[type]), title: title});
>	}
>
>	function samples(n) {
>		return n === 1 ? '1 sample' : n.toString().replace(/\B(?=(\d{3})+(?!\d))/g, ',') + ' samples';
>	}
>
>	function pct(a, b) {
>		return a >= b ? '100' : (100 * a / b).toFixed(2);
>	}
>
>	function findFrame(frames, x) {
>		let left = 0;
>		let right = frames.length - 1;
>
>		while (left <= right) {
>			const mid = (left + right) >>> 1;
>			const f = frames[mid];
>
>			if (f.left > x) {
>				right = mid - 1;
>			} else if (f.left + f.width <= x) {
>				left = mid + 1;
>			} else {
>				return f;
>			}
>		}
>
>		if (frames[left] && (frames[left].left - x) * px < 0.5) return frames[left];
>		if (frames[right] && (x - (frames[right].left + frames[right].width)) * px < 0.5) return frames[right];
>
>		return null;
>	}
>
>	function search(r) {
>		if (r && (r = prompt('Enter regexp to search:', '')) === null) {
>			return;
>		}
>
>		pattern = r ? RegExp(r) : undefined;
>		const matched = render(root, rootLevel);
>		document.getElementById('matchval').textContent = pct(matched, root.width) + '%';
>		document.getElementById('match').style.display = r ? 'inherit' : 'none';
>	}
>
>	function render(newRoot, newLevel) {
>		if (root) {
>			c.fillStyle = '#ffffff';
>			c.fillRect(0, 0, canvasWidth, canvasHeight);
>		}
>
>		root = newRoot || levels[0][0];
>		rootLevel = newLevel || 0;
>		px = canvasWidth / root.width;
>
>		const x0 = root.left;
>		const x1 = x0 + root.width;
>		const marked = [];
>
>		function mark(f) {
>			return marked[f.left] >= f.width || (marked[f.left] = f.width);
>		}
>
>		function totalMarked() {
>			let total = 0;
>			let left = 0;
>			Object.keys(marked).sort(function(a, b) { return a - b; }).forEach(function(x) {
>				if (+x >= left) {
>					total += marked[x];
>					left = +x + marked[x];
>				}
>			});
>			return total;
>		}
>
>		function drawFrame(f, y, alpha) {
>			if (f.left < x1 && f.left + f.width > x0) {
>				c.fillStyle = pattern && f.title.match(pattern) && mark(f) ? '#ee00ee' : f.color;
>				c.fillRect((f.left - x0) * px, y, f.width * px, 15);
>
>				if (f.width * px >= 21) {
>					const chars = Math.floor(f.width * px / 7);
>					const title = f.title.length <= chars ? f.title : f.title.substring(0, chars - 2) + '..';
>					c.fillStyle = '#000000';
>					c.fillText(title, Math.max(f.left - x0, 0) * px + 3, y + 12, f.width * px - 6);
>				}
>
>				if (alpha) {
>					c.fillStyle = 'rgba(255, 255, 255, 0.5)';
>					c.fillRect((f.left - x0) * px, y, f.width * px, 15);
>				}
>			}
>		}
>
>		for (let h = 0; h < levels.length; h++) {
>			const y = reverse ? h * 16 : canvasHeight - (h + 1) * 16;
>			const frames = levels[h];
>			for (let i = 0; i < frames.length; i++) {
>				drawFrame(frames[i], y, h < rootLevel);
>			}
>		}
>
>		return totalMarked();
>	}
>
>	canvas.onmousemove = function() {
>		const h = Math.floor((reverse ? event.offsetY : (canvasHeight - event.offsetY)) / 16);
>		if (h >= 0 && h < levels.length) {
>			const f = findFrame(levels[h], event.offsetX / px + root.left);
>			if (f) {
>				hl.style.left = (Math.max(f.left - root.left, 0) * px + canvas.offsetLeft) + 'px';
>				hl.style.width = (Math.min(f.width, root.width) * px) + 'px';
>				hl.style.top = ((reverse ? h * 16 : canvasHeight - (h + 1) * 16) + canvas.offsetTop) + 'px';
>				hl.firstChild.textContent = f.title;
>				hl.style.display = 'block';
>				canvas.title = f.title + '\n(' + samples(f.width) + ', ' + pct(f.width, levels[0][0].width) + '%)';
>				canvas.style.cursor = 'pointer';
>				canvas.onclick = function() {
>					if (f != root) {
>						render(f, h);
>						canvas.onmousemove();
>					}
>				};
>				status.textContent = 'Function: ' + canvas.title;
>				return;
>			}
>		}
>		canvas.onmouseout();
>	}
>
>	canvas.onmouseout = function() {
>		hl.style.display = 'none';
>		status.textContent = '\xa0';
>		canvas.title = '';
>		canvas.style.cursor = '';
>		canvas.onclick = '';
>	}
>
>	document.getElementById('reverse').onclick = function() {
>		reverse = !reverse;
>		render();
>	}
>
>	document.getElementById('search').onclick = function() {
>		search(true);
>	}
>
>	document.getElementById('reset').onclick = function() {
>		search(false);
>	}
>
>	window.onkeydown = function() {
>		if (event.ctrlKey && event.keyCode === 70) {
>			event.preventDefault();
>			search(true);
>		} else if (event.keyCode === 27) {
>			search(false);
>		}
>	}
>f(0,0,4864,4,'all')
>f(1,0,1,4,'.annobin___pthread_cond_signal.start')
>f(1,1,628,4,'.annobin___pthread_cond_timedwait.start')
>f(1,629,129,0,'TestByteArrayCopy.main')
>f(2,629,129,0,'java/util/Arrays.copyOf')
>f(3,629,129,0,'java/util/Arrays.copyOf')
>f(4,629,6,0,'java/lang/System.arraycopy')
>f(5,629,6,4,'JVM_ArrayCopy')
>f(6,629,6,3,'ObjArrayKlass::copy_array(arrayOopDesc*, int, arrayOopDesc*, int, int, Thread*)')
>f(7,629,6,3,'ObjArrayKlass::do_copy(arrayOopDesc*, unsigned long, arrayOopDesc*, unsigned long, int, Thread*) [clone .constprop.124]')
>f(8,629,3,4,'aci_CopyRight')
>f(8,632,3,3,'bool AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<36225142ul, G1BarrierSet>, (AccessInternal::BarrierType)8, 36225142ul>::oop_access_barrier<HeapWord>(arrayOopDesc*, unsigned long, HeapWord*, arrayOopDesc*, unsigned long, HeapWord*, unsigned long)')
>f(9,632,3,3,'G1BarrierSet::write_ref_array_pre(unsigned int*, unsigned long, bool)')
>f(4,635,93,0,'java/lang/reflect/Array.newInstance')
>f(5,635,93,0,'java/lang/reflect/Array.newArray')
>f(6,635,6,4,'JVM_NewArray')
>f(7,635,6,3,'Reflection::reflect_new_array(oopDesc*, int, Thread*)')
>f(8,635,6,3,'InstanceKlass::allocate_objArray(int, int, Thread*)')
>f(9,635,6,3,'CollectedHeap::array_allocate(Klass*, int, int, bool, Thread*)')
>f(10,635,6,3,'MemAllocator::allocate() const')
>f(11,635,6,4,'__memset_avx2_erms')
>f(6,641,87,4,'__pthread_cond_wait')
>f(4,728,30,4,'oop_disjoint_arraycopy')
>f(1,758,1,4,'[not_walkable_Java]')
>f(2,758,1,3,'G1BarrierSet::invalidate(MemRegion)')
>f(3,758,1,3,'PtrQueue::enqueue_known_active(void*)')
>f(1,759,8,4,'[unknown]')
>f(2,759,8,4,'clock_gettime@@GLIBC_2.17')
>f(3,760,7,4,'__vdso_clock_gettime')
>f(1,767,1,4,'[unknown_Java]')
>f(2,767,1,3,'G1BarrierSet::invalidate(MemRegion)')
>f(3,767,1,3,'PtrQueue::enqueue_known_active(void*)')
>f(1,768,132,4,'__GI___pthread_timedjoin_ex')
>f(1,900,1249,4,'__pthread_cond_wait')
>f(1,2149,1028,4,'do_futex_wait.constprop.1')
>f(1,3177,132,0,'java/lang/ref/Finalizer$FinalizerThread.run')
>f(2,3177,132,0,'java/lang/ref/ReferenceQueue.remove')
>f(3,3177,132,0,'java/lang/ref/ReferenceQueue.remove')
>f(4,3177,132,0,'java/lang/Object.wait')
>f(5,3177,132,4,'__pthread_cond_wait')
>f(1,3309,132,0,'java/lang/ref/Reference$ReferenceHandler.run')
>f(2,3309,132,0,'java/lang/ref/Reference.processPendingReferences')
>f(3,3309,132,0,'java/lang/ref/Reference.waitForReferencePendingList')
>f(4,3309,132,4,'__pthread_cond_wait')
>f(1,3441,132,0,'jdk/internal/misc/InnocuousThread.run')
>f(2,3441,132,0,'java/lang/Thread.run')
>f(3,3441,132,0,'jdk/internal/ref/CleanerImpl.run')
>f(4,3441,132,0,'java/lang/ref/ReferenceQueue.remove')
>f(5,3441,132,0,'java/lang/Object.wait')
>f(6,3441,132,4,'.annobin___pthread_cond_timedwait.start')
>f(1,3573,1291,4,'start_thread')
>f(2,3573,1291,4,'thread_native_entry(Thread*)')
>f(3,3573,1291,3,'Thread::call_run()')
>f(4,3573,398,3,'ConcurrentGCThread::run()')
>f(5,3573,398,3,'G1ConcurrentRefineThread::run_service()')
>f(6,3573,398,3,'DirtyCardQueueSet::refine_completed_buffer_concurrently(unsigned int, unsigned long)')
>f(7,3576,395,4,'.annobin__ZN8G1RemSet24refine_card_concurrentlyEPaj.start')
>f(8,3597,371,4,'.annobin__ZN10HeapRegion32oops_on_card_seq_iterate_carefulILb0E28G1ConcurrentRefineOopClosureEEb9MemRegionPT0_.start')
>f(9,3597,48,4,'.annobin__ZN28OopOopIterateBoundedDispatchI28G1ConcurrentRefineOopClosureE5Table23oop_oop_iterate_boundedI13ObjArrayKlassjEEvPS0_P7oopDescP5Klass9MemRegion.start')
>f(10,3616,13,3,'BitMap::par_at_put(unsigned long, bool)')
>f(10,3629,16,3,'OtherRegionsTable::add_reference(void*, unsigned int)')
>f(9,3645,166,3,'OtherRegionsTable::add_reference(void*, unsigned int)')
>f(9,3811,157,3,'void G1ConcurrentRefineOopClosure::do_oop_work<unsigned int>(unsigned int*)')
>f(8,3968,1,4,'.annobin__ZN28OopOopIterateBoundedDispatchI28G1ConcurrentRefineOopClosureE5Table23oop_oop_iterate_boundedI13ObjArrayKlassjEEvPS0_P7oopDescP5Klass9MemRegion.start')
>f(8,3969,1,3,'G1CollectedHeap::is_obj_dead(oopDesc*, HeapRegion const*) const')
>f(8,3970,1,3,'G1HotCardCache::insert(signed char*)')
>f(4,3971,892,3,'GangWorker::loop()')
>f(5,3971,2,4,'.annobin__ZN14G1CMRemarkTask4workEj.start')
>f(6,3971,2,3,'G1CMTask::do_marking_step(double, bool, bool)')
>f(7,3971,2,4,'SpinPause')
>f(5,3973,1,4,'.annobin__ZN21ParallelSPCleanupTask4workEj.start')
>f(6,3973,1,3,'Threads::possibly_parallel_threads_do(bool, ThreadClosure*)')
>f(7,3973,1,3,'JavaThread::nmethods_do(CodeBlobClosure*)')
>f(5,3974,2,3,'G1CMConcurrentMarkingTask::work(unsigned int)')
>f(6,3974,2,3,'G1CMTask::do_marking_step(double, bool, bool)')
>f(7,3974,1,4,'SpinPause')
>f(7,3975,1,4,'__GI_sched_yield')
>f(5,3976,2,3,'G1ClearBitMapTask::work(unsigned int)')
>f(6,3976,2,3,'HeapRegionManager::par_iterate(HeapRegionClosure*, HeapRegionClaimer*, unsigned int) const')
>f(7,3976,2,4,'.annobin__ZN17G1ClearBitMapTask22G1ClearBitmapHRClosure14do_heap_regionEP10HeapRegion.start')
>f(8,3976,2,3,'G1CMBitMap::clear_range(MemRegion)')
>f(9,3976,2,4,'__memset_avx2_erms')
>f(5,3978,856,3,'G1ParTask::work(unsigned int)')
>f(6,3978,1,3,'G1ParEvacuateFollowersClosure::do_void()')
>f(7,3978,1,3,'GenericTaskQueue<StarTask, (MemoryType)5, 131072u>::pop_global(StarTask volatile&)')
>f(6,3979,91,3,'G1RemSet::oops_into_collection_set_do(G1ParScanThreadState*, unsigned int)')
>f(7,3979,91,3,'G1RemSet::update_rem_set(G1ParScanThreadState*, unsigned int)')
>f(8,3979,91,4,'.annobin__ZN15G1CollectedHeap26iterate_dirty_card_closureEP21CardTableEntryClosurej.start')
>f(9,3979,91,3,'DirtyCardQueueSet::apply_closure_to_completed_buffer(CardTableEntryClosure*, unsigned int, unsigned long, bool)')
>f(10,3979,2,3,'G1ParScanThreadState::trim_queue_partially()')
>f(10,3981,89,3,'G1RefineCardClosure::do_card_ptr(signed char*, unsigned int)')
>f(11,3981,1,3,'G1CardTable::mark_card_deferred(unsigned long)')
>f(11,3982,64,3,'G1ParScanThreadState::trim_queue_partially()')
>f(12,4045,1,3,'CompositeElapsedCounterSource::now()')
>f(13,4045,1,3,'os::elapsed_counter()')
>f(11,4046,24,3,'G1RemSet::refine_card_during_gc(signed char*, G1ScanObjsDuringUpdateRSClosure*) [clone .part.136]')
>f(12,4046,24,4,'.annobin__ZN10HeapRegion32oops_on_card_seq_iterate_carefulILb1E31G1ScanObjsDuringUpdateRSClosureEEb9MemRegionPT0_.start')
>f(13,4046,10,3,'OverflowTaskQueue<StarTask, (MemoryType)5, 131072u>::push(StarTask) [clone .constprop.567]')
>f(13,4056,12,3,'void G1ScanObjsDuringUpdateRSClosure::do_oop_work<unsigned int>(unsigned int*)')
>f(13,4068,2,3,'void OopOopIterateBoundedDispatch<G1ScanObjsDuringUpdateRSClosure>::Table::oop_oop_iterate_bounded<ObjArrayKlass, unsigned int>(G1ScanObjsDuringUpdateRSClosure*, oopDesc*, Klass*, MemRegion)')
>f(6,4070,761,3,'G1RemSet::scan_rem_set(G1ParScanThreadState*, unsigned int)')
>f(7,4070,761,3,'G1CollectionSet::iterate_from(HeapRegionClosure*, unsigned int, unsigned int) const')
>f(8,4070,761,3,'G1ScanRSForRegionClosure::do_heap_region(HeapRegion*)')
>f(9,4070,5,3,'G1ParScanThreadState::trim_queue_partially()')
>f(9,4075,2,3,'G1ScanRSForRegionClosure::claim_card(unsigned long, unsigned int)')
>f(9,4077,754,3,'G1ScanRSForRegionClosure::scan_rem_set_roots(HeapRegion*)')
>f(10,4081,1,4,'.annobin__ZN28OopOopIterateBoundedDispatchI29G1ScanObjsDuringScanRSClosureE5Table23oop_oop_iterate_boundedI13ObjArrayKlassjEEvPS0_P7oopDescP5Klass9MemRegion.start')
>f(10,4082,25,3,'G1CardTable::mark_card_deferred(unsigned long)')
>f(10,4107,517,3,'G1ParScanThreadState::trim_queue_partially()')
>f(11,4621,2,3,'CompositeElapsedCounterSource::now()')
>f(12,4622,1,3,'os::elapsed_counter()')
>f(11,4623,1,3,'PtrQueue::enqueue_known_active(void*)')
>f(10,4624,2,3,'HeapRegionRemSetIterator::fine_has_next(unsigned long&)')
>f(10,4626,14,3,'HeapRegionRemSetIterator::has_next(unsigned long&)')
>f(11,4631,9,3,'HeapRegionRemSetIterator::fine_has_next(unsigned long&)')
>f(10,4640,191,3,'bool HeapRegion::oops_on_card_seq_iterate_careful<true, G1ScanObjsDuringScanRSClosure>(MemRegion, G1ScanObjsDuringScanRSClosure*)')
>f(11,4641,13,4,'.annobin__ZN28OopOopIterateBoundedDispatchI29G1ScanObjsDuringScanRSClosureE5Table23oop_oop_iterate_boundedI13ObjArrayKlassjEEvPS0_P7oopDescP5Klass9MemRegion.start')
>f(11,4654,112,4,'.annobin__ZN29G1ScanObjsDuringScanRSClosure11do_oop_workIjEEvPT_.start')
>f(11,4766,65,3,'OverflowTaskQueue<StarTask, (MemoryType)5, 131072u>::push(StarTask) [clone .constprop.567]')
>f(6,4831,3,3,'G1RootProcessor::evacuate_roots(G1ParScanThreadState*, unsigned int)')
>f(7,4831,3,3,'G1RootProcessor::process_java_roots(G1RootClosures*, G1GCPhaseTimes*, unsigned int)')
>f(8,4831,1,3,'ClassLoaderDataGraph::roots_cld_do(CLDClosure*, CLDClosure*)')
>f(9,4831,1,3,'G1CLDScanClosure::do_cld(ClassLoaderData*)')
>f(10,4831,1,3,'ClassLoaderData::ChunkedHandleList::oops_do(OopClosure*)')
>f(11,4831,1,4,'.annobin__ZN16G1ParCopyClosureIL9G1Barrier1EL6G1Mark1EE6do_oopEPP7oopDesc.start')
>f(12,4831,1,4,'__memmove_avx_unaligned_erms')
>f(8,4832,2,3,'Threads::possibly_parallel_oops_do(bool, OopClosure*, CodeBlobClosure*)')
>f(9,4832,1,3,'CodeCacheSweeperThread::oops_do(OopClosure*, CodeBlobClosure*)')
>f(10,4832,1,3,'JavaThread::oops_do(OopClosure*, CodeBlobClosure*)')
>f(11,4832,1,3,'G1ParCopyClosure<(G1Barrier)0, (G1Mark)1>::do_oop(oopDesc**)')
>f(12,4832,1,4,'__memmove_avx_unaligned_erms')
>f(9,4833,1,3,'JavaThread::oops_do(OopClosure*, CodeBlobClosure*)')
>f(10,4833,1,3,'JNIHandleBlock::oops_do(OopClosure*)')
>f(11,4833,1,3,'G1ParCopyClosure<(G1Barrier)0, (G1Mark)1>::do_oop(oopDesc**)')
>f(12,4833,1,3,'G1ParScanThreadState::copy_to_survivor_space(InCSetState, oopDesc*, markOopDesc*)')
>f(5,4834,23,3,'G1RebuildRemSetTask::work(unsigned int)')
>f(6,4834,23,3,'HeapRegionManager::par_iterate(HeapRegionClosure*, HeapRegionClaimer*, unsigned int) const')
>f(7,4834,23,3,'G1RebuildRemSetTask::G1RebuildRemSetHeapRegionClosure::do_heap_region(HeapRegion*)')
>f(8,4834,7,4,'.annobin__ZN22G1RebuildRemSetClosure11do_oop_workIjEEvPT_.start')
>f(8,4841,5,4,'.annobin__ZN28OopOopIterateBoundedDispatchI22G1RebuildRemSetClosureE5Table23oop_oop_iterate_boundedI13ObjArrayKlassjEEvPS0_P7oopDescP5Klass9MemRegion.start')
>f(9,4845,1,3,'BitMap::par_at_put(unsigned long, bool)')
>f(8,4846,11,3,'OtherRegionsTable::add_reference(void*, unsigned int)')
>f(5,4857,6,3,'G1ReclaimEmptyRegionsTask::work(unsigned int)')
>f(6,4857,6,3,'HeapRegionManager::par_iterate(HeapRegionClosure*, HeapRegionClaimer*, unsigned int) const')
>f(7,4862,1,4,'.annobin__ZN25G1ReclaimEmptyRegionsTask28G1ReclaimEmptyRegionsClosure14do_heap_regionEP10HeapRegion.start')
>f(4,4863,1,3,'VMThread::run()')
>f(5,4863,1,3,'VMThread::loop()')
>f(6,4863,1,3,'SafepointSynchronize::begin()')
>render();
></script></body></html>

Comment 104 Mike Millson 2021-12-23 22:20:08 UTC
Hi Simeon,

You should see attached the profiling I did on JDK8 and JDK11 using the reproducer. It shows the time difference is not in System.arraycopy or Arrays.copyOf but due to GC.

Would you be ok closing out this bugzilla? Or do you still feel there is an issue related to System.arraycopy and/or Arrays.copyOf performance in OpenJDK11 vs OpenJDK8?

Thank you,
Mike

Comment 105 Andrey Loskutov 2021-12-23 22:34:18 UTC
(In reply to Mike Millson from comment #104)
> Hi Simeon,
> 
> You should see attached the profiling I did on JDK8 and JDK11 using the
> reproducer. It shows the time difference is not in System.arraycopy or
> Arrays.copyOf but due to GC.

Coul you please elaborate "due to GC"? 
So GC is now:
1) performing GC work not fast enough?
2) hinders execution because it does too much? 
3) whatever else?

and if that "due to GC" mean, the performance drop via GC is well known and there is no solution?

Just trying to understand, why should we accept performance drop "due to GC", and shouldn't the GC be *improved* in 11?

Comment 106 Mike Millson 2021-12-23 23:29:33 UTC
(In reply to Andrey Loskutov from comment #105)
> (In reply to Mike Millson from comment #104)
> > Hi Simeon,
> > 
> > You should see attached the profiling I did on JDK8 and JDK11 using the
> > reproducer. It shows the time difference is not in System.arraycopy or
> > Arrays.copyOf but due to GC.
> 
> Coul you please elaborate "due to GC"? 
> So GC is now:
> 1) performing GC work not fast enough?
> 2) hinders execution because it does too much? 
> 3) whatever else?
> 
> and if that "due to GC" mean, the performance drop via GC is well known and
> there is no solution?

The throughput for the reproducer use case seems to be higher on JDK8 vs JDK11 with the G1 collector, but it's a very limited use case. It was not found to be true after a warmup period, with heap tuning (e.g. a smaller heap), or other collectors.

The reproducer seems to require some tuning considerations moving to JDK11. Does it really reflect your production environment? For example, is your production application really short-lived, and startup performance is a key concern?

> 
> Just trying to understand, why should we accept performance drop "due to
> GC", and shouldn't the GC be *improved* in 11?

You shouldn't have to accept a performance drop, but it is expected to have to do performance testing and tuning when moving to a new LTS release.

Comment 108 Andrew John Hughes 2023-02-06 14:28:05 UTC
Can we close this bug? There hasn't been an update in over a year and it seems the regression was difficult to reproduce. I also notice most of your recent issues have been raised against java-17-openjdk, so maybe java-11-openjdk is no longer as much of a concern.

Comment 109 Simeon Andreev 2023-02-06 15:00:10 UTC
I've discussed in my team, you can close the bug.

Comment 110 Andrew John Hughes 2023-08-17 20:39:42 UTC
Thanks Simeon.