2132371 – JDK15+ performance degradation when reading byte-per-byte from BufferedInputStream

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2132371 - JDK15+ performance degradation when reading byte-per-byte from BufferedInputStream

Summary: JDK15+ performance degradation when reading byte-per-byte from BufferedInputS...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	java-17-openjdk
Sub Component:
Version:	9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Aleksey Shipilev
QA Contact:	OpenJDK QA
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-10-05 12:45 UTC by Simeon Andreev
Modified:	2022-10-25 17:13 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-10-25 17:13:39 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-135721	0	None	None	None	2022-10-05 13:05:34 UTC

Description Simeon Andreev 2022-10-05 12:45:20 UTC

Description of problem:

We observe a performance degradation in our product based on Eclipse and xtext, when updating from Java 11 to Java 17. 

Version-Release number of selected component (if applicable):

OpenJDK 15+.

Steps to Reproduce:
1. Run the following snippet with OpenJDK 15+ and with OpenJDK 14 or earlier.
public class JEP374PerformanceDegradation {
    public static void main(String[] args) throws IOException {
    	Random r = new Random(42);
        boolean b = false;
        ArrayList<Path> paths = new ArrayList<>();
        int n = 50;
		for (int i = 0; i < n; ++i) {
        	Path p = Files.createTempFile(Paths.get("/tmp/"), "test", "test");
        	paths.add(p);
        	byte[] c = new byte[672369 + i];
        	r.nextBytes(c);
        	Files.write(p, c);
        }
        long s = System.currentTimeMillis();
        for (Path p : paths) {
            File f = p.toFile();
            if (f.isFile()) {
                b = b || hasContentsChanged(f);
            }
            b = false;
        }
        long e = System.currentTimeMillis();
        System.out.println("elapsed: " + (e - s) + "ms");
        for (Path p : paths) {
        	Files.delete(p);
        }
    }

    protected static boolean hasContentsChanged(File file) {
        boolean contentChanged = false;
        BufferedInputStream oldContent = null;
        BufferedInputStream newContent = null;
        try {
            newContent = new BufferedInputStream(new FileInputStream(file));
            oldContent = new BufferedInputStream(new FileInputStream(file));
            int newByte = newContent.read();
            int oldByte = oldContent.read();
            while (newByte != -1 && oldByte != -1 && newByte == oldByte) {
                newByte = newContent.read();
                oldByte = oldContent.read();
            }
            contentChanged = newByte != oldByte;
        } catch (IOException e) {
            contentChanged = true;
        } finally {
            if (oldContent != null) {
                try {
                    oldContent.close();
                } catch (IOException e) {
                    // ignore
                }
            }
            if (newContent != null) {
                try {
                    newContent.close();
                } catch (IOException e) {
                    // ignore
                }
            }
        }
        return contentChanged;
    }
}

2. Observe a performance degradation.
3. The performance degradation is gone, if the deprecated VM flag is used: -XX:+UseBiasedLocking

Output with OpenJDK 14:

$ /data/jdk/openjdk-14.0.2_linux-x64_bin/jdk-14.0.2/bin/java /home/sandreev/JEP374PerformanceDegradation.java 
elapsed: 194ms

Output with OpenJDK 15:

$ /data/jdk/openjdk-15.0.2_linux-x64_bin/jdk-15.0.2/bin/java /home/sandreev/JEP374PerformanceDegradation.java 
elapsed: 1010ms

Output with OpenJDK 15 and flag "-XX:+UseBiasedLocking":

$ /data/jdk/openjdk-15.0.2_linux-x64_bin/jdk-15.0.2/bin/java -XX:+UseBiasedLocking /home/sandreev/JEP374PerformanceDegradation.java 
OpenJDK 64-Bit Server VM warning: Option UseBiasedLocking was deprecated in version 15.0 and will likely be removed in a future release.
elapsed: 185ms

Additional info:

We bisected the degradation to this change:

https://github.com/openjdk/jdk15u/commit/398a2b3c37236c0cd1b5258b517adf9edaf0b044

So this seems to be a regression from: https://openjdk.org/jeps/374

For our product and xtext, we will try to read bytes in an array "in bulk" instead of asking the BufferedInputStream for single bytes at a time. That doesn't seem to result in degraded performance (we are still evaluating though).

Comment 1 Simeon Andreev 2022-10-06 13:23:02 UTC

Roman, just FYI. In https://github.com/openjdk/jdk/pull/10590 I see you mention code which has heavy single-threaded use of synchronized JRE types.

You mention Vector and StringBuffer as examples, maybe this bug gives another example of such code. Here we have use of the sychronized BufferedInputStream.read() to compare 2 input streams, likely written like this since the code is very trivial to write (and due to using buffered streams, the general expectation is that read() would not perform much worse than read(byte[], int, int)).

Comment 2 Andrew Dinn 2022-10-20 10:33:42 UTC

Hi Simeon,

> Here we have use of the sychronized BufferedInputStream.read() to compare 2 input streams, likely written like this since the code is very trivial to write (and due to using buffered streams, the general expectation is that read() would not perform much worse than read(byte[], int, int)).

Yes, that really should be a reasonable expectation of a buffered stream. Unfortunately, the expectation is invalidated because a call to read() implies synchronization. This is a known and unfortunate problem with the way that the stream APIs were designed many years ago. It is also, as you mention, a problem that was to a large degree hidden by biased locking (very much like Vector and StringBuffer) but has now appeared again. The OpenJDK project is aware of this problem and is considering remedies but there is no immediate plan to provide an alternative API or a JVM change to help this specific use case. The best option for clients of BufferedInputStream is to read data in bulk where possible.

Comment 3 Aleksey Shipilev 2022-10-20 18:23:19 UTC

To add to what Andrew Dinn says.

Changing to bulk operations is the proper remedy to this. Fixing `BufferedInputStream` on JDK side would be very hard, if possible at all. The underlying problem with locks in this scenario is atomic CAS on lock acquisition, and that is a bare minimum we have to do. Even if we manage to avoid explicit lock synchronization there, we would still need to use atomics to do safe buffered reads, which suffers from the same problem. Biased locking managed to avoid atomics on lock acquisition with intricate JVM cooperation; attempting to reproduce this in pure Java would likely fail.

Assuming the reproducer is similar to the real code that customer has, there is `Files.mismatch` since JDK 12 that should be good performance-wise. That method already does "bulk" reads and comparisons. If the real code just needs to deal with buffered read for some other purpose, I'd recommend `InputStream.readNBytes` convenience method introduced in JDK 9.

AFAICS, it is techinically possible to side-step this problem by "hacking in" the subclass of `BufferedInputStream` which overrides `read()` with unsynchronized version. This would require extensive verification that all `BufferedInputStream` instances are confined to a single thread. Then something like this would work:

```
class UnsyncBufferedInputStream extends BufferedInputStream {
    public UnsyncBufferedInputStream(InputStream in) {
        super(in);
    }

    @Override
    public int read() throws IOException {
        if (pos >= count) {
            // Let superclass handle buffers
            return super.read();
        }
        return buf[pos++] & 0xFF;
    }
}
```

Then again, if user modifies the source code, it would be more future-proof to just avoid doing per-byte operations in favor of bulk operations.

Some performance data for your amusement:
 https://gist.github.com/shipilev/71fd881f7cbf3f87028bf87385e84889

Comment 4 Simeon Andreev 2022-10-24 06:15:06 UTC

I.e. we are to not expect improvements in the JDK here?

We've already added a workaround in our code, to  prevent the performance degradation. We'll also likely have to advise users to adjust their code, if they too have code that runs into the degradation.

Comment 5 Aleksey Shipilev 2022-10-24 08:43:20 UTC

> I.e. we are to not expect improvements in the JDK here?

Yes, I do not see an easy way out of this degradation at the JDK side. 
Changing the user code is (unfortunately) the pragmatic way to go here.

Comment 6 Simeon Andreev 2022-10-24 09:02:01 UTC

OK, please close this as wont fix then. Thanks!

Comment 7 Andrew Dinn 2022-10-24 09:41:54 UTC

Hi Simeon.

The only JVM change that might affect this is the work Roman Kennke is doing on locking. However, I doubt that is going to do achieve the improvement provided by biased locking.

I raised this upstream on the OpenJDK hotspot-dev list and discussion did suggest the possibility of a different workaround to the one offered by Aleksey.

The difficulty here is that the inner loop is continually locking and unlocking each of the two input streams in turn. That makes it hard to optimize the code as is. However, one thing you might try is synchronizing on each stream outside the loop

        try {
            newContent = new BufferedInputStream(new FileInputStream(file));
            oldContent = new BufferedInputStream(new FileInputStream(file));
            synchronized(oldContent) {
                synchronized(newContent) {
                    int newByte = newContent.read();
                    int oldByte = oldContent.read();
                    while (newByte != -1 && oldByte != -1 && newByte == oldByte) {
                        newByte = newContent.read();
                        oldByte = oldContent.read();
                    }
                }
            }
            contentChanged = newByte != oldByte;
         . . .

The compiler may be able to elide the synchronizations associated with the inner reads when it inlines them because it knows the associated object is already locked in an outer block (this is called lock coarsening). This would be suitable for cases where the code really does need to proceed by reading one byte at a time.

If the compiler cannot perform lock coarsening wit this version it may help to move the initial reads outside the synchronizations.

Comment 8 Simeon Andreev 2022-10-24 10:13:24 UTC

Alright, thanks for the suggested workaround. I'll document it as well, under the list of suggested workarounds for users.

Comment 9 Aleksey Shipilev 2022-10-25 17:13:39 UTC

(In reply to Simeon Andreev from comment #6)
> OK, please close this as wont fix then. Thanks!

Thanks, did so (hopefully correctly).

Note You need to log in before you can comment on or make changes to this bug.