Bug 1978378

Summary: high cpu usage in pthread_cond_timedwait when using very small timeout
Product: Red Hat Enterprise Linux 8 Reporter: Paulo Andrade <pandrade>
Component: glibcAssignee: glibc team <glibc-bugzilla>
Status: CLOSED NOTABUG QA Contact: qe-baseos-tools-bugs
Severity: medium Docs Contact:
Priority: unspecified    
Version: 8.4CC: aph, ashankar, codonell, dj, fweimer, mmillson, mnewsome, pfrankli, sipoyare
Target Milestone: beta   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-09 13:51:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Paulo Andrade 2021-07-01 16:44:00 UTC
Sample reproducer:

# yum install java-1.8.0-openjdk-devel
# cat Spin.java
public class Spin
{
    public static void main(String[] args)
    {
        while (true)
            {
                //java.util.concurrent.locks.LockSupport.parkNanos(50000);
                java.util.concurrent.locks.LockSupport.parkNanos(1);
            }
    }
}

# javac Spin.java
# java Spin & top

  It will show the java process at 100% cpu usage in rhel8, and 20-40% on rhel7.

  If using 50000 (50us) as argument, rhel7 and rhel8 will show roughly the same cpu usage.

  50000 is also default timer_slack_ns, but playing with /proc/$pid/timerslack_ns does not make any change in rhel8.

  This is somewhat of a random guess, but the issue appears to be this chunk in glibc-2.28, pthread_cond_wait.c:__pthread_cond_wait_common()
...
		  struct timespec rt;
		  if (__clock_gettime (CLOCK_MONOTONIC, &rt) != 0)
		    __libc_fatal ("clock_gettime does not support "
				  "CLOCK_MONOTONIC\n");
		  /* Convert the absolute timeout value to a relative
		     timeout.  */
		  rt.tv_sec = abstime->tv_sec - rt.tv_sec;
		  rt.tv_nsec = abstime->tv_nsec - rt.tv_nsec;
		  if (rt.tv_nsec < 0)
		    {
		      rt.tv_nsec += 1000000000;
		      --rt.tv_sec;
		    }
		  /* Did we already time out?  */
		  if (__glibc_unlikely (rt.tv_sec < 0))
		    err = ETIMEDOUT;
...

that may be preventing a context switch to the kernel, as a 1 nanosecond timeout should have already timed out when reaching that point, and it will return due to the ETIMEDOUT.

In newer glibc that chunk does not exist.
The problem does not happen in Fedora 33.

  If adding a counter, for example:
public class Spin
{
    public static void main(String[] args)
    {
        long n = 1000000;
        while (n > 0)
        {
            java.util.concurrent.locks.LockSupport.parkNanos(1);
            n = n - 1;
        }
    }
}

it runs apparently like 10 times faster in rhel8, for example, on similar systems:

rhel7: /bin/time java Spin
1.22user 2.47system 0:55.70elapsed 6%CPU (0avgtext+0avgdata 22848maxresident)k
0inputs+88outputs (0major+5858minor)pagefaults 0swaps

rhel8: /bin/time java Spin
0.75user 0.26system 0:01.01elapsed 100%CPU (0avgtext+0avgdata 26436maxresident)k
0inputs+64outputs (0major+2709minor)pagefaults 0swaps

Comment 1 Florian Weimer 2021-07-02 13:29:54 UTC
Andrew, is this a valid use case of java.util.concurrent.locks.LockSupport? Is parking with a timeout supposed to de-schedule the current thread?

For the POSIX condition variable, I believe both old and new behavior are okay. But maybe Java has different requirements. Maybe it's time to dust off your old patches to use futex directly?

Comment 2 Paulo Andrade 2021-07-02 14:25:23 UTC
  Even if it is invalid usage, I believe there might be java code around that
relies on any values smaller than default timerslack_ns (50us) to delay at
least the 50us, and allow other threads to run concurrently.
  It appears to be the issue of the support case. Solution in java code would
be to ensure to call parkNanos with at least 50000 as argument, to match default
timerslack_ns.

Comment 3 Andrew Haley 2021-07-03 08:45:37 UTC
(In reply to Florian Weimer from comment #1)
> Andrew, is this a valid use case of java.util.concurrent.locks.LockSupport?
> Is parking with a timeout supposed to de-schedule the current thread?
> 
> For the POSIX condition variable, I believe both old and new behavior are
> okay. But maybe Java has different requirements. Maybe it's time to dust off
> your old patches to use futex directly?

There's no guarantee that parkNanos() will do anything at all. It might return immediately.
Returning immediately without blocking in the kernel is correct when the timer has
already timed out.

Comment 4 Florian Weimer 2021-07-03 11:13:13 UTC
(In reply to Andrew Haley from comment #3)
> There's no guarantee that parkNanos() will do anything at all. It might
> return immediately.
>
> Returning immediately without blocking in the kernel is correct when the
> timer has already timed out.

Okay, then the old and new behavior are equally correct for OpenJDK, too, not just as far as POSIX is concerned.

Paulo, I assume this means that there is nothing to fix on the system side, and the application has to be changed to use a synchronization algorithm with more predictable performance.

Comment 5 Andrew Haley 2021-07-03 13:09:40 UTC
(In reply to Florian Weimer from comment #4)
>
> Paulo, I assume this means that there is nothing to fix on the system side,
> and the application has to be changed to use a synchronization algorithm
> with more predictable performance.

Absolutely. If you need to yield, call Thread.yield(), which uses sched_yield().

Comment 6 Carlos O'Donell 2021-07-09 13:51:42 UTC
(In reply to Andrew Haley from comment #5)
> (In reply to Florian Weimer from comment #4)
> >
> > Paulo, I assume this means that there is nothing to fix on the system side,
> > and the application has to be changed to use a synchronization algorithm
> > with more predictable performance.
> 
> Absolutely. If you need to yield, call Thread.yield(), which uses
> sched_yield().

Andrew, Thank you for your comments and direction.

From our perspective this is a CLOSED NOTABUG, and I'm marking it as such.

Comment 7 Paulo Andrade 2021-07-09 14:24:08 UTC
Thanks for the comments. Still waiting from feedback from the user,
but the final recommendation was to update java code, to not expect
a 1 nano park to sleep for at least 50us, and not be surprised when
the system does (or attempts to do) what was asked.

  Since it appears the issue was to just put a logger thread to sleep
and poll from time to time, we suggested the pseudo code:

my park(delay)
{
  time1 = gettime();
  Thread.yeld();
  time2 = gettime();
  if (time2 - time1 < delay)
    parkNanos(delay - (time2 - time1));
  time3 = gettime();
  if (time3 - time1 < delay) 
    Thread.yeld();
}

  Other than that, likely need a very different approach, probably
using kernel-rt and not java code for the delay.