Bug 1781346

Summary: rngd uses 100% CPU while in a yield() loop
Product: [Fedora] Fedora Reporter: Linus Torvalds <torvalds>
Component: rng-toolsAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 30CC: jaromir.capik, jgarzik, lewk, nhorman
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: rng-tools-6.9-1.fc30 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-03 20:35:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Linus Torvalds 2019-12-09 20:07:25 UTC
Description of problem:

On my laptop (but not my desktop - related to TPM?) rngd spends 100% of CPU time and keeps the fan running.

Interim solution: kill it.

Version-Release number of selected component (if applicable):

rng-tools-6.7-2.fc30.x86_64

How reproducible:


Steps to Reproduce:
1. Boot up-to-date Fedora 30
2. Wait
3. Profit! If you're the power company, or a fan manufacturer, that is.

Actual results:

A load average of 1.0 when the laptop is otherwise idle, and a noisy spinning fan


Expected results:


Maybe rngd can _occasionally_ spin in a yield loop for jitter entropy or whatever, but doing it enough that the fan stays on for hours at a time isn't great.

Additional info:

As mentioned, this seems to be somewhat hw-specific for unknown reasons. It doesn't happen on my desktop, despite rng-tools being installed there too, and the configuration being pretty much the same (ie up-to-date F30 with my -git kernel, of course).

Comment 1 Linus Torvalds 2019-12-09 20:11:58 UTC
Just to clarify:when I do an 'strace -p' on the rngd pid, it seemed to literally just be in an endless sched_yield() loop when the CPU is going. But I didn't try to really debug it any more than looking at that.

Comment 2 Neil Horman 2019-12-10 11:39:43 UTC
Hey Linus-
     Pretty sure I know what this is, I think I hit it upstream recently and fixed it, but it was only hapening on shutdown for me.  Can you send me a pstack output while its happening, just to confirm its the same issue?  If so, I'll pull the requisite patches into fedora and get you a new release shortly.  If it is the same issue, and you want to try avoid it in the interim, you can modify the rngd unit file to pass the '-x jitter' option.  This will disable the jitterentropy randomness source, which was the cause of the problem for me.

Comment 3 Linus Torvalds 2019-12-10 17:06:59 UTC
I'm not sure how useful this pstack is (looks like limited debug info), but here goes:

pstack 1899

Thread 5 (Thread 0x7f12a6f2e700 (LWP 1920)):
#0  0x00007f12a910093f in write () from /lib64/libpthread.so.0
#1  0x000055710ea478fd in ?? ()
#2  0x00007f12a90f74c0 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f12a9025163 in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f12a772f700 (LWP 1918)):
#0  0x00007f12a910093f in write () from /lib64/libpthread.so.0
#1  0x000055710ea478fd in ?? ()
#2  0x00007f12a90f74c0 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f12a9025163 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f12a7f30700 (LWP 1917)):
#0  0x00007f12a910093f in write () from /lib64/libpthread.so.0
#1  0x000055710ea478fd in ?? ()
#2  0x00007f12a90f74c0 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f12a9025163 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f12a8731700 (LWP 1916)):
#0  0x00007f12a910093f in write () from /lib64/libpthread.so.0
#1  0x000055710ea478fd in ?? ()
#2  0x00007f12a90f74c0 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f12a9025163 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f12a8732d40 (LWP 1899)):
#0  0x00007f12a900c12b in sched_yield () from /lib64/libc.so.6
#1  0x000055710ea488e5 in ?? ()
#2  0x000055710ea4290c in ?? ()
#3  0x00007f12a8f4cf43 in __libc_start_main () from /lib64/libc.so.6
#4  0x000055710ea4324e in ?? ()

Comment 4 Linus Torvalds 2019-12-10 17:10:50 UTC
Oh, and just as I pressed 'submit', the busy looping stopped.

I assume that this is jitterentropy that just waits for the timing while doing a sched_yield() and me doing the writing and mousing for cut-and-paste made it all happy.

Presumably my strace only saw the yield calls because the timing is done with the TSC (possibly using the vdso).

That may explain why I only see this on my laptop - particularly during the merge window I boot my laptop for testing, but I do all my real work on my desktop, so when I'm home the laptop often sits basically idle, just occasionally getting a new kernel for basic smoke testing.

Comment 5 Linus Torvalds 2019-12-10 17:17:24 UTC
Or maybe it was pstack that sends a signal or something and perturbs the thing enough to get it out of the busy loop.

Because I've seen to go overnight before, and I just come into my office in the morning and my laptop fan is running..

Comment 6 Neil Horman 2019-12-10 17:33:40 UTC
Copy that
 
I'm working on some improvements that removes the need for sched_yield in rngd.  If you want to try it out, its here:
https://github.com/nhorman/rng-tools/tree/yield-removal

I'm going to do some more testing with it, and when I feel good about it, I'll make a new release and port it to f30 and rawhide

Comment 7 Linus Torvalds 2019-12-10 17:45:10 UTC
Btw, I'm not sure how aware you are, but the kernel these days (as of 5.4, I'm not sure what the stable status is) does a simple jitter-rng on its own, because we got tired of user space locking up or doing things badly.

See kernel comit 50ee7529ec45 ("random: try to actively add entropy rather than passively wait for it").

The kernel only does it enough that getrandom() no longer blocks.

Comment 8 Neil Horman 2019-12-11 12:51:03 UTC
yeah, the jitterentropy work in rngd was done a few months prior to that, and for the same reasons, to prevent /dev/random blockage on low entropy systems. did you try the branch from comment 6?

Comment 9 Linus Torvalds 2019-12-11 18:57:53 UTC
So I didn't want to replace /sbin/rngd, and as a result my testing was slightly different from the usual "just boot up and wait". And that "boot and see" wasn't something that happened every single time either (although it did happen  more often than not).

But for what it's worth, it _seems_ to work. I can't make it go into the yield loop with those two changes, but see above about the caveat about difference in test environment.

Comment 10 Neil Horman 2019-12-11 20:38:28 UTC
understood, I'm letting it run here for the next few days, if that fails to find any problem, I'll pull it in under a 'doesn't seem to hurt' policy, and you can let me know if there are any subsequent issues

Comment 11 Fedora Update System 2019-12-18 16:17:29 UTC
FEDORA-2019-b6158d5147 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-b6158d5147

Comment 12 Fedora Update System 2019-12-19 01:01:02 UTC
rng-tools-6.9-1.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-b6158d5147

Comment 13 Fedora Update System 2020-01-03 20:35:50 UTC
rng-tools-6.9-1.fc30 has been pushed to the Fedora 30 stable repository. If problems still persist, please make note of it in this bug report.