Description of problem: $ journalctl -b-1 -u rngd -- Logs begin at Mon 2019-03-18 14:30:04 CET, end at Tue 2019-03-19 11:34:55 CET. -- Mar 19 11:29:43 workstation2 systemd[1]: Started Hardware RNG Entropy Gatherer Daemon. Mar 19 11:29:43 workstation2 rngd[792]: Initalizing available sources Mar 19 11:29:43 workstation2 rngd[792]: Initalizing entropy source hwrng Mar 19 11:29:43 workstation2 rngd[792]: Enabling RDSEED rng support Mar 19 11:29:43 workstation2 rngd[792]: Initalizing entropy source rdrand Mar 19 11:29:47 workstation2 rngd[792]: Initalizing AES buffer Mar 19 11:29:47 workstation2 rngd[792]: Enabling JITTER rng support Mar 19 11:29:47 workstation2 rngd[792]: Initalizing entropy source jitter Mar 19 11:29:47 workstation2 rngd[792]: PKCS11 Engine /usr/lib64/opensc-pkcs11.so Error: No such file or directory Mar 19 11:29:47 workstation2 rngd[792]: Failed to init entropy source pkcs11 Mar 19 11:29:59 workstation2 systemd[1]: Stopping Hardware RNG Entropy Gatherer Daemon... Mar 19 11:31:30 workstation2 systemd[1]: rngd.service: State 'stop-sigterm' timed out. Killing. Mar 19 11:31:30 workstation2 systemd[1]: rngd.service: Killing process 792 (rngd) with signal SIGKILL. Mar 19 11:31:30 workstation2 systemd[1]: rngd.service: Main process exited, code=killed, status=9/KILL Mar 19 11:31:30 workstation2 systemd[1]: rngd.service: Failed with result 'timeout'. Mar 19 11:31:30 workstation2 systemd[1]: Stopped Hardware RNG Entropy Gatherer Daemon. Version-Release number of selected component (if applicable): rng-tools-6.7-1.fc30.x86_64 How reproducible: Not repeatable, maybe it's some race condition?
hmm, I would agree its likely a race condition, possibly blocking in the jitter rng, waiting for one of the pipe writer threads to return for an exit, but I can't immediately see how. If you are able, I suggest you modify the unit file to enable debugging (-d) on the rngd exec line, and if it happens again, send in the full log. That will give us a better clue as to whats going on here.
I haven't seen this since. Since it's most likely a race, and seemingly one hard to hit at that, let's close. If it happens again, I'll reopen with more info.
I'm seeing this on my laptop now too: $ journalctl -u rngd Mar 29 16:04:28 krowka systemd[1]: Started Hardware RNG Entropy Gatherer Daemon. Mar 29 16:04:28 krowka rngd[1189]: Initalizing available sources Mar 29 16:04:28 krowka rngd[1189]: Failed to init entropy source hwrng Mar 29 16:04:28 krowka rngd[1189]: Enabling RDSEED rng support Mar 29 16:04:28 krowka rngd[1189]: Initalizing entropy source rdrand Mar 29 16:04:32 krowka rngd[1189]: Initalizing AES buffer Mar 29 16:04:32 krowka rngd[1189]: Enabling JITTER rng support Mar 29 16:04:32 krowka rngd[1189]: Initalizing entropy source jitter Mar 29 16:04:32 krowka rngd[1189]: PKCS11 Engine /usr/lib64/opensc-pkcs11.so Error: No such file or directory Mar 29 16:04:32 krowka rngd[1189]: Failed to init entropy source pkcs11 Apr 02 15:09:07 krowka systemd[1]: Stopping Hardware RNG Entropy Gatherer Daemon... Apr 02 15:10:37 krowka systemd[1]: rngd.service: State 'stop-sigterm' timed out. Killing. Apr 02 15:10:37 krowka systemd[1]: rngd.service: Killing process 1189 (rngd) with signal SIGKILL. Apr 02 15:10:37 krowka systemd[1]: rngd.service: Main process exited, code=killed, status=9/KILL Apr 02 15:10:37 krowka systemd[1]: rngd.service: Failed with result 'timeout'. Apr 02 15:10:37 krowka systemd[1]: Stopped Hardware RNG Entropy Gatherer Daemon. Apr 02 15:10:37 krowka systemd[1]: rngd.service: Consumed 1min 4.897s CPU time. Apr 02 16:01:07 krowka systemd[1]: Started Hardware RNG Entropy Gatherer Daemon. Apr 02 16:01:07 krowka rngd[1238]: Initalizing available sources Apr 02 16:01:07 krowka rngd[1238]: Failed to init entropy source hwrng Apr 02 16:01:07 krowka rngd[1238]: Enabling RDSEED rng support Apr 02 16:01:07 krowka rngd[1238]: Initalizing entropy source rdrand Apr 02 16:01:13 krowka rngd[1238]: Initalizing AES buffer Apr 02 16:01:13 krowka rngd[1238]: Enabling JITTER rng support Apr 02 16:01:13 krowka rngd[1238]: Initalizing entropy source jitter Apr 02 16:01:13 krowka rngd[1238]: PKCS11 Engine /usr/lib64/opensc-pkcs11.so Error: No such file or directory Apr 02 16:01:13 krowka rngd[1238]: Failed to init entropy source pkcs11 Apr 04 16:24:29 krowka systemd[1]: Stopping Hardware RNG Entropy Gatherer Daemon... Apr 04 16:25:59 krowka systemd[1]: rngd.service: State 'stop-sigterm' timed out. Killing. Apr 04 16:25:59 krowka systemd[1]: rngd.service: Killing process 1238 (rngd) with signal SIGKILL. Apr 04 16:25:59 krowka systemd[1]: rngd.service: Main process exited, code=killed, status=9/KILL Apr 04 16:25:59 krowka systemd[1]: rngd.service: Failed with result 'timeout'. Apr 04 16:25:59 krowka systemd[1]: Stopped Hardware RNG Entropy Gatherer Daemon. Apr 04 16:25:59 krowka systemd[1]: rngd.service: Consumed 3min 52.770s CPU time. I added -d now, I'll report again if I get more logs.
That would be helpful please. In fact, if you see it happen again, if you could please run this command: ptrace -p `pidof rngd` that should give you a backtrace of where rngd is when its blocked. You may need to install the rngd debuginfo package, but that will confirm whats going on. Thanks
Looks like rngd doesn't catch any signals. So, its not really designed for a graceful shutdown. Maybe just altering the service file like this would fix the issue: diff --git a/rngd.service b/rngd.service index 3d9dcb5..7014179 100644 --- a/rngd.service +++ b/rngd.service @@ -3,6 +3,7 @@ Description=Hardware RNG Entropy Gatherer Daemon [Service] ExecStart=/sbin/rngd -f +KillSignal=SIGKILL [Install] WantedBy=multi-user.target
(In reply to Steve Grubb from comment #5) > Looks like rngd doesn't catch any signals. So, its not really designed for a > graceful shutdown. Maybe just altering the service file like this would fix > the issue: > > diff --git a/rngd.service b/rngd.service > index 3d9dcb5..7014179 100644 > --- a/rngd.service > +++ b/rngd.service > @@ -3,6 +3,7 @@ Description=Hardware RNG Entropy Gatherer Daemon > > [Service] > ExecStart=/sbin/rngd -f > +KillSignal=SIGKILL > > [Install] > WantedBy=multi-user.target It should catch SIGTERM as you can see at https://github.com/nhorman/rng-tools/blob/master/rngd.c#L811
I see this happen sporadically, both on real hardware and in some VMs: $ journalctl -u rngd -b-7 -- Logs begin at Thu 2019-04-18 23:28:02 CEST, end at Thu 2019-04-25 12:07:35 CEST. -- Apr 21 16:21:05.829170 rngd[29558]: Added 3092/4096 bits entropy Apr 21 16:21:05.829170 rngd[29558]: Reading entropy from JITTER Entropy generator Apr 21 16:21:05.829170 rngd[29558]: xread_jitter requests 2500 bytes from pipe Apr 21 16:21:05.833853 rngd[29558]: xread_jitter gets 2500 bytes Apr 21 16:21:05.833853 rngd[29558]: Added 3186/4096 bits entropy Apr 21 16:21:05.833853 rngd[29558]: Added 3271/4096 bits entropy Apr 21 16:21:05.833853 rngd[29558]: Added 3349/4096 bits entropy Apr 21 16:21:05.833853 rngd[29558]: Added 3419/4096 bits entropy Apr 21 16:21:05.833853 rngd[29558]: Added 3482/4096 bits entropy Apr 21 16:21:05.833853 rngd[29558]: Added 3539/4096 bits entropy Apr 21 16:21:05.833853 rngd[29558]: Added 3592/4096 bits entropy Apr 21 16:21:05.833853 rngd[29558]: Added 3639/4096 bits entropy Apr 21 16:21:05.833853 rngd[29558]: Added 3682/4096 bits entropy ... Apr 21 16:21:05.833853 rngd[29558]: Pool full at 4096, sleeping! Apr 21 16:46:14.054552 rngd[29558]: Added 3125/4096 bits entropy ... Apr 21 16:46:14.064115 rngd[29558]: Reading entropy from Intel RDRAND Instruction RNG Apr 21 16:46:14.064115 rngd[29558]: Added 3733/4096 bits entropy ... Apr 21 17:11:23.218949 rngd[29558]: Added 3947/4096 bits entropy Apr 21 17:11:23.218949 rngd[29558]: Reading entropy from JITTER Entropy generator Apr 21 17:11:23.218949 rngd[29558]: xread_jitter requests 2500 bytes from pipe ... Apr 23 00:14:28.446809 rngd[29558]: Pool full at 4096, sleeping! Apr 23 16:06:33.656160 rngd[29558]: Checking on done for thread 0 Apr 23 16:06:33.656160 rngd[29558]: DONE Writing to pipe with return 4096 Apr 23 16:06:33.656160 rngd[29558]: DONE Writing to pipe with return -1 Apr 23 16:06:33.656160 rngd[29558]: DONE Writing to pipe with return 4096 Apr 23 16:06:33.656160 rngd[29558]: DONE Writing to pipe with return 8192 Apr 23 16:06:33.656160 rngd[29558]: Error on pipe write: Resource temporarily unavailable <---------------- Apr 23 16:06:33.656160 rngd[29558]: Checking on done for thread 1 Apr 23 16:06:33.633826 systemd[1]: Stopping Hardware RNG Entropy Gatherer Daemon... Apr 23 16:08:03.788223 systemd[1]: rngd.service: State 'stop-sigterm' timed out. Killing. Apr 23 16:08:03.788529 systemd[1]: rngd.service: Killing process 29558 (rngd) with signal SIGKILL. Apr 23 16:08:03.797241 systemd[1]: rngd.service: Main process exited, code=killed, status=9/KILL Apr 23 16:08:03.797489 systemd[1]: rngd.service: Failed with result 'timeout'. Apr 23 16:08:03.798149 systemd[1]: Stopped Hardware RNG Entropy Gatherer Daemon. Apr 23 16:08:03.798405 systemd[1]: rngd.service: Consumed 35min 54.922s CPU time. The pipe error looks relevant. I haven't had the chance to gather more information though.
I may see the problem: diff --git a/rngd_jitter.c b/rngd_jitter.c index f68e2ca..a471253 100644 --- a/rngd_jitter.c +++ b/rngd_jitter.c @@ -284,7 +284,7 @@ static void *thread_entropy_task(void *data) /* Write to pipe */ written = 0; - while(written != me->buf_sz) { + while(me->active && written != me->buf_sz) { message(LOG_DAEMON|LOG_DEBUG, "Writing to pipe\n"); ret = write(me->pipe_fd, &tmpbuf[written], me->buf_sz - written); message(LOG_DAEMON|LOG_DEBUG, "DONE Writing to pipe with return %d\n", ret); Heres a build with the above patch, if you could try it out please https://koji.fedoraproject.org/koji/taskinfo?taskID=34421984
I installed the package. I'll report in a few days if I still see the problem.
copy that, thank you!
This version doesn't seem to be a huge improvement ;) It's spinning on the CPU. strace: 22:28:24.573853 sched_yield() = 0 22:28:24.573918 sched_yield() = 0 22:28:24.573987 sched_yield() = 0 22:28:24.574053 sched_yield() = 0 22:28:24.574125 sched_yield() = 0 22:28:24.574194 sched_yield() = 0 22:28:24.574261 sched_yield() = 0 22:28:24.575644 sched_yield() = 0 22:28:24.575740 sched_yield() = 0 22:28:24.575863 sched_yield() = 0 22:28:24.575936 sched_yield() = 0 22:28:24.576014 sched_yield() = 0 22:28:24.576181 sched_yield() = 0 22:28:24.576325 sched_yield() = 0 22:28:24.576470 sched_yield() = 0 22:28:24.576573 sched_yield() = 0 22:28:24.576740 sched_yield() = 0 22:28:24.576895 sched_yield() = 0 22:28:24.577048 sched_yield() = 0 22:28:24.577208 sched_yield() = 0 22:28:24.578828 sched_yield() = 0 22:28:24.579778 sched_yield() = 0 22:28:24.581056 sched_yield() = 0 22:28:24.581200 sched_yield() = 0 22:28:24.581286 sched_yield() = 0 22:28:24.581361 sched_yield() = 0 22:28:24.581454 sched_yield() = 0 22:28:24.581554 sched_yield() = 0 22:28:24.581952 sched_yield() = 0 22:28:24.582124 sched_yield() = 0 22:28:24.582239 sched_yield() = 0 22:28:24.582358 sched_yield() = 0 22:28:24.582446 sched_yield() = 0 22:28:24.582518 sched_yield() = 0 22:28:24.582632 sched_yield() = 0 22:28:24.583964 sched_yield() = 0 pstack: #0 0x00007ffb0b312efb in sched_yield () from /lib64/libc.so.6 #1 0x000055e8913818c5 in init_jitter_entropy_source (ent_src=0x55e891388800 <entropy_sources+480>) at rngd_jitter.c:429 #2 0x000055e89137b8cc in main (argc=<optimized out>, argv=<optimized out>) at rngd.c:752
Oh, shoot, I didn't take care of the initialization case. Apologies. This build should take care of that: https://koji.fedoraproject.org/koji/taskinfo?taskID=34469597 Needed to add a check for !first in the entropy task functions while loop, so that it could actually start up properly
No change, -test2 still sping in sched_yield().
hmm, very odd. I've done some more looking and found something unexpected. I've managed to get this to occur on my system as well (or a similar problem), and the issue doesn't appear to be the interlock on thread activation. Rather the problem seems to be that the call to jent_read_entropy is never returning (or exiting somewhere within the library), suggesting that the issue is in the jitterentropy library. Would you please run the following command on your system: ltrace -f rngd -f and confirm that you see output simmilar to this: [pid 20049] rngd->jent_read_entropy(0x1aec3e0, 0x7fd430000b20, 0x4097, 24JITTER thread on cpu 1 wakes up for refill <unfinished ...> [pid 20050] rngd->jent_read_entropy(0x1aecda0, 0x7fd428000b20, 0x4097, 24JITTER thread on cpu 2 wakes up for refill <unfinished ...> [pid 20051] rngd->jent_read_entropy(0x1aed760, 0x7fd424000b20, 0x4097, 24Waiting for jitter tasks to start JITTER thread on cpu 3 wakes up for refill <unfinished ...> [pid 20052] rngd->jent_read_entropy(0x1aee120, 0x7fd41c000b20, 0x4097, 24 <unfinished ...> [pid 20050] <... jent_read_entropy resumed> ) = 0x4097 [pid 20049] <... jent_read_entropy resumed> ) = 0x4097 [pid 20050] +++ exited (status 0) +++ [pid 20049] +++ exited (status 0) +++ [pid 20052] <... jent_read_entropy resumed> ) = 0x4097 [pid 20052] +++ exited (status 0) +++ [pid 20051] <... jent_read_entropy resumed> ) = 0x4097 [pid 20051] +++ exited (status 0) +++ I'm not sure why this would be happening now, as the library hasn't been updated that I can tell, but we may need to get the maintainer of that package involved here
scratch that, its too early here, I'm being stupid
https://koji.fedoraproject.org/koji/taskinfo?taskID=34527303 Ok, I think I have this sorted. Looks like the race was in the shutdown function. We would flag the threads for the jitter entropy source to exit, and the threads would exit before we could get feedback on their completion, leading to us blocking in the read call. By using pthread_kill, I'm able to skip the read operation entirely on exit, which seems to be working. I managed to recreate the problem on a VM, and with the patch in the above build, I was able to start and stop the daemon repeatedly for about 2 hours with no subsequent failures. If you could please confirm the above build fixes this for you as well, I can check the fix in upstream and to fedrora.
I just ran into this in the F30 build as well.
Me too on fedora 30: I was rebooting and got the cylon eyeball and a message about a stop job running for Entropy Gatherer Daemon. Had to press the reset button (after getting tired of waiting to see if anything would timeout).
*** Bug 1706432 has been marked as a duplicate of this bug. ***
thats great, would anyone like to confirm that the build I provided fixes the issue for them?
> thats great, would anyone like to confirm that the build I provided fixes the issue for them? I use rng-tools-6.7-1.fc30 on F30 and it still hangs 50/50 on reboot after suspend-resume.
I'm testing rng-tools-6.7-1.fc30.x86_64 and so far it seems to behave OK.
I still have this issue even on my VM after 15 min uptime. rng-tools-6.7-1.fc30.x86_64
(In reply to Artem from comment #23) > I still have this issue even on my VM after 15 min uptime. > rng-tools-6.7-1.fc30.x86_64 Same here, with that same package: it's reproducible 100% of the time.
Just to confirm. for me Fedora 30 the same version the same problem
Getting this issue on both of my machines after upgrading to f30
how many people are going to respond with "I'm seeing this too" before someone takes the time to confirm that the build in comment 16 fixes the issue for them?
> before someone takes the time to confirm that the build in comment 16 fixes the issue for them? We already use rng-tools-6.7-1 from F30 repos. It still hangs.
Hi Neil Horman, are you saying that the F31 build for 6.7.1 is patched? Where can i grab the rpm? I couldn't see a link to it. Sorry, new to Fedora.
I'm saying that I built a test build in comment 16 and asked the reporter (or anyone experiencing the problem) to test it and confirm that it solved the problem for them, but so far no one has.
Neil, please note that the rpm you provided is built for rawhide, so is not installable for most (all?) reporters here. Unfortunately it has the same version as the existing rpm in F30, which people actually *can* install, which makes things confusing. When I was testing your build in koji, I rebuilt it locally in mock from the srpm to be able to install on F30. I didn't observe the bug, but the bug was always probabilistic, hence this lack of occurrence is not conclusive, so I only reported it as "so far seems ok" in comment #c22. My suggestion would be to submit an update for F30, with a bumped version, and ask people to test this and provide karma.
zbigniew, please note that the bug is opened against rawhide, so I would expect users to have a rawhide install available to test with, or open a bug against a new release. Given that some of the reporters here had 100% failure rates, I'll take your testing along with mine to be sufficient to show that its fixed and check it in.
In OpenBMC we met an issue that rngd consumes 100% CPU, and it's found that it is caused by the patch mentioned here: ``` diff --git a/rngd_jitter.c b/rngd_jitter.c index f68e2ca..a471253 100644 --- a/rngd_jitter.c +++ b/rngd_jitter.c @@ -284,7 +284,7 @@ static void *thread_entropy_task(void *data) /* Write to pipe */ written = 0; - while(written != me->buf_sz) { + while(me->active && written != me->buf_sz) { message(LOG_DAEMON|LOG_DEBUG, "Writing to pipe\n"); ret = write(me->pipe_fd, &tmpbuf[written], me->buf_sz - written); message(LOG_DAEMON|LOG_DEBUG, "DONE Writing to pipe with return %d\n", ret); ``` It is wrong and cause the task's `active` state never gets a chance to be set to 1, and cause init_jitter_entropy_source() spinning to wait for it to become 1. See details in https://github.com/openbmc/openbmc/issues/3574
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle. Changing version to '31'.
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle. Changing version to 31.
I was hitting this bug with a fresh (un-updated) F30 install today. While debugging a plymouth issue I have been rebooting all day and I hit this regularly. So getting annoyed by the extra reboot time, I did some digging and I found this bug. Neil, I assume that the fix from the no longer available scratch-build, which got little feedback, is the same as the one from rng-tools-6.7-2? After upgrading to that I've done like 20 more reboots or so and I've not seen this hit once. So I believe it is safe to call this fixed now (with your fix) and I'm closing this bug.