In our ci jobs, we are collecting reboot time using 'systemd-analyze', but it is frequently failed after retried 60s in recent test because of "Bootup is not yet finished". I see "Error on pipe write" in journal log, is it the reason for it takes longer time than before? [root@ip-10-116-2-153 ec2-user]# sudo systemd-analyze Bootup is not yet finished (org.freedesktop.systemd1.Manager.FinishTimestampMonotonic=0). Please try again later. Hint: Use 'systemctl list-jobs' to see active jobs [root@ip-10-116-2-153 ec2-user]# systemctl list-jobs JOB UNIT TYPE STATE 153 rngd.service start waiting 1 jobs listed. Below is the journal log, the rngd logged time is from Nov 26 07:58:15 to Nov 26 07:59:45. [root@ip-10-116-2-153 ec2-user]# journalctl |grep -i rngd Nov 26 07:58:10 localhost rngd[242]: Initializing available sources Nov 26 07:58:10 localhost rngd[242]: Failed to init entropy source hwrng Nov 26 07:58:13 localhost rngd[242]: Initializing AES buffer Nov 26 07:58:13 localhost rngd[242]: Enabling JITTER rng support Nov 26 07:58:13 localhost rngd[242]: Initializing entropy source jitter Nov 26 07:58:15 localhost rngd[242]: Error on pipe write: Bad file descriptor Nov 26 07:58:15 localhost rngd[242]: Closing thread 0 Nov 26 07:58:17 ip-10-116-2-153.us-west-2.compute.internal systemd[1]: rngd.service: Current command vanished from the unit file, execution of the command list won't be resumed. Nov 26 07:58:17 ip-10-116-2-153.us-west-2.compute.internal rngd[242]: Error on pipe write: Bad file descriptor Nov 26 07:58:18 ip-10-116-2-153.us-west-2.compute.internal rngd[242]: Error on pipe write: Bad file descriptor Nov 26 07:59:45 ip-10-116-2-153.us-west-2.compute.internal systemd[1]: rngd.service: State 'stop-sigterm' timed out. Killing. Nov 26 07:59:45 ip-10-116-2-153.us-west-2.compute.internal systemd[1]: rngd.service: Killing process 242 (rngd) with signal SIGKILL. Nov 26 07:59:45 ip-10-116-2-153.us-west-2.compute.internal systemd[1]: rngd.service: Main process exited, code=killed, status=9/KILL Nov 26 07:59:45 ip-10-116-2-153.us-west-2.compute.internal systemd[1]: rngd.service: Failed with result 'timeout'. Nov 26 07:59:45 ip-10-116-2-153.us-west-2.compute.internal rngd[1940]: Initializing available sources Nov 26 07:59:45 ip-10-116-2-153.us-west-2.compute.internal rngd[1940]: Failed to init entropy source hwrng Nov 26 07:59:48 ip-10-116-2-153.us-west-2.compute.internal rngd[1940]: Initializing AES buffer Nov 26 07:59:48 ip-10-116-2-153.us-west-2.compute.internal rngd[1940]: Enabling JITTER rng support Nov 26 07:59:48 ip-10-116-2-153.us-west-2.compute.internal rngd[1940]: Initializing entropy source jitter Nov 26 07:59:48 ip-10-116-2-153.us-west-2.compute.internal rngd[1940]: unable to adjust write_wakeup_threshold: Permission denied [root@ip-10-116-2-153 ec2-user]# rpm -qa|grep rng rng-tools-6.8-1.el8.aarch64 [root@ip-10-116-2-153 ec2-user]# systemd-analyze Startup finished in 2.202s (kernel) + 6.728s (initrd) + 1min 29.586s (userspace) = 1min 38.517s Version-Release number of selected component (if applicable): RHEL Version: RHEL8.2(4.18.0-151.el8.aarch64) How reproducible: 100% Steps to Reproduce: 1. Start a RHEL8.2 arm instance. 2. check journal log and system bootup time. Actual results: Bootup is not yet finished Expected results: No such failure Additional info: - No such failure in RHEL8.1
This works just fine on a physical system in beaker installed with RHEL8, and on my local RHEL8 system (both x86_64). Can I get access to the system that this is failing on to see whats going on.
Just got hold of an aarch64 system, and it seems to work fine there too, so I'm not sure whats going on in your environment. Note that there can be a pipe write failure during shutdown of rngd as we are closing the pipe, but the message is harmless, as the daemon is shutting down. It may be, that on aarch64 systems, rngd is blocked on shutdown as the daemon is executing inside the jitterentropy library, which takes a long time to compute entropy. I do have an upstream fix for that. Does this problem happen consistently, or only occasionally?
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=24969019 If you could, test this package out please. It incorporates the faster exit patch from upstream. I think what you're seeing is a problem in systemd-analyze, but this will likely help avoid it.
(In reply to Neil Horman from comment #3) > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=24969019 > > If you could, test this package out please. It incorporates the faster exit > patch from upstream. I think what you're seeing is a problem in > systemd-analyze, but this will likely help avoid it. I can always reproduce this issue in a1.xlarge instance(including your scratch build), and I sent the access information to you via mail, you can debug in online system. Thanks
Ok, I've taken a look, and it I think it appears that what were seeing is a combination of a few issues: 1) The fact that the initramfs copy of rngd is getting shut down (which is normal), and issuing a warning about pipe write failures (which is not). This is due to us checking errno on the failure of a write call but in a location where it may have changed 2) The time it takes for jitter to generate entropy on a system without any other source of entropy can be long, leading to systemd-analyze to wait a long time for boot to finish I can fix these updstream if this bz gets approved, but note, to test it, you will not only need to install the new rng-tools package, but you will need to regenerate all the initramfs's so that the new rngd gets picked up properly
*** Bug 1776086 has been marked as a duplicate of this bug. ***
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=25085017 This appears to fix the problem, please confirm. Note: To properly test this, you will have to not only install the new rpm, but regenerate the initramfs for the running kernel
(In reply to Neil Horman from comment #7) > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=25085017 > > This appears to fix the problem, please confirm. > > Note: To properly test this, you will have to not only install the new rpm, > but regenerate the initramfs for the running kernel Thanks Neil, It is working now and cannot reproduce this issue again with your new build. But there is a new error found: Dec 03 05:58:26 ip-10-116-2-232.us-west-2.compute.internal systemd[1]: /usr/lib/systemd/system/rngd.service:11: Failed to parse TimeoutStartSec= parameter, ignoring: "60s" In /usr/lib/systemd/system/rngd.service, seems no need '"' to wrap the value. The error disappeared after I changed it. [root@ip-10-116-2-119 ec2-user]# diff -u rngd.service.orig /usr/lib/systemd/system/rngd.service --- rngd.service.orig 2019-12-03 06:09:11.130429909 +0000 +++ /usr/lib/systemd/system/rngd.service 2019-12-03 06:01:41.713624471 +0000 @@ -8,7 +8,7 @@ CapabilityBoundingSet=CAP_SYS_ADMIN AmbientCapabilities=CAP_SYS_ADMIN ExecStart=/sbin/rngd -f --fill-watermark=0 -TimeoutStartSec="60s" +TimeoutStartSec=60s
copy that, thanks! Requesting exception for 8.2. On arm systems using the jitterentropy source can cause additional boot delays, which can be seen as a regression (as noted here). The fix is self contained and straightforward.
In response to comment 10: 1) The impact is that systems may not receive as much entropy on bootup leading to boot time delays, and potentially shutdown delays, when systemd waits for the entire service stop period before killing the rngd process 2) The risk is low, this fix is self contained to rngd, and will not impact other components. 3) No other details to provide, but will be on the blocker meeting calls and am happy to answer specific questions regarding this fix.
(In reply to Neil Horman from comment #11) > In response to comment 10: > 1) The impact is that systems may not receive as much entropy on bootup > leading to boot time delays, and potentially shutdown delays, when systemd > waits for the entire service stop period before killing the rngd process > > > 2) The risk is low, this fix is self contained to rngd, and will not impact > other components. > > 3) No other details to provide, but will be on the blocker meeting calls and > am happy to answer specific questions regarding this fix. Hi Neil, From qe perspective, customer usually comparing different distributions' boot time by 'systemd-analyze'(1446698 and 1446688). So we added such cases in our automation after that. And there are some bugs got fixed in the past, eg. bug 1625874 in RHEL8.0 and 1591542 in RHEL7. So I prefer to add it to RHEL8.2 to avoid deliver a bug again product. Thanks
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1762