Bug 1776710

Summary: [RHEL8]Bootup is not yet finished for waiting rngd.service start up
Product: Red Hat Enterprise Linux 8 Reporter: Frank Liang <xiliang>
Component: rng-toolsAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Frank Liang <xiliang>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.2CC: core-kernel-mgr, leiwang, linl, ribarry, vkuznets, wshi, ymao
Target Milestone: rcKeywords: Regression
Target Release: 8.2   
Hardware: aarch64   
OS: Linux   
Whiteboard:
Fixed In Version: rng-tools-6.8-2.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-28 16:06:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Frank Liang 2019-11-26 08:28:44 UTC
In our ci jobs, we are collecting reboot time using 'systemd-analyze', but it is frequently failed after retried 60s in recent test because of "Bootup is not yet finished".
I see "Error on pipe write" in journal log, is it the reason for it takes longer time than before?

[root@ip-10-116-2-153 ec2-user]# sudo systemd-analyze
Bootup is not yet finished (org.freedesktop.systemd1.Manager.FinishTimestampMonotonic=0).
Please try again later.
Hint: Use 'systemctl list-jobs' to see active jobs
[root@ip-10-116-2-153 ec2-user]# systemctl list-jobs
JOB UNIT         TYPE  STATE  
153 rngd.service start waiting

1 jobs listed.

Below is the journal log, the rngd logged time is from Nov 26 07:58:15 to Nov 26 07:59:45.

[root@ip-10-116-2-153 ec2-user]# journalctl |grep -i rngd
Nov 26 07:58:10 localhost rngd[242]: Initializing available sources
Nov 26 07:58:10 localhost rngd[242]: Failed to init entropy source hwrng
Nov 26 07:58:13 localhost rngd[242]: Initializing AES buffer
Nov 26 07:58:13 localhost rngd[242]: Enabling JITTER rng support
Nov 26 07:58:13 localhost rngd[242]: Initializing entropy source jitter
Nov 26 07:58:15 localhost rngd[242]: Error on pipe write: Bad file descriptor
Nov 26 07:58:15 localhost rngd[242]: Closing thread 0
Nov 26 07:58:17 ip-10-116-2-153.us-west-2.compute.internal systemd[1]: rngd.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Nov 26 07:58:17 ip-10-116-2-153.us-west-2.compute.internal rngd[242]: Error on pipe write: Bad file descriptor
Nov 26 07:58:18 ip-10-116-2-153.us-west-2.compute.internal rngd[242]: Error on pipe write: Bad file descriptor
Nov 26 07:59:45 ip-10-116-2-153.us-west-2.compute.internal systemd[1]: rngd.service: State 'stop-sigterm' timed out. Killing.
Nov 26 07:59:45 ip-10-116-2-153.us-west-2.compute.internal systemd[1]: rngd.service: Killing process 242 (rngd) with signal SIGKILL.
Nov 26 07:59:45 ip-10-116-2-153.us-west-2.compute.internal systemd[1]: rngd.service: Main process exited, code=killed, status=9/KILL
Nov 26 07:59:45 ip-10-116-2-153.us-west-2.compute.internal systemd[1]: rngd.service: Failed with result 'timeout'.
Nov 26 07:59:45 ip-10-116-2-153.us-west-2.compute.internal rngd[1940]: Initializing available sources
Nov 26 07:59:45 ip-10-116-2-153.us-west-2.compute.internal rngd[1940]: Failed to init entropy source hwrng
Nov 26 07:59:48 ip-10-116-2-153.us-west-2.compute.internal rngd[1940]: Initializing AES buffer
Nov 26 07:59:48 ip-10-116-2-153.us-west-2.compute.internal rngd[1940]: Enabling JITTER rng support
Nov 26 07:59:48 ip-10-116-2-153.us-west-2.compute.internal rngd[1940]: Initializing entropy source jitter
Nov 26 07:59:48 ip-10-116-2-153.us-west-2.compute.internal rngd[1940]: unable to adjust write_wakeup_threshold: Permission denied

[root@ip-10-116-2-153 ec2-user]# rpm -qa|grep rng
rng-tools-6.8-1.el8.aarch64

[root@ip-10-116-2-153 ec2-user]# systemd-analyze 
Startup finished in 2.202s (kernel) + 6.728s (initrd) + 1min 29.586s (userspace) = 1min 38.517s

Version-Release number of selected component (if applicable):

RHEL Version:
RHEL8.2(4.18.0-151.el8.aarch64)

How reproducible:
100%

Steps to Reproduce:
1. Start a RHEL8.2 arm instance.
2. check journal log and system bootup time.

Actual results:
Bootup is not yet finished

Expected results:
No such failure

Additional info:
- No such failure in RHEL8.1

Comment 1 Neil Horman 2019-11-26 15:37:13 UTC
This works just fine on a physical system in beaker installed with RHEL8, and on my local RHEL8 system (both x86_64).  Can I get access to the system that this is failing on to see whats going on.

Comment 2 Neil Horman 2019-11-26 16:42:51 UTC
Just got hold of an aarch64 system, and it seems to work fine there too, so I'm not sure whats going on in your environment.  Note that there can be a pipe write failure during shutdown of rngd as we are closing the pipe, but the message is harmless, as the daemon is shutting down.

It may be, that on aarch64 systems, rngd is blocked on shutdown as the daemon is executing inside the jitterentropy library, which takes a long time to compute entropy.  I do have an upstream fix for that.  Does this problem happen consistently, or only occasionally?

Comment 3 Neil Horman 2019-11-26 17:08:56 UTC
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=24969019

If you could, test this package out please.  It incorporates the faster exit patch from upstream.  I think what you're seeing is a problem in systemd-analyze, but this will likely help avoid it.

Comment 4 Frank Liang 2019-12-02 05:53:52 UTC
(In reply to Neil Horman from comment #3)
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=24969019
> 
> If you could, test this package out please.  It incorporates the faster exit
> patch from upstream.  I think what you're seeing is a problem in
> systemd-analyze, but this will likely help avoid it.

I can always reproduce this issue in a1.xlarge instance(including your scratch build), and I sent the access information to you via mail, you can debug in online system. Thanks

Comment 5 Neil Horman 2019-12-02 15:51:36 UTC
Ok, I've taken a look, and it I think it appears that what were seeing is a combination of a few issues:

1) The fact that the initramfs copy of rngd is getting shut down (which is normal), and issuing a warning about pipe write failures (which is not).  This is due to us checking errno on the failure of a write call but in a location where it may have changed

2) The time it takes for jitter to generate entropy on a system without any other source of entropy can be long, leading to systemd-analyze to wait a long time for boot to finish

I can fix these  updstream if this bz gets approved, but note, to test it, you will not only need to install the new rng-tools package, but you will need to regenerate all the initramfs's so that the new rngd gets picked up properly

Comment 6 Neil Horman 2019-12-02 16:14:29 UTC
*** Bug 1776086 has been marked as a duplicate of this bug. ***

Comment 7 Neil Horman 2019-12-02 16:49:47 UTC
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=25085017

This appears to fix the problem, please confirm.

Note:  To properly test this, you will have to not only install the new rpm, but regenerate the initramfs for the running kernel

Comment 8 Frank Liang 2019-12-03 06:11:03 UTC
(In reply to Neil Horman from comment #7)
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=25085017
> 
> This appears to fix the problem, please confirm.
> 
> Note:  To properly test this, you will have to not only install the new rpm,
> but regenerate the initramfs for the running kernel

Thanks Neil,

It is working now and cannot reproduce this issue again with your new build.

But there is a new error found:
Dec 03 05:58:26 ip-10-116-2-232.us-west-2.compute.internal systemd[1]: /usr/lib/systemd/system/rngd.service:11: Failed to parse TimeoutStartSec= parameter, ignoring: "60s"

In /usr/lib/systemd/system/rngd.service, seems no need '"' to wrap the value. The error disappeared after I changed it.

[root@ip-10-116-2-119 ec2-user]# diff -u rngd.service.orig /usr/lib/systemd/system/rngd.service
--- rngd.service.orig	2019-12-03 06:09:11.130429909 +0000
+++ /usr/lib/systemd/system/rngd.service	2019-12-03 06:01:41.713624471 +0000
@@ -8,7 +8,7 @@
 CapabilityBoundingSet=CAP_SYS_ADMIN
 AmbientCapabilities=CAP_SYS_ADMIN
 ExecStart=/sbin/rngd -f --fill-watermark=0
-TimeoutStartSec="60s"
+TimeoutStartSec=60s

Comment 9 Neil Horman 2019-12-03 11:43:17 UTC
copy that, thanks!

Requesting exception for 8.2. On arm systems using the jitterentropy source can cause additional boot delays, which can be seen as a regression (as noted here).  The fix is self contained and straightforward.

Comment 11 Neil Horman 2019-12-09 11:36:56 UTC
In response to comment 10:
1) The impact is that systems may not receive as much entropy on bootup leading to boot time delays, and potentially shutdown delays, when systemd waits for the entire service stop period before killing the rngd process


2) The risk is low, this fix is self contained to rngd, and will not impact other components.

3) No other details to provide, but will be on the blocker meeting calls and am happy to answer specific questions regarding this fix.

Comment 12 Frank Liang 2019-12-09 14:31:36 UTC
(In reply to Neil Horman from comment #11)
> In response to comment 10:
> 1) The impact is that systems may not receive as much entropy on bootup
> leading to boot time delays, and potentially shutdown delays, when systemd
> waits for the entire service stop period before killing the rngd process
> 
> 
> 2) The risk is low, this fix is self contained to rngd, and will not impact
> other components.
> 
> 3) No other details to provide, but will be on the blocker meeting calls and
> am happy to answer specific questions regarding this fix.

Hi Neil,

From qe perspective, customer usually comparing different distributions' boot time by 'systemd-analyze'(1446698 and 1446688). So we added such cases in our automation after that.
And there are some bugs got fixed in the past, eg. bug 1625874 in RHEL8.0 and 1591542 in RHEL7. So I prefer to add it to RHEL8.2 to avoid deliver a bug again product. Thanks

Comment 16 errata-xmlrpc 2020-04-28 16:06:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1762