Hide Forgot
Description of problem: - Installation fails due to the default NSS CA certificate database not being present on the system (Discovered by running 'curl' by hand afterwards) - This occurs when enabling 'fips=1' and pointing to a kickstart on a https location - The primary curl error is: curl: (77) Problem with the SSL CA cert Version-Release number of selected component (if applicable): - Red Hat Enterprise Linux 7.2 How reproducible: - For customer - Everytime - For me - setting up testing now Steps to Reproduce: 1. Host kickstart on custom https location 2. Boot to installation media 3. Modify kernel to include "fips=1 ks=https://location/of/ks.cfg" Actual results: - Dracut Times out. - Manually running curl provides: * Connected to <IP> port 443 (#) * Initializing NSS with certpath: none * Unable to initialize NSS * Closing connection 0 curl: (77) Problem with the SSL CA cert Expected results: - Installation succeeds Additional info:
When I host the kickstart on a https share and fips=0 curl responds with: curl: (60) Peer's certificate issuer has been marked as not trusted by the user. I can use curl -k (--insecure) to still read the kickstart When I include fips=1 in the kernel curl responds with: curl: (77) Problem with the SSL CA cert (path? access rights?)
Also requested the customer attempt targeting the kickstart while it is on a local location (usb, dvd). Intent would be to bypass the https issue and see if "fips=1" causes any other issues with the installation.
I don't think this is a CA Cert problem. I am able to use a non-self-signed https source just fine without fips enabled on the cmdline and the CA bundle is present in the initrd. With fips=1 fetching the kickstart in the initrd fails, but works using curl on the cmdline after switch-root. It looks like there is a problem with fips=1 in the initrd.
The fips kernel modules are listed here: https://github.com/dracutdevs/dracut/blob/RHEL-7/modules.d/01fips/module-setup.sh#L19 Would be interesting to see, which kernel modules are triggered by the curl call. A dump of lsmod and /proc/crypto after the curl would be most interesting.
FTR: I can see this on all supported RHEL7 releases (7.1 EUS, 7.2 Z-stream) and latest RHEL7.3 rel-eng compose.
Created attachment 1191509 [details] /proc/crypto
Created attachment 1191510 [details] lsmod output
(In reply to Harald Hoyer from comment #5) > The fips kernel modules are listed here: > https://github.com/dracutdevs/dracut/blob/RHEL-7/modules.d/01fips/module- > setup.sh#L19 > > Would be interesting to see, which kernel modules are triggered by the curl > call. > > A dump of lsmod and /proc/crypto after the curl would be most interesting. Attached.
BTW: Similar issue was already resolved on RHEL6.
/etc/pki/tls/certs/ca-bundle.crt is now a symbolic link... dracut used inst_simple(), which does not include the original file. Thanks for the hint!!
Tested on RHEL-7.3-20160825.1 with dracut-033-458.el7 using "ks=https://mkovarik.fedorapeople.org/test.ks method=http://nap/os/ fips=1 rd.break rd.shell console=ttyS0". Kickstart was not downloaded successfully: [ 20.361865] dracut-initqueue[564]: % Total % Received % Xferd Average Speed Time Time Time Current^M^M [ 20.368965] dracut-initqueue[564]: Dload Upload Total Spent Left Speed^M^M [ 20.449990] dracut-initqueue[564]: 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (77) Problem with the SSL CA cert (path? access rights?)^M^M [ 20.456967] dracut-initqueue[564]: Warning: failed to fetch kickstart from https://mkovarik.fedorapeople.org/test.ks I was not able to download kickstart in dracut shell: switch_root:/# curl https://mkovarik.fedorapeople.org/test.ks curl: (77) Problem with the SSL CA cert (path? access rights?) switch_root:/# curl -k https://mkovarik.fedorapeople.org/test.ks curl: (77) Problem with the SSL CA cert (path? access rights?)
Created attachment 1195345 [details] console output with dracut-033-458.el7
Well, I checked the initrd.img... It wasn't built with dracut fips support. So all the *.chk files are missing in the initrd.
Created attachment 1195446 [details] Incomplete patch for lorax This patch needs more work, there is no hmac for the kernel included and dracut wouldn't know where to look for it if there was.
Lorax patch - https://github.com/rhinstaller/lorax/pull/230 Anaconda patch - https://github.com/rhinstaller/anaconda/pull/1139
I'm not sure if you're aware of that, but enforcing fips installation has some irritating consequences. Because /etc/system-fips file is created it triggers some blocking codepaths in libgcrypt. From what I managed to determine there's a custom patch for CentOS that checks for existance of this file (https://git.centos.org/commitdiff/rpms!libgcrypt.git/f268f105748f455df30c79bb1b21bb94a6575f2d;jsessionid=6i5sotd7c7a78z3ypmmozhk5) . If it's present it eventually triggers getrandom function which is blocking unless non blocking random pool is initialized. The problem is that it takes some time for bare metal boot (in our case it's mostly ~60 seconds) and it never happens for paravirt virtual machines (as they don't have good mechanism to generate entropy). So as a effect this fix breaks provisioning of virtual machines completely and slows down bare metal. If you consider the delay to be legitimate behavior then, I guess, it would be good to enforce loading of virtio-rng module for virtual machines.
To make it clear - problems starts once /init (in that case systemd) is executed. If booted with rdinit=bash we're getting operational shell, but certain commands can hang it in similar manner (for example ps). Catting /proc/<pid>/stack shows that it actually hangs on getrandom. On VM /proc/sys/kernel/random/entropy_avail is always 0, therefore urandom never gets properly initialized. On bare metal it slowly rises and right after kernel prints out "random: nonblocking pool is initialized" execution proceeds. From getrandom man page: If the urandom source has not yet been initialized, then getrandom() will block, unless GRND_NONBLOCK is specified in flags.
(In reply to Łukasz Siudut from comment #36) > I'm not sure if you're aware of that, but enforcing fips installation has > some irritating consequences. Because /etc/system-fips file is created it > triggers some blocking codepaths in libgcrypt. Is this present on non-fips installations as well? Is it present on http-only fips installations (that were possible even prior to this bugzilla fix)? > > From what I managed to determine there's a custom patch for CentOS that > checks for existance of this file > (https://git.centos.org/commitdiff/rpms!libgcrypt.git/ > f268f105748f455df30c79bb1b21bb94a6575f2d; > jsessionid=6i5sotd7c7a78z3ypmmozhk5) . If it's present it eventually > triggers getrandom function which is blocking unless non blocking random > pool is initialized. The problem is that it takes some time for bare metal > boot (in our case it's mostly ~60 seconds) and it never happens for paravirt > virtual machines (as they don't have good mechanism to generate entropy). That heavily depends on the machine architecture (and thus available syscalls/hypercalls), physical and virtual hardware and kernel version. For example, my server with 5 rotational HDDs initializes the pool within 9 seconds after kernel start and I have seen x86-based VMs without virtio-rng that do so within 50 seconds. The problem is much less present on POWER with pseries-rng (QEMU based VM or not, no special device needed, it uses special hypercalls) and much worse on ie. IBM ZSeries where a z/VM can remain without entropy for days with no clean and supported solution to the problem. The thing is - if https is specified for the kickstart, you presumably want secure TLS connection, which is not exactly possible with insecure PRNG feeding the handshake and thus the PRNG needs to be sufficiently initialized first. There are different ideas of "secure", ie. some people consider CPU branching delays to be a secure source of entropy (and IIRC haveged uses that), but the Linux kernel upstream rejected such patch. Anaconda already bundles virtio-rng and uses it by default if you provide the rng device on the VM host ("hypervisor") side. The problem is not limited to FIPS, the same happens with encrypted drives, though anaconda added a "10 minute hack", which is wrong and broken on so many levels, but at least it continues with insecure encryption after 10 minutes, helping user experience. > > So as a effect this fix breaks provisioning of virtual machines completely > and slows down bare metal. If you add virtio rng device on the VM host side, anaconda will automatically use it (on at least x86 and POWER/QEMU). > > If you consider the delay to be legitimate behavior then, I guess, it would > be good to enforce loading of virtio-rng module for virtual machines. The delay is legitimate for anything that requires SSL/TLS (which should presumably be kickstart download even in non-fips mode), however user preferences vary and it's pretty safe to assume FIPS users would want secure connection over superior user experience, with non-fips users wanting easier setup over security. As another example, current SHA512 password hashes in /etc/shadow, generated from anaconda, are potentially insecure because shadow-utils uses /dev/urandom and anaconda doesn't ensure it's properly initialized, but by the time the installation process gets to it (after writing lots of data to drives), there's a good chance it is. This is unfortunately not the case for kickstart download. Finally, cat-ing entropy_avail may lower the number it shows, because bash spawns a new process to exec() cat, eating entropy (~120 bits).
(In reply to Jiri Jaburek from comment #38) > (In reply to Łukasz Siudut from comment #36) > > I'm not sure if you're aware of that, but enforcing fips installation has > > some irritating consequences. Because /etc/system-fips file is created it > > triggers some blocking codepaths in libgcrypt. > > Is this present on non-fips installations as well? > Non-fips works just fine. > Is it present on http-only fips installations (that were possible even prior > to this bugzilla fix)? > Yes, it's http-only. I'll elaborate on in one of following answers. > That heavily depends on the machine architecture (and thus available > syscalls/hypercalls), physical and virtual hardware and kernel version. For > example, my server with 5 rotational HDDs initializes the pool within 9 > seconds after kernel start and I have seen x86-based VMs without virtio-rng > that do so within 50 seconds. The problem is much less present on POWER with > pseries-rng (QEMU based VM or not, no special device needed, it uses special > hypercalls) and much worse on ie. IBM ZSeries where a z/VM can remain > without entropy for days with no clean and supported solution to the problem. > Yes, this is why I emphasized the part that for us it takes about a minute. And single-disk setups are pretty common in datacenters. I have no idea what happens when there's storage which isn't supported by kernel by default (like fio). By then it won't even get to the moment when module can be loaded. > The thing is - if https is specified for the kickstart, you presumably want > secure TLS connection, which is not exactly possible with insecure PRNG > feeding the handshake and thus the PRNG needs to be sufficiently initialized > first. There are different ideas of "secure", ie. some people consider CPU > branching delays to be a secure source of entropy (and IIRC haveged uses > that), but the Linux kernel upstream rejected such patch. No, it's not related to any secure operation. Once again - even if we booted to bash and tried to run certain commands (even simple `ps`) it was hanging. > Anaconda already bundles virtio-rng and uses it by default if you provide > the rng device on the VM host ("hypervisor") side. > Yes it does bundle but it's not loading it. I must have hacked /init to be a custom script which does that. Normally it's a symlink to systemd. I guess that what you mean is that systemd loads this module - the problem is that systemd will never execute because it hangs on getrandom syscall. In case of virtual machines it never proceeds because of lack of the entropy. > If you add virtio rng device on the VM host side, anaconda will > automatically use it (on at least x86 and POWER/QEMU). > Once again, look on my previous answer.
I think what it comes down to is that these slowdowns are the price of extra security. Maybe something could be done to load virtio-rng sooner? The kickstart request in question is happening really early, in the initrd before switch root. If that's needed (I haven't looked closely to see if it is or is not) then that needs to happen in dracut.
It has to be done in init script. The only alternative would be to have virtio-rng compiled in what sounds like an overkill to me. Anyway, as mentioned, our hacky solution is to replace /init (which is symlink to systemd) with small sh script which is preloading virtio-rng before execing to the proper init binary. As for now provisioning VMs with upstream Anaconda is completely broken though. I'm thinking about cutting you another bug regarding this issue, unless you want to follow on this in here?
(In reply to Łukasz Siudut from comment #41) > It has to be done in init script. The only alternative would be to have > virtio-rng compiled in what sounds like an overkill to me. Right, dracut it what builds the initrd. Including something in the fips dracut module may make sense. > > Anyway, as mentioned, our hacky solution is to replace /init (which is > symlink to systemd) with small sh script which is preloading virtio-rng > before execing to the proper init binary. > > As for now provisioning VMs with upstream Anaconda is completely broken > though. > > I'm thinking about cutting you another bug regarding this issue, unless you > want to follow on this in here? One bug per problem please :) I think the lorax part of this is working, so if you are having problems with anaconda file one against it, and/or dracut for early loading of virtio-rng
Can do. The only reason why I started discussion here is that the patch that broke things was applied on Lorax. We also patched dracut to fix that :).
https://github.com/systemd/systemd/pull/6665
(In reply to Łukasz Siudut from comment #41) > It has to be done in init script. The only alternative would be to have > virtio-rng compiled in what sounds like an overkill to me. Dracut and anaconda are two separate images; you load kernel and dracut (as initramfs) via a bootloader, dracut then fetches kickstart and from it (or other kernel cmdline options) figures out where to get anaconda image from, downloads it (http) or mounts it (nfs) and switches root into it. This, kind of by definition, needs the virtio-rng module in dracut, before the kickstart download (if the blocking wait happens because of TLS). It could be worth having other -rng modules in dracut as well (instead of in anaconda), mainly just pseries-rng which works on bare metal, LPARs and QEMU, but that can be done later.
This is well understood. It doesn't change the fact that the change introduced in Lorax broke the boot process for certain cases. And it remains broken for now. There's an issue on dracut github. https://github.com/dracutdevs/dracut/issues/273
*.chk files for secure network communication libs are now correctly added by lorax to initrd so Dracut with fips=1 on kernel cmdline is able to use these libraries to download kickstart via secured https from source with signed certificates(e.g. github.com) and the installation starts without problem.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0947