Bug 1341280 - Anaconda can not install with 'fips=1' and 'ks=https://kickstart' in the kernel line, the SSL negotiation fails
Summary: Anaconda can not install with 'fips=1' and 'ks=https://kickstart' in the kern...
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: lorax   
(Show other bugs)
Version: 7.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Brian Lane
QA Contact: Release Test Team
Petr Bokoc
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: 1256306 1420851 839624 1478303
TreeView+ depends on / blocked
 
Reported: 2016-05-31 16:48 UTC by jcastran
Modified: 2018-04-10 17:39 UTC (History)
17 users (show)

Fixed In Version: lorax-19.6.95-1
Doc Type: Bug Fix
Doc Text:
FIPS mode now supports loading files over HTTPS during installation Previously, installation images did not support FIPS mode (`fips=1`) during installation where a Kickstart file is being loaded from an HTTPS source (`inst.ks=https://<location>/ks.cfg`). This release implements support for this previously missing functionality, and loading files over HTTPS in FIPS mode works as expected.
Story Points: ---
Clone Of:
: 1411298 (view as bug list)
Environment:
Last Closed: 2018-04-10 17:38:04 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/proc/crypto (7.80 KB, text/plain)
2016-08-17 09:17 UTC, Ondrej Moriš
no flags Details
lsmod output (3.68 KB, text/plain)
2016-08-17 09:17 UTC, Ondrej Moriš
no flags Details
console output with dracut-033-458.el7 (55.61 KB, text/plain)
2016-08-29 12:05 UTC, Michal Kovarik
no flags Details
Incomplete patch for lorax (1.58 KB, patch)
2016-08-29 17:22 UTC, Brian Lane
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0947 None None None 2018-04-10 17:39 UTC
Red Hat Bugzilla 1259880 None CLOSED Download of kickstart file over https fails 2019-03-13 03:17 UTC

Internal Trackers: 1259880

Description jcastran 2016-05-31 16:48:07 UTC
Description of problem:
 - Installation fails due to the default NSS CA certificate database not being present on the system (Discovered by running 'curl' by hand afterwards)
 - This occurs when enabling 'fips=1' and pointing to a kickstart on a https location
 - The primary curl error is:
      curl: (77) Problem with the SSL CA cert 

Version-Release number of selected component (if applicable):
 - Red Hat Enterprise Linux 7.2

How reproducible:
 - For customer - Everytime
 - For me - setting up testing now

Steps to Reproduce:
 1. Host kickstart on custom https location
 2. Boot to installation media
 3. Modify kernel to include "fips=1 ks=https://location/of/ks.cfg"

Actual results:
 - Dracut Times out.
 - Manually running curl provides:
      * Connected to <IP> port 443 (#)
      * Initializing NSS with certpath: none
      * Unable to initialize NSS
      * Closing connection 0
      curl: (77) Problem with the SSL CA cert 

Expected results:
 - Installation succeeds

Additional info:

Comment 1 jcastran 2016-05-31 17:07:09 UTC
When I host the kickstart on a https share and fips=0 curl responds with:
   curl: (60) Peer's certificate issuer has been marked as not trusted by the user.

I can use curl -k (--insecure) to still read the kickstart



When I include fips=1 in the kernel curl responds with:
   curl: (77) Problem with the SSL CA cert (path? access rights?)

Comment 2 jcastran 2016-05-31 17:16:19 UTC
Also requested the customer attempt targeting the kickstart while it is on a local location (usb, dvd).

Intent would be to bypass the https issue and see if "fips=1" causes any other issues with the installation.

Comment 3 Brian Lane 2016-05-31 18:43:12 UTC
I don't think this is a CA Cert problem. 

I am able to use a non-self-signed https source just fine without fips enabled on the cmdline and the CA bundle is present in the initrd.

With fips=1 fetching the kickstart in the initrd fails, but works using curl on the cmdline after switch-root.

It looks like there is a problem with fips=1 in the initrd.

Comment 5 Harald Hoyer 2016-07-22 10:28:08 UTC
The fips kernel modules are listed here:
https://github.com/dracutdevs/dracut/blob/RHEL-7/modules.d/01fips/module-setup.sh#L19

Would be interesting to see, which kernel modules are triggered by the curl call.

A dump of lsmod and /proc/crypto after the curl would be most interesting.

Comment 6 Miroslav Vadkerti 2016-07-25 16:57:43 UTC
FTR: I can see this on all supported RHEL7 releases (7.1 EUS, 7.2 Z-stream) and latest RHEL7.3 rel-eng compose.

Comment 9 Ondrej Moriš 2016-08-17 09:17 UTC
Created attachment 1191509 [details]
/proc/crypto

Comment 10 Ondrej Moriš 2016-08-17 09:17 UTC
Created attachment 1191510 [details]
lsmod output

Comment 11 Ondrej Moriš 2016-08-17 09:18:46 UTC
(In reply to Harald Hoyer from comment #5)
> The fips kernel modules are listed here:
> https://github.com/dracutdevs/dracut/blob/RHEL-7/modules.d/01fips/module-
> setup.sh#L19
> 
> Would be interesting to see, which kernel modules are triggered by the curl
> call.
> 
> A dump of lsmod and /proc/crypto after the curl would be most interesting.

Attached.

Comment 12 Ondrej Moriš 2016-08-17 12:50:16 UTC
BTW: Similar issue was already resolved on RHEL6.

Comment 13 Harald Hoyer 2016-08-17 14:43:51 UTC
/etc/pki/tls/certs/ca-bundle.crt is now a symbolic link... dracut used inst_simple(), which does not include the original file.

Thanks for the hint!!

Comment 18 Michal Kovarik 2016-08-29 12:04:36 UTC
Tested on RHEL-7.3-20160825.1 with dracut-033-458.el7 using "ks=https://mkovarik.fedorapeople.org/test.ks method=http://nap/os/ fips=1 rd.break rd.shell console=ttyS0".

Kickstart was not downloaded successfully:
[   20.361865] dracut-initqueue[564]: % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current^M^M
[   20.368965] dracut-initqueue[564]: Dload  Upload   Total   Spent    Left  Speed^M^M
[   20.449990] dracut-initqueue[564]: 0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (77) Problem with the SSL CA cert (path? access rights?)^M^M
[   20.456967] dracut-initqueue[564]: Warning: failed to fetch kickstart from https://mkovarik.fedorapeople.org/test.ks

I was not able to download kickstart in dracut shell:

switch_root:/# curl https://mkovarik.fedorapeople.org/test.ks
curl: (77) Problem with the SSL CA cert (path? access rights?)
switch_root:/# curl -k https://mkovarik.fedorapeople.org/test.ks
curl: (77) Problem with the SSL CA cert (path? access rights?)

Comment 19 Michal Kovarik 2016-08-29 12:05 UTC
Created attachment 1195345 [details]
console output with dracut-033-458.el7

Comment 20 Harald Hoyer 2016-08-29 14:16:53 UTC
Well, I checked the initrd.img... It wasn't built with dracut fips support. So all the *.chk files are missing in the initrd.

Comment 23 Brian Lane 2016-08-29 17:22 UTC
Created attachment 1195446 [details]
Incomplete patch for lorax

This patch needs more work, there is no hmac for the kernel included and dracut wouldn't know where to look for it if there was.

Comment 35 Brian Lane 2017-08-08 16:41:11 UTC
Lorax patch - https://github.com/rhinstaller/lorax/pull/230
Anaconda patch - https://github.com/rhinstaller/anaconda/pull/1139

Comment 36 Łukasz Siudut 2017-08-22 21:26:41 UTC
I'm not sure if you're aware of that, but enforcing fips installation has some irritating consequences. Because /etc/system-fips file is created it triggers some blocking codepaths in libgcrypt.

From what I managed to determine there's a custom patch for CentOS that checks for existance of this file (https://git.centos.org/commitdiff/rpms!libgcrypt.git/f268f105748f455df30c79bb1b21bb94a6575f2d;jsessionid=6i5sotd7c7a78z3ypmmozhk5) . If it's present it eventually triggers getrandom function which is blocking unless non blocking random pool is initialized. The problem is that it takes some time for bare metal boot (in our case it's mostly ~60 seconds) and it never happens for paravirt virtual machines (as they don't have good mechanism to generate entropy).

So as a effect this fix breaks provisioning of virtual machines completely and slows down bare metal.

If you consider the delay to be legitimate behavior then, I guess, it would be good to enforce loading of virtio-rng module for virtual machines.

Comment 37 Łukasz Siudut 2017-08-22 22:20:12 UTC
To make it clear - problems starts once /init (in that case systemd) is executed. If booted with rdinit=bash we're getting operational shell, but certain commands can hang it in similar manner (for example ps).

Catting /proc/<pid>/stack shows that it actually hangs on getrandom. On VM /proc/sys/kernel/random/entropy_avail is always 0, therefore urandom never gets properly initialized. On bare metal it slowly rises and right after kernel prints out "random: nonblocking pool is initialized" execution proceeds.

From getrandom man page:

If the urandom source has not yet been initialized, then getrandom() will block, unless GRND_NONBLOCK is specified in flags.

Comment 38 Jiri Jaburek 2017-08-23 10:11:18 UTC
(In reply to Łukasz Siudut from comment #36)
> I'm not sure if you're aware of that, but enforcing fips installation has
> some irritating consequences. Because /etc/system-fips file is created it
> triggers some blocking codepaths in libgcrypt.

Is this present on non-fips installations as well?

Is it present on http-only fips installations (that were possible even prior to this bugzilla fix)?

> 
> From what I managed to determine there's a custom patch for CentOS that
> checks for existance of this file
> (https://git.centos.org/commitdiff/rpms!libgcrypt.git/
> f268f105748f455df30c79bb1b21bb94a6575f2d;
> jsessionid=6i5sotd7c7a78z3ypmmozhk5) . If it's present it eventually
> triggers getrandom function which is blocking unless non blocking random
> pool is initialized. The problem is that it takes some time for bare metal
> boot (in our case it's mostly ~60 seconds) and it never happens for paravirt
> virtual machines (as they don't have good mechanism to generate entropy).

That heavily depends on the machine architecture (and thus available syscalls/hypercalls), physical and virtual hardware and kernel version. For example, my server with 5 rotational HDDs initializes the pool within 9 seconds after kernel start and I have seen x86-based VMs without virtio-rng that do so within 50 seconds. The problem is much less present on POWER with pseries-rng (QEMU based VM or not, no special device needed, it uses special hypercalls) and much worse on ie. IBM ZSeries where a z/VM can remain without entropy for days with no clean and supported solution to the problem.

The thing is - if https is specified for the kickstart, you presumably want secure TLS connection, which is not exactly possible with insecure PRNG feeding the handshake and thus the PRNG needs to be sufficiently initialized first. There are different ideas of "secure", ie. some people consider CPU branching delays to be a secure source of entropy (and IIRC haveged uses that), but the Linux kernel upstream rejected such patch.

Anaconda already bundles virtio-rng and uses it by default if you provide the rng device on the VM host ("hypervisor") side.

The problem is not limited to FIPS, the same happens with encrypted drives, though anaconda added a "10 minute hack", which is wrong and broken on so many levels, but at least it continues with insecure encryption after 10 minutes, helping user experience.

> 
> So as a effect this fix breaks provisioning of virtual machines completely
> and slows down bare metal.

If you add virtio rng device on the VM host side, anaconda will automatically use it (on at least x86 and POWER/QEMU).

> 
> If you consider the delay to be legitimate behavior then, I guess, it would
> be good to enforce loading of virtio-rng module for virtual machines.

The delay is legitimate for anything that requires SSL/TLS (which should presumably be kickstart download even in non-fips mode), however user preferences vary and it's pretty safe to assume FIPS users would want secure connection over superior user experience, with non-fips users wanting easier setup over security.

As another example, current SHA512 password hashes in /etc/shadow, generated from anaconda, are potentially insecure because shadow-utils uses /dev/urandom and anaconda doesn't ensure it's properly initialized, but by the time the installation process gets to it (after writing lots of data to drives), there's a good chance it is. This is unfortunately not the case for kickstart download.

Finally, cat-ing entropy_avail may lower the number it shows, because bash spawns a new process to exec() cat, eating entropy (~120 bits).

Comment 39 Łukasz Siudut 2017-08-23 10:19:42 UTC
(In reply to Jiri Jaburek from comment #38)
> (In reply to Łukasz Siudut from comment #36)
> > I'm not sure if you're aware of that, but enforcing fips installation has
> > some irritating consequences. Because /etc/system-fips file is created it
> > triggers some blocking codepaths in libgcrypt.
> 
> Is this present on non-fips installations as well?
> 

Non-fips works just fine.

> Is it present on http-only fips installations (that were possible even prior
> to this bugzilla fix)?
>

Yes, it's http-only. I'll elaborate on in one of following answers.

> That heavily depends on the machine architecture (and thus available
> syscalls/hypercalls), physical and virtual hardware and kernel version. For
> example, my server with 5 rotational HDDs initializes the pool within 9
> seconds after kernel start and I have seen x86-based VMs without virtio-rng
> that do so within 50 seconds. The problem is much less present on POWER with
> pseries-rng (QEMU based VM or not, no special device needed, it uses special
> hypercalls) and much worse on ie. IBM ZSeries where a z/VM can remain
> without entropy for days with no clean and supported solution to the problem.
> 

Yes, this is why I emphasized the part that for us it takes about a minute. And single-disk setups are pretty common in datacenters. I have no idea what happens when there's storage which isn't supported by kernel by default (like fio). By then it won't even get to the moment when module can be loaded.

> The thing is - if https is specified for the kickstart, you presumably want
> secure TLS connection, which is not exactly possible with insecure PRNG
> feeding the handshake and thus the PRNG needs to be sufficiently initialized
> first. There are different ideas of "secure", ie. some people consider CPU
> branching delays to be a secure source of entropy (and IIRC haveged uses
> that), but the Linux kernel upstream rejected such patch.

No, it's not related to any secure operation. Once again - even if we booted to bash and tried to run certain commands (even simple `ps`) it was hanging.

> Anaconda already bundles virtio-rng and uses it by default if you provide
> the rng device on the VM host ("hypervisor") side.
> 

Yes it does bundle but it's not loading it. I must have hacked /init to be a custom script which does that. Normally it's a symlink to systemd. I guess that what you mean is that systemd loads this module - the problem is that systemd will never execute because it hangs on getrandom syscall. In case of virtual machines it never proceeds because of lack of the entropy.

> If you add virtio rng device on the VM host side, anaconda will
> automatically use it (on at least x86 and POWER/QEMU).
> 

Once again, look on my previous answer.

Comment 40 Brian Lane 2017-08-23 15:30:08 UTC
I think what it comes down to is that these slowdowns are the price of extra security. Maybe something could be done to load virtio-rng sooner? The kickstart request in question is happening really early, in the initrd before switch root.

If that's needed (I haven't looked closely to see if it is or is not) then that needs to happen in dracut.

Comment 41 Łukasz Siudut 2017-08-23 15:36:43 UTC
It has to be done in init script. The only alternative would be to have virtio-rng compiled in what sounds like an overkill to me.

Anyway, as mentioned, our hacky solution is to replace /init (which is symlink to systemd) with small sh script which is preloading virtio-rng before execing to the proper init binary.

As for now provisioning VMs with upstream Anaconda is completely broken though.

I'm thinking about cutting you another bug regarding this issue, unless you want to follow on this in here?

Comment 42 Brian Lane 2017-08-23 15:45:06 UTC
(In reply to Łukasz Siudut from comment #41)
> It has to be done in init script. The only alternative would be to have
> virtio-rng compiled in what sounds like an overkill to me.

Right, dracut it what builds the initrd. Including something in the fips dracut module may make sense.

> 
> Anyway, as mentioned, our hacky solution is to replace /init (which is
> symlink to systemd) with small sh script which is preloading virtio-rng
> before execing to the proper init binary.
> 
> As for now provisioning VMs with upstream Anaconda is completely broken
> though.
> 
> I'm thinking about cutting you another bug regarding this issue, unless you
> want to follow on this in here?

One bug per problem please :) I think the lorax part of this is working, so if you are having problems with anaconda file one against it, and/or dracut for early loading of virtio-rng

Comment 43 Łukasz Siudut 2017-08-23 15:49:41 UTC
Can do. The only reason why I started discussion here is that the patch that broke things was applied on Lorax. We also patched dracut to fix that :).

Comment 44 Harald Hoyer 2017-08-24 08:07:17 UTC
https://github.com/systemd/systemd/pull/6665

Comment 45 Jiri Jaburek 2017-08-24 15:48:22 UTC
(In reply to Łukasz Siudut from comment #41)
> It has to be done in init script. The only alternative would be to have
> virtio-rng compiled in what sounds like an overkill to me.

Dracut and anaconda are two separate images; you load kernel and dracut (as initramfs) via a bootloader, dracut then fetches kickstart and from it (or other kernel cmdline options) figures out where to get anaconda image from, downloads it (http) or mounts it (nfs) and switches root into it.

This, kind of by definition, needs the virtio-rng module in dracut, before the kickstart download (if the blocking wait happens because of TLS).

It could be worth having other -rng modules in dracut as well (instead of in anaconda), mainly just pseries-rng which works on bare metal, LPARs and QEMU, but that can be done later.

Comment 46 Łukasz Siudut 2017-08-24 16:09:43 UTC
This is well understood. It doesn't change the fact that the change introduced in Lorax broke the boot process for certain cases. And it remains broken for now.

There's an issue on dracut github.

https://github.com/dracutdevs/dracut/issues/273

Comment 51 Marek Hruscak 2018-01-03 20:54:00 UTC
*.chk files for secure network communication libs are now correctly added by lorax to initrd so Dracut with fips=1 on kernel cmdline is able to use these libraries to download kickstart via secured https from source with signed certificates(e.g. github.com) and the installation starts without problem.

Comment 54 errata-xmlrpc 2018-04-10 17:38:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0947


Note You need to log in before you can comment on or make changes to this bug.