1862957 – [FIPS] [AWS] machine-config-daemon-firstboot.service failed when fips is enabled "Refusing to modify FIPS on a running cluster: unreconcilable"

Bug 1862957 - [FIPS] [AWS] machine-config-daemon-firstboot.service failed when fips is enabled "Refusing to modify FIPS on a running cluster: unreconcilable"

Summary: [FIPS] [AWS] machine-config-daemon-firstboot.service failed when fips is enab...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Jonathan Lebon
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1869654 1870394 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-03 10:09 UTC by weiwei jiang
Modified:	2020-10-27 16:23 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:22:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must-gather logs (2.88 MB, application/gzip) 2020-08-03 10:11 UTC, weiwei jiang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 4066	0	None	closed	Bug 1862957: bump RHCOS images for FIPS fix	2021-02-05 04:05:03 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:23:00 UTC

Description weiwei jiang 2020-08-03 10:09:05 UTC

Description of problem:
IPI on OSP failed and crio service on 2/3 masters failed to serve:
[root@wj46uos803z-g5vjh-master-1 core]# systemctl status crio
Warning: The unit file, source configuration file or drop-ins of crio.service changed on disk. Run 'systemctl daemon-reload' to reload units.
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─10-mco-default-env.conf
   Active: inactive (dead)
     Docs: https://github.com/cri-o/cri-o

Aug 03 08:18:26 wj46uos803z-g5vjh-master-1 systemd[1]: Dependency failed for Open Container Initiative Daemon.
Aug 03 08:18:26 wj46uos803z-g5vjh-master-1 systemd[1]: crio.service: Job crio.service/start failed with result 'dependency'.



Version-Release number of the following components:
registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-03-025909
Client Version: 4.6.0-202008011451.p0-9cc7d77
./openshift-install 4.6.0-0.nightly-2020-08-03-025909
built from commit d75bf7ad98124b901ae7e22b5595e0392ed6ea3c
release image registry.svc.ci.openshift.org/ocp/release@sha256:9c975449f07cfd9b8696763363cfe636dfb449f80315bb99d7544f0de2f3b946

How reproducible:
Not sure

Steps to Reproduce:
1. Try to install an OSP cluster
2.
3.

Actual results:
Got crio is not serving on 2/3 masters

Expected results:
Crio should work well
Additional info:

Comment 1 weiwei jiang 2020-08-03 10:11:00 UTC

Created attachment 1703262 [details]
must-gather logs

Comment 4 Micah Abbott 2020-08-04 14:17:21 UTC

From comment#3 (does it need to be private?), `machine-config-daemon-firstboot.service` believes the cluster is running and refused to enable FIPS mode on a running cluster, however this scenario appears to be an install from comment #1.

Aug 04 04:37:20 ip-10-0-62-21 machine-config-daemon[2033]: error: can't reconcile config mco-empty-mc with rendered-worker-093a7f46508e47432153a61514b404c5: detected change to FIPS flag. Refusing to modify FIPS on a running cluster: unreconcilable

Additionally, `crio.service` is failing to start because of a failed dependency, though the only dependency that may have failed is `crio-wipe.service`.  It shows it exited successfully, though there were these INFO level messages in the logs:

Aug 04 04:37:23 ip-10-0-62-21 crio[2065]: time="2020-08-04 04:37:23.645256947Z" level=info msg="Version file /var/run/crio/version not found: open /var/run/crio/version: no such file or dir>
Aug 04 04:37:23 ip-10-0-62-21 crio[2065]: time="2020-08-04 04:37:23.645359644Z" level=info msg="Version file /var/lib/crio/version not found: open /var/lib/crio/version: no such file or dir>

I'm going to ask Sinny and Jonathan to have a look here, as there may be an issue in the MCO re: FIPS mode.

Comment 5 Colin Walters 2020-08-04 15:01:24 UTC

Hmm right...this may be a regression from https://github.com/openshift/machine-config-operator/pull/1766

Basically firstboot shouldn't try to change FIPS - we may need to gather the FIPS state from the running system when synthesizing the empty MC.

Comment 6 Micah Abbott 2020-08-05 12:55:09 UTC

Sending over to MCO for additional triage based on comment #5

Comment 7 Yu Qi Zhang 2020-08-12 19:57:02 UTC

So I installed a default AWS 4.6 cluster with fips enabled. The master node booted fine with fips, however both the worker nodes and the bootstrap node were broken. The bootstrap/worker node had `fips_enabled` 0 whereas in 4.5, I can see that everything correctly had `fips_enabled` 1 (/proc/sys/crypto/fips_enabled)

On a working master, this is what the logs look like:

Aug 12 19:15:23 ip-10-0-139-165 machine-config-daemon[2171]: I0812 19:15:23.785662    2171 daemon.go:229] Installed Ignition binary version: 2.5.0
Aug 12 19:15:23 ip-10-0-139-165 machine-config-daemon[2171]: I0812 19:15:23.786917    2171 update.go:382] Checking Reconcilable for config mco-empy-mc to rendered-master-2feb73a27e640a176af7258c0b970a4c
Aug 12 19:15:23 ip-10-0-139-165 machine-config-daemon[2171]: I0812 19:15:23.787283    2171 update.go:691] FIPS is configured and enabled
Aug 12 19:15:23 ip-10-0-139-165 machine-config-daemon[2171]: I0812 19:15:23.787535    2171 update.go:1497] Starting update from mco-empty-mc to redered-master-2feb73a27e640a176af7258c0b970a4c: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensons:false}

So fips is fine reconciling there. In Weiwei's case it seems that the node did not properly set fips in the first place. The reconcile from empty MC is fine since we check the actual node fips vs the desired config and not the MC. That behaviour did not change with https://github.com/openshift/machine-config-operator/pull/1766/files

On a worker node I can see this:

journalctl -u rhcos-fips.service
-- Logs begin at Wed 2020-08-12 19:27:18 UTC, end at Wed 2020-08-12 19:51:17 UTC. --
Aug 12 19:27:22 ip-10-0-187-160 rhcos-fips[720]: No Ignition config provided.
Aug 12 19:27:22 ip-10-0-187-160 systemd[1]: Starting Check for FIPS mode...
Aug 12 19:27:22 ip-10-0-187-160 systemd[1]: Started Check for FIPS mode.
Aug 12 19:27:30 ip-10-0-187-160 systemd[1]: Stopped Check for FIPS mode.

journalctl -q MESSAGE_ID=57124006b5c94805b77ce473e92a8aeb IGNITION_CONFIG_TYPE=user
Aug 12 19:27:22 ip-10-0-187-160 ignition[671]: fetched user config from "aws"
Aug 12 19:27:22 ip-10-0-187-160 ignition[671]: fetched referenced user config from "/config/worker"

And on masters I see:

sh-4.4# journalctl -u rhcos-fips.service
-- Logs begin at Wed 2020-08-12 19:14:52 UTC, end at Wed 2020-08-12 19:54:22 UTC. --
Aug 12 19:14:58 ip-10-0-139-165 rhcos-fips[961]: FIPS mode is enabled.
Aug 12 19:14:58 ip-10-0-139-165 systemd[1]: Starting Check for FIPS mode...
Aug 12 19:14:58 ip-10-0-139-165 systemd[1]: Started Check for FIPS mode.
Aug 12 19:15:02 ip-10-0-139-165 systemd[1]: Stopped Check for FIPS mode.

sh-4.4# journalctl -q MESSAGE_ID=57124006b5c94805b77ce473e92a8aeb IGNITION_CONFIG_TYPE=user
Aug 12 19:14:58 ip-10-0-139-165 ignition[912]: fetched user config from "aws"
Aug 12 19:14:58 ip-10-0-139-165 ignition[912]: fetched referenced user config from "/config/master"


Looking at the timestamps I'm guessing theres a race between the message showing up in the journal and rhcos-fips.service querying it?

Passing back to RHCOS to take a look at that.

Comment 8 Yu Qi Zhang 2020-08-12 19:57:57 UTC

This being the logs in question: https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/blob/master/overlay.d/05rhcos/usr/lib/dracut/modules.d/40rhcos-fips/rhcos-fips.sh#L24

Comment 10 Colin Walters 2020-08-17 13:37:24 UTC

> Looking at the timestamps I'm guessing theres a race between the message showing up in the journal and rhcos-fips.service querying it?

That's plausible.  Nice debugging on this, we'll look.

Comment 11 Jonathan Lebon 2020-08-17 18:50:05 UTC

> Looking at the timestamps I'm guessing theres a race between the message showing up in the journal and rhcos-fips.service querying it?

Hmm, that's odd. `rhcos-fips.service` does have:

```
# we need the fetched Ignition config, but before disks start
After=ignition-fetch.service
Before=ignition-disks.service
```

So we should be running after the `fetch` stage (which emits these messages). I'm not sure what guarantees (or lack thereof) the journal offers us here wrt querying right after. This might also be related to interactions with fetch-offline.

Digging.

Comment 13 Jonathan Lebon 2020-08-18 15:17:12 UTC

*** Bug 1869654 has been marked as a duplicate of this bug. ***

Comment 14 Seth Jennings 2020-08-18 16:40:26 UTC

As an aside, I'm adding a test to check for this situation in the future (Machines created but Nodes never join)
https://github.com/openshift/origin/pull/25419

Comment 15 xiyuan 2020-08-20 05:49:07 UTC

with payload 4.6.0-0.nightly-2020-08-18-165040, the bug reproduced the bug with IPI on AWS, but didn't reproduce with IPI on OSP13.
For ipi-aws, all worker nodes are in Provisioned PHASE
$ oc -n openshift-machine-api get machine
'NAME                                        PHASE         TYPE        REGION      ZONE         AGE
xiyuan08203-6cjp5-master-0                  Running       m5.xlarge   us-east-2   us-east-2a   50m
xiyuan08203-6cjp5-master-1                  Running       m5.xlarge   us-east-2   us-east-2b   50m
xiyuan08203-6cjp5-master-2                  Running       m5.xlarge   us-east-2   us-east-2c   50m
xiyuan08203-6cjp5-worker-us-east-2a-kr5vn   Provisioned   m5.large    us-east-2   us-east-2a   35m
xiyuan08203-6cjp5-worker-us-east-2b-gzzgn   Provisioned   m5.large    us-east-2   us-east-2b   35m
xiyuan08203-6cjp5-worker-us-east-2c-2gzdr   Provisioned   m5.large    us-east-2   us-east-2c   35m

Login one of the worker machine, could still see the same error:
[core@ip-10-0-156-117 ~]$  journalctl  -f -u machine-config-daemon-firstboot.service
-- Logs begin at Thu 2020-08-20 03:17:48 UTC. --
Aug 20 03:18:30 ip-10-0-156-117 systemd[1]: Starting Machine Config Daemon Firstboot...
Aug 20 03:18:31 ip-10-0-156-117 machine-config-daemon[1849]: I0820 03:18:31.041477    1849 rpm-ostree.go:261] Running captured: rpm-ostree status --json
Aug 20 03:18:32 ip-10-0-156-117 machine-config-daemon[1849]: I0820 03:18:32.179415    1849 daemon.go:222] Booted osImageURL:  (46.82.202008111140-0)
Aug 20 03:18:33 ip-10-0-156-117 machine-config-daemon[1849]: I0820 03:18:33.327299    1849 daemon.go:229] Installed Ignition binary version: 2.6.0
Aug 20 03:18:33 ip-10-0-156-117 machine-config-daemon[1849]: I0820 03:18:33.329602    1849 update.go:403] Checking Reconcilable for config mco-empty-mc to rendered-worker-c95ae04c95978d75475fcb179a7bb3a9
Aug 20 03:18:33 ip-10-0-156-117 machine-config-daemon[1849]: error: can't reconcile config mco-empty-mc with rendered-worker-c95ae04c95978d75475fcb179a7bb3a9: detected change to FIPS flag; refusing to modify FIPS on a running cluster: unreconcilable
Aug 20 03:18:33 ip-10-0-156-117 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=1/FAILURE
Aug 20 03:18:33 ip-10-0-156-117 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'.
Aug 20 03:18:33 ip-10-0-156-117 systemd[1]: Failed to start Machine Config Daemon Firstboot.
Aug 20 03:18:33 ip-10-0-156-117 systemd[1]: machine-config-daemon-firstboot.service: Consumed 89ms CPU time

Comment 17 Jonathan Lebon 2020-08-20 13:51:18 UTC

> with payload 4.6.0-0.nightly-2020-08-18-165040, the bug reproduced the bug with IPI on AWS, but didn't reproduce with IPI on OSP13.
> ...
> Aug 20 03:18:32 ip-10-0-156-117 machine-config-daemon[1849]: I0820 03:18:32.179415    1849 daemon.go:222] Booted osImageURL:  (46.82.202008111140-0)

Yeah, this needs the bootimage bump at https://github.com/openshift/installer/pull/4066.

Comment 22 xiyuan 2020-08-25 03:21:23 UTC

Move status to Verified due to https://bugzilla.redhat.com/show_bug.cgi?id=1862957#c21

Comment 23 Seth Jennings 2020-08-28 20:55:04 UTC

*** Bug 1870394 has been marked as a duplicate of this bug. ***

Comment 24 Sunil Choudhary 2020-09-29 05:27:50 UTC

*** Bug 1870394 has been marked as a duplicate of this bug. ***

Comment 26 errata-xmlrpc 2020-10-27 16:22:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.