Description of problem: IPI on OSP failed and crio service on 2/3 masters failed to serve: [root@wj46uos803z-g5vjh-master-1 core]# systemctl status crio Warning: The unit file, source configuration file or drop-ins of crio.service changed on disk. Run 'systemctl daemon-reload' to reload units. ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-mco-default-env.conf Active: inactive (dead) Docs: https://github.com/cri-o/cri-o Aug 03 08:18:26 wj46uos803z-g5vjh-master-1 systemd[1]: Dependency failed for Open Container Initiative Daemon. Aug 03 08:18:26 wj46uos803z-g5vjh-master-1 systemd[1]: crio.service: Job crio.service/start failed with result 'dependency'. Version-Release number of the following components: registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-03-025909 Client Version: 4.6.0-202008011451.p0-9cc7d77 ./openshift-install 4.6.0-0.nightly-2020-08-03-025909 built from commit d75bf7ad98124b901ae7e22b5595e0392ed6ea3c release image registry.svc.ci.openshift.org/ocp/release@sha256:9c975449f07cfd9b8696763363cfe636dfb449f80315bb99d7544f0de2f3b946 How reproducible: Not sure Steps to Reproduce: 1. Try to install an OSP cluster 2. 3. Actual results: Got crio is not serving on 2/3 masters Expected results: Crio should work well Additional info:
Created attachment 1703262 [details] must-gather logs
From comment#3 (does it need to be private?), `machine-config-daemon-firstboot.service` believes the cluster is running and refused to enable FIPS mode on a running cluster, however this scenario appears to be an install from comment #1. Aug 04 04:37:20 ip-10-0-62-21 machine-config-daemon[2033]: error: can't reconcile config mco-empty-mc with rendered-worker-093a7f46508e47432153a61514b404c5: detected change to FIPS flag. Refusing to modify FIPS on a running cluster: unreconcilable Additionally, `crio.service` is failing to start because of a failed dependency, though the only dependency that may have failed is `crio-wipe.service`. It shows it exited successfully, though there were these INFO level messages in the logs: Aug 04 04:37:23 ip-10-0-62-21 crio[2065]: time="2020-08-04 04:37:23.645256947Z" level=info msg="Version file /var/run/crio/version not found: open /var/run/crio/version: no such file or dir> Aug 04 04:37:23 ip-10-0-62-21 crio[2065]: time="2020-08-04 04:37:23.645359644Z" level=info msg="Version file /var/lib/crio/version not found: open /var/lib/crio/version: no such file or dir> I'm going to ask Sinny and Jonathan to have a look here, as there may be an issue in the MCO re: FIPS mode.
Hmm right...this may be a regression from https://github.com/openshift/machine-config-operator/pull/1766 Basically firstboot shouldn't try to change FIPS - we may need to gather the FIPS state from the running system when synthesizing the empty MC.
Sending over to MCO for additional triage based on comment #5
So I installed a default AWS 4.6 cluster with fips enabled. The master node booted fine with fips, however both the worker nodes and the bootstrap node were broken. The bootstrap/worker node had `fips_enabled` 0 whereas in 4.5, I can see that everything correctly had `fips_enabled` 1 (/proc/sys/crypto/fips_enabled) On a working master, this is what the logs look like: Aug 12 19:15:23 ip-10-0-139-165 machine-config-daemon[2171]: I0812 19:15:23.785662 2171 daemon.go:229] Installed Ignition binary version: 2.5.0 Aug 12 19:15:23 ip-10-0-139-165 machine-config-daemon[2171]: I0812 19:15:23.786917 2171 update.go:382] Checking Reconcilable for config mco-empy-mc to rendered-master-2feb73a27e640a176af7258c0b970a4c Aug 12 19:15:23 ip-10-0-139-165 machine-config-daemon[2171]: I0812 19:15:23.787283 2171 update.go:691] FIPS is configured and enabled Aug 12 19:15:23 ip-10-0-139-165 machine-config-daemon[2171]: I0812 19:15:23.787535 2171 update.go:1497] Starting update from mco-empty-mc to redered-master-2feb73a27e640a176af7258c0b970a4c: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensons:false} So fips is fine reconciling there. In Weiwei's case it seems that the node did not properly set fips in the first place. The reconcile from empty MC is fine since we check the actual node fips vs the desired config and not the MC. That behaviour did not change with https://github.com/openshift/machine-config-operator/pull/1766/files On a worker node I can see this: journalctl -u rhcos-fips.service -- Logs begin at Wed 2020-08-12 19:27:18 UTC, end at Wed 2020-08-12 19:51:17 UTC. -- Aug 12 19:27:22 ip-10-0-187-160 rhcos-fips[720]: No Ignition config provided. Aug 12 19:27:22 ip-10-0-187-160 systemd[1]: Starting Check for FIPS mode... Aug 12 19:27:22 ip-10-0-187-160 systemd[1]: Started Check for FIPS mode. Aug 12 19:27:30 ip-10-0-187-160 systemd[1]: Stopped Check for FIPS mode. journalctl -q MESSAGE_ID=57124006b5c94805b77ce473e92a8aeb IGNITION_CONFIG_TYPE=user Aug 12 19:27:22 ip-10-0-187-160 ignition[671]: fetched user config from "aws" Aug 12 19:27:22 ip-10-0-187-160 ignition[671]: fetched referenced user config from "/config/worker" And on masters I see: sh-4.4# journalctl -u rhcos-fips.service -- Logs begin at Wed 2020-08-12 19:14:52 UTC, end at Wed 2020-08-12 19:54:22 UTC. -- Aug 12 19:14:58 ip-10-0-139-165 rhcos-fips[961]: FIPS mode is enabled. Aug 12 19:14:58 ip-10-0-139-165 systemd[1]: Starting Check for FIPS mode... Aug 12 19:14:58 ip-10-0-139-165 systemd[1]: Started Check for FIPS mode. Aug 12 19:15:02 ip-10-0-139-165 systemd[1]: Stopped Check for FIPS mode. sh-4.4# journalctl -q MESSAGE_ID=57124006b5c94805b77ce473e92a8aeb IGNITION_CONFIG_TYPE=user Aug 12 19:14:58 ip-10-0-139-165 ignition[912]: fetched user config from "aws" Aug 12 19:14:58 ip-10-0-139-165 ignition[912]: fetched referenced user config from "/config/master" Looking at the timestamps I'm guessing theres a race between the message showing up in the journal and rhcos-fips.service querying it? Passing back to RHCOS to take a look at that.
This being the logs in question: https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/blob/master/overlay.d/05rhcos/usr/lib/dracut/modules.d/40rhcos-fips/rhcos-fips.sh#L24
> Looking at the timestamps I'm guessing theres a race between the message showing up in the journal and rhcos-fips.service querying it? That's plausible. Nice debugging on this, we'll look.
> Looking at the timestamps I'm guessing theres a race between the message showing up in the journal and rhcos-fips.service querying it? Hmm, that's odd. `rhcos-fips.service` does have: ``` # we need the fetched Ignition config, but before disks start After=ignition-fetch.service Before=ignition-disks.service ``` So we should be running after the `fetch` stage (which emits these messages). I'm not sure what guarantees (or lack thereof) the journal offers us here wrt querying right after. This might also be related to interactions with fetch-offline. Digging.
*** Bug 1869654 has been marked as a duplicate of this bug. ***
As an aside, I'm adding a test to check for this situation in the future (Machines created but Nodes never join) https://github.com/openshift/origin/pull/25419
with payload 4.6.0-0.nightly-2020-08-18-165040, the bug reproduced the bug with IPI on AWS, but didn't reproduce with IPI on OSP13. For ipi-aws, all worker nodes are in Provisioned PHASE $ oc -n openshift-machine-api get machine 'NAME PHASE TYPE REGION ZONE AGE xiyuan08203-6cjp5-master-0 Running m5.xlarge us-east-2 us-east-2a 50m xiyuan08203-6cjp5-master-1 Running m5.xlarge us-east-2 us-east-2b 50m xiyuan08203-6cjp5-master-2 Running m5.xlarge us-east-2 us-east-2c 50m xiyuan08203-6cjp5-worker-us-east-2a-kr5vn Provisioned m5.large us-east-2 us-east-2a 35m xiyuan08203-6cjp5-worker-us-east-2b-gzzgn Provisioned m5.large us-east-2 us-east-2b 35m xiyuan08203-6cjp5-worker-us-east-2c-2gzdr Provisioned m5.large us-east-2 us-east-2c 35m Login one of the worker machine, could still see the same error: [core@ip-10-0-156-117 ~]$ journalctl -f -u machine-config-daemon-firstboot.service -- Logs begin at Thu 2020-08-20 03:17:48 UTC. -- Aug 20 03:18:30 ip-10-0-156-117 systemd[1]: Starting Machine Config Daemon Firstboot... Aug 20 03:18:31 ip-10-0-156-117 machine-config-daemon[1849]: I0820 03:18:31.041477 1849 rpm-ostree.go:261] Running captured: rpm-ostree status --json Aug 20 03:18:32 ip-10-0-156-117 machine-config-daemon[1849]: I0820 03:18:32.179415 1849 daemon.go:222] Booted osImageURL: (46.82.202008111140-0) Aug 20 03:18:33 ip-10-0-156-117 machine-config-daemon[1849]: I0820 03:18:33.327299 1849 daemon.go:229] Installed Ignition binary version: 2.6.0 Aug 20 03:18:33 ip-10-0-156-117 machine-config-daemon[1849]: I0820 03:18:33.329602 1849 update.go:403] Checking Reconcilable for config mco-empty-mc to rendered-worker-c95ae04c95978d75475fcb179a7bb3a9 Aug 20 03:18:33 ip-10-0-156-117 machine-config-daemon[1849]: error: can't reconcile config mco-empty-mc with rendered-worker-c95ae04c95978d75475fcb179a7bb3a9: detected change to FIPS flag; refusing to modify FIPS on a running cluster: unreconcilable Aug 20 03:18:33 ip-10-0-156-117 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=1/FAILURE Aug 20 03:18:33 ip-10-0-156-117 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'. Aug 20 03:18:33 ip-10-0-156-117 systemd[1]: Failed to start Machine Config Daemon Firstboot. Aug 20 03:18:33 ip-10-0-156-117 systemd[1]: machine-config-daemon-firstboot.service: Consumed 89ms CPU time
> with payload 4.6.0-0.nightly-2020-08-18-165040, the bug reproduced the bug with IPI on AWS, but didn't reproduce with IPI on OSP13. > ... > Aug 20 03:18:32 ip-10-0-156-117 machine-config-daemon[1849]: I0820 03:18:32.179415 1849 daemon.go:222] Booted osImageURL: (46.82.202008111140-0) Yeah, this needs the bootimage bump at https://github.com/openshift/installer/pull/4066.
Move status to Verified due to https://bugzilla.redhat.com/show_bug.cgi?id=1862957#c21
*** Bug 1870394 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196