https://prow.ci.openshift.org/?job=*pull-ci-openshift-origin-master-e2e-aws-fips* This e2e test has been failing consistently for the better part of a week (probably longer) The logs show that the RHCOS node is booting in FIPS mode successfully. ``` Jul 27 18:00:18.100371 localhost kernel: Kernel command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-f33933d95ca511a7136eebb678b9b7136a691b9c20baf644bd84b35260fa773b/vmlinuz-4.18.0-211.el8.x86_64 ... fips=1 boot=LABEL=boot Jul 27 18:00:18.100384 localhost kernel: fips mode: enabled ... Jul 27 18:00:35.480860 ip-10-0-128-41 sshd[1913]: FIPS mode initialized ``` It was pointed out that the SDN container doesn't seem to be getting scheduled properly `"message": "DaemonSet \"openshift-multus/network-metrics-daemon\" is waiting for other operators to become ready\nDaemonSet \"openshift-sdn/sdn-metrics\" is not yet scheduled on any nodes",`
A sampling of failures from the last day or so shows the following tests failing pretty consistently on this job: [sig-apps] DisruptionController should block an eviction until the PDB is updated to allow it [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-instrumentation] Prometheus when installed on the cluster should provide named network metrics [Suite:openshift/conformance/parallel] Searching CI logs across OpenShift doesn't show any failures of these tests in the last 8 days, though.
While it's possible this is (core) RHCOS I doubt it; we seem to be entering FIPS mode OK. Seems most likely to be something crio/kubelet or SDN, but unsure right now. Trying to get a live reproducer environment to debug.
At least: DisruptionController should block an eviction until the PDB is updated to allow it does not seem to be FIPS-specific. I bumped into it on a non-FIPS PR preflight [1], and CI search suggests it is fairly widespread: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=DisruptionController%20should%20block%20an%20eviction%20until%20the%20PDB%20is%20updated%20to%20allow%20it' | grep 'failures match' | sort promote-release-openshift-machine-os-content-e2e-aws-4.6 - 157 runs, 100% failed, 2% of failures match promote-release-openshift-okd-machine-os-content-e2e-aws-4.6 - 23 runs, 100% failed, 4% of failures match pull-ci-cri-o-cri-o-master-e2e-aws - 54 runs, 74% failed, 45% of failures match pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws - 4 runs, 75% failed, 33% of failures match ... pull-ci-operator-framework-operator-registry-master-e2e-aws - 8 runs, 63% failed, 80% of failures match rehearse-10454-pull-ci-openshift-cloud-credential-operator-master-e2e-azure - 1 runs, 100% failed, 100% of failures match rehearse-10454-pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi - 1 runs, 100% failed, 100% of failures match rehearse-10454-pull-ci-openshift-origin-master-e2e-gcp - 1 runs, 100% failed, 100% of failures match release-openshift-ocp-e2e-aws-scaleup-rhel7-4.6 - 8 runs, 100% failed, 63% of failures match release-openshift-ocp-installer-e2e-aws-4.6 - 9 runs, 89% failed, 63% of failures match release-openshift-ocp-installer-e2e-aws-fips-4.6 - 1 runs, 100% failed, 100% of failures match release-openshift-ocp-installer-e2e-aws-mirrors-4.6 - 1 runs, 100% failed, 100% of failures match release-openshift-ocp-installer-e2e-azure-4.6 - 17 runs, 100% failed, 24% of failures match release-openshift-ocp-installer-e2e-gcp-4.6 - 4 runs, 100% failed, 100% of failures match release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 1 runs, 100% failed, 100% of failures match release-openshift-ocp-installer-e2e-openstack-4.6 - 8 runs, 100% failed, 25% of failures match release-openshift-ocp-installer-e2e-openstack-ppc64le-4.6 - 2 runs, 100% failed, 50% of failures match release-openshift-ocp-installer-e2e-ovirt-4.6 - 9 runs, 100% failed, 44% of failures match release-openshift-origin-installer-e2e-aws-ovn-4.6 - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-azure-shared-vpc-4.5 - 2 runs, 50% failed, 100% of failures match [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/406/pull-ci-openshift-cluster-version-operator-master-e2e/1287865991518228480
I spun out the PDB business into bug 1861189.
The `Prometheus when installed on the cluster should provide named network metrics` seems to be widespread, too. Filed 1861391
https://github.com/openshift/release/pull/10488
FIPS tests are completely broken, we cannot test, which means we are in "outage" wr.t. fips. Setting to urgent. I need e2e-aws-fips passing ASAP, and at least one verification check in CI code or an e2e suite that says "if fips is on, fips is actually enabled on nodes". Where will that check go?
Transcribing some updates here: - Colin proposed a new FIPS test: https://github.com/openshift/origin/pull/25362 - Discussion in Slack seemed to conclude that the FIPS test was expecting a certain amount of nodes available in the test, but assuming a certain number of nodes was the wrong thing - Mrunal proposed a quick patch to the test: https://github.com/openshift/release/pull/10653 - Currently watching the first fixed version of the test here: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25370/pull-ci-openshift-origin-master-e2e-aws-fips/1290696271396343808 While this is currently assigned to the RHCOS component, we don't believe there is any fault in the RHCOS handling of FIPS mode.
OK fips is still pretty red, it's not clear to me why. I don't see anything wrong RHCOS/MCO - offhand it looks like the sdn pods are failing to start because configmaps aren't being mounted. Tossing to Node.
https://github.com/openshift/machine-config-operator/pull/1990 is related to this
Ryan is on leave
MCO has the fix. Sending to them.
Verified the change to the MCO is included in latest 4.6 nightlies (4.6.0-0.nightly-2020-08-12-140703). Unfortunately the `*pull-ci-openshift-origin-master-e2e-aws-fips` are still pretty red, though for reasons unrelated to FIPS. The new log message is visible in the journals of nodes on both passing and failing jobs. ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-08-12-140703 True False 8m58s Cluster version is 4.6.0-0.nightly-2020-08-12-140703 $ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-w7rgd3t-f76d1-gt2ps-master-0 Ready master 36m v1.19.0-rc.2+edbf229-dirty ci-ln-w7rgd3t-f76d1-gt2ps-master-1 Ready master 36m v1.19.0-rc.2+edbf229-dirty ci-ln-w7rgd3t-f76d1-gt2ps-master-2 Ready master 36m v1.19.0-rc.2+edbf229-dirty ci-ln-w7rgd3t-f76d1-gt2ps-worker-b-v249f Ready worker 23m v1.19.0-rc.2+edbf229-dirty ci-ln-w7rgd3t-f76d1-gt2ps-worker-c-8mnx2 Ready worker 23m v1.19.0-rc.2+edbf229-dirty ci-ln-w7rgd3t-f76d1-gt2ps-worker-d-4h2ld Ready worker 23m v1.19.0-rc.2+edbf229-dirty $ oc debug node/ci-ln-w7rgd3t-f76d1-gt2ps-master-0 Starting pod/ci-ln-w7rgd3t-f76d1-gt2ps-master-0-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.0.3 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# cat /proc/cmdline BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-87d3beadaaabc3da4a1e18e14d8670e865b6b5a191b16b80bad40167ca43532d/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd. luks.options=discard ostree=/ostree/boot.0/rhcos/87d3beadaaabc3da4a1e18e14d8670e865b6b5a191b16b80bad40167ca43532d/0 ignition.platform.id=gcp fips=1 boot=LABEL=boot sh-4.4# fips-mode-setup --check FIPS mode is enabled. sh-4.4# journalctl | grep FIPS Aug 12 18:16:45 localhost systemd[1]: Starting Check for FIPS mode... Aug 12 18:16:45 localhost rhcos-fips[966]: FIPS mode is enabled. Aug 12 18:16:45 localhost systemd[1]: Started Check for FIPS mode. Aug 12 18:16:48 localhost systemd[1]: Starting Finish FIPS mode setup... Aug 12 18:16:50 localhost rhcos-fips[1142]: Setting system policy to FIPS Aug 12 18:16:50 localhost rhcos-fips[1142]: FIPS mode will be enabled. Aug 12 18:16:50 localhost systemd[1]: Started Finish FIPS mode setup. Aug 12 18:16:50 localhost systemd[1]: Stopped Finish FIPS mode setup. Aug 12 18:16:50 localhost systemd[1]: Stopped Check for FIPS mode. Aug 12 18:16:53 localhost systemd[1]: Starting Finish FIPS mode setup... Aug 12 18:16:53 localhost rhcos-fips[1142]: Setting system policy to FIPS Aug 12 18:16:53 localhost rhcos-fips[1142]: FIPS mode will be enabled. Aug 12 18:16:53 localhost systemd[1]: Started Finish FIPS mode setup. Aug 12 18:16:53 localhost systemd[1]: Stopped Finish FIPS mode setup. Aug 12 18:16:53 localhost systemd[1]: Stopped Check for FIPS mode. Aug 12 18:16:59 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 sshd[2012]: FIPS mode initialized Aug 12 18:17:13 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 machine-config-daemon[2326]: I0812 18:17:13.396937 2326 update.go:691] FIPS is configured and enabled Aug 12 18:18:20 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 sshd[1623]: FIPS mode initialized Aug 12 18:20:19 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 hyperkube[1748]: # SKIPPED DUE TO FIPS COPY. Aug 12 18:20:19 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 hyperkube[1748]: # SKIPPED DUE TO FIPS COPY. ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196