Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1861095

Summary:	e2e-aws-fips tests failure
Product:	OpenShift Container Platform	Reporter:	Micah Abbott <miabbott>
Component:	Machine Config Operator	Assignee:	Colin Walters <walters>
Status:	CLOSED ERRATA	QA Contact:	Micah Abbott <miabbott>
Severity:	urgent	Docs Contact:
Priority:	medium
Version:	4.6	CC:	aos-bugs, bbreard, ccoleman, imcleod, jligon, jokerman, nstielau, walters, wking
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	non-multi-arch
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:17:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Micah Abbott 2020-07-27 19:42:07 UTC

https://prow.ci.openshift.org/?job=*pull-ci-openshift-origin-master-e2e-aws-fips*

This e2e test has been failing consistently for the better part of a week (probably longer)

The logs show that the RHCOS node is booting in FIPS mode successfully.

```
Jul 27 18:00:18.100371 localhost kernel: Kernel command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-f33933d95ca511a7136eebb678b9b7136a691b9c20baf644bd84b35260fa773b/vmlinuz-4.18.0-211.el8.x86_64 ... fips=1 boot=LABEL=boot
Jul 27 18:00:18.100384 localhost kernel: fips mode: enabled
...
Jul 27 18:00:35.480860 ip-10-0-128-41 sshd[1913]: FIPS mode initialized
```

It was pointed out that the SDN container doesn't seem to be getting scheduled properly

`"message": "DaemonSet \"openshift-multus/network-metrics-daemon\" is waiting for other operators to become ready\nDaemonSet \"openshift-sdn/sdn-metrics\" is not yet scheduled on any nodes",`

Comment 1 Micah Abbott 2020-07-27 20:28:13 UTC

A sampling of failures from the last day or so shows the following tests failing pretty consistently on this job:

[sig-apps] DisruptionController should block an eviction until the PDB is updated to allow it [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-instrumentation] Prometheus when installed on the cluster should provide named network metrics [Suite:openshift/conformance/parallel]


Searching CI logs across OpenShift doesn't show any failures of these tests in the last 8 days, though.

Comment 2 Colin Walters 2020-07-27 22:16:48 UTC

While it's possible this is (core) RHCOS I doubt it; we seem to be entering FIPS mode OK.
Seems most likely to be something crio/kubelet or SDN, but unsure right now.  Trying to
get a live reproducer environment to debug.

Comment 3 W. Trevor King 2020-07-27 23:25:37 UTC

At least:

  DisruptionController should block an eviction until the PDB is updated to allow it

does not seem to be FIPS-specific.  I bumped into it on a non-FIPS PR preflight [1], and CI search suggests it is fairly widespread:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=DisruptionController%20should%20block%20an%20eviction%20until%20the%20PDB%20is%20updated%20to%20allow%20it' | grep 'failures match' | sort
promote-release-openshift-machine-os-content-e2e-aws-4.6 - 157 runs, 100% failed, 2% of failures match
promote-release-openshift-okd-machine-os-content-e2e-aws-4.6 - 23 runs, 100% failed, 4% of failures match
pull-ci-cri-o-cri-o-master-e2e-aws - 54 runs, 74% failed, 45% of failures match
pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws - 4 runs, 75% failed, 33% of failures match
...
pull-ci-operator-framework-operator-registry-master-e2e-aws - 8 runs, 63% failed, 80% of failures match
rehearse-10454-pull-ci-openshift-cloud-credential-operator-master-e2e-azure - 1 runs, 100% failed, 100% of failures match
rehearse-10454-pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi - 1 runs, 100% failed, 100% of failures match
rehearse-10454-pull-ci-openshift-origin-master-e2e-gcp - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.6 - 8 runs, 100% failed, 63% of failures match
release-openshift-ocp-installer-e2e-aws-4.6 - 9 runs, 89% failed, 63% of failures match
release-openshift-ocp-installer-e2e-aws-fips-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-mirrors-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-azure-4.6 - 17 runs, 100% failed, 24% of failures match
release-openshift-ocp-installer-e2e-gcp-4.6 - 4 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-openstack-4.6 - 8 runs, 100% failed, 25% of failures match
release-openshift-ocp-installer-e2e-openstack-ppc64le-4.6 - 2 runs, 100% failed, 50% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.6 - 9 runs, 100% failed, 44% of failures match
release-openshift-origin-installer-e2e-aws-ovn-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-azure-shared-vpc-4.5 - 2 runs, 50% failed, 100% of failures match

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/406/pull-ci-openshift-cluster-version-operator-master-e2e/1287865991518228480

Comment 4 W. Trevor King 2020-07-28 02:57:51 UTC

I spun out the PDB business into bug 1861189.

Comment 5 Micah Abbott 2020-07-28 13:44:04 UTC

The `Prometheus when installed on the cluster should provide named network metrics` seems to be widespread, too.  Filed 1861391

Comment 6 Colin Walters 2020-07-28 14:10:53 UTC

https://github.com/openshift/release/pull/10488

Comment 7 Clayton Coleman 2020-07-31 20:09:47 UTC

FIPS tests are completely broken, we cannot test, which means we are in "outage" wr.t. fips.  Setting to urgent.  I need e2e-aws-fips passing ASAP, and at least one verification check in CI code or an e2e suite that says "if fips is on, fips is actually enabled on nodes".  Where will that check go?

Comment 8 Micah Abbott 2020-08-04 17:54:57 UTC

Transcribing some updates here:

- Colin proposed a new FIPS test:  https://github.com/openshift/origin/pull/25362
- Discussion in Slack seemed to conclude that the FIPS test was expecting a certain amount of nodes available in the test, but assuming a certain number of nodes was the wrong thing
- Mrunal proposed a quick patch to the test:  https://github.com/openshift/release/pull/10653
- Currently watching the first fixed version of the test here:  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25370/pull-ci-openshift-origin-master-e2e-aws-fips/1290696271396343808

While this is currently assigned to the RHCOS component, we don't believe there is any fault in the RHCOS handling of FIPS mode.

Comment 9 Colin Walters 2020-08-06 14:59:18 UTC

OK fips is still pretty red, it's not clear to me why.  I don't see anything wrong RHCOS/MCO - offhand it looks like the sdn pods are failing to start because configmaps aren't being mounted.  Tossing to Node.

Comment 10 Colin Walters 2020-08-07 17:34:59 UTC

https://github.com/openshift/machine-config-operator/pull/1990 is related to this

Comment 11 Seth Jennings 2020-08-10 16:06:35 UTC

Ryan is on leave

Comment 12 Seth Jennings 2020-08-10 18:14:47 UTC

MCO has the fix. Sending to them.

Comment 15 Micah Abbott 2020-08-12 20:24:54 UTC

Verified the change to the MCO is included in latest 4.6 nightlies (4.6.0-0.nightly-2020-08-12-140703).  Unfortunately the `*pull-ci-openshift-origin-master-e2e-aws-fips` are still pretty red, though for reasons unrelated to FIPS.  The new log message is visible in the journals of nodes on both passing and failing jobs.

```
$ oc get clusterversion                                                                                                                                  
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS                                                                                                                               
version   4.6.0-0.nightly-2020-08-12-140703   True        False         8m58s   Cluster version is 4.6.0-0.nightly-2020-08-12-140703                                                                            

$ oc get nodes                                                                                                                                           
NAME                                       STATUS   ROLES    AGE   VERSION                                                                                                                                           
ci-ln-w7rgd3t-f76d1-gt2ps-master-0         Ready    master   36m   v1.19.0-rc.2+edbf229-dirty                                                                                                                        
ci-ln-w7rgd3t-f76d1-gt2ps-master-1         Ready    master   36m   v1.19.0-rc.2+edbf229-dirty             
ci-ln-w7rgd3t-f76d1-gt2ps-master-2         Ready    master   36m   v1.19.0-rc.2+edbf229-dirty             
ci-ln-w7rgd3t-f76d1-gt2ps-worker-b-v249f   Ready    worker   23m   v1.19.0-rc.2+edbf229-dirty             
ci-ln-w7rgd3t-f76d1-gt2ps-worker-c-8mnx2   Ready    worker   23m   v1.19.0-rc.2+edbf229-dirty
ci-ln-w7rgd3t-f76d1-gt2ps-worker-d-4h2ld   Ready    worker   23m   v1.19.0-rc.2+edbf229-dirty 

$ oc debug node/ci-ln-w7rgd3t-f76d1-gt2ps-master-0                                                                                                       
Starting pod/ci-ln-w7rgd3t-f76d1-gt2ps-master-0-debug ...                                                 
To use host binaries, run `chroot /host`                                                                  
Pod IP: 10.0.0.3                                                                                                                                                                                                     
If you don't see a command prompt, try pressing enter.                                                                                                                                                               
sh-4.2# chroot /host
sh-4.4# cat /proc/cmdline                                                                                 
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-87d3beadaaabc3da4a1e18e14d8670e865b6b5a191b16b80bad40167ca43532d/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.
luks.options=discard ostree=/ostree/boot.0/rhcos/87d3beadaaabc3da4a1e18e14d8670e865b6b5a191b16b80bad40167ca43532d/0 ignition.platform.id=gcp fips=1 boot=LABEL=boot
sh-4.4# fips-mode-setup --check                                                                           
FIPS mode is enabled.
sh-4.4# journalctl | grep FIPS                                                                            
Aug 12 18:16:45 localhost systemd[1]: Starting Check for FIPS mode...                     
Aug 12 18:16:45 localhost rhcos-fips[966]: FIPS mode is enabled.                                 
Aug 12 18:16:45 localhost systemd[1]: Started Check for FIPS mode.
Aug 12 18:16:48 localhost systemd[1]: Starting Finish FIPS mode setup...
Aug 12 18:16:50 localhost rhcos-fips[1142]: Setting system policy to FIPS
Aug 12 18:16:50 localhost rhcos-fips[1142]: FIPS mode will be enabled.
Aug 12 18:16:50 localhost systemd[1]: Started Finish FIPS mode setup.
Aug 12 18:16:50 localhost systemd[1]: Stopped Finish FIPS mode setup.
Aug 12 18:16:50 localhost systemd[1]: Stopped Check for FIPS mode.
Aug 12 18:16:53 localhost systemd[1]: Starting Finish FIPS mode setup...
Aug 12 18:16:53 localhost rhcos-fips[1142]: Setting system policy to FIPS
Aug 12 18:16:53 localhost rhcos-fips[1142]: FIPS mode will be enabled.
Aug 12 18:16:53 localhost systemd[1]: Started Finish FIPS mode setup.
Aug 12 18:16:53 localhost systemd[1]: Stopped Finish FIPS mode setup.
Aug 12 18:16:53 localhost systemd[1]: Stopped Check for FIPS mode.
Aug 12 18:16:59 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 sshd[2012]: FIPS mode initialized
Aug 12 18:17:13 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 machine-config-daemon[2326]: I0812 18:17:13.396937    2326 update.go:691] FIPS is configured and enabled
Aug 12 18:18:20 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 sshd[1623]: FIPS mode initialized
Aug 12 18:20:19 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 hyperkube[1748]: # SKIPPED DUE TO FIPS COPY.
Aug 12 18:20:19 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 hyperkube[1748]: # SKIPPED DUE TO FIPS COPY.
```

Comment 17 errata-xmlrpc 2020-10-27 16:17:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 18 errata-xmlrpc 2020-10-27 16:20:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196