Bug 1861095 - e2e-aws-fips tests failure
Summary: e2e-aws-fips tests failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: ---
: 4.6.0
Assignee: Colin Walters
QA Contact: Micah Abbott
URL:
Whiteboard: non-multi-arch
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-27 19:42 UTC by Micah Abbott
Modified: 2020-10-27 16:20 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:17:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1990 0 None closed Bug 1861095: daemon: Log when we validate that FIPS is on 2020-09-14 09:00:02 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:20:44 UTC

Description Micah Abbott 2020-07-27 19:42:07 UTC
https://prow.ci.openshift.org/?job=*pull-ci-openshift-origin-master-e2e-aws-fips*

This e2e test has been failing consistently for the better part of a week (probably longer)

The logs show that the RHCOS node is booting in FIPS mode successfully.

```
Jul 27 18:00:18.100371 localhost kernel: Kernel command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-f33933d95ca511a7136eebb678b9b7136a691b9c20baf644bd84b35260fa773b/vmlinuz-4.18.0-211.el8.x86_64 ... fips=1 boot=LABEL=boot
Jul 27 18:00:18.100384 localhost kernel: fips mode: enabled
...
Jul 27 18:00:35.480860 ip-10-0-128-41 sshd[1913]: FIPS mode initialized
```

It was pointed out that the SDN container doesn't seem to be getting scheduled properly

`"message": "DaemonSet \"openshift-multus/network-metrics-daemon\" is waiting for other operators to become ready\nDaemonSet \"openshift-sdn/sdn-metrics\" is not yet scheduled on any nodes",`

Comment 1 Micah Abbott 2020-07-27 20:28:13 UTC
A sampling of failures from the last day or so shows the following tests failing pretty consistently on this job:

[sig-apps] DisruptionController should block an eviction until the PDB is updated to allow it [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-instrumentation] Prometheus when installed on the cluster should provide named network metrics [Suite:openshift/conformance/parallel]


Searching CI logs across OpenShift doesn't show any failures of these tests in the last 8 days, though.

Comment 2 Colin Walters 2020-07-27 22:16:48 UTC
While it's possible this is (core) RHCOS I doubt it; we seem to be entering FIPS mode OK.
Seems most likely to be something crio/kubelet or SDN, but unsure right now.  Trying to
get a live reproducer environment to debug.

Comment 3 W. Trevor King 2020-07-27 23:25:37 UTC
At least:

  DisruptionController should block an eviction until the PDB is updated to allow it

does not seem to be FIPS-specific.  I bumped into it on a non-FIPS PR preflight [1], and CI search suggests it is fairly widespread:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=DisruptionController%20should%20block%20an%20eviction%20until%20the%20PDB%20is%20updated%20to%20allow%20it' | grep 'failures match' | sort
promote-release-openshift-machine-os-content-e2e-aws-4.6 - 157 runs, 100% failed, 2% of failures match
promote-release-openshift-okd-machine-os-content-e2e-aws-4.6 - 23 runs, 100% failed, 4% of failures match
pull-ci-cri-o-cri-o-master-e2e-aws - 54 runs, 74% failed, 45% of failures match
pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws - 4 runs, 75% failed, 33% of failures match
...
pull-ci-operator-framework-operator-registry-master-e2e-aws - 8 runs, 63% failed, 80% of failures match
rehearse-10454-pull-ci-openshift-cloud-credential-operator-master-e2e-azure - 1 runs, 100% failed, 100% of failures match
rehearse-10454-pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi - 1 runs, 100% failed, 100% of failures match
rehearse-10454-pull-ci-openshift-origin-master-e2e-gcp - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.6 - 8 runs, 100% failed, 63% of failures match
release-openshift-ocp-installer-e2e-aws-4.6 - 9 runs, 89% failed, 63% of failures match
release-openshift-ocp-installer-e2e-aws-fips-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-mirrors-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-azure-4.6 - 17 runs, 100% failed, 24% of failures match
release-openshift-ocp-installer-e2e-gcp-4.6 - 4 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-openstack-4.6 - 8 runs, 100% failed, 25% of failures match
release-openshift-ocp-installer-e2e-openstack-ppc64le-4.6 - 2 runs, 100% failed, 50% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.6 - 9 runs, 100% failed, 44% of failures match
release-openshift-origin-installer-e2e-aws-ovn-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-azure-shared-vpc-4.5 - 2 runs, 50% failed, 100% of failures match

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/406/pull-ci-openshift-cluster-version-operator-master-e2e/1287865991518228480

Comment 4 W. Trevor King 2020-07-28 02:57:51 UTC
I spun out the PDB business into bug 1861189.

Comment 5 Micah Abbott 2020-07-28 13:44:04 UTC
The `Prometheus when installed on the cluster should provide named network metrics` seems to be widespread, too.  Filed 1861391

Comment 6 Colin Walters 2020-07-28 14:10:53 UTC
https://github.com/openshift/release/pull/10488

Comment 7 Clayton Coleman 2020-07-31 20:09:47 UTC
FIPS tests are completely broken, we cannot test, which means we are in "outage" wr.t. fips.  Setting to urgent.  I need e2e-aws-fips passing ASAP, and at least one verification check in CI code or an e2e suite that says "if fips is on, fips is actually enabled on nodes".  Where will that check go?

Comment 8 Micah Abbott 2020-08-04 17:54:57 UTC
Transcribing some updates here:

- Colin proposed a new FIPS test:  https://github.com/openshift/origin/pull/25362
- Discussion in Slack seemed to conclude that the FIPS test was expecting a certain amount of nodes available in the test, but assuming a certain number of nodes was the wrong thing
- Mrunal proposed a quick patch to the test:  https://github.com/openshift/release/pull/10653
- Currently watching the first fixed version of the test here:  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25370/pull-ci-openshift-origin-master-e2e-aws-fips/1290696271396343808

While this is currently assigned to the RHCOS component, we don't believe there is any fault in the RHCOS handling of FIPS mode.

Comment 9 Colin Walters 2020-08-06 14:59:18 UTC
OK fips is still pretty red, it's not clear to me why.  I don't see anything wrong RHCOS/MCO - offhand it looks like the sdn pods are failing to start because configmaps aren't being mounted.  Tossing to Node.

Comment 10 Colin Walters 2020-08-07 17:34:59 UTC
https://github.com/openshift/machine-config-operator/pull/1990 is related to this

Comment 11 Seth Jennings 2020-08-10 16:06:35 UTC
Ryan is on leave

Comment 12 Seth Jennings 2020-08-10 18:14:47 UTC
MCO has the fix. Sending to them.

Comment 15 Micah Abbott 2020-08-12 20:24:54 UTC
Verified the change to the MCO is included in latest 4.6 nightlies (4.6.0-0.nightly-2020-08-12-140703).  Unfortunately the `*pull-ci-openshift-origin-master-e2e-aws-fips` are still pretty red, though for reasons unrelated to FIPS.  The new log message is visible in the journals of nodes on both passing and failing jobs.

```
$ oc get clusterversion                                                                                                                                  
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS                                                                                                                               
version   4.6.0-0.nightly-2020-08-12-140703   True        False         8m58s   Cluster version is 4.6.0-0.nightly-2020-08-12-140703                                                                            

$ oc get nodes                                                                                                                                           
NAME                                       STATUS   ROLES    AGE   VERSION                                                                                                                                           
ci-ln-w7rgd3t-f76d1-gt2ps-master-0         Ready    master   36m   v1.19.0-rc.2+edbf229-dirty                                                                                                                        
ci-ln-w7rgd3t-f76d1-gt2ps-master-1         Ready    master   36m   v1.19.0-rc.2+edbf229-dirty             
ci-ln-w7rgd3t-f76d1-gt2ps-master-2         Ready    master   36m   v1.19.0-rc.2+edbf229-dirty             
ci-ln-w7rgd3t-f76d1-gt2ps-worker-b-v249f   Ready    worker   23m   v1.19.0-rc.2+edbf229-dirty             
ci-ln-w7rgd3t-f76d1-gt2ps-worker-c-8mnx2   Ready    worker   23m   v1.19.0-rc.2+edbf229-dirty
ci-ln-w7rgd3t-f76d1-gt2ps-worker-d-4h2ld   Ready    worker   23m   v1.19.0-rc.2+edbf229-dirty 

$ oc debug node/ci-ln-w7rgd3t-f76d1-gt2ps-master-0                                                                                                       
Starting pod/ci-ln-w7rgd3t-f76d1-gt2ps-master-0-debug ...                                                 
To use host binaries, run `chroot /host`                                                                  
Pod IP: 10.0.0.3                                                                                                                                                                                                     
If you don't see a command prompt, try pressing enter.                                                                                                                                                               
sh-4.2# chroot /host
sh-4.4# cat /proc/cmdline                                                                                 
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-87d3beadaaabc3da4a1e18e14d8670e865b6b5a191b16b80bad40167ca43532d/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.
luks.options=discard ostree=/ostree/boot.0/rhcos/87d3beadaaabc3da4a1e18e14d8670e865b6b5a191b16b80bad40167ca43532d/0 ignition.platform.id=gcp fips=1 boot=LABEL=boot
sh-4.4# fips-mode-setup --check                                                                           
FIPS mode is enabled.
sh-4.4# journalctl | grep FIPS                                                                            
Aug 12 18:16:45 localhost systemd[1]: Starting Check for FIPS mode...                     
Aug 12 18:16:45 localhost rhcos-fips[966]: FIPS mode is enabled.                                 
Aug 12 18:16:45 localhost systemd[1]: Started Check for FIPS mode.
Aug 12 18:16:48 localhost systemd[1]: Starting Finish FIPS mode setup...
Aug 12 18:16:50 localhost rhcos-fips[1142]: Setting system policy to FIPS
Aug 12 18:16:50 localhost rhcos-fips[1142]: FIPS mode will be enabled.
Aug 12 18:16:50 localhost systemd[1]: Started Finish FIPS mode setup.
Aug 12 18:16:50 localhost systemd[1]: Stopped Finish FIPS mode setup.
Aug 12 18:16:50 localhost systemd[1]: Stopped Check for FIPS mode.
Aug 12 18:16:53 localhost systemd[1]: Starting Finish FIPS mode setup...
Aug 12 18:16:53 localhost rhcos-fips[1142]: Setting system policy to FIPS
Aug 12 18:16:53 localhost rhcos-fips[1142]: FIPS mode will be enabled.
Aug 12 18:16:53 localhost systemd[1]: Started Finish FIPS mode setup.
Aug 12 18:16:53 localhost systemd[1]: Stopped Finish FIPS mode setup.
Aug 12 18:16:53 localhost systemd[1]: Stopped Check for FIPS mode.
Aug 12 18:16:59 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 sshd[2012]: FIPS mode initialized
Aug 12 18:17:13 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 machine-config-daemon[2326]: I0812 18:17:13.396937    2326 update.go:691] FIPS is configured and enabled
Aug 12 18:18:20 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 sshd[1623]: FIPS mode initialized
Aug 12 18:20:19 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 hyperkube[1748]: # SKIPPED DUE TO FIPS COPY.
Aug 12 18:20:19 ci-ln-w7rgd3t-f76d1-gt2ps-master-0 hyperkube[1748]: # SKIPPED DUE TO FIPS COPY.
```

Comment 17 errata-xmlrpc 2020-10-27 16:17:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 18 errata-xmlrpc 2020-10-27 16:20:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.