https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23264#1:build-log.txt%3A12093 [Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial] Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20200327-025325.xml error: 1 fail, 0 pass, 0 skip (51m15s) 2020/03/27 02:53:26 Container test in pod e2e-aws-upgrade failed, exit code 1, reason Error 2020/03/27 03:02:03 Copied 177.71MB of artifacts from e2e-aws-upgrade to /logs/artifacts/e2e-aws-upgrade 2020/03/27 03:02:03 Releasing lease for "aws-quota-slice" 2020/03/27 03:02:03 No custom metadata found and prow metadata already exists. Not updating the metadata. 2020/03/27 03:02:04 Ran for 1h33m33s error: could not run steps: step e2e-aws-upgrade failed: template pod "e2e-aws-upgrade" failed: the pod ci-op-j80hjybn/e2e-aws-upgrade failed after 1h30m5s (failed containers: test): ContainerFailed one or more containers exited Container test exited with code 1, reason Error --- ack-off restarting failed container (11 times) Mar 27 02:50:14.206 W ns/kube-system route/console on reused connections Mar 27 02:50:14.297 W ns/kube-system route/oauth-openshift on new connections Mar 27 02:50:15.983 W clusteroperator/dns changed Progressing to False: AsExpected: Desired and available number of DNS DaemonSets are equal Mar 27 02:50:32.083 I ns/openshift-ingress service/router-default Updated load balancer with new hosts (2 times) Mar 27 02:50:51.663 W ns/openshift-machine-config-operator pod/machine-config-daemon-vtvp6 node/ip-10-0-140-224.us-west-2.compute.internal container=oauth-proxy container restarted Mar 27 02:52:44.920 I ns/openshift-machine-api machine/ci-op-j80hjybn-77109-4sltm-master-2 Updated machine ci-op-j80hjybn-77109-4sltm-master-2 (5 times) Mar 27 02:52:45.042 I ns/openshift-machine-api machine/ci-op-j80hjybn-77109-4sltm-worker-us-west-2a-6pmg5 Updated machine ci-op-j80hjybn-77109-4sltm-worker-us-west-2a-6pmg5 (3 times) Mar 27 02:52:45.170 I ns/openshift-machine-api machine/ci-op-j80hjybn-77109-4sltm-worker-us-west-2a-cvl9n Updated machine ci-op-j80hjybn-77109-4sltm-worker-us-west-2a-cvl9n (5 times) Mar 27 02:52:45.286 I ns/openshift-machine-api machine/ci-op-j80hjybn-77109-4sltm-worker-us-west-2b-f6dtp Updated machine ci-op-j80hjybn-77109-4sltm-worker-us-west-2b-f6dtp (3 times) Mar 27 02:52:46.196 I ns/openshift-machine-api machine/ci-op-j80hjybn-77109-4sltm-master-0 Updated machine ci-op-j80hjybn-77109-4sltm-master-0 (3 times) Mar 27 02:52:47.138 I ns/openshift-machine-api machine/ci-op-j80hjybn-77109-4sltm-master-1 Updated machine ci-op-j80hjybn-77109-4sltm-master-1 (3 times) Mar 27 02:53:25.797 I test="[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]" failed Failing tests: [Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]
Actual failure for [1] was: fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: Mar 27 02:50:23.048: API was unreachable during disruption for at least 4m21s of 48m9s (9%): Not sure if this 4.2.26 -> 4.3.0-0.nightly-2020-03-27-012404 failure is ingress/routing or the API server itself. I guessed ingress/routing for bug 1818104, so going with the API server here. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23264
Might also be an SDN issue like bug 1793635.
This seems to be coming in "180 (14% of all failures) API was unreachable during disruption" in last two days of CI runs.
I checked all the `clusteroperator` objects, all reported OK except for `kube-paiserver` curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23264/artifacts/e2e-aws-upgrade/clusteroperators.json | jq '.items | .[] | select(.metadata.name == "kube-apiserver") | .status.conditions[] | select(.type == "Upgradeable")' { "lastTransitionTime": "2020-03-27T02:07:36Z", "message": "DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [anyuid hostmount-anyuid privileged]", "reason": "DefaultSecurityContextConstraints_Mutated", "status": "False", "type": "Upgradeable" } This is a known issue, the e2e test suite is changing the default SCC. In 4.3, any mutation of the default SCC will prevent upgrade. The resolution is - delete the default SCC object(s) that have been mutated and then delete any of the `openshift-apiserver` Pod in the `openshfit-apiserver` namespace. This is a known issue - the api/auth team had a conversation with Ben Parees about this on slack - https://coreos.slack.com/archives/CB48XQ4KZ/p1585580675154600 Basically, what's happening here is e2e test suite is changing the default SCC. it is adding `system:serviceaccount:e2e-test-s2i-build-root-4qr5v:builder` to `users` of the default SCC. - users: - system:admin - system:serviceaccount:openshift-infra:build-controller - system:serviceaccount:e2e-test-s2i-build-root-4qr5v:builder The default one that ships with the cluster does not have system:serviceaccount:e2e-test-s2i-build-root-4qr5v:builder
Assigning it to infrastructure team for now so that they can validate this.
Hi eparis, we verified this, please see my comment above - https://bugzilla.redhat.com/show_bug.cgi?id=1818106#c4
Gabe this was a sympton of the SCC mutation e2e you fixed recently. If you've already got a bug for it, just dupe this against that.
Gabe, not sure which branches you put the e2e change into, but it sounds like we probably need it at least back to 4.3 to unblock upgrade jobs.
Ben https://github.com/openshift/origin/pull/24821 is awaiting cherrypick approval for 4.3 and https://github.com/openshift/origin/pull/24822 for 4.4 is in the same boat The 4.5 bug that merged is 1819276 the 4.3.z bug is 1820266 ... I'll use that for the dupe *** This bug has been marked as a duplicate of bug 1820266 ***