1818106 – [upgrade] API was unreachable during disruption for at least...

Bug 1818106 - [upgrade] API was unreachable during disruption for at least...

Summary: [upgrade] API was unreachable during disruption for at least...

Keywords:
Status:	CLOSED DUPLICATE of bug 1820266
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Build
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Gabe Montero
QA Contact:	wewang
Docs Contact:
URL:
Whiteboard:	buildcop
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-27 17:27 UTC by Hongkai Liu
Modified:	2020-04-03 20:15 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-03 20:15:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hongkai Liu 2020-03-27 17:27:44 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23264#1:build-log.txt%3A12093

[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]
Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20200327-025325.xml
error: 1 fail, 0 pass, 0 skip (51m15s)
2020/03/27 02:53:26 Container test in pod e2e-aws-upgrade failed, exit code 1, reason Error
2020/03/27 03:02:03 Copied 177.71MB of artifacts from e2e-aws-upgrade to /logs/artifacts/e2e-aws-upgrade
2020/03/27 03:02:03 Releasing lease for "aws-quota-slice"
2020/03/27 03:02:03 No custom metadata found and prow metadata already exists. Not updating the metadata.
2020/03/27 03:02:04 Ran for 1h33m33s
error: could not run steps: step e2e-aws-upgrade failed: template pod "e2e-aws-upgrade" failed: the pod ci-op-j80hjybn/e2e-aws-upgrade failed after 1h30m5s (failed containers: test): ContainerFailed one or more containers exited
Container test exited with code 1, reason Error
---
ack-off restarting failed container (11 times)
Mar 27 02:50:14.206 W ns/kube-system route/console on reused connections
Mar 27 02:50:14.297 W ns/kube-system route/oauth-openshift on new connections
Mar 27 02:50:15.983 W clusteroperator/dns changed Progressing to False: AsExpected: Desired and available number of DNS DaemonSets are equal
Mar 27 02:50:32.083 I ns/openshift-ingress service/router-default Updated load balancer with new hosts (2 times)
Mar 27 02:50:51.663 W ns/openshift-machine-config-operator pod/machine-config-daemon-vtvp6 node/ip-10-0-140-224.us-west-2.compute.internal container=oauth-proxy container restarted
Mar 27 02:52:44.920 I ns/openshift-machine-api machine/ci-op-j80hjybn-77109-4sltm-master-2 Updated machine ci-op-j80hjybn-77109-4sltm-master-2 (5 times)
Mar 27 02:52:45.042 I ns/openshift-machine-api machine/ci-op-j80hjybn-77109-4sltm-worker-us-west-2a-6pmg5 Updated machine ci-op-j80hjybn-77109-4sltm-worker-us-west-2a-6pmg5 (3 times)
Mar 27 02:52:45.170 I ns/openshift-machine-api machine/ci-op-j80hjybn-77109-4sltm-worker-us-west-2a-cvl9n Updated machine ci-op-j80hjybn-77109-4sltm-worker-us-west-2a-cvl9n (5 times)
Mar 27 02:52:45.286 I ns/openshift-machine-api machine/ci-op-j80hjybn-77109-4sltm-worker-us-west-2b-f6dtp Updated machine ci-op-j80hjybn-77109-4sltm-worker-us-west-2b-f6dtp (3 times)
Mar 27 02:52:46.196 I ns/openshift-machine-api machine/ci-op-j80hjybn-77109-4sltm-master-0 Updated machine ci-op-j80hjybn-77109-4sltm-master-0 (3 times)
Mar 27 02:52:47.138 I ns/openshift-machine-api machine/ci-op-j80hjybn-77109-4sltm-master-1 Updated machine ci-op-j80hjybn-77109-4sltm-master-1 (3 times)
Mar 27 02:53:25.797 I test="[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]" failed
Failing tests:
[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]

Comment 1 W. Trevor King 2020-03-31 04:42:19 UTC

Actual failure for [1] was:

  fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: Mar 27 02:50:23.048: API was unreachable during disruption for at least 4m21s of 48m9s (9%):

Not sure if this 4.2.26 -> 4.3.0-0.nightly-2020-03-27-012404 failure is ingress/routing or the API server itself.  I guessed ingress/routing for bug 1818104, so going with the API server here.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23264

Comment 2 W. Trevor King 2020-03-31 04:49:53 UTC

Might also be an SDN issue like bug 1793635.

Comment 3 Lalatendu Mohanty 2020-03-31 11:20:32 UTC

This seems to be coming in "180 (14% of all failures) API was unreachable during disruption" in last two days of CI runs.

Comment 4 Abu Kashem 2020-04-03 18:29:18 UTC

I checked all the `clusteroperator` objects, all reported OK except for `kube-paiserver`

curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23264/artifacts/e2e-aws-upgrade/clusteroperators.json | jq '.items | .[] | select(.metadata.name == "kube-apiserver") | .status.conditions[] | select(.type == "Upgradeable")'
{
  "lastTransitionTime": "2020-03-27T02:07:36Z",
  "message": "DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [anyuid hostmount-anyuid privileged]",
  "reason": "DefaultSecurityContextConstraints_Mutated",
  "status": "False",
  "type": "Upgradeable"
}

This is a known issue, the e2e test suite is changing the default SCC. In 4.3, any mutation of the default SCC will prevent upgrade. The resolution is - delete the default SCC object(s) that have been mutated and then delete any of the `openshift-apiserver` Pod in the `openshfit-apiserver` namespace.

This is a known issue - the api/auth team had a conversation with Ben Parees about this on slack - https://coreos.slack.com/archives/CB48XQ4KZ/p1585580675154600

Basically, what's happening here is e2e test suite is changing the default SCC. it is adding `system:serviceaccount:e2e-test-s2i-build-root-4qr5v:builder` to `users` of the default SCC.

- users:
- system:admin
- system:serviceaccount:openshift-infra:build-controller
- system:serviceaccount:e2e-test-s2i-build-root-4qr5v:builder

The default one that ships with the cluster does not have system:serviceaccount:e2e-test-s2i-build-root-4qr5v:builder

Comment 5 Abu Kashem 2020-04-03 18:48:29 UTC

Assigning it to infrastructure team for now so that they can validate this.

Comment 7 Abu Kashem 2020-04-03 19:42:05 UTC

Hi eparis, we verified this, please see my comment above - https://bugzilla.redhat.com/show_bug.cgi?id=1818106#c4

Comment 8 Ben Parees 2020-04-03 19:57:43 UTC

Gabe this was a sympton of the SCC mutation e2e you fixed recently.  If you've already got a bug for it, just dupe this against that.

Comment 9 Ben Parees 2020-04-03 20:00:40 UTC

Gabe, not sure which branches you put the e2e change into, but it sounds like we probably need it at least back to 4.3 to unblock upgrade jobs.

Comment 10 Gabe Montero 2020-04-03 20:15:38 UTC

Ben https://github.com/openshift/origin/pull/24821 is awaiting cherrypick approval for 4.3 and https://github.com/openshift/origin/pull/24822 for 4.4 is in the same boat

The 4.5 bug that merged is 1819276

the 4.3.z bug is 1820266 ... I'll use that for the dupe

*** This bug has been marked as a duplicate of bug 1820266 ***

Note You need to log in before you can comment on or make changes to this bug.