Created attachment 1682325 [details] openshift-apiserver-operator.events.txt During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion. From: 4.3.17 To: 4.4.0-rc.12 Eventually upgrade was completed successfully (which is so nice). But those alerts and messages are too frightening. I would like to create a bug for each of those and feel better for the next upgrade. https://coreos.slack.com/archives/CHY2E1BL4/p1588028269133600 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.17 True True 54m Unable to apply 4.4.0-rc.12: the cluster operator openshift-apiserver is degraded
Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.
Please add must-gather output.
Telemetry suggests the Degraded condition was APIServerDeployment_UnavailablePod, which is probably not giving the control plane machines long enough for a healthy reboot (or it is giving them long enough, but reboots aren't healthy?).
Iām adding UpcomingSprint, because I lack the information to properly root cause the bug. I will revisit this bug when the information is available. If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant
I see no LifecycleStale keyword? But (luckily?) this issue is not rare. Checking in CI search [1]: release-openshift-origin-installer-e2e-aws-upgrade - 224 runs, 25% failed, 193% of failures match including [2] which passed 4.4.4 -> 4.5.0-0.ci-2020-05-21-034945 and has: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/29703/artifacts/e2e-aws-upgrade/container-logs/test.log | grep 'openshift-apiserver changed Degraded' May 21 05:01:35.785 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver May 21 05:15:31.821 W clusteroperator/openshift-apiserver changed Degraded to False [1]: https://search.apps.build01.ci.devcluster.openshift.com/?search=APIServerDeployment_UnavailablePod&maxAge=48h&context=1&type=bug%2Bjunit&name=%5Erelease.*upgrade&maxMatches=5&maxBytes=20971520&groupBy=job [2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/29703
> Not clear to me. Could be must-gather limiting event depth, could be me looking in the wrong place, or could be API-server operator not actually publishing events about these messages. Events have a ttl, 24h afaik.
From Hongkai's new must-gather: $ yaml2json <namespaces/openshift-apiserver-operator/core/events.yaml | jq -r '[.items[] | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .metadata.namespace + " " + .message' | grep Degraded 2020-06-25T19:50:08Z - 2020-06-25T19:50:08Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "" to "WorkloadDegraded: \"deployments\": invalid dependency reference: \"etcdserver: leader changed\"\nWorkloadDegraded: " 2020-06-25T19:50:08Z - 2020-06-25T19:50:08Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "WorkloadDegraded: \"deployments\": invalid dependency reference: \"etcdserver: leader changed\"\nWorkloadDegraded: " to "WorkloadDegraded: \"deployments\": invalid dependency reference: \"etcdserver: leader changed\"\nWorkloadDegraded: \nAPIServerDeploymentDegraded: deployment/apiserver.openshift-apiserver: could not be retrieved" 2020-06-25T19:50:21Z - 2020-06-25T19:50:21Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "WorkloadDegraded: \"deployments\": invalid dependency reference: \"etcdserver: leader changed\"\nWorkloadDegraded: \nAPIServerDeploymentDegraded: deployment/apiserver.openshift-apiserver: could not be retrieved" to "APIServerDeploymentDegraded: deployment/apiserver.openshift-apiserver: could not be retrieved" 2020-06-25T19:50:21Z - 2020-06-25T19:50:21Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: deployment/apiserver.openshift-apiserver: could not be retrieved" to "" 2020-06-25T19:52:55Z - 2020-06-25T20:10:44Z (2) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "" to "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable" 2020-06-25T19:54:09Z - 2020-06-25T19:54:09Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: {"conditions":[{"type":"Degraded","status":"False","lastTransitionTime":"2020-06-23T22:04:52Z","reason":"AsExpected","message":"APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable"},{"type":"Progressing","status":"False","lastTransitionTime":"2020-06-25T19:53:04Z","reason":"AsExpected"},{"type":"Available","status":"True","lastTransitionTime":"2020-06-25T19:52:40Z","reason":"AsExpected"},{"type":"Upgradeable","status":"True","lastTransitionTime":"2020-05-21T19:35:57Z","reason":"AsExpected"}],"versions":[{"name":"operator","version":"4.4.10"},{"name":"openshift-apiserver","version":"4.4. 2020-06-25T19:54:09Z - 2020-06-25T19:54:09Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable" to "" 2020-06-25T19:54:09Z - 2020-06-25T19:54:09Z (1) openshift-apiserver-operator (combined from similar events): Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable" to "" 2020-06-25T20:12:58Z - 2020-06-25T20:12:58Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded changed from False to True ("APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable") 2020-06-25T20:28:07Z - 2020-06-25T20:28:07Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable" to "APIServerDeploymentDegraded: 2 of 3 requested instances are unavailable" 2020-06-25T20:28:14Z - 2020-06-25T20:28:14Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: 2 of 3 requested instances are unavailable" to "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable" 2020-06-25T20:36:54Z - 2020-06-25T20:36:54Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded changed from True to False ("") So some WorkloadDegraded for etcd leader changes, and also some APIServerDeploymentDegraded (possibly secondary to the etcd elections). Moving to the etcd folks to see if they're comfortable with the etcd handoffs. If they are, I expect they'll kick it back to the API-server folks to be more gracious about the occasional etcd leader election.
Ah, and I don't have to punt to the etcd folks to say that the timestamps on those degradations show etcd leader election outages bracketed between 19:50:08Z and 19:50:21Z. It's possible that the API server was sitting on the etcd outage for a bit before going degraded, but if not the outage was only 13s long. That's still not great. I'm not sure what the usual election time is. Also, it looks like etcd attempts to queue writes during elections. And I guess the API-server could internally retry writes if it lost them by writing to the dying member. Is there a contract between the API-server folks and the etcd folks about what sort of queue times are acceptable? Or about how the API-server on a machine tries to exit the LB and gracefully drain out before the local etcd leader abdicates?
I'm not clear what etcd bug is perceived to exist here. Etcd is reporting healthy. The election window reported in the last must-gather (19:50:08Z and 19:50:21Z) corresponds with the first creation of the etcd cluster pods during cluster pivot. Isn't that normal? The only other time the etcd containers are re-created is during the upgrade. I don't have any prometheus dump from these clusters, but the etcd-operator reports "2.5 elections in the last 5m" (the rounding issue has since been fixed) around the time of the upgrade, which seems expected. As to the duration of those outages, it would come down to the MCO, the nodes rebooting, etc. As to inducing election on reboot, that topic is being addressed in a variety of ways elsewhere and this bug adds nothing new to that story. Absent any evidence of a serious etcd bug, I'm going to kick it over to apiserver for a last look, but unless we have some specific data to measure impact and a fix, I'm not really sure what to do with this report. If I got something grievously wrong, please let me know.
This needs further triage. Labelling with UpcomingSprint.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
Refreshing the search from comment 10, there are till lots of these in CI: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=APIServerDeployment_UnavailablePod&maxAge=48h&context=1&type=bug%2Bjunit&name=^release.*upgrade' | grep 'failures match' release-openshift-origin-installer-e2e-aws-upgrade - 73 runs, 16% failed, 400% of failures match release-openshift-origin-installer-e2e-gcp-upgrade - 21 runs, 24% failed, 340% of failures match release-openshift-okd-installer-e2e-aws-upgrade - 12 runs, 17% failed, 550% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 8 runs, 50% failed, 150% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.4 - 7 runs, 29% failed, 300% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.5 - 8 runs, 38% failed, 200% of failures match release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci - 7 runs, 14% failed, 600% of failures match release-openshift-origin-installer-e2e-gcp-upgrade-4.6 - 8 runs, 13% failed, 700% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.6 - 8 runs, 38% failed, 200% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci - 8 runs, 50% failed, 150% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3-to-4.4-to-4.5-ci - 2 runs, 50% failed, 200% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly - 2 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci - 2 runs, 100% failed, 100% of failures match release-openshift-ocp-installer-e2e-ovirt-upgrade-4.5-stable-to-4.6-ci - 9 runs, 100% failed, 11% of failures match release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4-to-4.5 - 2 runs, 100% failed, 100% of failures match release-openshift-ocp-installer-e2e-ovirt-upgrade-4.4-stable-to-4.5-ci - 1 runs, 100% failed, 100% of failures match Picking one [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1308037922410205184/artifacts/e2e-gcp-upgrade/container-logs/test.log | grep 'openshift-apiserver changed Degraded' Sep 21 14:48:22.693 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver Sep 21 14:54:38.353 W clusteroperator/openshift-apiserver changed Degraded to False If all of the remaining matches are the same, this may be addressed via bug 1817588. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1308037922410205184
Other topics had higher prio. Adding UpcomingSprint.
Repeating comment 21, would be nice if API-server folks could self-serve this: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=APIServerDeployment_UnavailablePod&maxAge=48h&context=1&type=bug%2Bjunit&name=^release.*upgrade' | grep 'failures match' release-openshift-origin-installer-e2e-aws-upgrade - 73 runs, 16% failed, 400% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci - 23 runs, 22% failed, 360% of failures match release-openshift-origin-installer-e2e-aws-upgrade - 120 runs, 23% failed, 307% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.6-stable-to-4.7-ci - 9 runs, 44% failed, 200% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.7 - 8 runs, 50% failed, 200% of failures match release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci - 8 runs, 25% failed, 300% of failures match release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci - 8 runs, 100% failed, 75% of failures match release-openshift-origin-installer-e2e-gcp-upgrade-4.6-stable-to-4.7-ci - 8 runs, 13% failed, 700% of failures match release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.5-stable-to-4.6-ci - 16 runs, 100% failed, 94% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci - 8 runs, 25% failed, 350% of failures match release-openshift-okd-installer-e2e-aws-upgrade - 6 runs, 17% failed, 500% of failures match release-openshift-ocp-installer-e2e-ovirt-upgrade-4.4-stable-to-4.5-ci - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-azure-ovn-upgrade-4.5-stable-to-4.6-ci - 8 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.6 - 8 runs, 50% failed, 200% of failures match release-openshift-origin-installer-e2e-gcp-upgrade - 5 runs, 40% failed, 200% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 8 runs, 75% failed, 83% of failures match release-openshift-origin-installer-e2e-gcp-upgrade-4.5-stable-to-4.6-ci - 8 runs, 13% failed, 700% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.4 - 8 runs, 50% failed, 150% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.5 - 8 runs, 75% failed, 67% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly - 2 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.4-stable-to-4.5-ci - 8 runs, 25% failed, 300% of failures match release-openshift-origin-installer-e2e-azure-ovn-upgrade-4.4-stable-to-4.5-ci - 8 runs, 75% failed, 67% of failures match release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4-to-4.5 - 2 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci - 8 runs, 50% failed, 125% of failures match Picking one [1] (this time it's on the auth ClusterOperator and not the API-server ClusterOperator): $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci/1319114585923391488/build-log.txt | grep 'authentication changed Degraded' Oct 22 04:18:07.692 E clusteroperator/authentication changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver (container is crashed in apiserver-55d946cb79-8n88l pod) Oct 22 04:18:16.881 W clusteroperator/authentication changed Degraded to False If all of the remaining matches are the same, this may be addressed via bug 1817588. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci/1319114585923391488
Out of support, closing.