Bug 1828631 - [4.3 upgrade][clusterverion]the cluster operator openshift-apiserver is degraded [NEEDINFO]
Summary: [4.3 upgrade][clusterverion]the cluster operator openshift-apiserver is degraded
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Stefan Schimanski
QA Contact: Xingxing Xia
URL:
Whiteboard: LifecycleStale
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-28 00:50 UTC by Hongkai Liu
Modified: 2021-02-08 11:06 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-08 11:06:45 UTC
Target Upstream Version:
Embargoed:
mfojtik: needinfo?


Attachments (Terms of Use)
openshift-apiserver-operator.events.txt (54.09 KB, text/plain)
2020-04-28 00:50 UTC, Hongkai Liu
no flags Details

Description Hongkai Liu 2020-04-28 00:50:18 UTC
Created attachment 1682325 [details]
openshift-apiserver-operator.events.txt

During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion.

From: 4.3.17
To: 4.4.0-rc.12

Eventually upgrade was completed successfully (which is so nice).
But those alerts and messages are too frightening.

I would like to create a bug for each of those and feel better for the next upgrade.

https://coreos.slack.com/archives/CHY2E1BL4/p1588028269133600


NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.17    True        True          54m     Unable to apply 4.4.0-rc.12: the cluster operator openshift-apiserver is
 degraded

Comment 1 Stephen Cuppett 2020-04-28 13:45:27 UTC
Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 3 Stefan Schimanski 2020-04-28 14:08:14 UTC
Please add must-gather output.

Comment 6 W. Trevor King 2020-04-29 00:02:16 UTC
Telemetry suggests the Degraded condition was APIServerDeployment_UnavailablePod, which is probably not giving the control plane machines long enough for a healthy reboot (or it is giving them long enough, but reboots aren't healthy?).

Comment 9 Michal Fojtik 2020-05-20 10:08:30 UTC
Iā€™m adding UpcomingSprint, because I lack the information to properly root cause the bug. I will revisit this bug when the information is available. If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant

Comment 10 W. Trevor King 2020-05-21 05:56:08 UTC
I see no LifecycleStale keyword?  But (luckily?) this issue is not rare.  Checking in CI search [1]:

  release-openshift-origin-installer-e2e-aws-upgrade - 224 runs, 25% failed, 193% of failures match

including [2] which passed 4.4.4 -> 4.5.0-0.ci-2020-05-21-034945 and has:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/29703/artifacts/e2e-aws-upgrade/container-logs/test.log | grep 'openshift-apiserver changed Degraded'
  May 21 05:01:35.785 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver
  May 21 05:15:31.821 W clusteroperator/openshift-apiserver changed Degraded to False

[1]: https://search.apps.build01.ci.devcluster.openshift.com/?search=APIServerDeployment_UnavailablePod&maxAge=48h&context=1&type=bug%2Bjunit&name=%5Erelease.*upgrade&maxMatches=5&maxBytes=20971520&groupBy=job
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/29703

Comment 11 Stefan Schimanski 2020-06-18 10:18:00 UTC
> Not clear to me.  Could be must-gather limiting event depth, could be me looking in the wrong place, or could be API-server operator not actually publishing events about these messages.

Events have a ttl, 24h afaik.

Comment 13 W. Trevor King 2020-06-26 03:57:32 UTC
From Hongkai's new must-gather:

$ yaml2json <namespaces/openshift-apiserver-operator/core/events.yaml | jq -r '[.items[] | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .metadata.namespace + " " + .message' | grep Degraded
2020-06-25T19:50:08Z - 2020-06-25T19:50:08Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "" to "WorkloadDegraded: \"deployments\": invalid dependency reference: \"etcdserver: leader changed\"\nWorkloadDegraded: "
2020-06-25T19:50:08Z - 2020-06-25T19:50:08Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "WorkloadDegraded: \"deployments\": invalid dependency reference: \"etcdserver: leader changed\"\nWorkloadDegraded: " to "WorkloadDegraded: \"deployments\": invalid dependency reference: \"etcdserver: leader changed\"\nWorkloadDegraded: \nAPIServerDeploymentDegraded: deployment/apiserver.openshift-apiserver: could not be retrieved"
2020-06-25T19:50:21Z - 2020-06-25T19:50:21Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "WorkloadDegraded: \"deployments\": invalid dependency reference: \"etcdserver: leader changed\"\nWorkloadDegraded: \nAPIServerDeploymentDegraded: deployment/apiserver.openshift-apiserver: could not be retrieved" to "APIServerDeploymentDegraded: deployment/apiserver.openshift-apiserver: could not be retrieved"
2020-06-25T19:50:21Z - 2020-06-25T19:50:21Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: deployment/apiserver.openshift-apiserver: could not be retrieved" to ""
2020-06-25T19:52:55Z - 2020-06-25T20:10:44Z (2) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "" to "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable"
2020-06-25T19:54:09Z - 2020-06-25T19:54:09Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: {"conditions":[{"type":"Degraded","status":"False","lastTransitionTime":"2020-06-23T22:04:52Z","reason":"AsExpected","message":"APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable"},{"type":"Progressing","status":"False","lastTransitionTime":"2020-06-25T19:53:04Z","reason":"AsExpected"},{"type":"Available","status":"True","lastTransitionTime":"2020-06-25T19:52:40Z","reason":"AsExpected"},{"type":"Upgradeable","status":"True","lastTransitionTime":"2020-05-21T19:35:57Z","reason":"AsExpected"}],"versions":[{"name":"operator","version":"4.4.10"},{"name":"openshift-apiserver","version":"4.4.
2020-06-25T19:54:09Z - 2020-06-25T19:54:09Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable" to ""
2020-06-25T19:54:09Z - 2020-06-25T19:54:09Z (1) openshift-apiserver-operator (combined from similar events): Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable" to ""
2020-06-25T20:12:58Z - 2020-06-25T20:12:58Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded changed from False to True ("APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable")
2020-06-25T20:28:07Z - 2020-06-25T20:28:07Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable" to "APIServerDeploymentDegraded: 2 of 3 requested instances are unavailable"
2020-06-25T20:28:14Z - 2020-06-25T20:28:14Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: 2 of 3 requested instances are unavailable" to "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable"
2020-06-25T20:36:54Z - 2020-06-25T20:36:54Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded changed from True to False ("")

So some WorkloadDegraded for etcd leader changes, and also some APIServerDeploymentDegraded (possibly secondary to the etcd elections).  Moving to the etcd folks to see if they're comfortable with the etcd handoffs.  If they are, I expect they'll kick it back to the API-server folks to be more gracious about the occasional etcd leader election.

Comment 14 W. Trevor King 2020-06-26 04:06:27 UTC
Ah, and I don't have to punt to the etcd folks to say that the timestamps on those degradations show etcd leader election outages bracketed between 19:50:08Z and 19:50:21Z.  It's possible that the API server was sitting on the etcd outage for a bit before going degraded, but if not the outage was only 13s long.  That's still not great.  I'm not sure what the usual election time is.  Also, it looks like etcd attempts to queue writes during elections.  And I guess the API-server could internally retry writes if it lost them by writing to the dying member.  Is there a contract between the API-server folks and the etcd folks about what sort of queue times are acceptable?  Or about how the API-server on a machine tries to exit the LB and gracefully drain out before the local etcd leader abdicates?

Comment 18 Dan Mace 2020-08-18 14:24:07 UTC
I'm not clear what etcd bug is perceived to exist here. Etcd is reporting healthy. The election window reported in the last must-gather (19:50:08Z and 19:50:21Z) corresponds with the first creation of the etcd cluster pods during cluster pivot. Isn't that normal?

The only other time the etcd containers are re-created is during the upgrade. I don't have any prometheus dump from these clusters, but the etcd-operator reports "2.5 elections in the last 5m" (the rounding issue has since been fixed) around the time of the upgrade, which seems expected. As to the duration of those outages, it would come down to the MCO, the nodes rebooting, etc. As to inducing election on reboot, that topic is being addressed in a variety of ways elsewhere and this bug adds nothing new to that story.

Absent any evidence of a serious etcd bug, I'm going to kick it over to apiserver for a last look, but unless we have some specific data to measure impact and a fix, I'm not really sure what to do with this report.

If I got something grievously wrong, please let me know.

Comment 19 Stefan Schimanski 2020-09-11 14:45:42 UTC
This needs further triage. Labelling with UpcomingSprint.

Comment 20 Michal Fojtik 2020-09-17 14:49:36 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 21 W. Trevor King 2020-09-21 23:33:00 UTC
Refreshing the search from comment 10, there are till lots of these in CI:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=APIServerDeployment_UnavailablePod&maxAge=48h&context=1&type=bug%2Bjunit&name=^release.*upgrade' | grep 'failures match'
release-openshift-origin-installer-e2e-aws-upgrade - 73 runs, 16% failed, 400% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade - 21 runs, 24% failed, 340% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 12 runs, 17% failed, 550% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 8 runs, 50% failed, 150% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.4 - 7 runs, 29% failed, 300% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.5 - 8 runs, 38% failed, 200% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci - 7 runs, 14% failed, 600% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.6 - 8 runs, 13% failed, 700% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.6 - 8 runs, 38% failed, 200% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci - 8 runs, 50% failed, 150% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3-to-4.4-to-4.5-ci - 2 runs, 50% failed, 200% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly - 2 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci - 2 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-ovirt-upgrade-4.5-stable-to-4.6-ci - 9 runs, 100% failed, 11% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4-to-4.5 - 2 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-ovirt-upgrade-4.4-stable-to-4.5-ci - 1 runs, 100% failed, 100% of failures match

Picking one [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1308037922410205184/artifacts/e2e-gcp-upgrade/container-logs/test.log | grep 'openshift-apiserver changed Degraded'
Sep 21 14:48:22.693 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver
Sep 21 14:54:38.353 W clusteroperator/openshift-apiserver changed Degraded to False

If all of the remaining matches are the same, this may be addressed via bug 1817588.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1308037922410205184

Comment 22 Stefan Schimanski 2020-10-02 09:25:26 UTC
Other topics had higher prio. Adding UpcomingSprint.

Comment 23 Michal Fojtik 2020-10-22 00:12:06 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 24 W. Trevor King 2020-10-22 19:34:22 UTC
Repeating comment 21, would be nice if API-server folks could self-serve this:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=APIServerDeployment_UnavailablePod&maxAge=48h&context=1&type=bug%2Bjunit&name=^release.*upgrade' | grep 'failures match'
release-openshift-origin-installer-e2e-aws-upgrade - 73 runs, 16% failed, 400% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci - 23 runs, 22% failed, 360% of failures match
release-openshift-origin-installer-e2e-aws-upgrade - 120 runs, 23% failed, 307% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.6-stable-to-4.7-ci - 9 runs, 44% failed, 200% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.7 - 8 runs, 50% failed, 200% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci - 8 runs, 25% failed, 300% of failures match
release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci - 8 runs, 100% failed, 75% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.6-stable-to-4.7-ci - 8 runs, 13% failed, 700% of failures match
release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.5-stable-to-4.6-ci - 16 runs, 100% failed, 94% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci - 8 runs, 25% failed, 350% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 6 runs, 17% failed, 500% of failures match
release-openshift-ocp-installer-e2e-ovirt-upgrade-4.4-stable-to-4.5-ci - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-azure-ovn-upgrade-4.5-stable-to-4.6-ci - 8 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.6 - 8 runs, 50% failed, 200% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade - 5 runs, 40% failed, 200% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 8 runs, 75% failed, 83% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.5-stable-to-4.6-ci - 8 runs, 13% failed, 700% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.4 - 8 runs, 50% failed, 150% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.5 - 8 runs, 75% failed, 67% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly - 2 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.4-stable-to-4.5-ci - 8 runs, 25% failed, 300% of failures match
release-openshift-origin-installer-e2e-azure-ovn-upgrade-4.4-stable-to-4.5-ci - 8 runs, 75% failed, 67% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4-to-4.5 - 2 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci - 8 runs, 50% failed, 125% of failures match

Picking one [1] (this time it's on the auth ClusterOperator and not the API-server ClusterOperator):

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci/1319114585923391488/build-log.txt | grep 'authentication changed Degraded'
Oct 22 04:18:07.692 E clusteroperator/authentication changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver (container is crashed in apiserver-55d946cb79-8n88l pod)
Oct 22 04:18:16.881 W clusteroperator/authentication changed Degraded to False

If all of the remaining matches are the same, this may be addressed via bug 1817588.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci/1319114585923391488

Comment 25 Stefan Schimanski 2020-10-25 18:31:01 UTC
Other topics had higher prio. Adding UpcomingSprint.

Comment 26 Michal Fojtik 2020-11-21 20:12:06 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 27 Stefan Schimanski 2021-02-08 11:06:45 UTC
Out of support, closing.


Note You need to log in before you can comment on or make changes to this bug.