Bug 1929922 - AuthenticationOperator crashes / is degraded during install [NEEDINFO]
Summary: AuthenticationOperator crashes / is degraded during install
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: apiserver-auth
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: ---
Assignee: Standa Laznicka
QA Contact:
URL:
Whiteboard: tag-ci LifecycleStale
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-17 23:04 UTC by Clayton Coleman
Modified: 2022-01-19 11:25 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-19 11:25:04 UTC
Target Upstream Version:
Embargoed:
mfojtik: needinfo?


Attachments (Terms of Use)

Description Clayton Coleman 2021-02-17 23:04:34 UTC
Authentication operator reports down during a 4.6 to 4.7 upgrade, which means the pod is crashing / failing / not ready / not visible to metrics during the run

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1362089565350793216

ALERTS{alertname="ClusterOperatorDown", alertstate="firing", endpoint="metrics", instance="10.0.179.43:9099", job="cluster-version-operator", name="authentication", namespace="openshift-cluster-version", pod="cluster-version-operator-7775745d8f-6fdfc", service="cluster-version-operator", severity="critical", version="4.6.18"}

is firing at 18:00:19Z during this run which means one of the instances was not responding to metrics collection for at least 10m which is bad.

Needs investigation, operator being out of commission for 10m during an upgrade is a sign something is seriously wrong.

I will accept deferral out of 4.7.0 to 4.7.z if it can be shown this is due to another serious error that we have tracked (vs the authentication operator just crashes and dies for 10m at a time, which would be a blocker).

Comment 1 Clayton Coleman 2021-02-18 01:46:35 UTC
This happens right at the end of install (and just before install).  Because of that moving it to 4.7.z (operators should not be degraded during install).  I will be adding a test condition that fails an install if we catch a degraded operator during install, and for the alert to have fired we would have failed for much longer.

Comment 3 Standa Laznicka 2021-02-18 10:24:02 UTC
I couldn't find anything interesting happening in the openshift-authentication-operator namespace at around 18:00Z where the old 4.6 pod is still in place.

I browsed the events of the openshift-authentication-operator NS but they only show that the operator was alive and doing its job. There was no sign as to why it would not respond to /metrics requests.

I am, however, not sure, why this BZ should block 4.7 as the observed behavior did not occur during the upgrade, but only during the installation. The events of the cluster-authentication-operator prove that the operator was alive and well, and went degraded=false, available=true at 17:59:48Z.

I can see that comment 1 mentions degraded conditions during install. If that's what's really causing ALERTS{alertname="ClusterOperatorDown"), then this is a new thing and I wouldn't expect it to be fixed earlier than in 4.8.

Comment 5 Stefan Schimanski 2021-03-16 16:31:35 UTC
Not a blocker as already present in 4.6/4.7 and hence no regression.

Comment 7 Michal Fojtik 2021-04-18 12:00:17 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 8 W. Trevor King 2021-04-18 20:36:45 UTC
Still very popular to have authentication go Degraded=True or Available=False during CI runs:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+.*%28Degraded%7CAvailable%29&maxAge=24h&type=junit' | grep '
failures match' | sort
periodic-ci-openshift-cluster-api-provider-kubevirt-release-4.9-sanity-ovn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.6-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.6-e2e-azure-upgrade (all) - 4 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.6-e2e-gcp-upgrade (all) - 4 runs, 25% failed, 400% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade (all) - 9 runs, 11% failed, 900% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade (all) - 4 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 4 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-upgrade (all) - 4 runs, 25% failed, 400% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial (all) - 12 runs, 42% failed, 100% of failures match = 42% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-compact-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 12 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 16 runs, 100% failed, 94% of failures match = 94% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 11 runs, 82% failed, 122% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 10 runs, 90% failed, 111% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.6-upgrade-from-stable-4.5-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws (all) - 16 runs, 50% failed, 38% of failures match = 19% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-serial (all) - 9 runs, 22% failed, 100% of failures match = 22% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 7 runs, 57% failed, 25% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-workers-rhel7 (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-assisted (all) - 7 runs, 29% failed, 150% of failures match = 43% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 7 runs, 71% failed, 20% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-serial (all) - 8 runs, 100% failed, 13% of failures match = 13% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-upi-serial (all) - 8 runs, 25% failed, 50% of failures match = 13% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 7 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-okd-4.8-e2e-aws (all) - 6 runs, 50% failed, 33% of failures match = 17% impact
pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
...
pull-ci-openshift-ovn-kubernetes-master-e2e-openstack-ovn (all) - 11 runs, 100% failed, 18% of failures match = 18% impact
release-openshift-ocp-installer-e2e-gcp-ovn-4.8 (all) - 7 runs, 71% failed, 80% of failures match = 57% impact
release-openshift-ocp-installer-e2e-openstack-4.4 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 12 runs, 83% failed, 120% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-disruptive-4.6 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-disruptive-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.1-stable-to-4.2-ci (all) - 6 runs, 17% failed, 500% of failures match = 83% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3-to-4.4-to-4.5-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-azure-upgrade-4.3 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci (all) - 4 runs, 50% failed, 100% of failures match = 50% impact
release-openshift-origin-installer-launch-gcp (all) - 23 runs, 48% failed, 82% of failures match = 39% impact
release-openshift-origin-installer-old-rhcos-e2e-aws-4.2 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-old-rhcos-e2e-aws-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Comment 9 Michal Fojtik 2021-04-18 21:00:25 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 10 Clayton Coleman 2021-04-26 16:43:38 UTC
Roughly 10-20% of our CI runs fail because authentication operator is either degraded or unavailable during normal e2e runs.

I'm bumping priority and severity.  This is one of the top CI blockers and is impacting the ability of teams to merge.

https://search.ci.openshift.org/?search=ClusterOperatorDown.*authentication&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Operators may not go degraded during normal use.  If the operator is degraded when the install completes, it must heal within a few minutes (we will be tightening that constraint).

Comment 11 Clayton Coleman 2021-04-26 16:46:29 UTC
This is reproducible in so many CI runs I'm marking the needinfo provided.  Since we had no data before, repurposing this as "auth operator going degraded too easily".

Comment 12 Clayton Coleman 2021-04-26 16:54:29 UTC
I'm going to also mark the operator alert as a known failure in tests on this bug.

Comment 13 Clayton Coleman 2021-04-26 16:56:58 UTC
Actually, https://bugzilla.redhat.com/show_bug.cgi?id=1939580 covers this - so we can potentially close this for now.

Comment 14 Clayton Coleman 2021-04-26 17:01:05 UTC
I broadened the test bypass for https://bugzilla.redhat.com/show_bug.cgi?id=1939580 to cover normal e2e runs in https://github.com/openshift/origin/pull/26103 which will reduce churn on PR jobs for now.

Comment 16 Standa Laznicka 2021-05-20 12:45:03 UTC
I can see that the impact has gone down since the fixes to other BZs merged. There are still a couple of cases where the CAO reports "Down" during the installs so it might still be worth checking those runs out, but since the numbers are lower now, I'm dropping the severity/priority to medium.

Comment 17 Sergiusz Urbaniak 2021-06-03 13:12:00 UTC
unsetting from 4.8 as we have done work (see https://bugzilla.redhat.com/show_bug.cgi?id=1939580) to fix things here. As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1929922#c16 we're keeping this open to address reamining issues.

Comment 18 Michal Fojtik 2021-07-03 13:13:23 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 19 Sergiusz Urbaniak 2021-08-16 12:51:42 UTC
sprint review: the last final bits are still being worked on as commented in https://bugzilla.redhat.com/show_bug.cgi?id=1929922#c16

Comment 20 Michal Fojtik 2021-08-16 12:53:35 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 21 Standa Laznicka 2021-09-03 12:50:00 UTC
sprint review: I can see that there is still a high number of CAO failures, I wonder whether that's because CAO actually fails or because it fails with all the other operators as it highly depends on many other things.

Comment 22 Michal Fojtik 2021-10-03 13:30:07 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 23 Sergiusz Urbaniak 2021-11-08 06:56:01 UTC
reviewed-in-sprint: not enough capacity to work on this bugzilla.

Comment 25 Sergiusz Urbaniak 2021-11-26 07:23:15 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 26 Standa Laznicka 2022-01-19 11:25:04 UTC
Let's close this BZ as it hasn't seen much activity and not much is currently planned to do in regards of improving the CAO. If you see a specific CAO issue, please open a BZ specific to it.


Note You need to log in before you can comment on or make changes to this bug.