Authentication operator reports down during a 4.6 to 4.7 upgrade, which means the pod is crashing / failing / not ready / not visible to metrics during the run https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1362089565350793216 ALERTS{alertname="ClusterOperatorDown", alertstate="firing", endpoint="metrics", instance="10.0.179.43:9099", job="cluster-version-operator", name="authentication", namespace="openshift-cluster-version", pod="cluster-version-operator-7775745d8f-6fdfc", service="cluster-version-operator", severity="critical", version="4.6.18"} is firing at 18:00:19Z during this run which means one of the instances was not responding to metrics collection for at least 10m which is bad. Needs investigation, operator being out of commission for 10m during an upgrade is a sign something is seriously wrong. I will accept deferral out of 4.7.0 to 4.7.z if it can be shown this is due to another serious error that we have tracked (vs the authentication operator just crashes and dies for 10m at a time, which would be a blocker).
This happens right at the end of install (and just before install). Because of that moving it to 4.7.z (operators should not be degraded during install). I will be adding a test condition that fails an install if we catch a degraded operator during install, and for the alert to have fired we would have failed for much longer.
I couldn't find anything interesting happening in the openshift-authentication-operator namespace at around 18:00Z where the old 4.6 pod is still in place. I browsed the events of the openshift-authentication-operator NS but they only show that the operator was alive and doing its job. There was no sign as to why it would not respond to /metrics requests. I am, however, not sure, why this BZ should block 4.7 as the observed behavior did not occur during the upgrade, but only during the installation. The events of the cluster-authentication-operator prove that the operator was alive and well, and went degraded=false, available=true at 17:59:48Z. I can see that comment 1 mentions degraded conditions during install. If that's what's really causing ALERTS{alertname="ClusterOperatorDown"), then this is a new thing and I wouldn't expect it to be fixed earlier than in 4.8.
Not a blocker as already present in 4.6/4.7 and hence no regression.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
Still very popular to have authentication go Degraded=True or Available=False during CI runs: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+.*%28Degraded%7CAvailable%29&maxAge=24h&type=junit' | grep ' failures match' | sort periodic-ci-openshift-cluster-api-provider-kubevirt-release-4.9-sanity-ovn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.6-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.6-e2e-azure-upgrade (all) - 4 runs, 50% failed, 200% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.6-e2e-gcp-upgrade (all) - 4 runs, 25% failed, 400% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade (all) - 9 runs, 11% failed, 900% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade (all) - 4 runs, 50% failed, 200% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 4 runs, 50% failed, 200% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-upgrade (all) - 4 runs, 25% failed, 400% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial (all) - 12 runs, 42% failed, 100% of failures match = 42% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-compact-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 12 runs, 50% failed, 200% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 16 runs, 100% failed, 94% of failures match = 94% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 11 runs, 82% failed, 122% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 10 runs, 90% failed, 111% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.6-upgrade-from-stable-4.5-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws (all) - 16 runs, 50% failed, 38% of failures match = 19% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-serial (all) - 9 runs, 22% failed, 100% of failures match = 22% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 7 runs, 57% failed, 25% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-workers-rhel7 (all) - 4 runs, 25% failed, 100% of failures match = 25% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-assisted (all) - 7 runs, 29% failed, 150% of failures match = 43% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 7 runs, 71% failed, 20% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-serial (all) - 8 runs, 100% failed, 13% of failures match = 13% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-upi-serial (all) - 8 runs, 25% failed, 50% of failures match = 13% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 7 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-okd-4.8-e2e-aws (all) - 6 runs, 50% failed, 33% of failures match = 17% impact pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact ... pull-ci-openshift-ovn-kubernetes-master-e2e-openstack-ovn (all) - 11 runs, 100% failed, 18% of failures match = 18% impact release-openshift-ocp-installer-e2e-gcp-ovn-4.8 (all) - 7 runs, 71% failed, 80% of failures match = 57% impact release-openshift-ocp-installer-e2e-openstack-4.4 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact release-openshift-okd-installer-e2e-aws-upgrade (all) - 12 runs, 83% failed, 120% of failures match = 100% impact release-openshift-origin-installer-e2e-aws-disruptive-4.6 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-e2e-aws-disruptive-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-e2e-aws-upgrade-4.1-stable-to-4.2-ci (all) - 6 runs, 17% failed, 500% of failures match = 83% impact release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3-to-4.4-to-4.5-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-e2e-azure-upgrade-4.3 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci (all) - 4 runs, 50% failed, 100% of failures match = 50% impact release-openshift-origin-installer-launch-gcp (all) - 23 runs, 48% failed, 82% of failures match = 39% impact release-openshift-origin-installer-old-rhcos-e2e-aws-4.2 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-old-rhcos-e2e-aws-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.
Roughly 10-20% of our CI runs fail because authentication operator is either degraded or unavailable during normal e2e runs. I'm bumping priority and severity. This is one of the top CI blockers and is impacting the ability of teams to merge. https://search.ci.openshift.org/?search=ClusterOperatorDown.*authentication&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Operators may not go degraded during normal use. If the operator is degraded when the install completes, it must heal within a few minutes (we will be tightening that constraint).
This is reproducible in so many CI runs I'm marking the needinfo provided. Since we had no data before, repurposing this as "auth operator going degraded too easily".
I'm going to also mark the operator alert as a known failure in tests on this bug.
Actually, https://bugzilla.redhat.com/show_bug.cgi?id=1939580 covers this - so we can potentially close this for now.
I broadened the test bypass for https://bugzilla.redhat.com/show_bug.cgi?id=1939580 to cover normal e2e runs in https://github.com/openshift/origin/pull/26103 which will reduce churn on PR jobs for now.
I can see that the impact has gone down since the fixes to other BZs merged. There are still a couple of cases where the CAO reports "Down" during the installs so it might still be worth checking those runs out, but since the numbers are lower now, I'm dropping the severity/priority to medium.
unsetting from 4.8 as we have done work (see https://bugzilla.redhat.com/show_bug.cgi?id=1939580) to fix things here. As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1929922#c16 we're keeping this open to address reamining issues.
sprint review: the last final bits are still being worked on as commented in https://bugzilla.redhat.com/show_bug.cgi?id=1929922#c16
sprint review: I can see that there is still a high number of CAO failures, I wonder whether that's because CAO actually fails or because it fails with all the other operators as it highly depends on many other things.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
reviewed-in-sprint: not enough capacity to work on this bugzilla.
Iām adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Let's close this BZ as it hasn't seen much activity and not much is currently planned to do in regards of improving the CAO. If you see a specific CAO issue, please open a BZ specific to it.