Hide Forgot
test: [sig-arch] Check if alerts are firing during or after upgrade success is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-arch%5C%5D+Check+if+alerts+are+firing+during+or+after+upgrade+success https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_etcd/73/pull-ci-openshift-etcd-openshift-4.8-e2e-aws-upgrade/1372884621489868800 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1372874143489331200 Mar 19 13:31:45.881: Unexpected alerts fired or pending during the upgrade: alert AggregatedAPIDown fired for 180 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"} github.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc001f7caf0, 0xc000de02e0) github.com/openshift/origin/test/extended/util/disruption/disruption.go:175 +0x3be k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1(0xc000de02e0, 0xc000891570) k8s.io/kubernetes.0/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x6d created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do k8s.io/kubernetes.0/test/e2e/chaosmonkey/chaosmonkey.go:87 +0xc9
I added AggregatedAPIDown to the subject, because this test-case can fail for many reasons, and they won't all have the same root cause or assigned team, and AggregatedAPIDown is mentioned in comment 0. But looking into [1], also linked from comment 0, I see: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success Run #0: Failed 1h5m9s Unexpected alert behavior during upgrade: alert ClusterMonitoringOperatorReconciliationErrors pending for 1 seconds with labels: {__name__="ALERTS", container="kube-rbac-proxy", endpoint="https", instance="10.128.0.27:8443", job="cluster-monitoring-operator", namespace="openshift-monitoring", pod="cluster-monitoring-operator-6466d67f66-nz94k", service="cluster-monitoring-operator", severity="warning"} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1932624) So that sounds like this bug should be closed as a dup of bug 1932624. I'm going to go ahead and do that, and bump that bug so Sippy can find it more easily, but feel free to roll back my changes if I'm misunderstanding. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_etcd/73/pull-ci-openshift-etcd-openshift-4.8-e2e-aws-upgrade/1372884621489868800 *** This bug has been marked as a duplicate of bug 1932624 ***
Created attachment 1765157 [details] aggregator_unavailable_apiservice metrics From the logs of the other failed job [1], I see that the issue isn't related to the ClusterMonitoringOperatorReconciliationErrors alert. This time, the alert is AggregatedAPIDown for the v1beta1.metrics.k8s.io API (which is prometheus-adapter). Looking at the Prometheus adapter logs [2][3], we can see network hiccups also reported by the API servers [4]. The alert expression is "(1 - max by(name, namespace)(avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 85" with a for clause of 5 minutes. Looking at the raw metrics (see attachment), the alert might be too sensitive (especially the avg_over_time() function over 10 minutes + the 85 threshold increase the time it takes for the alert to resolve). [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1372874143489331200 [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1372874143489331200/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-adapter-7f6546cf79-djcwv_prometheus-adapter.log [3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1372874143489331200/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-adapter-7f6546cf79-xtr22_prometheus-adapter.log [4] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1372874143489331200/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-kube-apiserver_kube-apiserver-ip-10-0-148-8.us-east-2.compute.internal_kube-apiserver.log
@spasquie this bug is closed as DUP, please follow up in https://bugzilla.redhat.com/show_bug.cgi?id=1932624
The original description isn't a duplicate of bug 1932624: the former is about the ClusterMonitoringOperatorReconciliationErrors alert while this bug report is about AggregatedAPIDown. Reopening the bug and assigning to Damien.
Raising to high based on: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&name=4.8.*from.*4.7&search=AggregatedAPIDown+fired+for' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 18 runs, 100% failed, 94% of failures match = 94% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 5 runs, 100% failed, 60% of failures match = 60% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 80% of failures match = 80% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 10% of failures match = 10% impact rehearse-17604-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact rehearse-17604-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact rehearse-17604-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact rehearse-17604-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact If we get this fixed, and get bug 1948066 fixed, it looks like we'll be pretty close to having green 4.6->4.7 update CI [1]. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade
Damien, I see you set blocker-. But 4.7 -> 4.8 CI is dead on this error (and the presumably unrelated bug 1927264). I don't think we can GA without green 4.7 -> 4.8 CI. If we're ok with things like [1]: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 1h8m7s Apr 11 11:35:31.848: Unexpected alerts fired or pending during the upgrade: alert AggregatedAPIDown fired for 90 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"} alert ClusterMonitoringOperatorReconciliationErrors fired for 300 seconds with labels: {severity="warning"} going off, can you add carve outs to openshift/origin so those alerts are non-fatal? If we're not ok with them going off, can we fix them before 4.8 GAs (and set blocker+ to commit to that)? [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1381182524864073728
Adding the v1beta1.metrics.k8s.io description to the subject, to distinguish from the new bug 1948771 about AggregatedAPIDown for v1.packages.operators.coreos.com.
Happy to reevaluate it as a blocker that we have to fix before 4.8 GA, thank you for raising concerns again. On a side note, if bug 1948771 is also caused by network blips, we will most fix it with this Bugzilla as the issue seems to be coming from the alert being too sensitive rather than the aggregated API themselves.
FWIW, I noticed that the `AggregatedAPIDown` alert only fires when an aggregated API is reported unavailable because it has missing endpoints. It is the only unavailability reason that seems to have an impact on the target availability that we've set for Aggregated API, and it seems that the `MissingEndpoints` reasons is only be reported for the `v1beta1.metrics.k8s.io` and `v1.packages.operators.coreos.com` which align with what we are seeing in CI. Thus, my hunch would be that the real reason behind this issue isn't that the alert is too sensitive but that the Endpoints of these apiservices are degraded for some reasons. If this is correct, the reasons behind this BZ and bug 1948771 should be the same.
Created attachment 1771652 [details] aggregator_unavailable_apiservice_total missing-endpoints
In the end, it seems to be caused by the apiservice taking too much time to claim the aggregated API. I will increase the 'for' clause of the alert to take that into account. From what I can see with the CI failures, 15 minutes should be enough for most of the CI jobs, but SNO tests seems to require 30 minutes.
Is this only something that happens during install? Because during updates, I would be surprised if the API-server drops one of these aggregated APIs and then takes multiple minutes to pick it back up. That seems like an actual availability bug, not an overly-strict alert. Maybe we need the e2e suite to have uptime monitors for these APIs like we already have for the Kube and OpenShift API servers, so we can more easily distinguish between false-positive and true-positive alert firings.
Yes, this is happening during the installation. My explanation was a bit confusing, what I meant was that the apiserver seems to be correctly picking up the apiservice as it is able to tell that there is no active endpoints attached to it. However, the error seems to induce that the Aggregated API server, let's say prometheus-adapter in our case take some time to become available. But after having a second look, we would have seen `FailedDiscoveryCheck` rather than the `MissingEndpoints` errors. While I was at it, I also gave another look at both the metrics and the Kubernetes internals and have a little bit more insights. Actually, the `MissingEndpoints` and `ServiceNotFound` errors occur only for the two apiservices for which the alert is firing, namely: `v1.packages.operators.coreos.com` and `v1beta1.metrics.k8s.io`. The reasons behind these are that the apiserver is either unable to find the service targeted by the apiservice or their service port is inactive. So there might something going wrong with the service in front of both aggregated API servers. What I fail to explain though is why only these two apiservices are impacted and not the others.
Bug 1948771 was verified fixed after the packageserver deployment was moved back to having 'replicas: 2', reverting a brief attempt at 'replicas: 1'. So in that case, the issue really was "the service has availability disrupted because it wasn't as HA as it needed to be". I suspect the issue with v1beta1.metrics.k8s.io might also be "service is not HA enough to get through the update jobs mentioned in comment 6. Do we have PDBs on the relevant metrics deployment yet to protect it from excessive disruption?
Oh! Good idea, I forgot to check that and it seems that we don't have a PDB for prometheus-adapter even though we are running it in HA mode with 2 replicas. I'll add one and revert the changes made to the alert. Thank you for the suggestion.
Also some new-ish advice in [1,2], if you want to audit against all of those recommendations as part of this. [1]: https://github.com/openshift/enhancements/blob/ec375985d143e9b091d14dfabe5fe18258875d5f/CONVENTIONS.md#high-availability [2]: https://github.com/openshift/enhancements/blob/ec375985d143e9b091d14dfabe5fe18258875d5f/CONVENTIONS.md#upgrade-and-reconfiguration
Moving back to ASSIGNED to give Damien time to do the comment 17 stuff before QE takes a look.
We already have bug 1948711 tracking the implementation of the newest HA conventions for prometheus-adapter.
Current bug can be verified now for the dependent bug https://bugzilla.redhat.com/show_bug.cgi?id=1948711 is closed as verified." Checked result of recent CI RUN periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade, see no alert AggregatedAPIDown fired with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"}
Also done a upgrade from 4.7.0-0.nightly-2021-04-22-173152 to 4.8.0-0.nightly-2021-04-23-034722, didn't see the alert.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438