1940933 – [sig-arch] Check if alerts are firing during or after upgrade success: AggregatedAPIDown on v1beta1.metrics.k8s.io

Bug 1940933 - [sig-arch] Check if alerts are firing during or after upgrade success: AggregatedAPIDown on v1beta1.metrics.k8s.io

Summary: [sig-arch] Check if alerts are firing during or after upgrade success: Aggreg...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Damien Grisonnet
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Depends On:	1948711
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-19 15:07 UTC by Hongkai Liu
Modified:	2021-07-27 22:55 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	[sig-arch] Check if alerts are firing during or after upgrade success
Last Closed:	2021-07-27 22:54:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
aggregator_unavailable_apiservice metrics (106.66 KB, image/png) 2021-03-22 08:40 UTC, Simon Pasquier	no flags	Details
aggregator_unavailable_apiservice_total missing-endpoints (573.45 KB, image/png) 2021-04-13 15:37 UTC, Damien Grisonnet	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1117	None	open	WIP: Bug 1940933: jsonnet: make AggregatedAPIDown more resilient	2021-04-13 14:08:03 UTC
Github	openshift cluster-monitoring-operator pull 1120	None	open	Bug 1940933: Revert "jsonnet: make AggregatedAPIDown more resilient"	2021-04-15 13:31:33 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:55:01 UTC

Description Hongkai Liu 2021-03-19 15:07:55 UTC

test:
[sig-arch] Check if alerts are firing during or after upgrade success 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-arch%5C%5D+Check+if+alerts+are+firing+during+or+after+upgrade+success


https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_etcd/73/pull-ci-openshift-etcd-openshift-4.8-e2e-aws-upgrade/1372884621489868800

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1372874143489331200


Mar 19 13:31:45.881: Unexpected alerts fired or pending during the upgrade:

alert AggregatedAPIDown fired for 180 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"}

github.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc001f7caf0, 0xc000de02e0)
	github.com/openshift/origin/test/extended/util/disruption/disruption.go:175 +0x3be
k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1(0xc000de02e0, 0xc000891570)
	k8s.io/kubernetes.0/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x6d
created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do
	k8s.io/kubernetes.0/test/e2e/chaosmonkey/chaosmonkey.go:87 +0xc9

Comment 2 W. Trevor King 2021-03-19 23:07:04 UTC

I added AggregatedAPIDown to the subject, because this test-case can fail for many reasons, and they won't all have the same root cause or assigned team, and AggregatedAPIDown is mentioned in comment 0.  But looking into [1], also linked from comment 0, I see:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success
  Run #0: Failed	1h5m9s
    Unexpected alert behavior during upgrade:

    alert ClusterMonitoringOperatorReconciliationErrors pending for 1 seconds with labels: {__name__="ALERTS", container="kube-rbac-proxy", endpoint="https", instance="10.128.0.27:8443", job="cluster-monitoring-operator", namespace="openshift-monitoring", pod="cluster-monitoring-operator-6466d67f66-nz94k", service="cluster-monitoring-operator", severity="warning"} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1932624)

So that sounds like this bug should be closed as a dup of bug 1932624.  I'm going to go ahead and do that, and bump that bug so Sippy can find it more easily, but feel free to roll back my changes if I'm misunderstanding. 

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_etcd/73/pull-ci-openshift-etcd-openshift-4.8-e2e-aws-upgrade/1372884621489868800

*** This bug has been marked as a duplicate of bug 1932624 ***

Comment 3 Simon Pasquier 2021-03-22 08:40:27 UTC

Created attachment 1765157 [details]
aggregator_unavailable_apiservice metrics

From the logs of the other failed job [1], I see that the issue isn't related to the ClusterMonitoringOperatorReconciliationErrors alert. This time, the alert is AggregatedAPIDown for the v1beta1.metrics.k8s.io API (which is prometheus-adapter).

Looking at the Prometheus adapter logs [2][3], we can see network hiccups also reported by the API servers [4]. The alert expression is "(1 - max by(name, namespace)(avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 85" with a for clause of 5 minutes. Looking at the raw metrics (see attachment), the alert might be too sensitive (especially the avg_over_time() function over 10 minutes + the 85 threshold increase the time it takes for the alert to resolve).

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1372874143489331200
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1372874143489331200/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-adapter-7f6546cf79-djcwv_prometheus-adapter.log
[3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1372874143489331200/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-adapter-7f6546cf79-xtr22_prometheus-adapter.log
[4] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1372874143489331200/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-kube-apiserver_kube-apiserver-ip-10-0-148-8.us-east-2.compute.internal_kube-apiserver.log

Comment 4 Kir Kolyshkin 2021-03-26 21:06:10 UTC

@spasquie this bug is closed as DUP, please follow up in https://bugzilla.redhat.com/show_bug.cgi?id=1932624

Comment 5 Simon Pasquier 2021-03-29 09:07:54 UTC

The original description isn't a duplicate of bug 1932624: the former is about the ClusterMonitoringOperatorReconciliationErrors alert while this bug report is about AggregatedAPIDown. Reopening the bug and assigning to Damien.

Comment 6 W. Trevor King 2021-04-09 22:39:33 UTC

Raising to high based on:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&name=4.8.*from.*4.7&search=AggregatedAPIDown+fired+for' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 18 runs, 100% failed, 94% of failures match = 94% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 5 runs, 100% failed, 60% of failures match = 60% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 80% of failures match = 80% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 10% of failures match = 10% impact
rehearse-17604-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
rehearse-17604-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
rehearse-17604-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
rehearse-17604-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact


If we get this fixed, and get bug 1948066 fixed, it looks like we'll be pretty close to having green 4.6->4.7 update CI [1].

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade

Comment 7 W. Trevor King 2021-04-12 21:26:11 UTC

Damien, I see you set blocker-.  But 4.7 -> 4.8 CI is dead on this error (and the presumably unrelated bug 1927264).  I don't think we can GA without green 4.7 -> 4.8 CI.  If we're ok with things like [1]:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	1h8m7s
    Apr 11 11:35:31.848: Unexpected alerts fired or pending during the upgrade:

    alert AggregatedAPIDown fired for 90 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"}
    alert ClusterMonitoringOperatorReconciliationErrors fired for 300 seconds with labels: {severity="warning"}

going off, can you add carve outs to openshift/origin so those alerts are non-fatal?  If we're not ok with them going off, can we fix them before 4.8 GAs (and set blocker+ to commit to that)?

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1381182524864073728

Comment 8 W. Trevor King 2021-04-12 21:37:30 UTC

Adding the v1beta1.metrics.k8s.io description to the subject, to distinguish from the new bug 1948771 about AggregatedAPIDown for v1.packages.operators.coreos.com.

Comment 9 Damien Grisonnet 2021-04-13 07:44:06 UTC

Happy to reevaluate it as a blocker that we have to fix before 4.8 GA, thank you for raising concerns again.

On a side note, if bug 1948771 is also caused by network blips, we will most fix it with this Bugzilla as the issue seems to be coming from the alert being too sensitive rather than the aggregated API themselves.

Comment 11 Damien Grisonnet 2021-04-13 15:36:33 UTC

FWIW, I noticed that the `AggregatedAPIDown` alert only fires when an aggregated API is reported unavailable because it has missing endpoints. It is the only unavailability reason that seems to have an impact on the target availability that we've set for Aggregated API, and it seems that the `MissingEndpoints` reasons is only be reported for the `v1beta1.metrics.k8s.io` and `v1.packages.operators.coreos.com` which align with what we are seeing in CI.

Thus, my hunch would be that the real reason behind this issue isn't that the alert is too sensitive but that the Endpoints of these apiservices are degraded for some reasons. If this is correct, the reasons behind this BZ and bug 1948771 should be the same.

Comment 12 Damien Grisonnet 2021-04-13 15:37:47 UTC

Created attachment 1771652 [details]
aggregator_unavailable_apiservice_total missing-endpoints

Comment 13 Damien Grisonnet 2021-04-13 16:16:37 UTC

In the end, it seems to be caused by the apiservice taking too much time to claim the aggregated API. I will increase the 'for' clause of the alert to take that into account. From what I can see with the CI failures, 15 minutes should be enough for most of the CI jobs, but SNO tests seems to require 30 minutes.

Comment 14 W. Trevor King 2021-04-14 04:43:37 UTC

Is this only something that happens during install?  Because during updates, I would be surprised if the API-server drops one of these aggregated APIs and then takes multiple minutes to pick it back up.  That seems like an actual availability bug, not an overly-strict alert.  Maybe we need the e2e suite to have uptime monitors for these APIs like we already have for the Kube and OpenShift API servers, so we can more easily distinguish between false-positive and true-positive alert firings.

Comment 15 Damien Grisonnet 2021-04-14 17:24:02 UTC

Yes, this is happening during the installation.

My explanation was a bit confusing, what I meant was that the apiserver seems to be correctly picking up the apiservice as it is able to tell that there is no active endpoints attached to it. However, the error seems to induce that the Aggregated API server, let's say prometheus-adapter in our case take some time to become available. But after having a second look, we would have seen `FailedDiscoveryCheck` rather than the `MissingEndpoints` errors.

While I was at it, I also gave another look at both the metrics and the Kubernetes internals and have a little bit more insights. Actually, the `MissingEndpoints` and `ServiceNotFound` errors occur only for the two apiservices for which the alert is firing, namely: `v1.packages.operators.coreos.com` and `v1beta1.metrics.k8s.io`. The reasons behind these are that the apiserver is either unable to find the service targeted by the apiservice or their service port is inactive. So there might something going wrong with the service in front of both aggregated API servers. What I fail to explain though is why only these two apiservices are impacted and not the others.

Comment 16 W. Trevor King 2021-04-14 17:33:05 UTC

Bug 1948771 was verified fixed after the packageserver deployment was moved back to having 'replicas: 2', reverting a brief attempt at 'replicas: 1'.  So in that case, the issue really was "the service has availability disrupted because it wasn't as HA as it needed to be".  I suspect the issue with v1beta1.metrics.k8s.io might also be "service is not HA enough to get through the update jobs mentioned in comment 6.  Do we have PDBs on the relevant metrics deployment yet to protect it from excessive disruption?

Comment 17 Damien Grisonnet 2021-04-14 18:10:15 UTC

Oh! Good idea, I forgot to check that and it seems that we don't have a PDB for prometheus-adapter even though we are running it in HA mode with 2 replicas. I'll add one and revert the changes made to the alert. Thank you for the suggestion.

Comment 18 W. Trevor King 2021-04-14 18:53:07 UTC

Also some new-ish advice in [1,2], if you want to audit against all of those recommendations as part of this.

[1]: https://github.com/openshift/enhancements/blob/ec375985d143e9b091d14dfabe5fe18258875d5f/CONVENTIONS.md#high-availability
[2]: https://github.com/openshift/enhancements/blob/ec375985d143e9b091d14dfabe5fe18258875d5f/CONVENTIONS.md#upgrade-and-reconfiguration

Comment 20 W. Trevor King 2021-04-14 23:40:27 UTC

Moving back to ASSIGNED to give Damien time to do the comment 17 stuff before QE takes a look.

Comment 21 Simon Pasquier 2021-04-15 07:12:19 UTC

We already have bug 1948711 tracking the implementation of the newest HA conventions for prometheus-adapter.

Comment 22 hongyan li 2021-04-23 07:05:23 UTC

Current bug can be verified now for the dependent bug https://bugzilla.redhat.com/show_bug.cgi?id=1948711 is closed as verified."
Checked result of recent CI RUN periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade, see no alert AggregatedAPIDown fired with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"}

Comment 23 hongyan li 2021-04-23 08:54:48 UTC

Also done a upgrade from 4.7.0-0.nightly-2021-04-22-173152 to 4.8.0-0.nightly-2021-04-23-034722, didn't see the alert.

Comment 26 errata-xmlrpc 2021-07-27 22:54:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.