Bug 1970624
Summary: | 4.7->4.8 updates: AggregatedAPIDown for v1beta1.metrics.k8s.io | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | ravig <rgudimet> | ||||
Component: | Monitoring | Assignee: | Damien Grisonnet <dgrisonn> | ||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.8 | CC: | alegrand, anpicker, aos-bugs, erooth, kakkoyun, mfojtik, pkrupa, surbania, wking, xxia | ||||
Target Milestone: | --- | Keywords: | Upgrades | ||||
Target Release: | 4.8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | tag-ci | ||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-07-27 23:12:40 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
ravig
2021-06-10 20:04:41 UTC
Same AggregatedAPIDown on v1beta1.metrics.k8s.io mentioned in bug 1940933 too. That's now VERIFIED, and I'm not sure on the timing of the alert in that bug, but cross-linking just in case. Also, linking the job-detail page from the job mentioned in comment 0: [1]. And for Sippy: : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] 1h10m34s fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:190]: Jun 10 06:13:57.329: Unexpected alerts fired or pending during the upgrade: alert AggregatedAPIDown fired for 210 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"} disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 1h10m25s Jun 10 06:13:57.329: Unexpected alerts fired or pending during the upgrade: alert AggregatedAPIDown fired for 210 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"} [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1402845184974131200 It's just 4.7->4.8 updates, and they're getting hit pretty hard for at least the past few days: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&search=AggregatedAPIDown+fired+for.*v1beta1.metrics.k8s.io' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 66 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 57 runs, 95% failed, 65% of failures match = 61% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 17 runs, 100% failed, 47% of failures match = 47% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 16 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 4 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 8 runs, 88% failed, 57% of failures match = 50% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 28 runs, 96% failed, 52% of failures match = 50% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 28 runs, 100% failed, 4% of failures match = 4% impact pull-ci-openshift-ovn-kubernetes-master-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 26 runs, 92% failed, 50% of failures match = 46% impact rehearse-18785-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact rehearse-18877-periodic-ci-openshift-release-master-okd-4.8-upgrade-from-4.7-e2e-upgrade-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact Suspicious, from Loki [1]: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1402845184974131200"} | unpack |= "v1beta1.metrics.k8s.io" turns up: 2021-06-09 21:52:40 I0610 04:52:39.995169 1 apiservice.go:62] updating apiservice v1beta1.metrics.k8s.io with the service signing CA bundleShow [1]: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade%2F1402845184974131200%5C%22%7D%20%7C%20unpack%20%7C%20%5C%22v1beta1.metrics.k8s.io%5C%22%22%7D%5D Ah, also not mentioned above is from [1], AggregatedAPIDown fired from 5:44 to 5:46Z, so that is from 2m after the signing CA update to 4m after the signing CA update. I'm going to move this to the the API-server folks because they'll know more about how aggregation calls fit in with CA rotation. [1]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1402845184974131200 Created attachment 1790109 [details] metrics The revert in https://github.com/openshift/cluster-monitoring-operator/pull/1120 bubbled up the alert again. I think we have two options: 1. extending the `for` clause to 15m again. 2. raising the threshold of the whole expression. I suggest option 2.: currently, the expression `(1 - max by(name, namespace)(avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 85` yields true if api aggregation is not reachable within a 1,5 minute time window. The 1,5 minute time window is very tight when facing CA rotation. I suggest we raise this to 3 minutes, aka `(1 - max by(name, namespace)(avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 70`. This will cause the expression not to trigger in the first place and is easier to reason about. have not seen such error within 2 days https://search.ci.openshift.org/?search=AggregatedAPIDown+fired+for.*v1beta1.metrics.k8s.io&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |