Description of problem: While debugging upgrade failures, we noticed the following error: fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:190]: Jun 10 06:13:57.329: Unexpected alerts fired or pending during the upgrade: During the time when the alert got fired, DNS deployment was in the process of rollout. We expect DNS rollout not have any impact on apiservers availability. From Trevor: curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1402845184974131200/build-log.txt | grep 'clusteroperator/.*versions.*->' ... INFO[2021-06-10T06:14:13Z] Jun 10 05:39:50.773 I clusteroperator/cluster-autoscaler versions: operator 4.7.15 -> 4.8.0-0.ci-2021-06-10-001152 INFO[2021-06-10T06:14:13Z] Jun 10 05:41:53.343 I clusteroperator/authentication versions: oauth-apiserver 4.7.15 -> 4.8.0-0.ci-2021-06-10-001152 INFO[2021-06-10T06:14:13Z] Jun 10 05:44:13.091 I clusteroperator/network versions: operator 4.7.15 -> 4.8.0-0.ci-2021-06-10-001152 INFO[2021-06-10T06:14:13Z] Jun 10 05:44:53.444 I clusteroperator/dns versions: openshift-cli quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c6d0a0fed7ddb95550623aa23c434446fb99abef18e6d57b8b12add606efde8 -> registry.ci.openshift.org/ocp/4.8-2021-06-10-001152@sha256:52e918a9678d195e867e7b91c02e0be33922938d295172a554501b42f13574c9 INFO[2021-06-10T06:14:13Z] Jun 10 05:47:16.260 I clusteroperator/dns versions: kube-rbac-proxy quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37ee4cf8184666792caa983611ab8d58dfd533c7cc7abe9f81a22a81876d9cd2 -> registry.ci.openshift.org/ocp/4.8-2021-06-10-001152@sha256:3869910c1e208b125bdecd4ac2d8b2cae42efe221c704491b86aa9b18ce95a65, operator 4.7.15 -> 4.8.0-0.ci-2021-06-10-001152, coredns quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad08b23717af078a89f93a097f32abe9262daf9e32d124f8b1c6437efddb82e7 -> registry.ci.openshift.org/ocp/4.8-2021-06-10-001152@sha256:bcdefdbcee8af1e634e68a850c52fe1e9cb31364525e30f5b20ee4eacb93c3e8 INFO[2021-06-10T06:14:14Z] Jun 10 06:11:18.866 I clusteroperator/machine-config versions: operator 4.7.15 -> 4.8.0-0.ci-2021-06-10-001152 ``` OpenShift release version: Cluster Platform: How reproducible: Steps to Reproduce (in detail): 1. 2. 3. Actual results: Expected results: Impact of the problem: Additional info: ** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report. You may also mark the bug private if you wish.
Same AggregatedAPIDown on v1beta1.metrics.k8s.io mentioned in bug 1940933 too. That's now VERIFIED, and I'm not sure on the timing of the alert in that bug, but cross-linking just in case.
Also, linking the job-detail page from the job mentioned in comment 0: [1]. And for Sippy: : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] 1h10m34s fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:190]: Jun 10 06:13:57.329: Unexpected alerts fired or pending during the upgrade: alert AggregatedAPIDown fired for 210 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"} disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 1h10m25s Jun 10 06:13:57.329: Unexpected alerts fired or pending during the upgrade: alert AggregatedAPIDown fired for 210 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"} [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1402845184974131200
It's just 4.7->4.8 updates, and they're getting hit pretty hard for at least the past few days: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&search=AggregatedAPIDown+fired+for.*v1beta1.metrics.k8s.io' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 66 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 57 runs, 95% failed, 65% of failures match = 61% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 17 runs, 100% failed, 47% of failures match = 47% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 16 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 4 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 8 runs, 88% failed, 57% of failures match = 50% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 28 runs, 96% failed, 52% of failures match = 50% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 28 runs, 100% failed, 4% of failures match = 4% impact pull-ci-openshift-ovn-kubernetes-master-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 26 runs, 92% failed, 50% of failures match = 46% impact rehearse-18785-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact rehearse-18877-periodic-ci-openshift-release-master-okd-4.8-upgrade-from-4.7-e2e-upgrade-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
Suspicious, from Loki [1]: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1402845184974131200"} | unpack |= "v1beta1.metrics.k8s.io" turns up: 2021-06-09 21:52:40 I0610 04:52:39.995169 1 apiservice.go:62] updating apiservice v1beta1.metrics.k8s.io with the service signing CA bundleShow [1]: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade%2F1402845184974131200%5C%22%7D%20%7C%20unpack%20%7C%20%5C%22v1beta1.metrics.k8s.io%5C%22%22%7D%5D
Ah, also not mentioned above is from [1], AggregatedAPIDown fired from 5:44 to 5:46Z, so that is from 2m after the signing CA update to 4m after the signing CA update. I'm going to move this to the the API-server folks because they'll know more about how aggregation calls fit in with CA rotation. [1]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1402845184974131200
Created attachment 1790109 [details] metrics The revert in https://github.com/openshift/cluster-monitoring-operator/pull/1120 bubbled up the alert again. I think we have two options: 1. extending the `for` clause to 15m again. 2. raising the threshold of the whole expression. I suggest option 2.: currently, the expression `(1 - max by(name, namespace)(avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 85` yields true if api aggregation is not reachable within a 1,5 minute time window. The 1,5 minute time window is very tight when facing CA rotation. I suggest we raise this to 3 minutes, aka `(1 - max by(name, namespace)(avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 70`. This will cause the expression not to trigger in the first place and is easier to reason about.
have not seen such error within 2 days https://search.ci.openshift.org/?search=AggregatedAPIDown+fired+for.*v1beta1.metrics.k8s.io&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438