Bug 1970624 - 4.7->4.8 updates: AggregatedAPIDown for v1beta1.metrics.k8s.io
Summary: 4.7->4.8 updates: AggregatedAPIDown for v1beta1.metrics.k8s.io
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Damien Grisonnet
QA Contact: Junqi Zhao
URL:
Whiteboard: tag-ci
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-10 20:04 UTC by ravig
Modified: 2021-11-01 10:07 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:12:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
metrics (756.81 KB, image/png)
2021-06-11 06:30 UTC, Sergiusz Urbaniak
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1211 0 None open Bug 1970624: jsonnet: reduce threshold of AggregatedAPIDown 2021-06-11 08:03:56 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:12:57 UTC

Description ravig 2021-06-10 20:04:41 UTC
Description of problem:

While debugging upgrade failures, we noticed the following error:

fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:190]: Jun 10 06:13:57.329: Unexpected alerts fired or pending during the upgrade:

During the time when the alert got fired, DNS deployment was in the process of rollout. We expect DNS rollout not have any impact on apiservers availability.


From Trevor:

curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1402845184974131200/build-log.txt | grep 'clusteroperator/.*versions.*->'
...
INFO[2021-06-10T06:14:13Z] Jun 10 05:39:50.773 I clusteroperator/cluster-autoscaler versions: operator 4.7.15 -> 4.8.0-0.ci-2021-06-10-001152 
INFO[2021-06-10T06:14:13Z] Jun 10 05:41:53.343 I clusteroperator/authentication versions: oauth-apiserver 4.7.15 -> 4.8.0-0.ci-2021-06-10-001152 
INFO[2021-06-10T06:14:13Z] Jun 10 05:44:13.091 I clusteroperator/network versions: operator 4.7.15 -> 4.8.0-0.ci-2021-06-10-001152 
INFO[2021-06-10T06:14:13Z] Jun 10 05:44:53.444 I clusteroperator/dns versions: openshift-cli quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c6d0a0fed7ddb95550623aa23c434446fb99abef18e6d57b8b12add606efde8 -> registry.ci.openshift.org/ocp/4.8-2021-06-10-001152@sha256:52e918a9678d195e867e7b91c02e0be33922938d295172a554501b42f13574c9 
INFO[2021-06-10T06:14:13Z] Jun 10 05:47:16.260 I clusteroperator/dns versions: kube-rbac-proxy quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37ee4cf8184666792caa983611ab8d58dfd533c7cc7abe9f81a22a81876d9cd2 -> registry.ci.openshift.org/ocp/4.8-2021-06-10-001152@sha256:3869910c1e208b125bdecd4ac2d8b2cae42efe221c704491b86aa9b18ce95a65, operator 4.7.15 -> 4.8.0-0.ci-2021-06-10-001152, coredns quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad08b23717af078a89f93a097f32abe9262daf9e32d124f8b1c6437efddb82e7 -> registry.ci.openshift.org/ocp/4.8-2021-06-10-001152@sha256:bcdefdbcee8af1e634e68a850c52fe1e9cb31364525e30f5b20ee4eacb93c3e8 
INFO[2021-06-10T06:14:14Z] Jun 10 06:11:18.866 I clusteroperator/machine-config versions: operator 4.7.15 -> 4.8.0-0.ci-2021-06-10-001152 
```

OpenShift release version:


Cluster Platform:


How reproducible:


Steps to Reproduce (in detail):
1.
2.
3.


Actual results:


Expected results:


Impact of the problem:


Additional info:



** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 1 W. Trevor King 2021-06-10 20:42:02 UTC
Same AggregatedAPIDown on v1beta1.metrics.k8s.io mentioned in bug 1940933 too.  That's now VERIFIED, and I'm not sure on the timing of the alert in that bug, but cross-linking just in case.

Comment 2 W. Trevor King 2021-06-10 20:44:10 UTC
Also, linking the job-detail page from the job mentioned in comment 0: [1].  And for Sippy:

  : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]	1h10m34s
    fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:190]: Jun 10 06:13:57.329: Unexpected alerts fired or pending during the upgrade:

    alert AggregatedAPIDown fired for 210 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"}

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	1h10m25s
    Jun 10 06:13:57.329: Unexpected alerts fired or pending during the upgrade:

    alert AggregatedAPIDown fired for 210 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"}

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1402845184974131200

Comment 3 W. Trevor King 2021-06-10 20:46:49 UTC
It's just 4.7->4.8 updates, and they're getting hit pretty hard for at least the past few days:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&search=AggregatedAPIDown+fired+for.*v1beta1.metrics.k8s.io' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 66 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 57 runs, 95% failed, 65% of failures match = 61% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 17 runs, 100% failed, 47% of failures match = 47% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 16 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 8 runs, 88% failed, 57% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 28 runs, 96% failed, 52% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 28 runs, 100% failed, 4% of failures match = 4% impact
pull-ci-openshift-ovn-kubernetes-master-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 26 runs, 92% failed, 50% of failures match = 46% impact
rehearse-18785-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
rehearse-18877-periodic-ci-openshift-release-master-okd-4.8-upgrade-from-4.7-e2e-upgrade-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Comment 4 W. Trevor King 2021-06-10 21:20:51 UTC
Suspicious, from Loki [1]:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1402845184974131200"} | unpack |= "v1beta1.metrics.k8s.io"

turns up:

  2021-06-09 21:52:40	I0610 04:52:39.995169       1 apiservice.go:62] updating apiservice v1beta1.metrics.k8s.io with the service signing CA bundleShow

[1]: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade%2F1402845184974131200%5C%22%7D%20%7C%20unpack%20%7C%20%5C%22v1beta1.metrics.k8s.io%5C%22%22%7D%5D

Comment 5 W. Trevor King 2021-06-10 22:50:06 UTC
Ah, also not mentioned above is from [1], AggregatedAPIDown fired from 5:44 to 5:46Z, so that is from 2m after the signing CA update to 4m after the signing CA update.  I'm going to move this to the the API-server folks because they'll know more about how aggregation calls fit in with CA rotation.

[1]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1402845184974131200

Comment 6 Sergiusz Urbaniak 2021-06-11 06:30:04 UTC
Created attachment 1790109 [details]
metrics

The revert in https://github.com/openshift/cluster-monitoring-operator/pull/1120 bubbled up the alert again. I think we have two options:

1. extending the `for` clause to 15m again.
2. raising the threshold of the whole expression.

I suggest option 2.: currently, the expression `(1 - max by(name, namespace)(avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 85` yields true if api aggregation is not reachable within a 1,5 minute time window.

The 1,5 minute time window is very tight when facing CA rotation. I suggest we raise this to 3 minutes, aka `(1 - max by(name, namespace)(avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 70`. This will cause the expression not to trigger in the first place and is easier to reason about.

Comment 11 errata-xmlrpc 2021-07-27 23:12:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.