Bug 1942740 - [sig-arch] Check if alerts are firing during or after upgrade success
Summary: [sig-arch] Check if alerts are firing during or after upgrade success
Keywords:
Status: CLOSED DUPLICATE of bug 1928946
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-24 20:02 UTC by Michael Gugino
Modified: 2021-03-25 08:47 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
[sig-arch] Check if alerts are firing during or after upgrade success
Last Closed: 2021-03-25 08:47:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Michael Gugino 2021-03-24 20:02:28 UTC
test:
[sig-arch] Check if alerts are firing during or after upgrade success 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-arch%5C%5D+Check+if+alerts+are+firing+during+or+after+upgrade+success

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1374760429821104128


fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:118]: Mar 24 18:27:40.333: API "oauth-api-available-new-connections" was unreachable during disruption for at least 22s of 1h17m37s (0%):

Mar 24 17:19:22.233 E oauth-apiserver-new-connection oauth-apiserver-new-connection started failing: Get "https://api.**************-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/apis/oauth.openshift.io/v1/oauthclients": dial tcp 3.216.225.134:6443: connect: connection refused

Comment 1 W. Trevor King 2021-03-24 20:13:06 UTC
In the linked job [1], the relevant API-server alert was:

  alert AggregatedAPIDown fired for 180 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"}

I'm a bit fuzzy on the details, but this might be a dup of bug 1928946.  If so, probably mention the AggregatedAPIDown alert and:

  [sig-arch] Check if alerts are firing during or after upgrade success

test-case in that bug, to help Sippy find it.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1374760429821104128

Comment 2 Stefan Schimanski 2021-03-25 08:47:04 UTC
In the same run:

4 kube-apiserver reports a non-graceful termination.  Probably kubelet or CRI-O is not giving the time to cleanly shut down. This can lead to connection refused and network I/O timeout errors in other components.

ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-219-80.ec2.internal node/ip-10-0-219-80 - reason/NonGracefulTermination Previous pod kube-apiserver-ip-10-0-219-80.ec2.internal started at 2021-03-24 17:47:35.168007668 +0000 UTC did not terminate gracefully
ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-129-73.ec2.internal node/ip-10-0-129-73 - reason/NonGracefulTermination Previous pod kube-apiserver-ip-10-0-129-73.ec2.internal started at 2021-03-24 17:51:42.007655855 +0000 UTC did not terminate gracefully
ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-129-73.ec2.internal node/ip-10-0-129-73 - reason/NonGracefulTermination Previous pod kube-apiserver-ip-10-0-129-73.ec2.internal started at 2021-03-24 17:51:42.007655855 +0000 UTC did not terminate gracefully
ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-145-108.ec2.internal node/ip-10-0-145-108 - reason/NonGracefulTermination Previous pod kube-apiserver-ip-10-0-145-108.ec2.internal started at 2021-03-24 17:56:45.743304935 +0000 UTC did not terminate gracefully

*** This bug has been marked as a duplicate of bug 1928946 ***


Note You need to log in before you can comment on or make changes to this bug.