Bug 1942740

Summary:	[sig-arch] Check if alerts are firing during or after upgrade success
Product:	OpenShift Container Platform	Reporter:	Michael Gugino <mgugino>
Component:	kube-apiserver	Assignee:	Stefan Schimanski <sttts>
Status:	CLOSED DUPLICATE	QA Contact:	Ke Wang <kewang>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	aos-bugs, mfojtik, wking, xxia
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	[sig-arch] Check if alerts are firing during or after upgrade success
Last Closed:	2021-03-25 08:47:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Michael Gugino 2021-03-24 20:02:28 UTC

test:
[sig-arch] Check if alerts are firing during or after upgrade success 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-arch%5C%5D+Check+if+alerts+are+firing+during+or+after+upgrade+success

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1374760429821104128


fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:118]: Mar 24 18:27:40.333: API "oauth-api-available-new-connections" was unreachable during disruption for at least 22s of 1h17m37s (0%):

Mar 24 17:19:22.233 E oauth-apiserver-new-connection oauth-apiserver-new-connection started failing: Get "https://api.**************-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/apis/oauth.openshift.io/v1/oauthclients": dial tcp 3.216.225.134:6443: connect: connection refused

Comment 1 W. Trevor King 2021-03-24 20:13:06 UTC

In the linked job [1], the relevant API-server alert was:

  alert AggregatedAPIDown fired for 180 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"}

I'm a bit fuzzy on the details, but this might be a dup of bug 1928946.  If so, probably mention the AggregatedAPIDown alert and:

  [sig-arch] Check if alerts are firing during or after upgrade success

test-case in that bug, to help Sippy find it.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1374760429821104128

Comment 2 Stefan Schimanski 2021-03-25 08:47:04 UTC

In the same run:

4 kube-apiserver reports a non-graceful termination.  Probably kubelet or CRI-O is not giving the time to cleanly shut down. This can lead to connection refused and network I/O timeout errors in other components.

ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-219-80.ec2.internal node/ip-10-0-219-80 - reason/NonGracefulTermination Previous pod kube-apiserver-ip-10-0-219-80.ec2.internal started at 2021-03-24 17:47:35.168007668 +0000 UTC did not terminate gracefully
ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-129-73.ec2.internal node/ip-10-0-129-73 - reason/NonGracefulTermination Previous pod kube-apiserver-ip-10-0-129-73.ec2.internal started at 2021-03-24 17:51:42.007655855 +0000 UTC did not terminate gracefully
ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-129-73.ec2.internal node/ip-10-0-129-73 - reason/NonGracefulTermination Previous pod kube-apiserver-ip-10-0-129-73.ec2.internal started at 2021-03-24 17:51:42.007655855 +0000 UTC did not terminate gracefully
ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-145-108.ec2.internal node/ip-10-0-145-108 - reason/NonGracefulTermination Previous pod kube-apiserver-ip-10-0-145-108.ec2.internal started at 2021-03-24 17:56:45.743304935 +0000 UTC did not terminate gracefully

*** This bug has been marked as a duplicate of bug 1928946 ***