Bug 1942740

Summary: [sig-arch] Check if alerts are firing during or after upgrade success
Product: OpenShift Container Platform Reporter: Michael Gugino <mgugino>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED DUPLICATE QA Contact: Ke Wang <kewang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.8CC: aos-bugs, mfojtik, wking, xxia
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
[sig-arch] Check if alerts are firing during or after upgrade success
Last Closed: 2021-03-25 08:47:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michael Gugino 2021-03-24 20:02:28 UTC
test:
[sig-arch] Check if alerts are firing during or after upgrade success 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-arch%5C%5D+Check+if+alerts+are+firing+during+or+after+upgrade+success

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1374760429821104128


fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:118]: Mar 24 18:27:40.333: API "oauth-api-available-new-connections" was unreachable during disruption for at least 22s of 1h17m37s (0%):

Mar 24 17:19:22.233 E oauth-apiserver-new-connection oauth-apiserver-new-connection started failing: Get "https://api.**************-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/apis/oauth.openshift.io/v1/oauthclients": dial tcp 3.216.225.134:6443: connect: connection refused

Comment 1 W. Trevor King 2021-03-24 20:13:06 UTC
In the linked job [1], the relevant API-server alert was:

  alert AggregatedAPIDown fired for 180 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"}

I'm a bit fuzzy on the details, but this might be a dup of bug 1928946.  If so, probably mention the AggregatedAPIDown alert and:

  [sig-arch] Check if alerts are firing during or after upgrade success

test-case in that bug, to help Sippy find it.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1374760429821104128

Comment 2 Stefan Schimanski 2021-03-25 08:47:04 UTC
In the same run:

4 kube-apiserver reports a non-graceful termination.  Probably kubelet or CRI-O is not giving the time to cleanly shut down. This can lead to connection refused and network I/O timeout errors in other components.

ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-219-80.ec2.internal node/ip-10-0-219-80 - reason/NonGracefulTermination Previous pod kube-apiserver-ip-10-0-219-80.ec2.internal started at 2021-03-24 17:47:35.168007668 +0000 UTC did not terminate gracefully
ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-129-73.ec2.internal node/ip-10-0-129-73 - reason/NonGracefulTermination Previous pod kube-apiserver-ip-10-0-129-73.ec2.internal started at 2021-03-24 17:51:42.007655855 +0000 UTC did not terminate gracefully
ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-129-73.ec2.internal node/ip-10-0-129-73 - reason/NonGracefulTermination Previous pod kube-apiserver-ip-10-0-129-73.ec2.internal started at 2021-03-24 17:51:42.007655855 +0000 UTC did not terminate gracefully
ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-145-108.ec2.internal node/ip-10-0-145-108 - reason/NonGracefulTermination Previous pod kube-apiserver-ip-10-0-145-108.ec2.internal started at 2021-03-24 17:56:45.743304935 +0000 UTC did not terminate gracefully

*** This bug has been marked as a duplicate of bug 1928946 ***