Bug 1886563

Summary:	[sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully
Product:	OpenShift Container Platform	Reporter:	Benjamin Gilbert <bgilbert>
Component:	Node	Assignee:	Stefan Schimanski <sttts>
Status:	CLOSED DUPLICATE	QA Contact:	Ke Wang <kewang>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	aos-bugs, deads, jokerman, mfojtik, xxia
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	[sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully
Last Closed:	2020-10-08 19:24:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Benjamin Gilbert 2020-10-08 18:43:46 UTC

test:
[sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-api-machinery%5C%5D%5C%5BFeature%3AAPIServer%5C%5D%5C%5BLate%5C%5D+kubelet+terminates+kube-apiserver+gracefully

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.6/1314250018076495872

fail [github.com/onsi/ginkgo.0-origin.1+incompatible/internal/leafnodes/runner.go:64]: kube-apiserver reports a non-graceful termination: v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-apiserver-master-1.ci-op-jwq0wjcr-35904.origin-ci-int-aws.dev.rhcloud.com.163c1542b8f9d61a", GenerateName:"", Namespace:"openshift-kube-apiserver", SelfLink:"/api/v1/namespaces/openshift-kube-apiserver/events/kube-apiserver-master-1.ci-op-jwq0wjcr-35904.origin-ci-int-aws.dev.rhcloud.com.163c1542b8f9d61a", UID:"a55424d2-daf0-4ce1-8de1-1d414b5233da", ResourceVersion:"21297", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63737775743, loc:(*time.Location)(0x9003460)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"watch-termination", Operation:"Update", APIVersion:"v1", Time:(*v1.Time)(0xc001ebea00), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc001ebea20)}}}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-master-1.ci-op-jwq0wjcr-35904.origin-ci-int-aws.dev.rhcloud.com", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"NonGracefulTermination", Message:"Previous pod kube-apiserver-master-1.ci-op-jwq0wjcr-35904.origin-ci-int-aws.dev.rhcloud.com started at 2020-10-08 17:33:31.267794933 +0000 UTC did not terminate gracefully", Source:v1.EventSource{Component:"apiserver", Host:"master-1.ci-op-jwq0wjcr-35904.origin-ci-int-aws.dev.rhcloud.com"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63737775743, loc:(*time.Location)(0x9003460)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63737775743, loc:(*time.Location)(0x9003460)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}. Probably kubelet or CRI-O is not giving the time to cleanly shut down. This can lead to connection refused and network I/O timeout errors in other components.

Comment 1 David Eads 2020-10-08 19:20:58 UTC

What's noteworthy is less the failure and more the sudden shift in frequency for some jobs

release-openshift-ocp-installer-e2e-azure-4.6			88.00% (0.00%)(25 runs)		100.00% (0.00%)(39 runs)
release-openshift-origin-installer-e2e-remote-libvirt-s390x-4.6	90.00% (0.00%)(10 runs)		100.00% (0.00%)(6 runs)
release-openshift-ocp-installer-e2e-aws-4.6			92.00% (0.00%)(50 runs)		97.06% (0.00%)(68 runs)
periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere	93.94% (0.00%)(33 runs)		100.00% (0.00%)(35 runs)

@sjenning.  This suddenly got a lot more severe and would explain our upgrade 10% increase in availability downtime.

Comment 2 David Eads 2020-10-08 19:24:12 UTC


*** This bug has been marked as a duplicate of bug 1882750 ***