Bug 1886563

Summary: [sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully
Product: OpenShift Container Platform Reporter: Benjamin Gilbert <bgilbert>
Component: NodeAssignee: Stefan Schimanski <sttts>
Status: CLOSED DUPLICATE QA Contact: Ke Wang <kewang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.6CC: aos-bugs, deads, jokerman, mfojtik, xxia
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
[sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully
Last Closed: 2020-10-08 19:24:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Benjamin Gilbert 2020-10-08 18:43:46 UTC
test:
[sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-api-machinery%5C%5D%5C%5BFeature%3AAPIServer%5C%5D%5C%5BLate%5C%5D+kubelet+terminates+kube-apiserver+gracefully

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.6/1314250018076495872

fail [github.com/onsi/ginkgo.0-origin.1+incompatible/internal/leafnodes/runner.go:64]: kube-apiserver reports a non-graceful termination: v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-apiserver-master-1.ci-op-jwq0wjcr-35904.origin-ci-int-aws.dev.rhcloud.com.163c1542b8f9d61a", GenerateName:"", Namespace:"openshift-kube-apiserver", SelfLink:"/api/v1/namespaces/openshift-kube-apiserver/events/kube-apiserver-master-1.ci-op-jwq0wjcr-35904.origin-ci-int-aws.dev.rhcloud.com.163c1542b8f9d61a", UID:"a55424d2-daf0-4ce1-8de1-1d414b5233da", ResourceVersion:"21297", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63737775743, loc:(*time.Location)(0x9003460)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"watch-termination", Operation:"Update", APIVersion:"v1", Time:(*v1.Time)(0xc001ebea00), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc001ebea20)}}}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-master-1.ci-op-jwq0wjcr-35904.origin-ci-int-aws.dev.rhcloud.com", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"NonGracefulTermination", Message:"Previous pod kube-apiserver-master-1.ci-op-jwq0wjcr-35904.origin-ci-int-aws.dev.rhcloud.com started at 2020-10-08 17:33:31.267794933 +0000 UTC did not terminate gracefully", Source:v1.EventSource{Component:"apiserver", Host:"master-1.ci-op-jwq0wjcr-35904.origin-ci-int-aws.dev.rhcloud.com"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63737775743, loc:(*time.Location)(0x9003460)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63737775743, loc:(*time.Location)(0x9003460)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}. Probably kubelet or CRI-O is not giving the time to cleanly shut down. This can lead to connection refused and network I/O timeout errors in other components.

Comment 1 David Eads 2020-10-08 19:20:58 UTC
What's noteworthy is less the failure and more the sudden shift in frequency for some jobs

release-openshift-ocp-installer-e2e-azure-4.6			88.00% (0.00%)(25 runs)		100.00% (0.00%)(39 runs)
release-openshift-origin-installer-e2e-remote-libvirt-s390x-4.6	90.00% (0.00%)(10 runs)		100.00% (0.00%)(6 runs)
release-openshift-ocp-installer-e2e-aws-4.6			92.00% (0.00%)(50 runs)		97.06% (0.00%)(68 runs)
periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere	93.94% (0.00%)(33 runs)		100.00% (0.00%)(35 runs)

@sjenning.  This suddenly got a lot more severe and would explain our upgrade 10% increase in availability downtime.

Comment 2 David Eads 2020-10-08 19:24:12 UTC

*** This bug has been marked as a duplicate of bug 1882750 ***