Bug 1955669

Summary: release-openshift-origin-installer-old-rhcos-e2e-aws-4.7 is permfailing
Product: OpenShift Container Platform Reporter: Ryan Phillips <rphillips>
Component: NodeAssignee: Ryan Phillips <rphillips>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, bparees, schoudha
Version: 4.7   
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1955610 Environment:
job=release-openshift-origin-installer-old-rhcos-e2e-aws-4.7=all
Last Closed: 2021-05-19 15:17:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1955610    
Bug Blocks:    

Description Ryan Phillips 2021-04-30 16:01:59 UTC
+++ This bug was initially created as a clone of Bug #1955610 +++

job:
release-openshift-origin-installer-old-rhcos-e2e-aws-4.7 

is always failing in CI, see testgrid results:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-origin-installer-old-rhcos-e2e-aws-4.7


Note: this job attempts to run the current 4.7 codebase on top of the previous rhcos image (so older crio/kubelet).

The main concerning error is that the apiserver is getting terminated non-gracefully, which can lead to failures in other tests (since they can't reach the apiserver or lose connection to it).

see:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-old-rhcos-e2e-aws-4.7/1387145769860993024

fail [github.com/onsi/ginkgo@v4.5.0-origin.1+incompatible/internal/leafnodes/runner.go:64]: kube-apiserver reports a non-graceful termination: v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-apiserver-ip-10-0-194-195.ec2.internal.1679d369bf517c63", GenerateName:"", Namespace:"openshift-kube-apiserver", SelfLink:"/api/v1/namespaces/openshift-kube-apiserver/events/kube-apiserver-ip-10-0-194-195.ec2.internal.1679d369bf517c63", UID:"589e86a5-28cb-4ff4-991a-9977b27b0e73", ResourceVersion:"22443", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63755154792, loc:(*time.Location)(0x9068880)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"watch-termination", Operation:"Update", APIVersion:"v1", Time:(*v1.Time)(0xc0004a46c0), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc0004a46e0)}}}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-ip-10-0-194-195.ec2.internal", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"NonGracefulTermination", Message:"Previous pod kube-apiserver-ip-10-0-194-195.ec2.internal started at 2021-04-27 21:11:52.712098115 +0000 UTC did not terminate gracefully", Source:v1.EventSource{Component:"apiserver", Host:"ip-10-0-194-195"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63755154792, loc:(*time.Location)(0x9068880)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63755154792, loc:(*time.Location)(0x9068880)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}. Probably kubelet or CRI-O is not giving the time to cleanly shut down. This can lead to connection refused and network I/O timeout errors in other components.


So i am starting this with the kubelet.

--- Additional comment from Ryan Phillips on 2021-04-30 16:01:30 UTC ---

Self verifying... patch has merged into 4.8 to flake the test. Going to backport it to 4.7.

Comment 3 Sunil Choudhary 2021-05-05 04:39:32 UTC
Patch has merged into 4.7 to flake the test

Comment 4 Siddharth Sharma 2021-05-10 17:59:54 UTC
This bug will be shipped as part of next z-stream release 4.7.11 on May 19th, as 4.7.10 was dropped due to a blocker https://bugzilla.redhat.com/show_bug.cgi?id=1958518.

Comment 8 errata-xmlrpc 2021-05-19 15:17:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.11 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1550