Bug 1886055

Summary: After OCP upgrade, two of three openshift-apiserver pods in a CrashLoopBackOff mode
Product: OpenShift Container Platform Reporter: baiesi
Component: openshift-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED CURRENTRELEASE QA Contact: Xingxing Xia <xxia>
Severity: high Docs Contact:
Priority: low    
Version: 4.3.zCC: annair, aos-bugs, baiesi, eparis, jokerman, mfojtik, milei, nstielau, wking, wlewis
Target Milestone: ---Flags: mfojtik: needinfo?
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: LifecycleReset
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-21 09:07:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1869362    

Description baiesi 2020-10-07 14:49:14 UTC
Description of problem:
Issue:
After OCP upgrade,two of three openshift-apiserver pods in a CrashLoopBackOff mode

Environment:
UPI initial installed OCP 4.3 on Bare Metal
UI upgraded to OCP 4.3 to OCP 4.3.35
load balancer: LB-(master0, master1, master2), LB-(worker0, worker1, worker2)
bootstrap node are all in private network,
public NICs are disabled.
1 infra node has dual NICs to access both public and private network.
3 masters nodes
3 workers nodes

Steps to Reproduce:
-Deployed OCP 4.3.15 on Bare metal successfully no issues
-Upgraded via the UI OCP 4.3.35 successfully no issues but;
-Noticed in UI 2 pods were in error state:
*master0 and master1 nodes 
-PodP: apiserver-99xl5     NamespaceNS: openshift-apiserver DaemonSetDS: apiserver NodeN: master2 CrashLoopBackOff: ContainersNotReady    
-PodP: apiserver-lfpgj     NamespaceNS: openshift-apiserver DaemonSetDS: apiserver NodeN: master0 CrashLoopBackOff: ContainersNotReady

***Need solution so we may continue with a healthy cluster to upgrade to 4.5.x

Actual results:
*master0 and master1 nodes 
PodP: apiserver-99xl5     NamespaceNS: openshift-apiserver DaemonSetDS: apiserver NodeN: master2 CrashLoopBackOff: ContainersNotReady    
PodP: apiserver-lfpgj     NamespaceNS: openshift-apiserver DaemonSetDS: apiserver NodeN: master0 CrashLoopBackOff: ContainersNotReady

[baiesi@laptop1 keys]$ oc get pods -n openshift-apiserver
NAME              READY   STATUS             RESTARTS   AGE
apiserver-99xl5   0/1     CrashLoopBackOff   166        11h
apiserver-d7f8b   1/1     Running            6          11h
apiserver-lfpgj   0/1     CrashLoopBackOff   131        11h

Expected Results:
Expected Pods on the nodes not to not be in CrashLoopBackOff after upgrade, so we may continue with a healthy CLuster to upgrade to 4.5.x

Additional info:
Also noticed in the UI console, 
Dashboard indicationg under cluster inventory: 2 failing pods

[baiesi@laptop1 keys]$ oc get pods -n openshift-apiserver
NAME              READY   STATUS             RESTARTS   AGE
apiserver-99xl5   0/1     CrashLoopBackOff   166        11h
apiserver-d7f8b   1/1     Running            6          11h
apiserver-lfpgj   0/1     CrashLoopBackOff   131        11h

[baiesi@laptop1 keys]$ oc get pods -n openshift-controller-manager
NAME                       READY   STATUS    RESTARTS   AGE
controller-manager-98pbb   1/1     Running   1          12h
controller-manager-skc87   1/1     Running   1          12h
controller-manager-xpgfv   1/1     Running   1          12h

[baiesi@laptop1 keys]$  oc get nodes
NAME      STATUS   ROLES           AGE   VERSION
master0   Ready    master,worker   13h   v1.16.2+7279a4a
master1   Ready    master,worker   13h   v1.16.2+7279a4a
master2   Ready    master,worker   13h   v1.16.2+7279a4a
worker0   Ready    worker          13h   v1.16.2+7279a4a
worker1   Ready    worker          13h   v1.16.2+7279a4a
worker2   Ready    worker          13h   v1.16.2+7279a4a

[baiesi@laptop1 keys]$ oc adm top nodes
NAME      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
master0   592m         3%     5815Mi          1%        
master1   727m         4%     7116Mi          1%        
master2   957m         6%     4952Mi          1%        
worker0   505m         3%     6284Mi          1%        
worker1   324m         2%     4713Mi          1%        
worker2   111m         0%     3446Mi          0%

UI Event Mssages: RED
*Dopenshift-apiserver-operatorNamespaceNSopenshift-apiserver-operator less than a minute ago Generated from openshift-apiserver-operator-workload-controller
15 times in the last 12 hours
"authorization.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)

*Papiserver-99xl5NamespaceNSopenshift-apiserver a minute ago Generated from kubelet on master2 1383 times in the last 12 hours
Readiness probe failed: Get https://10.131.0.5:8443/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

*CSVpackageserverNamespaceNSopenshift-operator-lifecycle-manager 2 minutes ago Generated from operator-lifecycle-manager 123 times in the last 12 hours
APIServices not installed

Attached file:
The oc adm must-gather CLI command collects the information from your cluster that is most likely needed for debugging issues:
http://10.8.32.38/str/ocpdebug/must-gather-apiservercrash.tar.gz

Comment 1 milei 2020-10-07 14:52:48 UTC

apiserver-99xl5-openshift-apiserver.log:

Copying system trust bundle
I1007 11:50:30.551927       1 audit.go:368] Using audit backend: ignoreErrors<log>
I1007 11:50:30.557010       1 plugins.go:84] Registered admission plugin "NamespaceLifecycle"
I1007 11:50:30.557041       1 plugins.go:84] Registered admission plugin "ValidatingAdmissionWebhook"
I1007 11:50:30.557051       1 plugins.go:84] Registered admission plugin "MutatingAdmissionWebhook"
I1007 11:50:30.559354       1 admission.go:48] Admission plugin "project.openshift.io/ProjectRequestLimit" is not configured so it will be disabled.
I1007 11:50:30.560653       1 plugins.go:158] Loaded 5 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,build.openshift.io/BuildConfigSecretInjector,image.openshift.io/ImageLimitRange,image.openshift.io/ImagePolicy,MutatingAdmissionWebhook.
I1007 11:50:30.560687       1 plugins.go:161] Loaded 8 validating admission controller(s) successfully in the following order: OwnerReferencesPermissionEnforcement,build.openshift.io/BuildConfigSecretInjector,build.openshift.io/BuildByStrategy,image.openshift.io/ImageLimitRange,image.openshift.io/ImagePolicy,quota.openshift.io/ClusterResourceQuota,ValidatingAdmissionWebhook,ResourceQuota.
I1007 11:50:30.572544       1 client.go:361] parsed scheme: "endpoint"
I1007 11:50:30.572628       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd.openshift-etcd.svc:2379 0  <nil>}]
I1007 11:50:35.591672       1 client.go:361] parsed scheme: "endpoint"
I1007 11:50:35.591730       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd.openshift-etcd.svc:2379 0  <nil>}]
F1007 11:50:55.591913       1 openshift_apiserver.go:420] context deadline exceeded
W1007 11:50:55.592702       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd.openshift-etcd.svc:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: operation was canceled". Reconnecting...
I1007 11:50:55.591981       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

Comment 3 W. Trevor King 2020-10-07 16:32:58 UTC
Updates team is responsible for the updates framework and cluster-version operator manifest application.  If a component pod is crashlooping, that's the responsibility of that component's team.  In this case, maybe the pod needs to be more robust in the face of etcd connection, or maybe etcd is down, or maybe there is some networking issue between the pod and etcd.  But updates folks are not maintaining any of those components, so it's hard for us to know where the issue is.

Comment 5 Stefan Schimanski 2020-10-25 18:31:45 UTC
Other topics had higher prio. Adding UpcomingSprint.

Comment 6 Michal Fojtik 2020-11-06 18:12:06 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 7 Nick Stielau 2021-01-20 18:37:54 UTC
This doesn't seem like a 4.7 blocker, but does seem high severity.

Comment 8 Michal Fojtik 2021-01-20 19:20:30 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 9 Stefan Schimanski 2021-01-21 09:07:20 UTC
4.3 is EOL. Closing.

Please reopen if you see this with a more current version.