1914432 – OCP 4.7.0-fc.1 Cluster Fails to Restart Gracefully

Bug 1914432 - OCP 4.7.0-fc.1 Cluster Fails to Restart Gracefully

Summary: OCP 4.7.0-fc.1 Cluster Fails to Restart Gracefully

Keywords:
Status:	CLOSED DUPLICATE of bug 1899941
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	aos-network-edge-staff
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-08 20:16 UTC by Gurney Buchanan
Modified:	2022-08-04 22:30 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-13 17:08:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must-gather from 4.7.0-fc.1 cluster after shutdown and resume (12.27 MB, application/gzip) 2021-01-08 20:41 UTC, Gurney Buchanan	no flags	Details
View All

Description Gurney Buchanan 2021-01-08 20:16:00 UTC

Description of problem:
And OpenShift 4.7.0-fc.1 cluster was provisioned on AWS (3 m4.xlarge masters, 3 m4.2xlarge workers), left idle for approximately 30 minutes, and then all cluster nodes were stopped from the AWS console.  After 10 minutes, the instances were started and we observed the following behavior once all 3 masters signaled ready:

```
gbuchana-mac:bootstrap gurnben$ oc get csr
NAME        AGE   SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-22r86   93m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-26rd4   85m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-158-164.ec2.internal                                    Approved,Issued
csr-8r7qw   92m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-155-63.ec2.internal                                     Approved,Issued
csr-dh9t5   92m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-136-156.ec2.internal                                    Approved,Issued
csr-drs44   82m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-172-113.ec2.internal                                    Approved,Issued
csr-frvjr   82m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-h4b8b   92m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-161-97.ec2.internal                                     Approved,Issued
csr-hbrxb   85m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-mdths   92m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-mxtbl   85m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-ts2kz   85m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-138-77.ec2.internal                                     Approved,Issued
csr-wd7rp   92m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
gbuchana-mac:bootstrap gurnben$ oc get nodes -l node-role.kubernetes.io/master
]NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-136-156.ec2.internal   Ready    master   93m   v1.20.0+87544c5
ip-10-0-155-63.ec2.internal    Ready    master   93m   v1.20.0+87544c5
ip-10-0-161-97.ec2.internal    Ready    master   92m   v1.20.0+87544c5
gbuchana-mac:bootstrap gurnben$ oc get po --all-namespaces | grep "0/1     Running "
openshift-console                                  console-855fc4fc67-8pjr9                                  0/1     Running     4          64m
openshift-console                                  console-855fc4fc67-wxgw9                                  0/1     Running     4          64m
openshift-ingress                                  router-default-7b65d7b64-jtzt7                            0/1     Running     0          70m
openshift-ingress                                  router-default-7b65d7b64-xfdpr                            0/1     Running     0          70m

```

Cluster was unreachable via web ui and `oc login`.  

Note: this issue was originally reached by using a Hive ClusterPool with hibernation but was reproduced outside of hive to isolate OCP from hive componentry/interactions before this BZ was opened.  All info in this issue is from a bare openshift-install provisioned cluster.  


Version-Release number of selected component (if applicable): 4.7.0-fc.1


How reproducible:


Steps to Reproduce:
1. Provision OCP 4.7.0-fc.1 cluster (presumably all platforms, produced on AWS)
2. Shut down all instances via the cloud platform console (as documented in https://docs.openshift.com/container-platform/4.6/backup_and_restore/graceful-cluster-shutdown.html)
3. Start all instances and follow steps as necessary from https://docs.openshift.com/container-platform/4.6/backup_and_restore/graceful-cluster-restart.html
4. Attempt oc login/web console access

Actual results:
Unreachable cluster web console and auth endpoint

Expected results:
Cluster successfully resumed from shutdown


Additional info:
must-gather output will be attached once available (waiting for oc adm must-gather to complete)

Comment 1 Gurney Buchanan 2021-01-08 20:41:34 UTC

Created attachment 1745696 [details]
must-gather from 4.7.0-fc.1 cluster after shutdown and resume

Comment 2 Greg Sheremeta 2021-01-08 20:43:51 UTC

going to guess Networking (ingress not coming up preventing console from coming up?)

Comment 3 Gurney Buchanan 2021-01-12 17:52:18 UTC

I successfully reproduced this issue on 4.7.0-fc.2 as well!

Comment 4 Jacob Tanenbaum 2021-01-12 19:45:40 UTC

Looks like an openshift ingress issue the CNO reports that the network is fine and the router logs have errors, this one being the first.

2021-01-08T18:32:47.847199722Z E0108 18:32:47.847161       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory

Comment 5 Andrew McDermott 2021-01-13 17:08:24 UTC


*** This bug has been marked as a duplicate of bug 1899941 ***

Note You need to log in before you can comment on or make changes to this bug.