Bug 1797897

Summary:	After masters stopped and restarted, cluster is dead
Product:	OpenShift Container Platform	Reporter:	Xingxing Xia <xxia>
Component:	Etcd Operator	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED ERRATA	QA Contact:	Xingxing Xia <xxia>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.4	CC:	aos-bugs, geliu, jialiu, mfojtik, tnozicka
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-04 11:33:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Xingxing Xia 2020-02-04 07:49:04 UTC

Description of problem:
After masters stopped and restarted, cluster is dead

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-02-04-002939

How reproducible:
Always

Steps to Reproduce:
1. Install fresh 4.4 envs
2. Go to the cloud platform console, stop all masters
3. Then start the stopped hosts
4. Run any oc cmd, like `oc get no --v 6`

Actual results:
4. Cannot access the cluster. `oc get no --v 6` shows:
...
I0204 02:00:13.163815   28273 helpers.go:221] Connection error: Get https://...:6443/api?timeout=32s: dial tcp ...:6443: i/o timeout
F0204 02:00:13.163837   28273 helpers.go:114] Unable to connect to the server: dial tcp ...:6443: i/o timeout

ssh to the master, check:
# podman ps -a # nothing
CONTAINER ID  IMAGE  COMMAND  CREATED  STATUS  PORTS  NAMES
# journalctl -e -f -u kubelet.service
...
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.760259    6814 kubelet.go:2271] node "xxia04-xgnnk-m-0.c.openshift-qe.internal" not found
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.768141    6814 remote_runtime.go:261] RemoveContainer "5e660a0a6dffdd60e3778c438353765a57b0e9aa8dbea65147348f2bff7e2593" from runtime service failed: rpc error: code = Unknown desc = failed to delete storage for container 5e660a0a6dffdd60e3778c438353765a57b0e9aa8dbea65147348f2bff7e2593: container not known
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.768243    6814 kuberuntime_container.go:671] failed to remove pod init container "wait-for-kube": rpc error: code = Unknown desc = failed to delete storage for container 5e660a0a6dffdd60e3778c438353765a57b0e9aa8dbea65147348f2bff7e2593: container not known; Skipping pod "etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)"
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: I0204 03:47:51.768372    6814 kuberuntime_manager.go:856] checking backoff for container "wait-for-kube" in pod "etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)"
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: I0204 03:47:51.768506    6814 kuberuntime_manager.go:866] back-off 10s restarting failed container=wait-for-kube pod=etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.768557    6814 pod_workers.go:191] Error syncing pod b3c3d89708b507e16125b60366555d50 ("etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)"), skipping: failed to "StartContainer" for "wait-for-kube" with CrashLoopBackOff: "back-off 10s restarting failed container=wait-for-kube pod=etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)"
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: I0204 03:47:51.768580    6814 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-etcd", Name:"etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal", UID:"b3c3d89708b507e16125b60366555d50", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.initContainers{wait-for-kube}"}): type: 'Warning' reason: 'BackOff' Back-off restarting failed container
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: I0204 03:47:51.852688    6814 prober.go:129] Liveness probe for "openshift-kube-scheduler-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-kube-scheduler(3d06aa3b353778af12908af306f1190a):scheduler" succeeded
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.860488    6814 kubelet.go:2271] node "xxia04-xgnnk-m-0.c.openshift-qe.internal" not found
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.960707    6814 kubelet.go:2271] node "xxia04-xgnnk-m-0.c.openshift-qe.internal" not found
...snipped...


Expected results:
4. Should access successfully

Additional info:
First found in trying upi on gcp DR, then confirmed in ipi on aws, both hit.
Tried below cases, both hit:
In one env, stop all masters, then restart them.
In another env, stop all masters and workers, then restart them.

If only stop one master and restart it, cluster pods, nodes, co are all back to normal.

Comment 2 Tomáš Nožička 2020-02-04 08:43:55 UTC

etcd is not starting and I am frightened to think the static etcd pod init container waits for kube

```
root@xxia04-6q4xb-m-0 core]# crictl ps -a | grep wait-for-kube
61d65b5ac1d93       c5fb513ba6473e74dfe8606378886fa8402c24df3c96ff2203db4593ed35a9fa                                                         50 seconds ago      Exited              wait-for-kube                                    59                  b274aa7d51007
[root@xxia04-6q4xb-m-0 core]# crictl logs -f 61d65b5ac1d93
F0204 08:38:02.018434       1 waitforkube.go:36] kube env not populated
```

Feb 04 07:55:14 xxia04-6q4xb-m-0.c.openshift-qe.internal hyperkube[10302]: E0204 07:55:14.607865   10302 pod_workers.go:191] Error syncing pod 3bb792a6006f0085d193d2c3c95dccf0 ("etcd-member-xxia04-6q4xb-m-0.c.openshift-qe.internal_openshift-etcd(3bb792a6006f0085d193d2c3c95dccf0)"), skipping: failed to "StartContainer" for "wait-for-kube" with CrashLoopBackOff: "back-off 5m0s restarting failed container=wait-for-kube pod=etcd-member-xxia04-6q4xb-m-0.c.openshift-qe.internal_openshift-etcd(3bb792a6006f0085d193d2c3c95dccf0)"
Feb 04 07:55:14 xxia04-6q4xb-m-0.c.openshift-qe.internal hyperkube[10302]: I0204 07:55:14.607870   10302 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-etcd", Name:"etcd-member-xxia04-6q4xb-m-0.c.openshift-qe.internal", UID:"3bb792a6006f0085d193d2c3c95dccf0", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.initContainers{wait-for-kube}"}): type: 'Warning' reason: 'BackOff' Back-off restarting failed container
Feb 04 07:55:18 xxia04-6q4xb-m-0.c.openshift-qe.internal hyperkube[10302]: I0204 07:55:18.972697   10302 worker.go:215] Non-running container probed: etcd-member-xxia04-6q4xb-m-0.c.openshift-qe.internal_openshift-etcd(3bb792a6006f0085d193d2c3c95dccf0) - etcd-member

Comment 3 Xingxing Xia 2020-02-05 10:16:33 UTC

Adding TestBlocker keyword since blocking DR scenario testing and bug 1771410 verification

Comment 5 Johnny Liu 2020-02-12 06:37:18 UTC

I also hit the same issue in upi on baremetal install with 4.4.0-0.nightly-2020-02-11-035407

Comment 7 Xingxing Xia 2020-02-14 08:18:02 UTC

Since irrelevant to cloud, tried IPI on AWS with 4.4.0-0.nightly-2020-02-13-212616. Above issue is fixed thus moving to verified, but hit another issue: bug 1802944

Comment 9 errata-xmlrpc 2020-05-04 11:33:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581