1797897 – After masters stopped and restarted, cluster is dead

Bug 1797897 - After masters stopped and restarted, cluster is dead

Summary: After masters stopped and restarted, cluster is dead

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd Operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Sam Batschelet
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-04 07:49 UTC by Xingxing Xia
Modified:	2020-05-04 11:33 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-04 11:33:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 73	None	closed	Bug 1797897: pkg/cmd/waitforkube: don't wait if etcd has previously been initialed as member	2020-09-26 10:49:41 UTC
Github	openshift machine-config-operator pull 1457	None	closed	Bug 1797897: etcd-member: do not wait-for-kube or validate membership for existing members	2020-09-26 10:49:41 UTC
Red Hat Product Errata	RHBA-2020:0581	None	None	None	2020-05-04 11:33:36 UTC

Description Xingxing Xia 2020-02-04 07:49:04 UTC

Description of problem:
After masters stopped and restarted, cluster is dead

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-02-04-002939

How reproducible:
Always

Steps to Reproduce:
1. Install fresh 4.4 envs
2. Go to the cloud platform console, stop all masters
3. Then start the stopped hosts
4. Run any oc cmd, like `oc get no --v 6`

Actual results:
4. Cannot access the cluster. `oc get no --v 6` shows:
...
I0204 02:00:13.163815   28273 helpers.go:221] Connection error: Get https://...:6443/api?timeout=32s: dial tcp ...:6443: i/o timeout
F0204 02:00:13.163837   28273 helpers.go:114] Unable to connect to the server: dial tcp ...:6443: i/o timeout

ssh to the master, check:
# podman ps -a # nothing
CONTAINER ID  IMAGE  COMMAND  CREATED  STATUS  PORTS  NAMES
# journalctl -e -f -u kubelet.service
...
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.760259    6814 kubelet.go:2271] node "xxia04-xgnnk-m-0.c.openshift-qe.internal" not found
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.768141    6814 remote_runtime.go:261] RemoveContainer "5e660a0a6dffdd60e3778c438353765a57b0e9aa8dbea65147348f2bff7e2593" from runtime service failed: rpc error: code = Unknown desc = failed to delete storage for container 5e660a0a6dffdd60e3778c438353765a57b0e9aa8dbea65147348f2bff7e2593: container not known
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.768243    6814 kuberuntime_container.go:671] failed to remove pod init container "wait-for-kube": rpc error: code = Unknown desc = failed to delete storage for container 5e660a0a6dffdd60e3778c438353765a57b0e9aa8dbea65147348f2bff7e2593: container not known; Skipping pod "etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)"
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: I0204 03:47:51.768372    6814 kuberuntime_manager.go:856] checking backoff for container "wait-for-kube" in pod "etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)"
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: I0204 03:47:51.768506    6814 kuberuntime_manager.go:866] back-off 10s restarting failed container=wait-for-kube pod=etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.768557    6814 pod_workers.go:191] Error syncing pod b3c3d89708b507e16125b60366555d50 ("etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)"), skipping: failed to "StartContainer" for "wait-for-kube" with CrashLoopBackOff: "back-off 10s restarting failed container=wait-for-kube pod=etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)"
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: I0204 03:47:51.768580    6814 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-etcd", Name:"etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal", UID:"b3c3d89708b507e16125b60366555d50", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.initContainers{wait-for-kube}"}): type: 'Warning' reason: 'BackOff' Back-off restarting failed container
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: I0204 03:47:51.852688    6814 prober.go:129] Liveness probe for "openshift-kube-scheduler-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-kube-scheduler(3d06aa3b353778af12908af306f1190a):scheduler" succeeded
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.860488    6814 kubelet.go:2271] node "xxia04-xgnnk-m-0.c.openshift-qe.internal" not found
Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.960707    6814 kubelet.go:2271] node "xxia04-xgnnk-m-0.c.openshift-qe.internal" not found
...snipped...


Expected results:
4. Should access successfully

Additional info:
First found in trying upi on gcp DR, then confirmed in ipi on aws, both hit.
Tried below cases, both hit:
In one env, stop all masters, then restart them.
In another env, stop all masters and workers, then restart them.

If only stop one master and restart it, cluster pods, nodes, co are all back to normal.

Comment 2 Tomáš Nožička 2020-02-04 08:43:55 UTC

etcd is not starting and I am frightened to think the static etcd pod init container waits for kube

```
root@xxia04-6q4xb-m-0 core]# crictl ps -a | grep wait-for-kube
61d65b5ac1d93       c5fb513ba6473e74dfe8606378886fa8402c24df3c96ff2203db4593ed35a9fa                                                         50 seconds ago      Exited              wait-for-kube                                    59                  b274aa7d51007
[root@xxia04-6q4xb-m-0 core]# crictl logs -f 61d65b5ac1d93
F0204 08:38:02.018434       1 waitforkube.go:36] kube env not populated
```

Feb 04 07:55:14 xxia04-6q4xb-m-0.c.openshift-qe.internal hyperkube[10302]: E0204 07:55:14.607865   10302 pod_workers.go:191] Error syncing pod 3bb792a6006f0085d193d2c3c95dccf0 ("etcd-member-xxia04-6q4xb-m-0.c.openshift-qe.internal_openshift-etcd(3bb792a6006f0085d193d2c3c95dccf0)"), skipping: failed to "StartContainer" for "wait-for-kube" with CrashLoopBackOff: "back-off 5m0s restarting failed container=wait-for-kube pod=etcd-member-xxia04-6q4xb-m-0.c.openshift-qe.internal_openshift-etcd(3bb792a6006f0085d193d2c3c95dccf0)"
Feb 04 07:55:14 xxia04-6q4xb-m-0.c.openshift-qe.internal hyperkube[10302]: I0204 07:55:14.607870   10302 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-etcd", Name:"etcd-member-xxia04-6q4xb-m-0.c.openshift-qe.internal", UID:"3bb792a6006f0085d193d2c3c95dccf0", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.initContainers{wait-for-kube}"}): type: 'Warning' reason: 'BackOff' Back-off restarting failed container
Feb 04 07:55:18 xxia04-6q4xb-m-0.c.openshift-qe.internal hyperkube[10302]: I0204 07:55:18.972697   10302 worker.go:215] Non-running container probed: etcd-member-xxia04-6q4xb-m-0.c.openshift-qe.internal_openshift-etcd(3bb792a6006f0085d193d2c3c95dccf0) - etcd-member

Comment 3 Xingxing Xia 2020-02-05 10:16:33 UTC

Adding TestBlocker keyword since blocking DR scenario testing and bug 1771410 verification

Comment 5 Johnny Liu 2020-02-12 06:37:18 UTC

I also hit the same issue in upi on baremetal install with 4.4.0-0.nightly-2020-02-11-035407

Comment 7 Xingxing Xia 2020-02-14 08:18:02 UTC

Since irrelevant to cloud, tried IPI on AWS with 4.4.0-0.nightly-2020-02-13-212616. Above issue is fixed thus moving to verified, but hit another issue: bug 1802944

Comment 9 errata-xmlrpc 2020-05-04 11:33:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.