Description of problem: After masters stopped and restarted, cluster is dead Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-02-04-002939 How reproducible: Always Steps to Reproduce: 1. Install fresh 4.4 envs 2. Go to the cloud platform console, stop all masters 3. Then start the stopped hosts 4. Run any oc cmd, like `oc get no --v 6` Actual results: 4. Cannot access the cluster. `oc get no --v 6` shows: ... I0204 02:00:13.163815 28273 helpers.go:221] Connection error: Get https://...:6443/api?timeout=32s: dial tcp ...:6443: i/o timeout F0204 02:00:13.163837 28273 helpers.go:114] Unable to connect to the server: dial tcp ...:6443: i/o timeout ssh to the master, check: # podman ps -a # nothing CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES # journalctl -e -f -u kubelet.service ... Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.760259 6814 kubelet.go:2271] node "xxia04-xgnnk-m-0.c.openshift-qe.internal" not found Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.768141 6814 remote_runtime.go:261] RemoveContainer "5e660a0a6dffdd60e3778c438353765a57b0e9aa8dbea65147348f2bff7e2593" from runtime service failed: rpc error: code = Unknown desc = failed to delete storage for container 5e660a0a6dffdd60e3778c438353765a57b0e9aa8dbea65147348f2bff7e2593: container not known Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.768243 6814 kuberuntime_container.go:671] failed to remove pod init container "wait-for-kube": rpc error: code = Unknown desc = failed to delete storage for container 5e660a0a6dffdd60e3778c438353765a57b0e9aa8dbea65147348f2bff7e2593: container not known; Skipping pod "etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)" Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: I0204 03:47:51.768372 6814 kuberuntime_manager.go:856] checking backoff for container "wait-for-kube" in pod "etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)" Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: I0204 03:47:51.768506 6814 kuberuntime_manager.go:866] back-off 10s restarting failed container=wait-for-kube pod=etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50) Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.768557 6814 pod_workers.go:191] Error syncing pod b3c3d89708b507e16125b60366555d50 ("etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)"), skipping: failed to "StartContainer" for "wait-for-kube" with CrashLoopBackOff: "back-off 10s restarting failed container=wait-for-kube pod=etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-etcd(b3c3d89708b507e16125b60366555d50)" Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: I0204 03:47:51.768580 6814 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-etcd", Name:"etcd-member-xxia04-xgnnk-m-0.c.openshift-qe.internal", UID:"b3c3d89708b507e16125b60366555d50", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.initContainers{wait-for-kube}"}): type: 'Warning' reason: 'BackOff' Back-off restarting failed container Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: I0204 03:47:51.852688 6814 prober.go:129] Liveness probe for "openshift-kube-scheduler-xxia04-xgnnk-m-0.c.openshift-qe.internal_openshift-kube-scheduler(3d06aa3b353778af12908af306f1190a):scheduler" succeeded Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.860488 6814 kubelet.go:2271] node "xxia04-xgnnk-m-0.c.openshift-qe.internal" not found Feb 04 03:47:51 xxia04-xgnnk-m-0.c.openshift-qe.internal hyperkube[6814]: E0204 03:47:51.960707 6814 kubelet.go:2271] node "xxia04-xgnnk-m-0.c.openshift-qe.internal" not found ...snipped... Expected results: 4. Should access successfully Additional info: First found in trying upi on gcp DR, then confirmed in ipi on aws, both hit. Tried below cases, both hit: In one env, stop all masters, then restart them. In another env, stop all masters and workers, then restart them. If only stop one master and restart it, cluster pods, nodes, co are all back to normal.
etcd is not starting and I am frightened to think the static etcd pod init container waits for kube ``` root@xxia04-6q4xb-m-0 core]# crictl ps -a | grep wait-for-kube 61d65b5ac1d93 c5fb513ba6473e74dfe8606378886fa8402c24df3c96ff2203db4593ed35a9fa 50 seconds ago Exited wait-for-kube 59 b274aa7d51007 [root@xxia04-6q4xb-m-0 core]# crictl logs -f 61d65b5ac1d93 F0204 08:38:02.018434 1 waitforkube.go:36] kube env not populated ``` Feb 04 07:55:14 xxia04-6q4xb-m-0.c.openshift-qe.internal hyperkube[10302]: E0204 07:55:14.607865 10302 pod_workers.go:191] Error syncing pod 3bb792a6006f0085d193d2c3c95dccf0 ("etcd-member-xxia04-6q4xb-m-0.c.openshift-qe.internal_openshift-etcd(3bb792a6006f0085d193d2c3c95dccf0)"), skipping: failed to "StartContainer" for "wait-for-kube" with CrashLoopBackOff: "back-off 5m0s restarting failed container=wait-for-kube pod=etcd-member-xxia04-6q4xb-m-0.c.openshift-qe.internal_openshift-etcd(3bb792a6006f0085d193d2c3c95dccf0)" Feb 04 07:55:14 xxia04-6q4xb-m-0.c.openshift-qe.internal hyperkube[10302]: I0204 07:55:14.607870 10302 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-etcd", Name:"etcd-member-xxia04-6q4xb-m-0.c.openshift-qe.internal", UID:"3bb792a6006f0085d193d2c3c95dccf0", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.initContainers{wait-for-kube}"}): type: 'Warning' reason: 'BackOff' Back-off restarting failed container Feb 04 07:55:18 xxia04-6q4xb-m-0.c.openshift-qe.internal hyperkube[10302]: I0204 07:55:18.972697 10302 worker.go:215] Non-running container probed: etcd-member-xxia04-6q4xb-m-0.c.openshift-qe.internal_openshift-etcd(3bb792a6006f0085d193d2c3c95dccf0) - etcd-member
Adding TestBlocker keyword since blocking DR scenario testing and bug 1771410 verification
I also hit the same issue in upi on baremetal install with 4.4.0-0.nightly-2020-02-11-035407
Since irrelevant to cloud, tried IPI on AWS with 4.4.0-0.nightly-2020-02-13-212616. Above issue is fixed thus moving to verified, but hit another issue: bug 1802944
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581