Description of problem: Director deployed OCP 3.11: all infra and worker nodes in the cluster go down when one of the master nodes become unavailable. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Deploy OCP overcloud with 3 x masters + 3 x worker + 3 x infra nodes with CNS 2. Power off the master node that has got the external VIP set 3. Wait a couple of minutes 4. Run oc get nodes on one of the master nodes Actual results: [root@openshift-master-0 heat-admin]# oc get nodes NAME STATUS ROLES AGE VERSION openshift-infra-0 NotReady infra 1h v1.11.0+d4cacc0 openshift-infra-1 NotReady infra 1h v1.11.0+d4cacc0 openshift-infra-2 NotReady infra 1h v1.11.0+d4cacc0 openshift-master-0 Ready master 1h v1.11.0+d4cacc0 openshift-master-1 NotReady master 1h v1.11.0+d4cacc0 openshift-master-2 Ready master 1h v1.11.0+d4cacc0 openshift-worker-0 NotReady compute 1h v1.11.0+d4cacc0 openshift-worker-1 NotReady compute 1h v1.11.0+d4cacc0 openshift-worker-2 NotReady compute 1h v1.11.0+d4cacc0 Expected results: The cluster remains operational when 1/3 master nodes goes down. Additional info:
Could it be related to https://bugzilla.redhat.com/show_bug.cgi?id=1598362 ? They have the same issue and it's related to 15min default timeout.
I tried reproducing this issue but I wasn't able to. I'm closing this bug for now and I'll reopen if it shows up again.
I am reopening this one since I was able to reproduce it: [root@openshift-master-2 heat-admin]# oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME openshift-infra-0 NotReady infra 2h v1.11.0+d4cacc0 172.17.1.29 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.el7.x86_64 docker://1.13.1 openshift-infra-1 NotReady infra 2h v1.11.0+d4cacc0 172.17.1.24 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.el7.x86_64 docker://1.13.1 openshift-infra-2 NotReady infra 2h v1.11.0+d4cacc0 172.17.1.41 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.el7.x86_64 docker://1.13.1 openshift-master-0 NotReady master 2h v1.11.0+d4cacc0 172.17.1.11 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.el7.x86_64 docker://1.13.1 openshift-master-1 Ready master 2h v1.11.0+d4cacc0 172.17.1.10 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.el7.x86_64 docker://1.13.1 openshift-master-2 Ready master 2h v1.11.0+d4cacc0 172.17.1.16 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.el7.x86_64 docker://1.13.1 openshift-worker-0 NotReady compute 2h v1.11.0+d4cacc0 172.17.1.17 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.el7.x86_64 docker://1.13.1 openshift-worker-1 NotReady compute 2h v1.11.0+d4cacc0 172.17.1.14 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.el7.x86_64 docker://1.13.1 openshift-worker-2 NotReady compute 2h v1.11.0+d4cacc0 172.17.1.25 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.el7.x86_64 docker://1.13.1
The issue persists even after waiting for more than 15 minutes. It appears that on the non-master nodes the atomic-openshift-node.service service goes into activating state: [root@openshift-worker-1 ~]# systemctl status atomic-openshift-node.service ● atomic-openshift-node.service - OpenShift Node Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Active: activating (start) since Wed 2018-11-28 15:09:14 EST; 39s ago Docs: https://github.com/openshift/origin Main PID: 76880 (hyperkube) Tasks: 11 Memory: 21.6M CGroup: /system.slice/atomic-openshift-node.service └─76880 /usr/bin/hyperkube kubelet --v=0 --address=0.0.0.0 --allow-privileged=true --anonymous-auth=true --authentication-token-webhook=true --authentication-token-webhook-cache-ttl=5m --authorization-mode=Webhook --authoriz... Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: Flag --tls-cipher-suites has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubern...re information. Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: Flag --tls-cipher-suites has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubern...re information. Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: Flag --tls-min-version has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernet...re information. Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: Flag --tls-private-key-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kub...re information. Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.825115 76880 server.go:418] Version: v1.11.0+d4cacc0 Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.825320 76880 plugins.go:97] No cloud provider specified. Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: E1128 15:09:14.849969 76880 bootstrap.go:195] Part of the existing bootstrap client certificate is expired: 2018-11-28 18:55:00 +0000 UTC Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.850887 76880 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem". Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.878922 76880 csr.go:105] csr for this node already exists, reusing Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.882771 76880 csr.go:113] csr for this node is still valid Hint: Some lines were ellipsized, use -l to show in full. Attaching the journal for atomic-openshift-node.service
Created attachment 1509623 [details] atomic-openshift-node.service journal
Checking the journal log I can see that the atomic-openshift-node.service entered failed state after the following error showed up: E1128 13:55:06.928399 12813 transport.go:108] The currently active client certificate has expired, but the server is not responsive. A restart may be necessary to retrieve new initial credentials. E1128 13:55:09.289059 12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized E1128 13:55:09.291313 12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized E1128 13:55:09.301042 12813 reflector.go:253] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to watch *v1.Node: the server has asked for the client to provide credentials (get nodes) E1128 13:55:09.302132 12813 reflector.go:253] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to watch *v1.Service: the server has asked for the client to provide credentials (get services) W1128 13:55:09.305199 12813 reflector.go:272] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: watch of *v1.Pod ended with: very short watch: k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Unexpected watch close - watch last ed less than a second and no items received E1128 13:55:09.305537 12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized E1128 13:55:09.320834 12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized E1128 13:55:09.332595 12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized
(In reply to Marius Cornea from comment #6) > Checking the journal log I can see that the atomic-openshift-node.service > entered failed state after the following error showed up: > > E1128 13:55:06.928399 12813 transport.go:108] The currently active client > certificate has expired, but the server is not responsive. A restart may be > necessary to retrieve new initial credentials. What happens if you restart atomic-openshift-node.service? will the cluster eventually recover?
(In reply to Martin André from comment #7) > (In reply to Marius Cornea from comment #6) > > Checking the journal log I can see that the atomic-openshift-node.service > > entered failed state after the following error showed up: > > > > E1128 13:55:06.928399 12813 transport.go:108] The currently active client > > certificate has expired, but the server is not responsive. A restart may be > > necessary to retrieve new initial credentials. > > What happens if you restart atomic-openshift-node.service? will the cluster > eventually recover? Restarting atomic-openshift-node.service doesn't work - the service remains stuck in activating state.
I just noticed we're setting experimental-cluster-signing-duration to 20m [1]. This parameter controls the duration of the certificates and according to [2] defaults to 8760h (?!?). If we're changing back the the defaults, this *greatly* reduces the risk of certs renewal during the time the master is unreachable. [1] https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/services/openshift-master.yaml#L198 [2] https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
Proposed possible fix upstream at https://review.openstack.org/622440
No doc text required.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045