1648493 – Director deployed OCP 3.11: all infra and worker nodes in the cluster go down when one of the master nodes become unavailable

Bug 1648493 - Director deployed OCP 3.11: all infra and worker nodes in the cluster go down when one of the master nodes become unavailable

Summary: Director deployed OCP 3.11: all infra and worker nodes in the cluster go down...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	14.0 (Rocky)
Assignee:	Martin André
QA Contact:	Marius Cornea
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-09 22:40 UTC by Marius Cornea
Modified:	2019-04-25 07:48 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-9.0.1-0.20181013060900.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-11 11:54:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
atomic-openshift-node.service journal (690.29 KB, text/plain) 2018-11-28 20:12 UTC, Marius Cornea	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	622440	0	'None'	MERGED	Remove openshift-ansible customization	2021-02-14 16:40:20 UTC
Red Hat Product Errata	RHEA-2019:0045	0	None	None	None	2019-01-11 11:54:53 UTC

Description Marius Cornea 2018-11-09 22:40:41 UTC

Description of problem:
Director deployed OCP 3.11: all infra and worker nodes in the cluster go down when one of the master nodes become unavailable.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP overcloud with 3 x masters + 3 x worker + 3 x infra nodes with CNS

2. Power off the master node that has got the external VIP set

3. Wait a couple of minutes

4. Run oc get nodes on one of the master nodes

Actual results:

[root@openshift-master-0 heat-admin]# oc get nodes
NAME                 STATUS     ROLES     AGE       VERSION
openshift-infra-0    NotReady   infra     1h        v1.11.0+d4cacc0
openshift-infra-1    NotReady   infra     1h        v1.11.0+d4cacc0
openshift-infra-2    NotReady   infra     1h        v1.11.0+d4cacc0
openshift-master-0   Ready      master    1h        v1.11.0+d4cacc0
openshift-master-1   NotReady   master    1h        v1.11.0+d4cacc0
openshift-master-2   Ready      master    1h        v1.11.0+d4cacc0
openshift-worker-0   NotReady   compute   1h        v1.11.0+d4cacc0
openshift-worker-1   NotReady   compute   1h        v1.11.0+d4cacc0
openshift-worker-2   NotReady   compute   1h        v1.11.0+d4cacc0


Expected results:
The cluster remains operational when 1/3 master nodes goes down. 

Additional info:

Comment 1 Mike Fedosin 2018-11-20 17:04:38 UTC

Could it be related to https://bugzilla.redhat.com/show_bug.cgi?id=1598362 ?
They have the same issue and it's related to 15min default timeout.

Comment 2 Marius Cornea 2018-11-20 18:50:59 UTC

I tried reproducing this issue but I wasn't able to. I'm closing this bug for now and I'll reopen if it shows up again.

Comment 3 Marius Cornea 2018-11-27 20:45:44 UTC

I am reopening this one since I was able to reproduce it:

[root@openshift-master-2 heat-admin]# oc get nodes -o wide
NAME                 STATUS     ROLES     AGE       VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                      KERNEL-VERSION          CONTAINER-RUNTIME
openshift-infra-0    NotReady   infra     2h        v1.11.0+d4cacc0   172.17.1.29   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-infra-1    NotReady   infra     2h        v1.11.0+d4cacc0   172.17.1.24   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-infra-2    NotReady   infra     2h        v1.11.0+d4cacc0   172.17.1.41   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-master-0   NotReady   master    2h        v1.11.0+d4cacc0   172.17.1.11   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-master-1   Ready      master    2h        v1.11.0+d4cacc0   172.17.1.10   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-master-2   Ready      master    2h        v1.11.0+d4cacc0   172.17.1.16   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-worker-0   NotReady   compute   2h        v1.11.0+d4cacc0   172.17.1.17   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-worker-1   NotReady   compute   2h        v1.11.0+d4cacc0   172.17.1.14   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-worker-2   NotReady   compute   2h        v1.11.0+d4cacc0   172.17.1.25   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1

Comment 4 Marius Cornea 2018-11-28 20:12:06 UTC

The issue persists even after waiting for more than 15 minutes. It appears that on the non-master nodes the atomic-openshift-node.service service goes into activating state:

[root@openshift-worker-1 ~]# systemctl status atomic-openshift-node.service
● atomic-openshift-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2018-11-28 15:09:14 EST; 39s ago
     Docs: https://github.com/openshift/origin
 Main PID: 76880 (hyperkube)
    Tasks: 11
   Memory: 21.6M
   CGroup: /system.slice/atomic-openshift-node.service
           └─76880 /usr/bin/hyperkube kubelet --v=0 --address=0.0.0.0 --allow-privileged=true --anonymous-auth=true --authentication-token-webhook=true --authentication-token-webhook-cache-ttl=5m --authorization-mode=Webhook --authoriz...

Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: Flag --tls-cipher-suites has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubern...re information.
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: Flag --tls-cipher-suites has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubern...re information.
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: Flag --tls-min-version has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernet...re information.
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: Flag --tls-private-key-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kub...re information.
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.825115   76880 server.go:418] Version: v1.11.0+d4cacc0
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.825320   76880 plugins.go:97] No cloud provider specified.
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: E1128 15:09:14.849969   76880 bootstrap.go:195] Part of the existing bootstrap client certificate is expired: 2018-11-28 18:55:00 +0000 UTC
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.850887   76880 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem".
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.878922   76880 csr.go:105] csr for this node already exists, reusing
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.882771   76880 csr.go:113] csr for this node is still valid
Hint: Some lines were ellipsized, use -l to show in full.


Attaching the journal for atomic-openshift-node.service

Comment 5 Marius Cornea 2018-11-28 20:12:35 UTC

Created attachment 1509623 [details]
atomic-openshift-node.service journal

Comment 6 Marius Cornea 2018-11-28 20:58:24 UTC

Checking the journal log I can see that the atomic-openshift-node.service entered failed state after the following error showed up:

E1128 13:55:06.928399   12813 transport.go:108] The currently active client certificate has expired, but the server is not responsive. A restart may be necessary to retrieve new initial credentials.

E1128 13:55:09.289059   12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized
E1128 13:55:09.291313   12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized
E1128 13:55:09.301042   12813 reflector.go:253] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to watch *v1.Node: the server has asked for the client to provide credentials (get nodes)
E1128 13:55:09.302132   12813 reflector.go:253] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to watch *v1.Service: the server has asked for the client to provide credentials (get services)
W1128 13:55:09.305199   12813 reflector.go:272] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: watch of *v1.Pod ended with: very short watch: k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Unexpected watch close - watch last
ed less than a second and no items received
E1128 13:55:09.305537   12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized
E1128 13:55:09.320834   12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized
E1128 13:55:09.332595   12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized

Comment 7 Martin André 2018-11-29 08:53:04 UTC

(In reply to Marius Cornea from comment #6)
> Checking the journal log I can see that the atomic-openshift-node.service
> entered failed state after the following error showed up:
> 
> E1128 13:55:06.928399   12813 transport.go:108] The currently active client
> certificate has expired, but the server is not responsive. A restart may be
> necessary to retrieve new initial credentials.

What happens if you restart atomic-openshift-node.service? will the cluster eventually recover?

Comment 8 Marius Cornea 2018-11-29 20:04:29 UTC

(In reply to Martin André from comment #7)
> (In reply to Marius Cornea from comment #6)
> > Checking the journal log I can see that the atomic-openshift-node.service
> > entered failed state after the following error showed up:
> > 
> > E1128 13:55:06.928399   12813 transport.go:108] The currently active client
> > certificate has expired, but the server is not responsive. A restart may be
> > necessary to retrieve new initial credentials.
> 
> What happens if you restart atomic-openshift-node.service? will the cluster
> eventually recover?

Restarting atomic-openshift-node.service doesn't work - the service remains stuck in activating state.

Comment 9 Martin André 2018-12-04 08:53:49 UTC

I just noticed we're setting experimental-cluster-signing-duration to 20m [1]. This parameter controls the duration of the certificates and according to [2] defaults to 8760h (?!?). If we're changing back the the defaults, this *greatly* reduces the risk of certs renewal during the time the master is unreachable.

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/services/openshift-master.yaml#L198
[2] https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/

Comment 10 Martin André 2018-12-04 17:16:32 UTC

Proposed possible fix upstream at https://review.openstack.org/622440

Comment 24 Martin André 2019-01-10 10:18:25 UTC

No doc text required.

Comment 25 errata-xmlrpc 2019-01-11 11:54:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Note You need to log in before you can comment on or make changes to this bug.