Bug 1648493 - Director deployed OCP 3.11: all infra and worker nodes in the cluster go down when one of the master nodes become unavailable
Summary: Director deployed OCP 3.11: all infra and worker nodes in the cluster go down...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 14.0 (Rocky)
Assignee: Martin André
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-09 22:40 UTC by Marius Cornea
Modified: 2019-04-25 07:48 UTC (History)
8 users (show)

Fixed In Version: openstack-tripleo-heat-templates-9.0.1-0.20181013060900.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-11 11:54:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
atomic-openshift-node.service journal (690.29 KB, text/plain)
2018-11-28 20:12 UTC, Marius Cornea
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 622440 0 'None' MERGED Remove openshift-ansible customization 2021-02-14 16:40:20 UTC
Red Hat Product Errata RHEA-2019:0045 0 None None None 2019-01-11 11:54:53 UTC

Description Marius Cornea 2018-11-09 22:40:41 UTC
Description of problem:
Director deployed OCP 3.11: all infra and worker nodes in the cluster go down when one of the master nodes become unavailable.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP overcloud with 3 x masters + 3 x worker + 3 x infra nodes with CNS

2. Power off the master node that has got the external VIP set

3. Wait a couple of minutes

4. Run oc get nodes on one of the master nodes

Actual results:

[root@openshift-master-0 heat-admin]# oc get nodes
NAME                 STATUS     ROLES     AGE       VERSION
openshift-infra-0    NotReady   infra     1h        v1.11.0+d4cacc0
openshift-infra-1    NotReady   infra     1h        v1.11.0+d4cacc0
openshift-infra-2    NotReady   infra     1h        v1.11.0+d4cacc0
openshift-master-0   Ready      master    1h        v1.11.0+d4cacc0
openshift-master-1   NotReady   master    1h        v1.11.0+d4cacc0
openshift-master-2   Ready      master    1h        v1.11.0+d4cacc0
openshift-worker-0   NotReady   compute   1h        v1.11.0+d4cacc0
openshift-worker-1   NotReady   compute   1h        v1.11.0+d4cacc0
openshift-worker-2   NotReady   compute   1h        v1.11.0+d4cacc0


Expected results:
The cluster remains operational when 1/3 master nodes goes down. 

Additional info:

Comment 1 Mike Fedosin 2018-11-20 17:04:38 UTC
Could it be related to https://bugzilla.redhat.com/show_bug.cgi?id=1598362 ?
They have the same issue and it's related to 15min default timeout.

Comment 2 Marius Cornea 2018-11-20 18:50:59 UTC
I tried reproducing this issue but I wasn't able to. I'm closing this bug for now and I'll reopen if it shows up again.

Comment 3 Marius Cornea 2018-11-27 20:45:44 UTC
I am reopening this one since I was able to reproduce it:

[root@openshift-master-2 heat-admin]# oc get nodes -o wide
NAME                 STATUS     ROLES     AGE       VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                      KERNEL-VERSION          CONTAINER-RUNTIME
openshift-infra-0    NotReady   infra     2h        v1.11.0+d4cacc0   172.17.1.29   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-infra-1    NotReady   infra     2h        v1.11.0+d4cacc0   172.17.1.24   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-infra-2    NotReady   infra     2h        v1.11.0+d4cacc0   172.17.1.41   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-master-0   NotReady   master    2h        v1.11.0+d4cacc0   172.17.1.11   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-master-1   Ready      master    2h        v1.11.0+d4cacc0   172.17.1.10   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-master-2   Ready      master    2h        v1.11.0+d4cacc0   172.17.1.16   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-worker-0   NotReady   compute   2h        v1.11.0+d4cacc0   172.17.1.17   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-worker-1   NotReady   compute   2h        v1.11.0+d4cacc0   172.17.1.14   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1
openshift-worker-2   NotReady   compute   2h        v1.11.0+d4cacc0   172.17.1.25   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.el7.x86_64   docker://1.13.1

Comment 4 Marius Cornea 2018-11-28 20:12:06 UTC
The issue persists even after waiting for more than 15 minutes. It appears that on the non-master nodes the atomic-openshift-node.service service goes into activating state:

[root@openshift-worker-1 ~]# systemctl status atomic-openshift-node.service
● atomic-openshift-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2018-11-28 15:09:14 EST; 39s ago
     Docs: https://github.com/openshift/origin
 Main PID: 76880 (hyperkube)
    Tasks: 11
   Memory: 21.6M
   CGroup: /system.slice/atomic-openshift-node.service
           └─76880 /usr/bin/hyperkube kubelet --v=0 --address=0.0.0.0 --allow-privileged=true --anonymous-auth=true --authentication-token-webhook=true --authentication-token-webhook-cache-ttl=5m --authorization-mode=Webhook --authoriz...

Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: Flag --tls-cipher-suites has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubern...re information.
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: Flag --tls-cipher-suites has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubern...re information.
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: Flag --tls-min-version has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernet...re information.
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: Flag --tls-private-key-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kub...re information.
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.825115   76880 server.go:418] Version: v1.11.0+d4cacc0
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.825320   76880 plugins.go:97] No cloud provider specified.
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: E1128 15:09:14.849969   76880 bootstrap.go:195] Part of the existing bootstrap client certificate is expired: 2018-11-28 18:55:00 +0000 UTC
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.850887   76880 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem".
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.878922   76880 csr.go:105] csr for this node already exists, reusing
Nov 28 15:09:14 openshift-worker-1 atomic-openshift-node[76880]: I1128 15:09:14.882771   76880 csr.go:113] csr for this node is still valid
Hint: Some lines were ellipsized, use -l to show in full.


Attaching the journal for atomic-openshift-node.service

Comment 5 Marius Cornea 2018-11-28 20:12:35 UTC
Created attachment 1509623 [details]
atomic-openshift-node.service journal

Comment 6 Marius Cornea 2018-11-28 20:58:24 UTC
Checking the journal log I can see that the atomic-openshift-node.service entered failed state after the following error showed up:

E1128 13:55:06.928399   12813 transport.go:108] The currently active client certificate has expired, but the server is not responsive. A restart may be necessary to retrieve new initial credentials.

E1128 13:55:09.289059   12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized
E1128 13:55:09.291313   12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized
E1128 13:55:09.301042   12813 reflector.go:253] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to watch *v1.Node: the server has asked for the client to provide credentials (get nodes)
E1128 13:55:09.302132   12813 reflector.go:253] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to watch *v1.Service: the server has asked for the client to provide credentials (get services)
W1128 13:55:09.305199   12813 reflector.go:272] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: watch of *v1.Pod ended with: very short watch: k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Unexpected watch close - watch last
ed less than a second and no items received
E1128 13:55:09.305537   12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized
E1128 13:55:09.320834   12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized
E1128 13:55:09.332595   12813 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "openshift-worker-1": Unauthorized

Comment 7 Martin André 2018-11-29 08:53:04 UTC
(In reply to Marius Cornea from comment #6)
> Checking the journal log I can see that the atomic-openshift-node.service
> entered failed state after the following error showed up:
> 
> E1128 13:55:06.928399   12813 transport.go:108] The currently active client
> certificate has expired, but the server is not responsive. A restart may be
> necessary to retrieve new initial credentials.

What happens if you restart atomic-openshift-node.service? will the cluster eventually recover?

Comment 8 Marius Cornea 2018-11-29 20:04:29 UTC
(In reply to Martin André from comment #7)
> (In reply to Marius Cornea from comment #6)
> > Checking the journal log I can see that the atomic-openshift-node.service
> > entered failed state after the following error showed up:
> > 
> > E1128 13:55:06.928399   12813 transport.go:108] The currently active client
> > certificate has expired, but the server is not responsive. A restart may be
> > necessary to retrieve new initial credentials.
> 
> What happens if you restart atomic-openshift-node.service? will the cluster
> eventually recover?

Restarting atomic-openshift-node.service doesn't work - the service remains stuck in activating state.

Comment 9 Martin André 2018-12-04 08:53:49 UTC
I just noticed we're setting experimental-cluster-signing-duration to 20m [1]. This parameter controls the duration of the certificates and according to [2] defaults to 8760h (?!?). If we're changing back the the defaults, this *greatly* reduces the risk of certs renewal during the time the master is unreachable.

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/services/openshift-master.yaml#L198
[2] https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/

Comment 10 Martin André 2018-12-04 17:16:32 UTC
Proposed possible fix upstream at https://review.openstack.org/622440

Comment 24 Martin André 2019-01-10 10:18:25 UTC
No doc text required.

Comment 25 errata-xmlrpc 2019-01-11 11:54:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045


Note You need to log in before you can comment on or make changes to this bug.