Bug 1575756

Summary: Stopping/Restarting active master controller instance makes the entire cluster unstable
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: MasterAssignee: Jordan Liggitt <jliggitt>
Status: CLOSED ERRATA QA Contact: Vikas Laad <vlaad>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, hongkliu, jokerman, mifiedle, mmccomas, vlaad
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-30 19:14:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Description Flags
master api exited container logs
journalctl logs from ip-172-31-22-84.us-west-2.compute.internal
desc node when its NotReady
desc pod when api pod is in CrashLoopBackOff none

Description Vikas Laad 2018-05-07 19:52:38 UTC
Description of problem:
Stopping one of the masters on HA cluster makes the cluster unhealthy, other 2 master api and controller pods restarts multiple times. In the following list I stopped ip-172-31-7-3.us-west-2.compute.internal for some time and started it back (using aws console).

root@ip-172-31-2-226: ~ # oc get pods -n kube-system                                                                          
NAME                                                             READY     STATUS    RESTARTS   AGE
master-api-ip-172-31-22-84.us-west-2.compute.internal            1/1       Running   8          2h
master-api-ip-172-31-32-140.us-west-2.compute.internal           1/1       Running   7          2h
master-api-ip-172-31-7-3.us-west-2.compute.internal              1/1       Running   1          9m
master-controllers-ip-172-31-22-84.us-west-2.compute.internal    1/1       Running   1          2h
master-controllers-ip-172-31-32-140.us-west-2.compute.internal   1/1       Running   2          2h
master-controllers-ip-172-31-7-3.us-west-2.compute.internal      1/1       Running   1          9m
master-etcd-ip-172-31-22-84.us-west-2.compute.internal           1/1       Running   0          2h
master-etcd-ip-172-31-32-140.us-west-2.compute.internal          1/1       Running   0          2h
master-etcd-ip-172-31-7-3.us-west-2.compute.internal             1/1       Running   1          9m

Here it shows the exited container
root@ip-172-31-22-84: ~ # crictl ps -a | grep EXITED
W0507 19:47:19.870182   23176 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock".
e2c13d3c6ab64       35c82099d3142075a2ddfb56815ff67f334460b91dd892eb37161d298d2b3528                                                                                 21 minutes ago      CONTAINER_EXITED    api                 7
49ed5dfa80a09       registry.reg-aws.openshift.com:443/openshift3/ose-node@sha256:c740e60f4f098c80289842a7f49f31d873ba176d83f716e66e03d4e23167862e                   2 hours ago         CONTAINER_EXITED    sync                0
ae4531789381c       registry.reg-aws.openshift.com:443/openshift3/ose-control-plane@sha256:7d5395addf13b47e75e65609fde5d7639487f695f86beb5fd64bc035bb819a63          2 hours ago         CONTAINER_EXITED    controllers         0

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.32.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

Steps to Reproduce:
1. Create a HA cluster with 3 masters
2. while doing some activity stop one of the masters (I used aws console to stop that node)
3. See "oc get pods -n kube-system" after some time

Actual results:
Other master api and controllers restart which were not stopped.

Expected results:
Other master api and controllers pod should not restart.

Additional info:
Attaching master logs and exited container logs from master which was not restarted.

Comment 1 Vikas Laad 2018-05-07 19:56:06 UTC
Created attachment 1432805 [details]
master api exited container logs

Comment 2 Vikas Laad 2018-05-07 19:56:39 UTC
Created attachment 1432806 [details]
journalctl logs from ip-172-31-22-84.us-west-2.compute.internal

Comment 3 Vikas Laad 2018-05-08 14:43:13 UTC
I was able to reproduce it, it happens when active master instance is stopped/restarted

Here is the state before stopping master controller node (active controller), in this case it was ip-172-31-7-3.us-west-2.compute.internal 
Master pods start stats
master-api-ip-172-31-22-84.us-west-2.compute.internal            1/1       Running   9          13m
master-api-ip-172-31-32-140.us-west-2.compute.internal           1/1       Running   8          19h
master-api-ip-172-31-7-3.us-west-2.compute.internal                 1/1       Running   2          18h
master-controllers-ip-172-31-22-84.us-west-2.compute.inter     1/1       Running   2          12m
master-controllers-ip-172-31-32-140.us-west-2.compute.irnal   1/1       Running   4          19h
master-controllers-ip-172-31-7-3.us-west-2.compute.intenal      1/1       Running   2          18h
master-etcd-ip-172-31-22-84.us-west-2.compute.intern l           1/1       Running   1          12m
master-etcd-ip-172-31-32-140.us-west-2.compute.interal          1/1       Running   1          19h
master-etcd-ip-172-31-7-3.us-west-2.compute.internal              1/1       Running   2          18h

After stopping that instance (ip-172-31-7-3.us-west-2.compute.internal ) I see both the other masters and infra node in NotReady state, even trying to get logs from any of the master api or controllers pod causes “Unable to connect to the server: unexpected EOF” 
NAME                                          STATUS     ROLES           AGE       VERSION
ip-172-31-1-0.us-west-2.compute.internal      NotReady   compute,infra   20h       v1.10.0+b81c8f8
ip-172-31-16-217.us-west-2.compute.internal   Ready      compute         20h       v1.10.0+b81c8f8
ip-172-31-22-84.us-west-2.compute.internal    NotReady   master          41m       v1.10.0+b81c8f8
ip-172-31-32-140.us-west-2.compute.internal   NotReady   master          20h       v1.10.0+b81c8f8
ip-172-31-6-179.us-west-2.compute.internal    Ready      compute         19h       v1.10.0+b81c8f8

Registry and Router also gets re-created after Infra node becomes Ready.

After around 15-16 mins Nodes become ready and master API and Controllers pods keep CrashLoopBackOff and restart many times.
master-api-ip-172-31-22-84.us-west-2.compute.internal            1/1       CrashLoopBackOff   13         2m
master-api-ip-172-31-32-140.us-west-2.compute.internal           1/1       Running   14         20h
master-controllers-ip-172-31-22-84.us-west-2.compute.internal    1/1       Running   4          13s
master-controllers-ip-172-31-32-140.us-west-2.compute.internal   1/1       Running   5          20h
master-etcd-ip-172-31-22-84.us-west-2.compute.internal           1/1       Running   2          14s
master-etcd-ip-172-31-32-140.us-west-2.compute.internal          1/1       Running   1          20h

After almost 20-30 mins cluster is still unstable 
master-api-ip-172-31-22-84.us-west-2.compute.internal            0/1       CrashLoopBackOff   19         25m
master-api-ip-172-31-32-140.us-west-2.compute.internal           1/1       Running            17         21h
master-controllers-ip-172-31-22-84.us-west-2.compute.internal    1/1       Running            6          23m
master-controllers-ip-172-31-32-140.us-west-2.compute.internal   1/1       Running            6          21h
master-etcd-ip-172-31-22-84.us-west-2.compute.internal           1/1       Running            2          23m
master-etcd-ip-172-31-32-140.us-west-2.compute.internal          1/1       Running            1          21h

Comment 4 Vikas Laad 2018-05-08 14:46:15 UTC
Created attachment 1433253 [details]
desc node when its NotReady

Comment 5 Vikas Laad 2018-05-08 14:46:49 UTC
Created attachment 1433254 [details]
desc pod when api pod is in CrashLoopBackOff

Comment 6 Jordan Liggitt 2018-05-12 05:20:19 UTC
is the CRIO in the title indicating this is only seen in CRIO installations?

Comment 7 Vikas Laad 2018-05-14 02:14:13 UTC
No, I first saw this on CRIO runtime cluster but I was able to re-produce it on Docker too. Removed CRIO from title.

Comment 8 Jordan Liggitt 2018-05-18 19:10:31 UTC
is this still reproducible with https://github.com/openshift/origin/pull/19638 in?

Comment 10 Vikas Laad 2018-05-22 17:54:54 UTC
Verified on following version, no unnecessary restarts occurred even after stopping multiple times the active master

openshift v3.10.0-0.50.0

Comment 12 errata-xmlrpc 2018-07-30 19:14:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.