1575756 – Stopping/Restarting active master controller instance makes the entire cluster unstable

Bug 1575756 - Stopping/Restarting active master controller instance makes the entire cluster unstable

Summary: Stopping/Restarting active master controller instance makes the entire cluste...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.10.0
Assignee:	Jordan Liggitt
QA Contact:	Vikas Laad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-07 19:52 UTC by Vikas Laad
Modified:	2018-07-30 19:14 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2018-07-30 19:14:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
master api exited container logs (78.42 KB, text/plain) 2018-05-07 19:56 UTC, Vikas Laad	no flags	Details
journalctl logs from ip-172-31-22-84.us-west-2.compute.internal (2.69 MB, text/x-vhdl) 2018-05-07 19:56 UTC, Vikas Laad	no flags	Details
desc node when its NotReady (9.47 KB, text/plain) 2018-05-08 14:46 UTC, Vikas Laad	no flags	Details
desc pod when api pod is in CrashLoopBackOff (2.67 KB, text/plain) 2018-05-08 14:46 UTC, Vikas Laad	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1816	0	None	None	None	2018-07-30 19:14:59 UTC

Description Vikas Laad 2018-05-07 19:52:38 UTC

Description of problem:
Stopping one of the masters on HA cluster makes the cluster unhealthy, other 2 master api and controller pods restarts multiple times. In the following list I stopped ip-172-31-7-3.us-west-2.compute.internal for some time and started it back (using aws console).

root@ip-172-31-2-226: ~ # oc get pods -n kube-system                                                                          
NAME                                                             READY     STATUS    RESTARTS   AGE
master-api-ip-172-31-22-84.us-west-2.compute.internal            1/1       Running   8          2h
master-api-ip-172-31-32-140.us-west-2.compute.internal           1/1       Running   7          2h
master-api-ip-172-31-7-3.us-west-2.compute.internal              1/1       Running   1          9m
master-controllers-ip-172-31-22-84.us-west-2.compute.internal    1/1       Running   1          2h
master-controllers-ip-172-31-32-140.us-west-2.compute.internal   1/1       Running   2          2h
master-controllers-ip-172-31-7-3.us-west-2.compute.internal      1/1       Running   1          9m
master-etcd-ip-172-31-22-84.us-west-2.compute.internal           1/1       Running   0          2h
master-etcd-ip-172-31-32-140.us-west-2.compute.internal          1/1       Running   0          2h
master-etcd-ip-172-31-7-3.us-west-2.compute.internal             1/1       Running   1          9m

Here it shows the exited container
root@ip-172-31-22-84: ~ # crictl ps -a | grep EXITED
W0507 19:47:19.870182   23176 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock".
e2c13d3c6ab64       35c82099d3142075a2ddfb56815ff67f334460b91dd892eb37161d298d2b3528                                                                                 21 minutes ago      CONTAINER_EXITED    api                 7
49ed5dfa80a09       registry.reg-aws.openshift.com:443/openshift3/ose-node@sha256:c740e60f4f098c80289842a7f49f31d873ba176d83f716e66e03d4e23167862e                   2 hours ago         CONTAINER_EXITED    sync                0
ae4531789381c       registry.reg-aws.openshift.com:443/openshift3/ose-control-plane@sha256:7d5395addf13b47e75e65609fde5d7639487f695f86beb5fd64bc035bb819a63          2 hours ago         CONTAINER_EXITED    controllers         0

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.32.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

Steps to Reproduce:
1. Create a HA cluster with 3 masters
2. while doing some activity stop one of the masters (I used aws console to stop that node)
3. See "oc get pods -n kube-system" after some time

Actual results:
Other master api and controllers restart which were not stopped.

Expected results:
Other master api and controllers pod should not restart.

Additional info:
Attaching master logs and exited container logs from master which was not restarted.

Comment 1 Vikas Laad 2018-05-07 19:56:06 UTC

Created attachment 1432805 [details]
master api exited container logs

Comment 2 Vikas Laad 2018-05-07 19:56:39 UTC

Created attachment 1432806 [details]
journalctl logs from ip-172-31-22-84.us-west-2.compute.internal

Comment 3 Vikas Laad 2018-05-08 14:43:13 UTC

I was able to reproduce it, it happens when active master instance is stopped/restarted

Here is the state before stopping master controller node (active controller), in this case it was ip-172-31-7-3.us-west-2.compute.internal 
Master pods start stats
master-api-ip-172-31-22-84.us-west-2.compute.internal            1/1       Running   9          13m
master-api-ip-172-31-32-140.us-west-2.compute.internal           1/1       Running   8          19h
master-api-ip-172-31-7-3.us-west-2.compute.internal                 1/1       Running   2          18h
master-controllers-ip-172-31-22-84.us-west-2.compute.inter     1/1       Running   2          12m
master-controllers-ip-172-31-32-140.us-west-2.compute.irnal   1/1       Running   4          19h
master-controllers-ip-172-31-7-3.us-west-2.compute.intenal      1/1       Running   2          18h
master-etcd-ip-172-31-22-84.us-west-2.compute.intern l           1/1       Running   1          12m
master-etcd-ip-172-31-32-140.us-west-2.compute.interal          1/1       Running   1          19h
master-etcd-ip-172-31-7-3.us-west-2.compute.internal              1/1       Running   2          18h

After stopping that instance (ip-172-31-7-3.us-west-2.compute.internal ) I see both the other masters and infra node in NotReady state, even trying to get logs from any of the master api or controllers pod causes “Unable to connect to the server: unexpected EOF” 
NAME                                          STATUS     ROLES           AGE       VERSION
ip-172-31-1-0.us-west-2.compute.internal      NotReady   compute,infra   20h       v1.10.0+b81c8f8
ip-172-31-16-217.us-west-2.compute.internal   Ready      compute         20h       v1.10.0+b81c8f8
ip-172-31-22-84.us-west-2.compute.internal    NotReady   master          41m       v1.10.0+b81c8f8
ip-172-31-32-140.us-west-2.compute.internal   NotReady   master          20h       v1.10.0+b81c8f8
ip-172-31-6-179.us-west-2.compute.internal    Ready      compute         19h       v1.10.0+b81c8f8

Registry and Router also gets re-created after Infra node becomes Ready.

After around 15-16 mins Nodes become ready and master API and Controllers pods keep CrashLoopBackOff and restart many times.
master-api-ip-172-31-22-84.us-west-2.compute.internal            1/1       CrashLoopBackOff   13         2m
master-api-ip-172-31-32-140.us-west-2.compute.internal           1/1       Running   14         20h
master-controllers-ip-172-31-22-84.us-west-2.compute.internal    1/1       Running   4          13s
master-controllers-ip-172-31-32-140.us-west-2.compute.internal   1/1       Running   5          20h
master-etcd-ip-172-31-22-84.us-west-2.compute.internal           1/1       Running   2          14s
master-etcd-ip-172-31-32-140.us-west-2.compute.internal          1/1       Running   1          20h

After almost 20-30 mins cluster is still unstable 
master-api-ip-172-31-22-84.us-west-2.compute.internal            0/1       CrashLoopBackOff   19         25m
master-api-ip-172-31-32-140.us-west-2.compute.internal           1/1       Running            17         21h
master-controllers-ip-172-31-22-84.us-west-2.compute.internal    1/1       Running            6          23m
master-controllers-ip-172-31-32-140.us-west-2.compute.internal   1/1       Running            6          21h
master-etcd-ip-172-31-22-84.us-west-2.compute.internal           1/1       Running            2          23m
master-etcd-ip-172-31-32-140.us-west-2.compute.internal          1/1       Running            1          21h

Comment 4 Vikas Laad 2018-05-08 14:46:15 UTC

Created attachment 1433253 [details]
desc node when its NotReady

Comment 5 Vikas Laad 2018-05-08 14:46:49 UTC

Created attachment 1433254 [details]
desc pod when api pod is in CrashLoopBackOff

Comment 6 Jordan Liggitt 2018-05-12 05:20:19 UTC

is the CRIO in the title indicating this is only seen in CRIO installations?

Comment 7 Vikas Laad 2018-05-14 02:14:13 UTC

No, I first saw this on CRIO runtime cluster but I was able to re-produce it on Docker too. Removed CRIO from title.

Comment 8 Jordan Liggitt 2018-05-18 19:10:31 UTC

is this still reproducible with https://github.com/openshift/origin/pull/19638 in?

Comment 10 Vikas Laad 2018-05-22 17:54:54 UTC

Verified on following version, no unnecessary restarts occurred even after stopping multiple times the active master

openshift v3.10.0-0.50.0

Comment 12 errata-xmlrpc 2018-07-30 19:14:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816

Note You need to log in before you can comment on or make changes to this bug.