Description of problem: Stopping one of the masters on HA cluster makes the cluster unhealthy, other 2 master api and controller pods restarts multiple times. In the following list I stopped ip-172-31-7-3.us-west-2.compute.internal for some time and started it back (using aws console). root@ip-172-31-2-226: ~ # oc get pods -n kube-system NAME READY STATUS RESTARTS AGE master-api-ip-172-31-22-84.us-west-2.compute.internal 1/1 Running 8 2h master-api-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 7 2h master-api-ip-172-31-7-3.us-west-2.compute.internal 1/1 Running 1 9m master-controllers-ip-172-31-22-84.us-west-2.compute.internal 1/1 Running 1 2h master-controllers-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 2 2h master-controllers-ip-172-31-7-3.us-west-2.compute.internal 1/1 Running 1 9m master-etcd-ip-172-31-22-84.us-west-2.compute.internal 1/1 Running 0 2h master-etcd-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 0 2h master-etcd-ip-172-31-7-3.us-west-2.compute.internal 1/1 Running 1 9m Here it shows the exited container root@ip-172-31-22-84: ~ # crictl ps -a | grep EXITED W0507 19:47:19.870182 23176 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock". e2c13d3c6ab64 35c82099d3142075a2ddfb56815ff67f334460b91dd892eb37161d298d2b3528 21 minutes ago CONTAINER_EXITED api 7 49ed5dfa80a09 registry.reg-aws.openshift.com:443/openshift3/ose-node@sha256:c740e60f4f098c80289842a7f49f31d873ba176d83f716e66e03d4e23167862e 2 hours ago CONTAINER_EXITED sync 0 ae4531789381c registry.reg-aws.openshift.com:443/openshift3/ose-control-plane@sha256:7d5395addf13b47e75e65609fde5d7639487f695f86beb5fd64bc035bb819a63 2 hours ago CONTAINER_EXITED controllers 0 Version-Release number of selected component (if applicable): openshift v3.10.0-0.32.0 kubernetes v1.10.0+b81c8f8 etcd 3.2.16 Steps to Reproduce: 1. Create a HA cluster with 3 masters 2. while doing some activity stop one of the masters (I used aws console to stop that node) 3. See "oc get pods -n kube-system" after some time Actual results: Other master api and controllers restart which were not stopped. Expected results: Other master api and controllers pod should not restart. Additional info: Attaching master logs and exited container logs from master which was not restarted.
Created attachment 1432805 [details] master api exited container logs
Created attachment 1432806 [details] journalctl logs from ip-172-31-22-84.us-west-2.compute.internal
I was able to reproduce it, it happens when active master instance is stopped/restarted Here is the state before stopping master controller node (active controller), in this case it was ip-172-31-7-3.us-west-2.compute.internal Master pods start stats master-api-ip-172-31-22-84.us-west-2.compute.internal 1/1 Running 9 13m master-api-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 8 19h master-api-ip-172-31-7-3.us-west-2.compute.internal 1/1 Running 2 18h master-controllers-ip-172-31-22-84.us-west-2.compute.inter 1/1 Running 2 12m master-controllers-ip-172-31-32-140.us-west-2.compute.irnal 1/1 Running 4 19h master-controllers-ip-172-31-7-3.us-west-2.compute.intenal 1/1 Running 2 18h master-etcd-ip-172-31-22-84.us-west-2.compute.intern l 1/1 Running 1 12m master-etcd-ip-172-31-32-140.us-west-2.compute.interal 1/1 Running 1 19h master-etcd-ip-172-31-7-3.us-west-2.compute.internal 1/1 Running 2 18h After stopping that instance (ip-172-31-7-3.us-west-2.compute.internal ) I see both the other masters and infra node in NotReady state, even trying to get logs from any of the master api or controllers pod causes “Unable to connect to the server: unexpected EOF” NAME STATUS ROLES AGE VERSION ip-172-31-1-0.us-west-2.compute.internal NotReady compute,infra 20h v1.10.0+b81c8f8 ip-172-31-16-217.us-west-2.compute.internal Ready compute 20h v1.10.0+b81c8f8 ip-172-31-22-84.us-west-2.compute.internal NotReady master 41m v1.10.0+b81c8f8 ip-172-31-32-140.us-west-2.compute.internal NotReady master 20h v1.10.0+b81c8f8 ip-172-31-6-179.us-west-2.compute.internal Ready compute 19h v1.10.0+b81c8f8 Registry and Router also gets re-created after Infra node becomes Ready. After around 15-16 mins Nodes become ready and master API and Controllers pods keep CrashLoopBackOff and restart many times. master-api-ip-172-31-22-84.us-west-2.compute.internal 1/1 CrashLoopBackOff 13 2m master-api-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 14 20h master-controllers-ip-172-31-22-84.us-west-2.compute.internal 1/1 Running 4 13s master-controllers-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 5 20h master-etcd-ip-172-31-22-84.us-west-2.compute.internal 1/1 Running 2 14s master-etcd-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 1 20h After almost 20-30 mins cluster is still unstable master-api-ip-172-31-22-84.us-west-2.compute.internal 0/1 CrashLoopBackOff 19 25m master-api-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 17 21h master-controllers-ip-172-31-22-84.us-west-2.compute.internal 1/1 Running 6 23m master-controllers-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 6 21h master-etcd-ip-172-31-22-84.us-west-2.compute.internal 1/1 Running 2 23m master-etcd-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 1 21h
Created attachment 1433253 [details] desc node when its NotReady
Created attachment 1433254 [details] desc pod when api pod is in CrashLoopBackOff
is the CRIO in the title indicating this is only seen in CRIO installations?
No, I first saw this on CRIO runtime cluster but I was able to re-produce it on Docker too. Removed CRIO from title.
is this still reproducible with https://github.com/openshift/origin/pull/19638 in?
Verified on following version, no unnecessary restarts occurred even after stopping multiple times the active master openshift v3.10.0-0.50.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816