Description of problem: --- - When non-active controllers and non-leader etcd stopped, OpenShift Master-controllers stopped working like showing Node status wrongly. Version-Release number of selected component (if applicable): --- - OCP v3.3.1.7 How reproducible: --- - Stop non-active controllers and non-leader etcd stopped. Steps to Reproduce: --- 1. Set up multi-masters env (e.g. I tested with 3 masters) 2. Stop atomic-openshift-master-controllers and etcd on non-active master and non-leader etcd hosts). # systemctl stop atomic-openshift-master-controllers etcd 3. Try to stop one of Node hosts. Actual results: - In step-3, Node remains "Ready". - When this issue happened, Master controllers started producing "etcd cluster is unavailable or misconfigured" and "unable to check lease". ~~~ Jan 30 14:04:44 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2627]: E0130 14:04:44.465167 2627 leaderlease.go:146] unable to release openshift.io/leases/controllers: 501: All the given peers are not reachable (failed to propose on members [https://knakayam-ose33-master1.example.com:2379 https://knakayam-ose33-master3.example.com:2379 https://knakayam-ose33-master2.example.com:2379] twice [last error: Unexpected HTTP status code]) [0] Jan 30 14:02:04 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2627]: E0130 14:02:04.514779 2627 nodecontroller.go:574] Unable to mark all pods NotReady on node knakayam-ose33-master2.example.com: client: etcd cluster is unavailable or misconfigured; client: etcd cluster is unavailable or misconfigured Jan 30 14:02:14 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2627]: E0130 14:02:14.522233 2627 nodecontroller.go:830] Error updating node knakayam-ose33-master3.example.com: client: etcd cluster is unavailable or misconfigured Jan 30 14:02:24 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2627]: E0130 14:02:24.532264 2627 nodecontroller.go:830] Error updating node knakayam-ose33-master3.example.com: client: etcd cluster is unavailable or misconfigured ... snip ... Jan 30 14:13:05 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[8138]: I0130 14:13:05.051309 8138 master.go:271] Started health checks at 0.0.0.0:8444 Jan 30 14:13:05 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[8138]: I0130 14:13:05.052849 8138 master_config.go:549] Attempting to acquire controller lease as master-7d1tk96l, renewing every 30 seconds Jan 30 14:13:46 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[8138]: E0130 14:13:46.651637 8138 leaderlease.go:69] unable to check lease openshift.io/leases/controllers: 501: All the given peers are not reachable (failed to propose on members [https://knakayam-ose33-master2.example.com:2379 https://knakayam-ose33-master3.example.com:2379 https://knakayam-ose33-master1.example.com:2379] twice [last error: Unexpected HTTP status code]) [0] ~~~ Expected results: - master-controllers service keeps working fine.
Could you please prioritize this issue, since it has an impact to the service level of their clusters?
Are the etcd servers running remote or local to the masters? Could you please provide the master configs.
Created attachment 1246722 [details] master-config.yaml Master and etcd are running in the same host.
could you also please provide the ansible inventory file used to configure the cluster?
Created attachment 1246993 [details] ansible inventory file I attached ansible inventory file.
Here is the exact steps to produce the issue --- step-1. Check which controllers are Active(master-controllers) and Leader(etcd). === # export `cat /etc/etcd/etcd.conf |grep ETCD_LISTEN_CLIENT_URLS` # etcdctl -C ${ETCD_LISTEN_CLIENT_URLS} --ca-file=/etc/etcd/ca.crt --cert-file=/etc/etcd/peer.crt --key-file=/etc/etcd/peer.key member list 2e1416469cd02549: name=knakayam-ose33-master2.example.com peerURLs=https://10.64.221.127:2380 clientURLs=https://10.64.221.127:2379 isLeader=false 5362484380f8bba6: name=knakayam-ose33-master1.example.com peerURLs=https://10.64.220.218:2380 clientURLs=https://10.64.220.218:2379 isLeader=true dee5f8da0297a2a1: name=knakayam-ose33-master3.example.com peerURLs=https://10.64.221.141:2380 clientURLs=https://10.64.221.141:2379 isLeader=false # ansible masters -m shell -a "journalctl --no-pager -u atomic-openshift-master-controllers.service | tail -1" knakayam-ose33-master1.example.com | SUCCESS | rc=0 >> Feb 02 15:59:29 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2679]: W0202 15:59:29.282726 2679 reflector.go:330] github.com/openshift/origin/pkg/build/controller/factory/factory.go:130: watch of *api.Build ended with: too old resource version: 713512 (1238236) knakayam-ose33-master2.example.com | SUCCESS | rc=0 >> Feb 02 15:52:21 knakayam-ose33-master2.example.com atomic-openshift-master-controllers[2720]: I0202 15:52:21.969443 2720 master_config.go:549] Attempting to acquire controller lease as master-cpfx1f8y, renewing every 30 seconds knakayam-ose33-master3.example.com | SUCCESS | rc=0 >> Feb 02 15:52:05 knakayam-ose33-master3.example.com atomic-openshift-master-controllers[2619]: I0202 15:52:05.145745 2619 master_config.go:549] Attempting to acquire controller lease as master-foxigv1n, renewing every 30 seconds step-2. Stop non-leader and non-active master controllers (e.g above case master2 and master3) === (non-active controllers 1) # systemctl stop etcd atomic-openshift-master-controllers (non-active controllers 2) # systemctl stop etcd atomic-openshift-master-controllers step-3. Checking atomic-openshift-master-controllers logs on active master === # journalctl --no-pager -u atomic-openshift-master-controllers.service -f
Hi, Engineering team, Which is true, you couldn't produce the issue or you haven't tried to produce the issue?
From comment 8, step-2 -- The problem is that they've taken a three node etcd cluster down to one node which makes the cluster read-only. You must have a majority of etcd hosts online to continue operations.
Could you please verify by taking down only 1 non-leader etcd member. Minimum of 2 is required to change state in a 3 node cluster. Minimum of 3 is required to change state in a 5 node cluster.
I see. Thank you. Then, it it not a bug.
Yes, I will confirm it.
I confirmed that 2 of 3 masters/etcd could work fine. I should have read the docs carefully, but following sentence is really misunderstandable (it is supposed to cause downtime)... https://docs.openshift.com/container-platform/3.3/admin_guide/backup_restore.html#backup-restore-adding-etcd-hosts "In cases where etcd hosts have failed, but you have at least one host still running, you can use the one surviving host to recover etcd hosts without downtime. "