Bug 1417524 - Multi masters' HA failed when non-active controllers and non-leader etcd stopped
Summary: Multi masters' HA failed when non-active controllers and non-leader etcd stopped
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.3.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Timothy St. Clair
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-30 05:17 UTC by Kenjiro Nakayama
Modified: 2020-03-11 15:40 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-03 18:28:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
master-config.yaml (4.46 KB, text/plain)
2017-02-01 16:17 UTC, Kenjiro Nakayama
no flags Details
ansible inventory file (2.65 KB, text/plain)
2017-02-02 06:18 UTC, Kenjiro Nakayama
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2918511 0 None None None 2017-02-10 07:16:10 UTC

Description Kenjiro Nakayama 2017-01-30 05:17:32 UTC
Description of problem:
---
- When non-active controllers and non-leader etcd stopped, OpenShift Master-controllers stopped working like showing Node status wrongly.

Version-Release number of selected component (if applicable):
---
- OCP v3.3.1.7

How reproducible:
---
- Stop non-active controllers and non-leader etcd stopped.

Steps to Reproduce:
---
1. Set up multi-masters env (e.g. I tested with 3 masters)
2. Stop atomic-openshift-master-controllers and etcd on non-active master and non-leader etcd hosts).

   # systemctl stop atomic-openshift-master-controllers etcd

3. Try to stop one of Node hosts.

Actual results:
- In step-3, Node remains "Ready".
- When this issue happened, Master controllers started producing "etcd cluster is unavailable or misconfigured" and "unable to check lease".

~~~
Jan 30 14:04:44 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2627]: E0130 14:04:44.465167    2627 leaderlease.go:146] unable to release openshift.io/leases/controllers: 501: All the given peers are not reachable (failed to propose on members [https://knakayam-ose33-master1.example.com:2379 https://knakayam-ose33-master3.example.com:2379 https://knakayam-ose33-master2.example.com:2379] twice [last error: Unexpected HTTP status code]) [0]
Jan 30 14:02:04 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2627]: E0130 14:02:04.514779    2627 nodecontroller.go:574] Unable to mark all pods NotReady on node knakayam-ose33-master2.example.com: client: etcd cluster is unavailable or misconfigured; client: etcd cluster is unavailable or misconfigured
Jan 30 14:02:14 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2627]: E0130 14:02:14.522233    2627 nodecontroller.go:830] Error updating node knakayam-ose33-master3.example.com: client: etcd cluster is unavailable or misconfigured
Jan 30 14:02:24 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2627]: E0130 14:02:24.532264    2627 nodecontroller.go:830] Error updating node knakayam-ose33-master3.example.com: client: etcd cluster is unavailable or misconfigured
... snip ...
Jan 30 14:13:05 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[8138]: I0130 14:13:05.051309    8138 master.go:271] Started health checks at 0.0.0.0:8444
Jan 30 14:13:05 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[8138]: I0130 14:13:05.052849    8138 master_config.go:549] Attempting to acquire controller lease as master-7d1tk96l, renewing every 30 seconds
Jan 30 14:13:46 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[8138]: E0130 14:13:46.651637    8138 leaderlease.go:69] unable to check lease openshift.io/leases/controllers: 501: All the given peers are not reachable (failed to propose on members [https://knakayam-ose33-master2.example.com:2379 https://knakayam-ose33-master3.example.com:2379 https://knakayam-ose33-master1.example.com:2379] twice [last error: Unexpected HTTP status code]) [0]
~~~

Expected results:
- master-controllers service keeps working fine.

Comment 1 Kenjiro Nakayama 2017-01-31 01:37:39 UTC
Could you please prioritize this issue, since it has an impact to the service level of their clusters?

Comment 2 Timothy St. Clair 2017-02-01 15:23:46 UTC
Are the etcd servers running remote or local to the masters?  

Could you please provide the master configs.

Comment 3 Kenjiro Nakayama 2017-02-01 16:17:35 UTC
Created attachment 1246722 [details]
master-config.yaml

Master and etcd are running in the same host.

Comment 4 Timothy St. Clair 2017-02-01 20:55:26 UTC
could you also please provide the ansible inventory file used to configure the cluster?

Comment 6 Kenjiro Nakayama 2017-02-02 06:18:51 UTC
Created attachment 1246993 [details]
ansible inventory file

I attached ansible inventory file.

Comment 8 Kenjiro Nakayama 2017-02-02 07:09:26 UTC
Here is the exact steps to produce the issue
---

step-1. Check which controllers are Active(master-controllers) and Leader(etcd).
===
  # export `cat /etc/etcd/etcd.conf |grep ETCD_LISTEN_CLIENT_URLS`
  # etcdctl -C ${ETCD_LISTEN_CLIENT_URLS} --ca-file=/etc/etcd/ca.crt     --cert-file=/etc/etcd/peer.crt         --key-file=/etc/etcd/peer.key member list
    2e1416469cd02549: name=knakayam-ose33-master2.example.com peerURLs=https://10.64.221.127:2380 clientURLs=https://10.64.221.127:2379 isLeader=false
    5362484380f8bba6: name=knakayam-ose33-master1.example.com peerURLs=https://10.64.220.218:2380 clientURLs=https://10.64.220.218:2379 isLeader=true
    dee5f8da0297a2a1: name=knakayam-ose33-master3.example.com peerURLs=https://10.64.221.141:2380 clientURLs=https://10.64.221.141:2379 isLeader=false

  # ansible masters -m shell -a "journalctl --no-pager -u atomic-openshift-master-controllers.service | tail -1"
  knakayam-ose33-master1.example.com | SUCCESS | rc=0 >>
  Feb 02 15:59:29 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2679]: W0202 15:59:29.282726    2679 reflector.go:330] github.com/openshift/origin/pkg/build/controller/factory/factory.go:130: watch of *api.Build ended with: too old resource version: 713512 (1238236)

  knakayam-ose33-master2.example.com | SUCCESS | rc=0 >>
  Feb 02 15:52:21 knakayam-ose33-master2.example.com atomic-openshift-master-controllers[2720]: I0202 15:52:21.969443    2720 master_config.go:549] Attempting to acquire controller lease as master-cpfx1f8y, renewing every 30 seconds

  knakayam-ose33-master3.example.com | SUCCESS | rc=0 >>
  Feb 02 15:52:05 knakayam-ose33-master3.example.com atomic-openshift-master-controllers[2619]: I0202 15:52:05.145745    2619 master_config.go:549] Attempting to acquire controller lease as master-foxigv1n, renewing every 30 seconds


step-2. Stop non-leader and non-active master controllers (e.g above case master2 and master3)
===
  (non-active controllers 1)
  # systemctl stop etcd atomic-openshift-master-controllers

  (non-active controllers 2)
  # systemctl stop etcd atomic-openshift-master-controllers


step-3. Checking atomic-openshift-master-controllers logs on active master
===
  # journalctl --no-pager -u atomic-openshift-master-controllers.service -f

Comment 11 Kenjiro Nakayama 2017-02-03 06:29:17 UTC
Hi, Engineering team, Which is true, you couldn't produce the issue or you haven't tried to produce the issue?

Comment 12 Scott Dodson 2017-02-03 14:49:00 UTC
From comment 8, step-2 --

The problem is that they've taken a three node etcd cluster down to one node which makes the cluster read-only. You must have a majority of etcd hosts online to continue operations.

Comment 13 Timothy St. Clair 2017-02-03 14:58:11 UTC
Could you please verify by taking down only 1 non-leader etcd member.  

Minimum of 2 is required to change state in a 3 node cluster. 
Minimum of 3 is required to change state in a 5 node cluster.

Comment 14 Kenjiro Nakayama 2017-02-03 15:07:38 UTC
I see. Thank you. Then, it it not a bug.

Comment 15 Kenjiro Nakayama 2017-02-03 15:11:03 UTC
Yes, I will confirm it.

Comment 17 Kenjiro Nakayama 2017-02-04 06:22:20 UTC
I confirmed that 2 of 3 masters/etcd could work fine. 

I should have read the docs carefully, but following sentence is really misunderstandable (it is supposed to cause downtime)...

https://docs.openshift.com/container-platform/3.3/admin_guide/backup_restore.html#backup-restore-adding-etcd-hosts
"In cases where etcd hosts have failed, but you have at least one host still running, you can use the one surviving host to recover etcd hosts without downtime.
"


Note You need to log in before you can comment on or make changes to this bug.