1417524 – Multi masters' HA failed when non-active controllers and non-leader etcd stopped

Bug 1417524 - Multi masters' HA failed when non-active controllers and non-leader etcd stopped

Summary: Multi masters' HA failed when non-active controllers and non-leader etcd stopped

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Timothy St. Clair
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-30 05:17 UTC by Kenjiro Nakayama
Modified:	2020-03-11 15:40 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-02-03 18:28:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
master-config.yaml (4.46 KB, text/plain) 2017-02-01 16:17 UTC, Kenjiro Nakayama	no flags	Details
ansible inventory file (2.65 KB, text/plain) 2017-02-02 06:18 UTC, Kenjiro Nakayama	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2918511	0	None	None	None	2017-02-10 07:16:10 UTC

Description Kenjiro Nakayama 2017-01-30 05:17:32 UTC

Description of problem:
---
- When non-active controllers and non-leader etcd stopped, OpenShift Master-controllers stopped working like showing Node status wrongly.

Version-Release number of selected component (if applicable):
---
- OCP v3.3.1.7

How reproducible:
---
- Stop non-active controllers and non-leader etcd stopped.

Steps to Reproduce:
---
1. Set up multi-masters env (e.g. I tested with 3 masters)
2. Stop atomic-openshift-master-controllers and etcd on non-active master and non-leader etcd hosts).

   # systemctl stop atomic-openshift-master-controllers etcd

3. Try to stop one of Node hosts.

Actual results:
- In step-3, Node remains "Ready".
- When this issue happened, Master controllers started producing "etcd cluster is unavailable or misconfigured" and "unable to check lease".

~~~
Jan 30 14:04:44 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2627]: E0130 14:04:44.465167    2627 leaderlease.go:146] unable to release openshift.io/leases/controllers: 501: All the given peers are not reachable (failed to propose on members [https://knakayam-ose33-master1.example.com:2379 https://knakayam-ose33-master3.example.com:2379 https://knakayam-ose33-master2.example.com:2379] twice [last error: Unexpected HTTP status code]) [0]
Jan 30 14:02:04 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2627]: E0130 14:02:04.514779    2627 nodecontroller.go:574] Unable to mark all pods NotReady on node knakayam-ose33-master2.example.com: client: etcd cluster is unavailable or misconfigured; client: etcd cluster is unavailable or misconfigured
Jan 30 14:02:14 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2627]: E0130 14:02:14.522233    2627 nodecontroller.go:830] Error updating node knakayam-ose33-master3.example.com: client: etcd cluster is unavailable or misconfigured
Jan 30 14:02:24 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2627]: E0130 14:02:24.532264    2627 nodecontroller.go:830] Error updating node knakayam-ose33-master3.example.com: client: etcd cluster is unavailable or misconfigured
... snip ...
Jan 30 14:13:05 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[8138]: I0130 14:13:05.051309    8138 master.go:271] Started health checks at 0.0.0.0:8444
Jan 30 14:13:05 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[8138]: I0130 14:13:05.052849    8138 master_config.go:549] Attempting to acquire controller lease as master-7d1tk96l, renewing every 30 seconds
Jan 30 14:13:46 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[8138]: E0130 14:13:46.651637    8138 leaderlease.go:69] unable to check lease openshift.io/leases/controllers: 501: All the given peers are not reachable (failed to propose on members [https://knakayam-ose33-master2.example.com:2379 https://knakayam-ose33-master3.example.com:2379 https://knakayam-ose33-master1.example.com:2379] twice [last error: Unexpected HTTP status code]) [0]
~~~

Expected results:
- master-controllers service keeps working fine.

Comment 1 Kenjiro Nakayama 2017-01-31 01:37:39 UTC

Could you please prioritize this issue, since it has an impact to the service level of their clusters?

Comment 2 Timothy St. Clair 2017-02-01 15:23:46 UTC

Are the etcd servers running remote or local to the masters?  

Could you please provide the master configs.

Comment 3 Kenjiro Nakayama 2017-02-01 16:17:35 UTC

Created attachment 1246722 [details]
master-config.yaml

Master and etcd are running in the same host.

Comment 4 Timothy St. Clair 2017-02-01 20:55:26 UTC

could you also please provide the ansible inventory file used to configure the cluster?

Comment 6 Kenjiro Nakayama 2017-02-02 06:18:51 UTC

Created attachment 1246993 [details]
ansible inventory file

I attached ansible inventory file.

Comment 8 Kenjiro Nakayama 2017-02-02 07:09:26 UTC

Here is the exact steps to produce the issue
---

step-1. Check which controllers are Active(master-controllers) and Leader(etcd).
===
  # export `cat /etc/etcd/etcd.conf |grep ETCD_LISTEN_CLIENT_URLS`
  # etcdctl -C ${ETCD_LISTEN_CLIENT_URLS} --ca-file=/etc/etcd/ca.crt     --cert-file=/etc/etcd/peer.crt         --key-file=/etc/etcd/peer.key member list
    2e1416469cd02549: name=knakayam-ose33-master2.example.com peerURLs=https://10.64.221.127:2380 clientURLs=https://10.64.221.127:2379 isLeader=false
    5362484380f8bba6: name=knakayam-ose33-master1.example.com peerURLs=https://10.64.220.218:2380 clientURLs=https://10.64.220.218:2379 isLeader=true
    dee5f8da0297a2a1: name=knakayam-ose33-master3.example.com peerURLs=https://10.64.221.141:2380 clientURLs=https://10.64.221.141:2379 isLeader=false

  # ansible masters -m shell -a "journalctl --no-pager -u atomic-openshift-master-controllers.service | tail -1"
  knakayam-ose33-master1.example.com | SUCCESS | rc=0 >>
  Feb 02 15:59:29 knakayam-ose33-master1.example.com atomic-openshift-master-controllers[2679]: W0202 15:59:29.282726    2679 reflector.go:330] github.com/openshift/origin/pkg/build/controller/factory/factory.go:130: watch of *api.Build ended with: too old resource version: 713512 (1238236)

  knakayam-ose33-master2.example.com | SUCCESS | rc=0 >>
  Feb 02 15:52:21 knakayam-ose33-master2.example.com atomic-openshift-master-controllers[2720]: I0202 15:52:21.969443    2720 master_config.go:549] Attempting to acquire controller lease as master-cpfx1f8y, renewing every 30 seconds

  knakayam-ose33-master3.example.com | SUCCESS | rc=0 >>
  Feb 02 15:52:05 knakayam-ose33-master3.example.com atomic-openshift-master-controllers[2619]: I0202 15:52:05.145745    2619 master_config.go:549] Attempting to acquire controller lease as master-foxigv1n, renewing every 30 seconds


step-2. Stop non-leader and non-active master controllers (e.g above case master2 and master3)
===
  (non-active controllers 1)
  # systemctl stop etcd atomic-openshift-master-controllers

  (non-active controllers 2)
  # systemctl stop etcd atomic-openshift-master-controllers


step-3. Checking atomic-openshift-master-controllers logs on active master
===
  # journalctl --no-pager -u atomic-openshift-master-controllers.service -f

Comment 11 Kenjiro Nakayama 2017-02-03 06:29:17 UTC

Hi, Engineering team, Which is true, you couldn't produce the issue or you haven't tried to produce the issue?

Comment 12 Scott Dodson 2017-02-03 14:49:00 UTC

From comment 8, step-2 --

The problem is that they've taken a three node etcd cluster down to one node which makes the cluster read-only. You must have a majority of etcd hosts online to continue operations.

Comment 13 Timothy St. Clair 2017-02-03 14:58:11 UTC

Could you please verify by taking down only 1 non-leader etcd member.  

Minimum of 2 is required to change state in a 3 node cluster. 
Minimum of 3 is required to change state in a 5 node cluster.

Comment 14 Kenjiro Nakayama 2017-02-03 15:07:38 UTC

I see. Thank you. Then, it it not a bug.

Comment 15 Kenjiro Nakayama 2017-02-03 15:11:03 UTC

Yes, I will confirm it.

Comment 17 Kenjiro Nakayama 2017-02-04 06:22:20 UTC

I confirmed that 2 of 3 masters/etcd could work fine. 

I should have read the docs carefully, but following sentence is really misunderstandable (it is supposed to cause downtime)...

https://docs.openshift.com/container-platform/3.3/admin_guide/backup_restore.html#backup-restore-adding-etcd-hosts
"In cases where etcd hosts have failed, but you have at least one host still running, you can use the one surviving host to recover etcd hosts without downtime.
"

Note You need to log in before you can comment on or make changes to this bug.