1426183 – [3.6]master ha failover sometimes does not work when one member in etcd cluster is stopped.

Bug 1426183 - [3.6]master ha failover sometimes does not work when one member in etcd cluster is stopped.

Summary: [3.6]master ha failover sometimes does not work when one member in etcd clust...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jordan Liggitt
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1426733
TreeView+	depends on / blocked

Reported:	2017-02-23 11:49 UTC by Johnny Liu
Modified:	2017-10-06 02:25 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: when attempting to connect to etcd to acquire a leader lease, the master controllers process only tries to reach a single etcd cluster member even if multiple are specified. Consequence: if the selected etcd cluster member is unavailable, the master controllers process is not able to acquire the leader lease, which means it will not start up and run properly. Fix: attempt to connect to all of the specified etcd cluster members until a successful connection is made. Result: the master controllers process can acquire the leader lease and start up properly.
Clone Of:
Clones:	1426733 (view as bug list)
Environment:
Last Closed:	2017-08-10 05:18:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Origin (Github)	13082	0	None	None	None	2017-02-23 18:50:53 UTC
Red Hat Product Errata	RHEA-2017:1716	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.6 RPM Release Advisory	2017-08-10 09:02:50 UTC

Description Johnny Liu 2017-02-23 11:49:59 UTC

Description of problem:
see the following detail.

Version-Release number of selected component (if applicable):
openshift v3.5.0.32-1+4f84c83
kubernetes v1.5.2+43a9be4


How reproducible:
Sometimes (about 60%)

Steps to Reproduce:
1. setup env 3 master + 3 etcd + 4 nodes. (etcd collocate with master on the same host)
master config file:
etcdClientInfo:
  ca: master.etcd-ca.crt
  certFile: master.etcd-client.crt
  keyFile: master.etcd-client.key
  urls:
    - https://jialiu1-share-master-etcd-zone1-1:2379
    - https://jialiu1-share-master-etcd-zone2-1:2379
    - https://jialiu1-share-master-etcd-zone2-2:2379

2. stop the first etcd service (jialiu1-share-master-etcd-zone1-1) in etcd cluster, and the whole etcd cluster is still healthy.

# etcdctl --ca-file "${ca_file}" --cert-file "${cert_file}" --key-file "${key_file}" -C ${url} cluster-health
2017-02-23 06:41:16.641822 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2017-02-23 06:41:16.642596 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
failed to check the health of member 5164ca6823df9a17 on https://10.240.0.34:2379: Get https://10.240.0.34:2379/health: dial tcp 10.240.0.34:2379: getsockopt: connection refused
member 5164ca6823df9a17 is unreachable: [https://10.240.0.34:2379] are all unreachable
member 6945dc12bf7fe38b is healthy: got healthy result from https://10.240.0.35:2379
member 9581692f618f8136 is healthy: got healthy result from https://10.240.0.36:2379
cluster is healthy

3. stop all the master controller service.

4. start master controller service one by one in master ha cluster.

5. If after run step 4, all the controllers are started successfully, repeat step 3 and step 4 until the following error is seen.

Actual results:
# journalctl -f -u atomic-openshift-master-controllers
-- Logs begin at Thu 2017-02-23 02:36:50 EST. --
Feb 23 06:32:43 jialiu1-share-master-etcd-zone2-2 atomic-openshift-master-controllers[18640]: E0223 11:32:43.460875       1 leaderlease.go:71] unable to check lease openshift.io/leases/controllers: dial tcp 10.240.0.34:2379: getsockopt: connection refused
Feb 23 06:32:44 jialiu1-share-master-etcd-zone2-2 atomic-openshift-master-controllers[18640]: E0223 11:32:44.466057       1 leaderle

In my test env, master 1 and master 3's controller service is failed to start.

Expected results:
Once the connecting etcd is not available, controller service should re-connect next etcd member.

Additional info:

Comment 2 Andy Goldstein 2017-02-23 17:07:50 UTC

This is happening because when we are performing the etcd "set" operation, we set the PrevExist option to PrevNoExist [1], which results in the etcd client considering the request a one-shot operation [2], meaning that as soon as it hits an error (such as a failed connection to the etcd member that is no longer running), it returns the error instead of retrying with other members in the cluster [3]

[1] https://github.com/openshift/origin/blob/7d4c2dd14106040b085ef025447da4360874d047/pkg/util/leaderlease/leaderlease.go#L105

[2] https://github.com/openshift/origin/blob/e22997b52ee30bb3ba4575bf0e991b14a5c6c94c/vendor/github.com/coreos/etcd/client/keys.go#L350-L351

[3] https://github.com/openshift/origin/blob/e22997b52ee30bb3ba4575bf0e991b14a5c6c94c/vendor/github.com/coreos/etcd/client/client.go#L369-L370

Comment 3 Andy Goldstein 2017-02-24 17:05:07 UTC

PR is in the merge queue.

Comment 4 Johnny Liu 2017-02-27 10:38:01 UTC

Retest this bug with atomic-openshift-3.5.0.34-1.git.0.9bd77cf.el7.x86_64, still reproduced.

Compare https://github.com/openshift/ose/blob/v3.5.0.34-1/pkg/util/leaderlease/leaderlease.go with the fix PR https://github.com/openshift/origin/pull/13082, seem like the PR is not merged into ocp.

Comment 5 Andy Goldstein 2017-02-27 11:40:16 UTC

Sorry, I was thinking this was the bug for origin master. This bug will be fixed for 3.6. The other bug will be fixed for 3.5 (bug 1426733).

Comment 6 Troy Dawson 2017-04-11 21:01:31 UTC

This has been merged into ocp and is in OCP v3.6.27 or newer.

Comment 8 Johnny Liu 2017-04-13 10:38:57 UTC

Verified this bug with atomic-openshift-3.6.27-1.git.0.86f238a.el7.x86_64, and PASS.

After stop one etcd service in the cluster and start controller service one by one, see the failover action in controller logs:

Apr 13 06:33:42 qe-jialiu-master-etcd-3 atomic-openshift-master-controllers[51467]: I0413 06:33:42.830731   51467 master.go:399] Started health checks at 0.0.0.0:8444
Apr 13 06:33:42 qe-jialiu-master-etcd-3 systemd[1]: Started Atomic OpenShift Master Controllers.
Apr 13 06:33:42 qe-jialiu-master-etcd-3 atomic-openshift-master-controllers[51467]: I0413 06:33:42.831414   51467 master_config.go:585] Attempting to acquire controller lease as master-1whfs8dl, renewing every 10 seconds
Apr 13 06:33:42 qe-jialiu-master-etcd-3 atomic-openshift-master-controllers[51467]: E0413 06:33:42.832113   51467 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 10.240.0.32:2379: getsockopt: connection refused
Apr 13 06:33:43 qe-jialiu-master-etcd-3 atomic-openshift-master-controllers[51467]: I0413 06:33:43.848159   51467 leaderlease.go:154] Lease openshift.io/leases/controllers owned by master-21j2tzf7 at 14557 ttl 6 seconds, waiting for expiration
Apr 13 06:33:43 qe-jialiu-master-etcd-3 atomic-openshift-master-controllers[51467]: I0413 06:33:43.848198   51467 leaderlease.go:290] watching for expiration of lease openshift.io/leases/controllers from 14558
Apr 13 06:33:46 qe-jialiu-master-etcd-3 atomic-openshift-master-controllers[51467]: I0413 06:33:46.534238   51467 leaderlease.go:290] watching for expiration of lease openshift.io/leases/controllers from 14573

Comment 10 errata-xmlrpc 2017-08-10 05:18:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.