Description of problem: see the following detail. Version-Release number of selected component (if applicable): openshift v3.5.0.32-1+4f84c83 kubernetes v1.5.2+43a9be4 How reproducible: Sometimes (about 60%) Steps to Reproduce: 1. setup env 3 master + 3 etcd + 4 nodes. (etcd collocate with master on the same host) master config file: etcdClientInfo: ca: master.etcd-ca.crt certFile: master.etcd-client.crt keyFile: master.etcd-client.key urls: - https://jialiu1-share-master-etcd-zone1-1:2379 - https://jialiu1-share-master-etcd-zone2-1:2379 - https://jialiu1-share-master-etcd-zone2-2:2379 2. stop the first etcd service (jialiu1-share-master-etcd-zone1-1) in etcd cluster, and the whole etcd cluster is still healthy. # etcdctl --ca-file "${ca_file}" --cert-file "${cert_file}" --key-file "${key_file}" -C ${url} cluster-health 2017-02-23 06:41:16.641822 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated 2017-02-23 06:41:16.642596 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated failed to check the health of member 5164ca6823df9a17 on https://10.240.0.34:2379: Get https://10.240.0.34:2379/health: dial tcp 10.240.0.34:2379: getsockopt: connection refused member 5164ca6823df9a17 is unreachable: [https://10.240.0.34:2379] are all unreachable member 6945dc12bf7fe38b is healthy: got healthy result from https://10.240.0.35:2379 member 9581692f618f8136 is healthy: got healthy result from https://10.240.0.36:2379 cluster is healthy 3. stop all the master controller service. 4. start master controller service one by one in master ha cluster. 5. If after run step 4, all the controllers are started successfully, repeat step 3 and step 4 until the following error is seen. Actual results: # journalctl -f -u atomic-openshift-master-controllers -- Logs begin at Thu 2017-02-23 02:36:50 EST. -- Feb 23 06:32:43 jialiu1-share-master-etcd-zone2-2 atomic-openshift-master-controllers[18640]: E0223 11:32:43.460875 1 leaderlease.go:71] unable to check lease openshift.io/leases/controllers: dial tcp 10.240.0.34:2379: getsockopt: connection refused Feb 23 06:32:44 jialiu1-share-master-etcd-zone2-2 atomic-openshift-master-controllers[18640]: E0223 11:32:44.466057 1 leaderle In my test env, master 1 and master 3's controller service is failed to start. Expected results: Once the connecting etcd is not available, controller service should re-connect next etcd member. Additional info:
This is happening because when we are performing the etcd "set" operation, we set the PrevExist option to PrevNoExist [1], which results in the etcd client considering the request a one-shot operation [2], meaning that as soon as it hits an error (such as a failed connection to the etcd member that is no longer running), it returns the error instead of retrying with other members in the cluster [3] [1] https://github.com/openshift/origin/blob/7d4c2dd14106040b085ef025447da4360874d047/pkg/util/leaderlease/leaderlease.go#L105 [2] https://github.com/openshift/origin/blob/e22997b52ee30bb3ba4575bf0e991b14a5c6c94c/vendor/github.com/coreos/etcd/client/keys.go#L350-L351 [3] https://github.com/openshift/origin/blob/e22997b52ee30bb3ba4575bf0e991b14a5c6c94c/vendor/github.com/coreos/etcd/client/client.go#L369-L370
PR is in the merge queue.
Retest this bug with atomic-openshift-3.5.0.34-1.git.0.9bd77cf.el7.x86_64, still reproduced. Compare https://github.com/openshift/ose/blob/v3.5.0.34-1/pkg/util/leaderlease/leaderlease.go with the fix PR https://github.com/openshift/origin/pull/13082, seem like the PR is not merged into ocp.
Sorry, I was thinking this was the bug for origin master. This bug will be fixed for 3.6. The other bug will be fixed for 3.5 (bug 1426733).
This has been merged into ocp and is in OCP v3.6.27 or newer.
Verified this bug with atomic-openshift-3.6.27-1.git.0.86f238a.el7.x86_64, and PASS. After stop one etcd service in the cluster and start controller service one by one, see the failover action in controller logs: Apr 13 06:33:42 qe-jialiu-master-etcd-3 atomic-openshift-master-controllers[51467]: I0413 06:33:42.830731 51467 master.go:399] Started health checks at 0.0.0.0:8444 Apr 13 06:33:42 qe-jialiu-master-etcd-3 systemd[1]: Started Atomic OpenShift Master Controllers. Apr 13 06:33:42 qe-jialiu-master-etcd-3 atomic-openshift-master-controllers[51467]: I0413 06:33:42.831414 51467 master_config.go:585] Attempting to acquire controller lease as master-1whfs8dl, renewing every 10 seconds Apr 13 06:33:42 qe-jialiu-master-etcd-3 atomic-openshift-master-controllers[51467]: E0413 06:33:42.832113 51467 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 10.240.0.32:2379: getsockopt: connection refused Apr 13 06:33:43 qe-jialiu-master-etcd-3 atomic-openshift-master-controllers[51467]: I0413 06:33:43.848159 51467 leaderlease.go:154] Lease openshift.io/leases/controllers owned by master-21j2tzf7 at 14557 ttl 6 seconds, waiting for expiration Apr 13 06:33:43 qe-jialiu-master-etcd-3 atomic-openshift-master-controllers[51467]: I0413 06:33:43.848198 51467 leaderlease.go:290] watching for expiration of lease openshift.io/leases/controllers from 14558 Apr 13 06:33:46 qe-jialiu-master-etcd-3 atomic-openshift-master-controllers[51467]: I0413 06:33:46.534238 51467 leaderlease.go:290] watching for expiration of lease openshift.io/leases/controllers from 14573
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716