Red Hat Bugzilla – Bug 1459505
atomic-openshift-master-controllers reports etcd cluster is unavailable or misconfigured; error #0: Forbidden
Last modified: 2017-06-29 06:11:47 EDT
Description of problem: After upgrading to latest version the master controllers are filled with below errors :
Jun 05 07:13:53 master.example.com atomic-openshift-master-controllers: E0605 05:35:39.674320 38205 leaderlease.go:87] client: etcd cluster is unavailable or misconfigured; error #0: F
Jun 05 07:13:53 master.example.com atomic-openshift-master-controllers: E0605 07:13:53.218280 27312 leaderlease.go:87] client: etcd cluster is unavailable or misconfigured; error #0: Forbidden
Jun 05 07:13:53 master.example.com atomic-openshift-master-controllers: ; error #1: Forbidden
Jun 05 07:13:53 master.example.com atomic-openshift-master-controllers: ; error #2: Forbidden
etcd cluster health and member list are showing up and running status
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Actual results: Showing errors in master-controller logs about etcd not healthy or forbidden to get the controller lease result from etcd
Expected results: There shouldn't be any issues contacting etcd after upgrading
whene the etcd is already healthy state
The environment was working, the change made was an upgrade from 3.4 to 3.5
After some more poking on our side we think we've found what's going on.
Jaspreet earlier made comments about having seen this when proxies started getting in the way. We looked at the configs but the hostnames for our etcd servers were in the no_proxy configs so we expected everything to behave.
I ran an strace on the master-controllers process and noticed that it was connecting to the proxy servers rather than to the etcd servers. On a hunch I tried adding the etcd IP addresses to the no_proxy lists and this seems to have cleared the error. So it would appear that for some reason it was connecting to the etcd servers by IP rather than hostname, thus ignoring the no_proxy setting. Additionally no_proxy doesn't handle CIDRs so having 10.X.Y.Z/24 in there didn't help.
As an educated guess the list of etcd cluster-members is now being pulled from etcd after making the initial connection, and then it's connecting by IP which is how etcd seems to store cluster members internally.
Not sure what I'd consider the correct fix here, but the change in behaviour from hostname -> IP address will break previously running clusters.
In the master-config.yaml, is the etcd url specified with hostname or IP? Is the config for 3.4 specified with hostname and 3.5 with IP?
- https://10.42.10.204:4001 <-- here
Also, did you use openshift-ansible (the atomic-openshift-installer) to do the upgrade? If so, there could have been a change in there that changed the etcd url from hostname to IP during the 3.5 upgrade.
The master configs, both before and after the upgrade, list the hostnames and not the IP addresses
We did use the openshift-ansible playbooks to perform the upgrades, specifically "playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade_control_plane.yml"
I will attempt to find what caused of this change in behavior between 3.4 and 3.5. It seems there is a workaround however (adding the etcd IPs to NO_PROXY).
Ok I think I know what happened here. etcd server changed how they store peer url endpoints underneath us.
In the etcd (v2) client, the endpoints are overwritten with the peer URL list from the server on the first Sync(). etcd, at some point, moves from storing this URLs as hostname to IP addresses. This change cascades down to the client, overwrite the user-provided list of endpoints.
Upstream issue for openshift-ansible:
ETCD_ADVERTISE_CLIENT_URLS was switched to using IP instead of hostname
etcd prefers using IPs so I doubt this change will be rolled back. There is work upstream to add IPs to the NO_PROXY as part of the installer. The workaround is to either 1) change ETCD_ADVERTISE_CLIENT_URLS on the etcd members to hostnames or 2) add the IPs to NO_PROXY.
I'm duping this to the documentation bug 1458660 for this issue which is also tracking the installer change.
*** This bug has been marked as a duplicate of bug 1458660 ***