Bug 1459505 - atomic-openshift-master-controllers reports etcd cluster is unavailable or misconfigured; error #0: Forbidden
atomic-openshift-master-controllers reports etcd cluster is unavailable or mi...
Status: CLOSED DUPLICATE of bug 1458660
Product: OpenShift Container Platform
Classification: Red Hat
Component: Kubernetes (Show other bugs)
3.5.1
Unspecified Unspecified
urgent Severity urgent
: ---
: ---
Assigned To: Seth Jennings
DeShuai Ma
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-07 06:15 EDT by Jaspreet Kaur
Modified: 2017-06-29 06:11 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-06-19 13:56:39 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jaspreet Kaur 2017-06-07 06:15:28 EDT
Description of problem: After upgrading to latest version the master controllers are filled with below errors :



Jun 05 07:13:53 master.example.com atomic-openshift-master-controllers[38205]: E0605 05:35:39.674320   38205 leaderlease.go:87] client: etcd cluster is unavailable or misconfigured; error #0: F
Jun 05 07:13:53 master.example.com atomic-openshift-master-controllers[27312]: E0605 07:13:53.218280   27312 leaderlease.go:87] client: etcd cluster is unavailable or misconfigured; error #0: Forbidden
Jun 05 07:13:53 master.example.com atomic-openshift-master-controllers[27312]: ; error #1: Forbidden
Jun 05 07:13:53 master.example.com atomic-openshift-master-controllers[27312]: ; error #2: Forbidden

etcd cluster health and member list are showing up and running status 


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Showing errors in master-controller logs about etcd not healthy or forbidden to get the controller lease result from etcd

Expected results: There shouldn't be any issues contacting etcd after upgrading
whene the etcd is already healthy state


Additional info:
Comment 4 Mark Chappell 2017-06-07 12:48:40 EDT
The environment was working, the change made was an upgrade from 3.4 to 3.5


After some more poking on our side we think we've found what's going on.

Jaspreet earlier made comments about having seen this when proxies started getting in the way.  We looked at the configs but the hostnames for our etcd servers were in the no_proxy configs so we expected everything to behave.

BUT...

I ran an strace on the master-controllers process and noticed that it was connecting to the proxy servers rather than to the etcd servers.  On a hunch I tried adding the etcd IP addresses to the no_proxy lists and this seems to have cleared the error.  So it would appear that for some reason it was connecting to the etcd servers by IP rather than hostname, thus ignoring the no_proxy setting.  Additionally no_proxy doesn't handle CIDRs so having 10.X.Y.Z/24 in there didn't help.

As an educated guess the list of etcd cluster-members is now being pulled from etcd after making the initial connection, and then it's connecting by IP which is how etcd seems to store cluster members internally.

Not sure what I'd consider the correct fix here, but the change in behaviour from hostname -> IP address will break previously running clusters.
Comment 5 Seth Jennings 2017-06-07 13:04:48 EDT
Mark,

In the master-config.yaml, is the etcd url specified with hostname or IP? Is the config for 3.4 specified with hostname and 3.5 with IP?

Example:

etcdClientInfo:
  ca: ca-bundle.crt
  certFile: master.etcd-client.crt
  keyFile: master.etcd-client.key
  urls:
  - https://10.42.10.204:4001 <-- here

Also, did you use openshift-ansible (the atomic-openshift-installer) to do the upgrade?  If so, there could have been a change in there that changed the etcd url from hostname to IP during the 3.5 upgrade.
Comment 6 Mark Chappell 2017-06-08 03:01:23 EDT
Seth,

The master configs, both before and after the upgrade, list the hostnames and not the IP addresses

We did use the openshift-ansible playbooks to perform the upgrades, specifically "playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade_control_plane.yml"
Comment 10 Seth Jennings 2017-06-14 22:06:22 EDT
I will attempt to find what caused of this change in behavior between 3.4 and 3.5.  It seems there is a workaround however (adding the etcd IPs to NO_PROXY).
Comment 11 Seth Jennings 2017-06-15 19:50:51 EDT
Ok I think I know what happened here.  etcd server changed how they store peer url endpoints underneath us.

https://github.com/openshift/origin/blob/master/vendor/github.com/coreos/etcd/client/client.go#L416-L468

In the etcd (v2) client, the endpoints are overwritten with the peer URL list from the server on the first Sync().  etcd, at some point, moves from storing this URLs as hostname to IP addresses.  This change cascades down to the client, overwrite the user-provided list of endpoints.
Comment 12 Seth Jennings 2017-06-19 13:56:39 EDT
Upstream issue for openshift-ansible:
https://github.com/openshift/openshift-ansible/issues/4490

ETCD_ADVERTISE_CLIENT_URLS was switched to using IP instead of hostname
https://github.com/openshift/openshift-ansible/pull/1754

etcd prefers using IPs so I doubt this change will be rolled back.  There is work upstream to add IPs to the NO_PROXY as part of the installer.  The workaround is to either 1) change ETCD_ADVERTISE_CLIENT_URLS on the etcd members to hostnames or 2) add the IPs to NO_PROXY.

I'm duping this to the documentation bug 1458660 for this issue which is also tracking the installer change.

*** This bug has been marked as a duplicate of bug 1458660 ***

Note You need to log in before you can comment on or make changes to this bug.