Bug 1459505

Summary: atomic-openshift-master-controllers reports etcd cluster is unavailable or misconfigured; error #0: Forbidden
Product: OpenShift Container Platform Reporter: Jaspreet Kaur <jkaur>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED DUPLICATE QA Contact: DeShuai Ma <dma>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.5.1CC: aos-bugs, jokerman, mchappel, mmccomas
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-19 17:56:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jaspreet Kaur 2017-06-07 10:15:28 UTC
Description of problem: After upgrading to latest version the master controllers are filled with below errors :



Jun 05 07:13:53 master.example.com atomic-openshift-master-controllers[38205]: E0605 05:35:39.674320   38205 leaderlease.go:87] client: etcd cluster is unavailable or misconfigured; error #0: F
Jun 05 07:13:53 master.example.com atomic-openshift-master-controllers[27312]: E0605 07:13:53.218280   27312 leaderlease.go:87] client: etcd cluster is unavailable or misconfigured; error #0: Forbidden
Jun 05 07:13:53 master.example.com atomic-openshift-master-controllers[27312]: ; error #1: Forbidden
Jun 05 07:13:53 master.example.com atomic-openshift-master-controllers[27312]: ; error #2: Forbidden

etcd cluster health and member list are showing up and running status 


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Showing errors in master-controller logs about etcd not healthy or forbidden to get the controller lease result from etcd

Expected results: There shouldn't be any issues contacting etcd after upgrading
whene the etcd is already healthy state


Additional info:

Comment 4 Mark Chappell 2017-06-07 16:48:40 UTC
The environment was working, the change made was an upgrade from 3.4 to 3.5


After some more poking on our side we think we've found what's going on.

Jaspreet earlier made comments about having seen this when proxies started getting in the way.  We looked at the configs but the hostnames for our etcd servers were in the no_proxy configs so we expected everything to behave.

BUT...

I ran an strace on the master-controllers process and noticed that it was connecting to the proxy servers rather than to the etcd servers.  On a hunch I tried adding the etcd IP addresses to the no_proxy lists and this seems to have cleared the error.  So it would appear that for some reason it was connecting to the etcd servers by IP rather than hostname, thus ignoring the no_proxy setting.  Additionally no_proxy doesn't handle CIDRs so having 10.X.Y.Z/24 in there didn't help.

As an educated guess the list of etcd cluster-members is now being pulled from etcd after making the initial connection, and then it's connecting by IP which is how etcd seems to store cluster members internally.

Not sure what I'd consider the correct fix here, but the change in behaviour from hostname -> IP address will break previously running clusters.

Comment 5 Seth Jennings 2017-06-07 17:04:48 UTC
Mark,

In the master-config.yaml, is the etcd url specified with hostname or IP? Is the config for 3.4 specified with hostname and 3.5 with IP?

Example:

etcdClientInfo:
  ca: ca-bundle.crt
  certFile: master.etcd-client.crt
  keyFile: master.etcd-client.key
  urls:
  - https://10.42.10.204:4001 <-- here

Also, did you use openshift-ansible (the atomic-openshift-installer) to do the upgrade?  If so, there could have been a change in there that changed the etcd url from hostname to IP during the 3.5 upgrade.

Comment 6 Mark Chappell 2017-06-08 07:01:23 UTC
Seth,

The master configs, both before and after the upgrade, list the hostnames and not the IP addresses

We did use the openshift-ansible playbooks to perform the upgrades, specifically "playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade_control_plane.yml"

Comment 10 Seth Jennings 2017-06-15 02:06:22 UTC
I will attempt to find what caused of this change in behavior between 3.4 and 3.5.  It seems there is a workaround however (adding the etcd IPs to NO_PROXY).

Comment 11 Seth Jennings 2017-06-15 23:50:51 UTC
Ok I think I know what happened here.  etcd server changed how they store peer url endpoints underneath us.

https://github.com/openshift/origin/blob/master/vendor/github.com/coreos/etcd/client/client.go#L416-L468

In the etcd (v2) client, the endpoints are overwritten with the peer URL list from the server on the first Sync().  etcd, at some point, moves from storing this URLs as hostname to IP addresses.  This change cascades down to the client, overwrite the user-provided list of endpoints.

Comment 12 Seth Jennings 2017-06-19 17:56:39 UTC
Upstream issue for openshift-ansible:
https://github.com/openshift/openshift-ansible/issues/4490

ETCD_ADVERTISE_CLIENT_URLS was switched to using IP instead of hostname
https://github.com/openshift/openshift-ansible/pull/1754

etcd prefers using IPs so I doubt this change will be rolled back.  There is work upstream to add IPs to the NO_PROXY as part of the installer.  The workaround is to either 1) change ETCD_ADVERTISE_CLIENT_URLS on the etcd members to hostnames or 2) add the IPs to NO_PROXY.

I'm duping this to the documentation bug 1458660 for this issue which is also tracking the installer change.

*** This bug has been marked as a duplicate of bug 1458660 ***