Bug 1653348

Summary:	Director deployed OCP 3.11: scaling out a master node fails with "The field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined
Product:	Red Hat OpenStack	Reporter:	Marius Cornea <mcornea>
Component:	openstack-tripleo-heat-templates	Assignee:	Martin André <m.andre>
Status:	CLOSED ERRATA	QA Contact:	Gurenko Alex <agurenko>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	14.0 (Rocky)	CC:	dbecker, lmarsh, ltomasbo, m.andre, mburns, morazi
Target Milestone:	z2	Keywords:	Triaged, ZStream
Target Release:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-9.2.1-0.20190119154866.el7ost	Doc Type:	Known Issue
Doc Text:	Scaling out with an additional OpenShift master node of a director deployed OpenShift environment fails with a message similar to: "The field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined…”	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-04-30 17:51:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1670513

Description Marius Cornea 2018-11-26 15:37:50 UTC

Description of problem:

Director deployed OCP 3.11: replacing a master node fails during TASK [etcd : Add new etcd members to cluster]:


TASK [etcd : Add new etcd members to cluster] **********************************
[1;30mFAILED - RETRYING: Add new etcd members to cluster (3 retries left).[0m
[1;30mFAILED - RETRYING: Add new etcd members to cluster (2 retries left).[0m
[1;30mFAILED - RETRYING: Add new etcd members to cluster (1 retries left).[0m
[0;31mfatal: [openshift-master-3 -> 192.168.24.23]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", "https://openshift-master-2:2379", "member", "add", "openshift-master-3", "https://172.17.1.23:2380"], "delta": "0:00:01.506743", "end": "2018-11-26 00:54:47.504738", "msg": "non-zero return code", "rc": 1, "start": "2018-11-26 00:54:45.997995", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.1.25:2379: getsockopt: no route to host\n; error #1: client: etcd member https://172.17.1.14:2379 has no leader\n; error #2: client: etcd member https://172.17.1.12:2379 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.1.25:2379: getsockopt: no route to host", "; error #1: client: etcd member https://172.17.1.14:2379 has no leader", "; error #2: client: etcd member https://172.17.1.12:2379 has no leader"], "stdout": "", "stdout_lines": []}[0m

Note: 172.17.1.25 is the removed master

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-9.0.1-0.20181013060891.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with 3 masters
2. Remove one of the masters with openstack overcloud node delete
3. Re-run the overcloud deploy command to re-add the master node back to the deployment

Actual results:
openshift-ansible fails in playbook-etcd.log

Expected results:
No failures.

Additional info:

Comment 5 Marius Cornea 2018-12-12 18:14:16 UTC

With the latest changes the issue occurs when scaling out master nodes(with both CNS and local storage):

TASK [openshift_service_catalog : template] ************************************
[0;31mfatal: [openshift-master-2]: FAILED! => {"msg": "The field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/roles/openshift_service_catalog/tasks/install.yml': line 102, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n## api server\n- template:\n  ^ here\n"}[0m

PLAY RECAP *********************************************************************
[0;32mlocalhost[0m                  : [0;32mok=36  [0m changed=0    unreachable=0    failed=0   
[0;33mopenshift-infra-0[0m          : [0;32mok=41  [0m [0;33mchanged=3   [0m unreachable=0    failed=0   
[0;33mopenshift-infra-1[0m          : [0;32mok=41  [0m [0;33mchanged=3   [0m unreachable=0    failed=0   
[0;33mopenshift-infra-2[0m          : [0;32mok=41  [0m [0;33mchanged=3   [0m unreachable=0    failed=0   
[0;33mopenshift-master-1[0m         : [0;32mok=72  [0m [0;33mchanged=8   [0m unreachable=0    failed=0   
[0;31mopenshift-master-2[0m         : [0;32mok=690 [0m [0;33mchanged=155 [0m unreachable=0    [0;31mfailed=1   [0m
[0;33mopenshift-master-3[0m         : [0;32mok=569 [0m [0;33mchanged=184 [0m unreachable=0    failed=0   
[0;33mopenshift-worker-0[0m         : [0;32mok=41  [0m [0;33mchanged=3   [0m unreachable=0    failed=0   
[0;33mopenshift-worker-1[0m         : [0;32mok=41  [0m [0;33mchanged=3   [0m unreachable=0    failed=0   
[0;33mopenshift-worker-2[0m         : [0;32mok=41  [0m [0;33mchanged=3   [0m unreachable=0    failed=0   


INSTALLER STATUS ***************************************************************
[0;32mLoad Balancer Install        : Complete (0:00:00)[0m
[0;32mInitialization               : Complete (0:03:00)[0m
[0;32mHealth Check                 : Complete (0:00:18)[0m
[0;32mNode Bootstrap Preparation   : Complete (0:01:31)[0m
[0;32metcd Install                 : Complete (0:01:04)[0m
[0;32mMaster Install               : Complete (0:04:20)[0m
[0;32mMaster Additional Install    : Complete (0:00:57)[0m
[0;32mNode Join                    : Complete (0:00:15)[0m
[0;32mGlusterFS Install            : Complete (0:02:15)[0m
[0;32mHosted Install               : Complete (0:01:37)[0m
[0;32mCluster Monitoring Operator  : Complete (0:00:16)[0m
[0;32mWeb Console Install          : Complete (0:00:46)[0m
[0;32mConsole Install              : Complete (0:00:16)[0m
[0;32mmetrics-server Install       : Complete (0:00:01)[0m
[0;31mService Catalog Install      : In Progress (0:00:38)[0m
	This phase can be restarted by running: playbooks/openshift-service-catalog/config.yml


Failure summary:


  1. Hosts:    openshift-master-2
     Play:     Service Catalog
     Task:     openshift_service_catalog : template
     Message:  [0;31mThe field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined[0m
               [0;31m[0m
               [0;31mThe error appears to have been in '/usr/share/ansible/openshift-ansible/roles/openshift_service_catalog/tasks/install.yml': line 102, column 3, but may[0m
               [0;31mbe elsewhere in the file depending on the exact syntax problem.[0m
               [0;31m[0m
               [0;31mThe offending line appears to be:[0m
               [0;31m[0m
               [0;31m## api server[0m
               [0;31m- template:[0m
               [0;31m  ^ here[0m
               [0;31m[0m

Comment 7 Martin André 2019-01-23 08:37:08 UTC

Found the issue is caused by TripleO re-running the deploy playbook (to apply potential changes to the cluster configuration) a bit too early. The fix delays the execution of the deploy playbook and uses an inventory that represents the final state of the cluster.

Posted the following patches upstream to fix the issue:

https://review.openstack.org/632638/
https://review.openstack.org/632639/
https://review.openstack.org/632640/

Comment 9 Marius Cornea 2019-01-23 21:08:27 UTC

(In reply to Martin André from comment #7)
> Found the issue is caused by TripleO re-running the deploy playbook (to
> apply potential changes to the cluster configuration) a bit too early. The
> fix delays the execution of the deploy playbook and uses an inventory that
> represents the final state of the cluster.
> 
> Posted the following patches upstream to fix the issue:
> 
> https://review.openstack.org/632638/
> https://review.openstack.org/632639/
> https://review.openstack.org/632640/

I applied the patches above on my env and now the initial failure came back. Do you want to keep track of this issue in a separate bug report?

TASK [etcd : Add new etcd members to cluster] **********************************
FAILED - RETRYING: Add new etcd members to cluster (3 retries left).
FAILED - RETRYING: Add new etcd members to cluster (2 retries left).
FAILED - RETRYING: Add new etcd members to cluster (1 retries left).
fatal: [openshift-master-3 -> 192.168.24.12]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", "https://openshift-master-2:2379", "member", "add", "openshift-master-3", "https://172.17.1.27:2380"], "delta": "0:00:01.833627", "end": "2019-01-23 16:01:12.168434", "msg": "non-zero return code", "rc": 1, "start": "2019-01-23 16:01:10.334807", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has no leader\n; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to host\n; error #2: client: etcd member https://172.17.1.10:2379 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has no leader", "; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to host", "; error #2: client: etcd member https://172.17.1.10:2379 has no leader"], "stdout": "", "stdout_lines": []}

Comment 10 Marius Cornea 2019-01-24 03:02:37 UTC

(In reply to Marius Cornea from comment #9)
> (In reply to Martin André from comment #7)
> > Found the issue is caused by TripleO re-running the deploy playbook (to
> > apply potential changes to the cluster configuration) a bit too early. The
> > fix delays the execution of the deploy playbook and uses an inventory that
> > represents the final state of the cluster.
> > 
> > Posted the following patches upstream to fix the issue:
> > 
> > https://review.openstack.org/632638/
> > https://review.openstack.org/632639/
> > https://review.openstack.org/632640/
> 
> I applied the patches above on my env and now the initial failure came back.
> Do you want to keep track of this issue in a separate bug report?
> 
> TASK [etcd : Add new etcd members to cluster]
> **********************************
> FAILED - RETRYING: Add new etcd members to cluster (3 retries left).
> FAILED - RETRYING: Add new etcd members to cluster (2 retries left).
> FAILED - RETRYING: Add new etcd members to cluster (1 retries left).
> fatal: [openshift-master-3 -> 192.168.24.12]: FAILED! => {"attempts": 3,
> "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd",
> "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file",
> "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints",
> "https://openshift-master-2:2379", "member", "add", "openshift-master-3",
> "https://172.17.1.27:2380"], "delta": "0:00:01.833627", "end": "2019-01-23
> 16:01:12.168434", "msg": "non-zero return code", "rc": 1, "start":
> "2019-01-23 16:01:10.334807", "stderr": "client: etcd cluster is unavailable
> or misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has
> no leader\n; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to
> host\n; error #2: client: etcd member https://172.17.1.10:2379 has no
> leader", "stderr_lines": ["client: etcd cluster is unavailable or
> misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has no
> leader", "; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to
> host", "; error #2: client: etcd member https://172.17.1.10:2379 has no
> leader"], "stdout": "", "stdout_lines": []}

FWIW the ^ error is caused by the node removed during scale down still being an etcd member. It can be worked around by removin the node from etcd manually after the scale down:
/usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://openshift-master-2:2379  member remove $node_id

Comment 17 errata-xmlrpc 2019-04-30 17:51:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0878