1653348 – Director deployed OCP 3.11: scaling out a master node fails with "The field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined

Bug 1653348 - Director deployed OCP 3.11: scaling out a master node fails with "The field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined

Summary: Director deployed OCP 3.11: scaling out a master node fails with "The field '...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	z2
Target Release:	14.0 (Rocky)
Assignee:	Martin André
QA Contact:	Gurenko Alex
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1670513
TreeView+	depends on / blocked

Reported:	2018-11-26 15:37 UTC by Marius Cornea
Modified:	2019-04-30 17:51 UTC (History)
CC List:	6 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-9.2.1-0.20190119154866.el7ost
Doc Type:	Known Issue
Doc Text:	Scaling out with an additional OpenShift master node of a director deployed OpenShift environment fails with a message similar to: "The field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined…”
Clone Of:
Environment:
Last Closed:	2019-04-30 17:51:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1812962	None	None	None	2019-01-23 08:37:07 UTC
OpenStack gerrit	632638	None	MERGED	Store nodes information in a dict	2020-03-16 17:43:23 UTC
OpenStack gerrit	632639	None	MERGED	Generate post-deployment openshift-ansible inventory	2020-03-16 17:43:23 UTC
OpenStack gerrit	632640	None	MERGED	Apply changes to cluster using updated inventory	2020-03-16 17:43:23 UTC
Red Hat Product Errata	RHBA-2019:0878	None	None	None	2019-04-30 17:51:25 UTC

Description Marius Cornea 2018-11-26 15:37:50 UTC

Description of problem:

Director deployed OCP 3.11: replacing a master node fails during TASK [etcd : Add new etcd members to cluster]:


TASK [etcd : Add new etcd members to cluster] **********************************
[1;30mFAILED - RETRYING: Add new etcd members to cluster (3 retries left).[0m
[1;30mFAILED - RETRYING: Add new etcd members to cluster (2 retries left).[0m
[1;30mFAILED - RETRYING: Add new etcd members to cluster (1 retries left).[0m
[0;31mfatal: [openshift-master-3 -> 192.168.24.23]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", "https://openshift-master-2:2379", "member", "add", "openshift-master-3", "https://172.17.1.23:2380"], "delta": "0:00:01.506743", "end": "2018-11-26 00:54:47.504738", "msg": "non-zero return code", "rc": 1, "start": "2018-11-26 00:54:45.997995", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.1.25:2379: getsockopt: no route to host\n; error #1: client: etcd member https://172.17.1.14:2379 has no leader\n; error #2: client: etcd member https://172.17.1.12:2379 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.1.25:2379: getsockopt: no route to host", "; error #1: client: etcd member https://172.17.1.14:2379 has no leader", "; error #2: client: etcd member https://172.17.1.12:2379 has no leader"], "stdout": "", "stdout_lines": []}[0m

Note: 172.17.1.25 is the removed master

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-9.0.1-0.20181013060891.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with 3 masters
2. Remove one of the masters with openstack overcloud node delete
3. Re-run the overcloud deploy command to re-add the master node back to the deployment

Actual results:
openshift-ansible fails in playbook-etcd.log

Expected results:
No failures.

Additional info:

Comment 5 Marius Cornea 2018-12-12 18:14:16 UTC

With the latest changes the issue occurs when scaling out master nodes(with both CNS and local storage):

TASK [openshift_service_catalog : template] ************************************
[0;31mfatal: [openshift-master-2]: FAILED! => {"msg": "The field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/roles/openshift_service_catalog/tasks/install.yml': line 102, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n## api server\n- template:\n  ^ here\n"}[0m

PLAY RECAP *********************************************************************
[0;32mlocalhost[0m                  : [0;32mok=36  [0m changed=0    unreachable=0    failed=0   
[0;33mopenshift-infra-0[0m          : [0;32mok=41  [0m [0;33mchanged=3   [0m unreachable=0    failed=0   
[0;33mopenshift-infra-1[0m          : [0;32mok=41  [0m [0;33mchanged=3   [0m unreachable=0    failed=0   
[0;33mopenshift-infra-2[0m          : [0;32mok=41  [0m [0;33mchanged=3   [0m unreachable=0    failed=0   
[0;33mopenshift-master-1[0m         : [0;32mok=72  [0m [0;33mchanged=8   [0m unreachable=0    failed=0   
[0;31mopenshift-master-2[0m         : [0;32mok=690 [0m [0;33mchanged=155 [0m unreachable=0    [0;31mfailed=1   [0m
[0;33mopenshift-master-3[0m         : [0;32mok=569 [0m [0;33mchanged=184 [0m unreachable=0    failed=0   
[0;33mopenshift-worker-0[0m         : [0;32mok=41  [0m [0;33mchanged=3   [0m unreachable=0    failed=0   
[0;33mopenshift-worker-1[0m         : [0;32mok=41  [0m [0;33mchanged=3   [0m unreachable=0    failed=0   
[0;33mopenshift-worker-2[0m         : [0;32mok=41  [0m [0;33mchanged=3   [0m unreachable=0    failed=0   


INSTALLER STATUS ***************************************************************
[0;32mLoad Balancer Install        : Complete (0:00:00)[0m
[0;32mInitialization               : Complete (0:03:00)[0m
[0;32mHealth Check                 : Complete (0:00:18)[0m
[0;32mNode Bootstrap Preparation   : Complete (0:01:31)[0m
[0;32metcd Install                 : Complete (0:01:04)[0m
[0;32mMaster Install               : Complete (0:04:20)[0m
[0;32mMaster Additional Install    : Complete (0:00:57)[0m
[0;32mNode Join                    : Complete (0:00:15)[0m
[0;32mGlusterFS Install            : Complete (0:02:15)[0m
[0;32mHosted Install               : Complete (0:01:37)[0m
[0;32mCluster Monitoring Operator  : Complete (0:00:16)[0m
[0;32mWeb Console Install          : Complete (0:00:46)[0m
[0;32mConsole Install              : Complete (0:00:16)[0m
[0;32mmetrics-server Install       : Complete (0:00:01)[0m
[0;31mService Catalog Install      : In Progress (0:00:38)[0m
	This phase can be restarted by running: playbooks/openshift-service-catalog/config.yml


Failure summary:


  1. Hosts:    openshift-master-2
     Play:     Service Catalog
     Task:     openshift_service_catalog : template
     Message:  [0;31mThe field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined[0m
               [0;31m[0m
               [0;31mThe error appears to have been in '/usr/share/ansible/openshift-ansible/roles/openshift_service_catalog/tasks/install.yml': line 102, column 3, but may[0m
               [0;31mbe elsewhere in the file depending on the exact syntax problem.[0m
               [0;31m[0m
               [0;31mThe offending line appears to be:[0m
               [0;31m[0m
               [0;31m## api server[0m
               [0;31m- template:[0m
               [0;31m  ^ here[0m
               [0;31m[0m

Comment 7 Martin André 2019-01-23 08:37:08 UTC

Found the issue is caused by TripleO re-running the deploy playbook (to apply potential changes to the cluster configuration) a bit too early. The fix delays the execution of the deploy playbook and uses an inventory that represents the final state of the cluster.

Posted the following patches upstream to fix the issue:

https://review.openstack.org/632638/
https://review.openstack.org/632639/
https://review.openstack.org/632640/

Comment 9 Marius Cornea 2019-01-23 21:08:27 UTC

(In reply to Martin André from comment #7)
> Found the issue is caused by TripleO re-running the deploy playbook (to
> apply potential changes to the cluster configuration) a bit too early. The
> fix delays the execution of the deploy playbook and uses an inventory that
> represents the final state of the cluster.
> 
> Posted the following patches upstream to fix the issue:
> 
> https://review.openstack.org/632638/
> https://review.openstack.org/632639/
> https://review.openstack.org/632640/

I applied the patches above on my env and now the initial failure came back. Do you want to keep track of this issue in a separate bug report?

TASK [etcd : Add new etcd members to cluster] **********************************
FAILED - RETRYING: Add new etcd members to cluster (3 retries left).
FAILED - RETRYING: Add new etcd members to cluster (2 retries left).
FAILED - RETRYING: Add new etcd members to cluster (1 retries left).
fatal: [openshift-master-3 -> 192.168.24.12]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", "https://openshift-master-2:2379", "member", "add", "openshift-master-3", "https://172.17.1.27:2380"], "delta": "0:00:01.833627", "end": "2019-01-23 16:01:12.168434", "msg": "non-zero return code", "rc": 1, "start": "2019-01-23 16:01:10.334807", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has no leader\n; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to host\n; error #2: client: etcd member https://172.17.1.10:2379 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has no leader", "; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to host", "; error #2: client: etcd member https://172.17.1.10:2379 has no leader"], "stdout": "", "stdout_lines": []}

Comment 10 Marius Cornea 2019-01-24 03:02:37 UTC

(In reply to Marius Cornea from comment #9)
> (In reply to Martin André from comment #7)
> > Found the issue is caused by TripleO re-running the deploy playbook (to
> > apply potential changes to the cluster configuration) a bit too early. The
> > fix delays the execution of the deploy playbook and uses an inventory that
> > represents the final state of the cluster.
> > 
> > Posted the following patches upstream to fix the issue:
> > 
> > https://review.openstack.org/632638/
> > https://review.openstack.org/632639/
> > https://review.openstack.org/632640/
> 
> I applied the patches above on my env and now the initial failure came back.
> Do you want to keep track of this issue in a separate bug report?
> 
> TASK [etcd : Add new etcd members to cluster]
> **********************************
> FAILED - RETRYING: Add new etcd members to cluster (3 retries left).
> FAILED - RETRYING: Add new etcd members to cluster (2 retries left).
> FAILED - RETRYING: Add new etcd members to cluster (1 retries left).
> fatal: [openshift-master-3 -> 192.168.24.12]: FAILED! => {"attempts": 3,
> "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd",
> "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file",
> "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints",
> "https://openshift-master-2:2379", "member", "add", "openshift-master-3",
> "https://172.17.1.27:2380"], "delta": "0:00:01.833627", "end": "2019-01-23
> 16:01:12.168434", "msg": "non-zero return code", "rc": 1, "start":
> "2019-01-23 16:01:10.334807", "stderr": "client: etcd cluster is unavailable
> or misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has
> no leader\n; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to
> host\n; error #2: client: etcd member https://172.17.1.10:2379 has no
> leader", "stderr_lines": ["client: etcd cluster is unavailable or
> misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has no
> leader", "; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to
> host", "; error #2: client: etcd member https://172.17.1.10:2379 has no
> leader"], "stdout": "", "stdout_lines": []}

FWIW the ^ error is caused by the node removed during scale down still being an etcd member. It can be worked around by removin the node from etcd manually after the scale down:
/usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://openshift-master-2:2379  member remove $node_id

Comment 17 errata-xmlrpc 2019-04-30 17:51:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0878

Note You need to log in before you can comment on or make changes to this bug.