Bug 1653348
Summary: | Director deployed OCP 3.11: scaling out a master node fails with "The field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> |
Component: | openstack-tripleo-heat-templates | Assignee: | Martin André <m.andre> |
Status: | CLOSED ERRATA | QA Contact: | Gurenko Alex <agurenko> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 14.0 (Rocky) | CC: | dbecker, lmarsh, ltomasbo, m.andre, mburns, morazi |
Target Milestone: | z2 | Keywords: | Triaged, ZStream |
Target Release: | 14.0 (Rocky) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-9.2.1-0.20190119154866.el7ost | Doc Type: | Known Issue |
Doc Text: |
Scaling out with an additional OpenShift master node of a director deployed OpenShift environment fails with a message similar to: "The field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined…”
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-04-30 17:51:14 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1670513 |
Description
Marius Cornea
2018-11-26 15:37:50 UTC
With the latest changes the issue occurs when scaling out master nodes(with both CNS and local storage): TASK [openshift_service_catalog : template] ************************************ [0;31mfatal: [openshift-master-2]: FAILED! => {"msg": "The field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/roles/openshift_service_catalog/tasks/install.yml': line 102, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n## api server\n- template:\n ^ here\n"}[0m PLAY RECAP ********************************************************************* [0;32mlocalhost[0m : [0;32mok=36 [0m changed=0 unreachable=0 failed=0 [0;33mopenshift-infra-0[0m : [0;32mok=41 [0m [0;33mchanged=3 [0m unreachable=0 failed=0 [0;33mopenshift-infra-1[0m : [0;32mok=41 [0m [0;33mchanged=3 [0m unreachable=0 failed=0 [0;33mopenshift-infra-2[0m : [0;32mok=41 [0m [0;33mchanged=3 [0m unreachable=0 failed=0 [0;33mopenshift-master-1[0m : [0;32mok=72 [0m [0;33mchanged=8 [0m unreachable=0 failed=0 [0;31mopenshift-master-2[0m : [0;32mok=690 [0m [0;33mchanged=155 [0m unreachable=0 [0;31mfailed=1 [0m [0;33mopenshift-master-3[0m : [0;32mok=569 [0m [0;33mchanged=184 [0m unreachable=0 failed=0 [0;33mopenshift-worker-0[0m : [0;32mok=41 [0m [0;33mchanged=3 [0m unreachable=0 failed=0 [0;33mopenshift-worker-1[0m : [0;32mok=41 [0m [0;33mchanged=3 [0m unreachable=0 failed=0 [0;33mopenshift-worker-2[0m : [0;32mok=41 [0m [0;33mchanged=3 [0m unreachable=0 failed=0 INSTALLER STATUS *************************************************************** [0;32mLoad Balancer Install : Complete (0:00:00)[0m [0;32mInitialization : Complete (0:03:00)[0m [0;32mHealth Check : Complete (0:00:18)[0m [0;32mNode Bootstrap Preparation : Complete (0:01:31)[0m [0;32metcd Install : Complete (0:01:04)[0m [0;32mMaster Install : Complete (0:04:20)[0m [0;32mMaster Additional Install : Complete (0:00:57)[0m [0;32mNode Join : Complete (0:00:15)[0m [0;32mGlusterFS Install : Complete (0:02:15)[0m [0;32mHosted Install : Complete (0:01:37)[0m [0;32mCluster Monitoring Operator : Complete (0:00:16)[0m [0;32mWeb Console Install : Complete (0:00:46)[0m [0;32mConsole Install : Complete (0:00:16)[0m [0;32mmetrics-server Install : Complete (0:00:01)[0m [0;31mService Catalog Install : In Progress (0:00:38)[0m This phase can be restarted by running: playbooks/openshift-service-catalog/config.yml Failure summary: 1. Hosts: openshift-master-2 Play: Service Catalog Task: openshift_service_catalog : template Message: [0;31mThe field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined[0m [0;31m[0m [0;31mThe error appears to have been in '/usr/share/ansible/openshift-ansible/roles/openshift_service_catalog/tasks/install.yml': line 102, column 3, but may[0m [0;31mbe elsewhere in the file depending on the exact syntax problem.[0m [0;31m[0m [0;31mThe offending line appears to be:[0m [0;31m[0m [0;31m## api server[0m [0;31m- template:[0m [0;31m ^ here[0m [0;31m[0m Found the issue is caused by TripleO re-running the deploy playbook (to apply potential changes to the cluster configuration) a bit too early. The fix delays the execution of the deploy playbook and uses an inventory that represents the final state of the cluster. Posted the following patches upstream to fix the issue: https://review.openstack.org/632638/ https://review.openstack.org/632639/ https://review.openstack.org/632640/ (In reply to Martin André from comment #7) > Found the issue is caused by TripleO re-running the deploy playbook (to > apply potential changes to the cluster configuration) a bit too early. The > fix delays the execution of the deploy playbook and uses an inventory that > represents the final state of the cluster. > > Posted the following patches upstream to fix the issue: > > https://review.openstack.org/632638/ > https://review.openstack.org/632639/ > https://review.openstack.org/632640/ I applied the patches above on my env and now the initial failure came back. Do you want to keep track of this issue in a separate bug report? TASK [etcd : Add new etcd members to cluster] ********************************** FAILED - RETRYING: Add new etcd members to cluster (3 retries left). FAILED - RETRYING: Add new etcd members to cluster (2 retries left). FAILED - RETRYING: Add new etcd members to cluster (1 retries left). fatal: [openshift-master-3 -> 192.168.24.12]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", "https://openshift-master-2:2379", "member", "add", "openshift-master-3", "https://172.17.1.27:2380"], "delta": "0:00:01.833627", "end": "2019-01-23 16:01:12.168434", "msg": "non-zero return code", "rc": 1, "start": "2019-01-23 16:01:10.334807", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has no leader\n; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to host\n; error #2: client: etcd member https://172.17.1.10:2379 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has no leader", "; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to host", "; error #2: client: etcd member https://172.17.1.10:2379 has no leader"], "stdout": "", "stdout_lines": []} (In reply to Marius Cornea from comment #9) > (In reply to Martin André from comment #7) > > Found the issue is caused by TripleO re-running the deploy playbook (to > > apply potential changes to the cluster configuration) a bit too early. The > > fix delays the execution of the deploy playbook and uses an inventory that > > represents the final state of the cluster. > > > > Posted the following patches upstream to fix the issue: > > > > https://review.openstack.org/632638/ > > https://review.openstack.org/632639/ > > https://review.openstack.org/632640/ > > I applied the patches above on my env and now the initial failure came back. > Do you want to keep track of this issue in a separate bug report? > > TASK [etcd : Add new etcd members to cluster] > ********************************** > FAILED - RETRYING: Add new etcd members to cluster (3 retries left). > FAILED - RETRYING: Add new etcd members to cluster (2 retries left). > FAILED - RETRYING: Add new etcd members to cluster (1 retries left). > fatal: [openshift-master-3 -> 192.168.24.12]: FAILED! => {"attempts": 3, > "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", > "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", > "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", > "https://openshift-master-2:2379", "member", "add", "openshift-master-3", > "https://172.17.1.27:2380"], "delta": "0:00:01.833627", "end": "2019-01-23 > 16:01:12.168434", "msg": "non-zero return code", "rc": 1, "start": > "2019-01-23 16:01:10.334807", "stderr": "client: etcd cluster is unavailable > or misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has > no leader\n; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to > host\n; error #2: client: etcd member https://172.17.1.10:2379 has no > leader", "stderr_lines": ["client: etcd cluster is unavailable or > misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has no > leader", "; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to > host", "; error #2: client: etcd member https://172.17.1.10:2379 has no > leader"], "stdout": "", "stdout_lines": []} FWIW the ^ error is caused by the node removed during scale down still being an etcd member. It can be worked around by removin the node from etcd manually after the scale down: /usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://openshift-master-2:2379 member remove $node_id Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0878 |