Description of problem: Director deployed OCP 3.11: replacing a master node fails during TASK [etcd : Add new etcd members to cluster]: TASK [etcd : Add new etcd members to cluster] ********************************** [1;30mFAILED - RETRYING: Add new etcd members to cluster (3 retries left).[0m [1;30mFAILED - RETRYING: Add new etcd members to cluster (2 retries left).[0m [1;30mFAILED - RETRYING: Add new etcd members to cluster (1 retries left).[0m [0;31mfatal: [openshift-master-3 -> 192.168.24.23]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", "https://openshift-master-2:2379", "member", "add", "openshift-master-3", "https://172.17.1.23:2380"], "delta": "0:00:01.506743", "end": "2018-11-26 00:54:47.504738", "msg": "non-zero return code", "rc": 1, "start": "2018-11-26 00:54:45.997995", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.1.25:2379: getsockopt: no route to host\n; error #1: client: etcd member https://172.17.1.14:2379 has no leader\n; error #2: client: etcd member https://172.17.1.12:2379 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.1.25:2379: getsockopt: no route to host", "; error #1: client: etcd member https://172.17.1.14:2379 has no leader", "; error #2: client: etcd member https://172.17.1.12:2379 has no leader"], "stdout": "", "stdout_lines": []}[0m Note: 172.17.1.25 is the removed master Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-9.0.1-0.20181013060891.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud with 3 masters 2. Remove one of the masters with openstack overcloud node delete 3. Re-run the overcloud deploy command to re-add the master node back to the deployment Actual results: openshift-ansible fails in playbook-etcd.log Expected results: No failures. Additional info:
With the latest changes the issue occurs when scaling out master nodes(with both CNS and local storage): TASK [openshift_service_catalog : template] ************************************ [0;31mfatal: [openshift-master-2]: FAILED! => {"msg": "The field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/roles/openshift_service_catalog/tasks/install.yml': line 102, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n## api server\n- template:\n ^ here\n"}[0m PLAY RECAP ********************************************************************* [0;32mlocalhost[0m : [0;32mok=36 [0m changed=0 unreachable=0 failed=0 [0;33mopenshift-infra-0[0m : [0;32mok=41 [0m [0;33mchanged=3 [0m unreachable=0 failed=0 [0;33mopenshift-infra-1[0m : [0;32mok=41 [0m [0;33mchanged=3 [0m unreachable=0 failed=0 [0;33mopenshift-infra-2[0m : [0;32mok=41 [0m [0;33mchanged=3 [0m unreachable=0 failed=0 [0;33mopenshift-master-1[0m : [0;32mok=72 [0m [0;33mchanged=8 [0m unreachable=0 failed=0 [0;31mopenshift-master-2[0m : [0;32mok=690 [0m [0;33mchanged=155 [0m unreachable=0 [0;31mfailed=1 [0m [0;33mopenshift-master-3[0m : [0;32mok=569 [0m [0;33mchanged=184 [0m unreachable=0 failed=0 [0;33mopenshift-worker-0[0m : [0;32mok=41 [0m [0;33mchanged=3 [0m unreachable=0 failed=0 [0;33mopenshift-worker-1[0m : [0;32mok=41 [0m [0;33mchanged=3 [0m unreachable=0 failed=0 [0;33mopenshift-worker-2[0m : [0;32mok=41 [0m [0;33mchanged=3 [0m unreachable=0 failed=0 INSTALLER STATUS *************************************************************** [0;32mLoad Balancer Install : Complete (0:00:00)[0m [0;32mInitialization : Complete (0:03:00)[0m [0;32mHealth Check : Complete (0:00:18)[0m [0;32mNode Bootstrap Preparation : Complete (0:01:31)[0m [0;32metcd Install : Complete (0:01:04)[0m [0;32mMaster Install : Complete (0:04:20)[0m [0;32mMaster Additional Install : Complete (0:00:57)[0m [0;32mNode Join : Complete (0:00:15)[0m [0;32mGlusterFS Install : Complete (0:02:15)[0m [0;32mHosted Install : Complete (0:01:37)[0m [0;32mCluster Monitoring Operator : Complete (0:00:16)[0m [0;32mWeb Console Install : Complete (0:00:46)[0m [0;32mConsole Install : Complete (0:00:16)[0m [0;32mmetrics-server Install : Complete (0:00:01)[0m [0;31mService Catalog Install : In Progress (0:00:38)[0m This phase can be restarted by running: playbooks/openshift-service-catalog/config.yml Failure summary: 1. Hosts: openshift-master-2 Play: Service Catalog Task: openshift_service_catalog : template Message: [0;31mThe field 'vars' has an invalid value, which includes an undefined variable. The error was: 'openshift_master_etcd_urls' is undefined[0m [0;31m[0m [0;31mThe error appears to have been in '/usr/share/ansible/openshift-ansible/roles/openshift_service_catalog/tasks/install.yml': line 102, column 3, but may[0m [0;31mbe elsewhere in the file depending on the exact syntax problem.[0m [0;31m[0m [0;31mThe offending line appears to be:[0m [0;31m[0m [0;31m## api server[0m [0;31m- template:[0m [0;31m ^ here[0m [0;31m[0m
Found the issue is caused by TripleO re-running the deploy playbook (to apply potential changes to the cluster configuration) a bit too early. The fix delays the execution of the deploy playbook and uses an inventory that represents the final state of the cluster. Posted the following patches upstream to fix the issue: https://review.openstack.org/632638/ https://review.openstack.org/632639/ https://review.openstack.org/632640/
(In reply to Martin André from comment #7) > Found the issue is caused by TripleO re-running the deploy playbook (to > apply potential changes to the cluster configuration) a bit too early. The > fix delays the execution of the deploy playbook and uses an inventory that > represents the final state of the cluster. > > Posted the following patches upstream to fix the issue: > > https://review.openstack.org/632638/ > https://review.openstack.org/632639/ > https://review.openstack.org/632640/ I applied the patches above on my env and now the initial failure came back. Do you want to keep track of this issue in a separate bug report? TASK [etcd : Add new etcd members to cluster] ********************************** FAILED - RETRYING: Add new etcd members to cluster (3 retries left). FAILED - RETRYING: Add new etcd members to cluster (2 retries left). FAILED - RETRYING: Add new etcd members to cluster (1 retries left). fatal: [openshift-master-3 -> 192.168.24.12]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", "https://openshift-master-2:2379", "member", "add", "openshift-master-3", "https://172.17.1.27:2380"], "delta": "0:00:01.833627", "end": "2019-01-23 16:01:12.168434", "msg": "non-zero return code", "rc": 1, "start": "2019-01-23 16:01:10.334807", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has no leader\n; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to host\n; error #2: client: etcd member https://172.17.1.10:2379 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has no leader", "; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to host", "; error #2: client: etcd member https://172.17.1.10:2379 has no leader"], "stdout": "", "stdout_lines": []}
(In reply to Marius Cornea from comment #9) > (In reply to Martin André from comment #7) > > Found the issue is caused by TripleO re-running the deploy playbook (to > > apply potential changes to the cluster configuration) a bit too early. The > > fix delays the execution of the deploy playbook and uses an inventory that > > represents the final state of the cluster. > > > > Posted the following patches upstream to fix the issue: > > > > https://review.openstack.org/632638/ > > https://review.openstack.org/632639/ > > https://review.openstack.org/632640/ > > I applied the patches above on my env and now the initial failure came back. > Do you want to keep track of this issue in a separate bug report? > > TASK [etcd : Add new etcd members to cluster] > ********************************** > FAILED - RETRYING: Add new etcd members to cluster (3 retries left). > FAILED - RETRYING: Add new etcd members to cluster (2 retries left). > FAILED - RETRYING: Add new etcd members to cluster (1 retries left). > fatal: [openshift-master-3 -> 192.168.24.12]: FAILED! => {"attempts": 3, > "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", > "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", > "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", > "https://openshift-master-2:2379", "member", "add", "openshift-master-3", > "https://172.17.1.27:2380"], "delta": "0:00:01.833627", "end": "2019-01-23 > 16:01:12.168434", "msg": "non-zero return code", "rc": 1, "start": > "2019-01-23 16:01:10.334807", "stderr": "client: etcd cluster is unavailable > or misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has > no leader\n; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to > host\n; error #2: client: etcd member https://172.17.1.10:2379 has no > leader", "stderr_lines": ["client: etcd cluster is unavailable or > misconfigured; error #0: client: etcd member https://172.17.1.19:2379 has no > leader", "; error #1: dial tcp 172.17.1.14:2379: getsockopt: no route to > host", "; error #2: client: etcd member https://172.17.1.10:2379 has no > leader"], "stdout": "", "stdout_lines": []} FWIW the ^ error is caused by the node removed during scale down still being an etcd member. It can be worked around by removin the node from etcd manually after the scale down: /usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://openshift-master-2:2379 member remove $node_id
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0878