Description of problem: Director deployed OCP 3.11: replacing a master node fails during TASK [etcd : Add new etcd members to cluster]: TASK [etcd : Add new etcd members to cluster] ********************************** [1;30mFAILED - RETRYING: Add new etcd members to cluster (3 retries left).[0m [1;30mFAILED - RETRYING: Add new etcd members to cluster (2 retries left).[0m [1;30mFAILED - RETRYING: Add new etcd members to cluster (1 retries left).[0m [0;31mfatal: [openshift-master-3 -> 192.168.24.23]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", "https://openshift-master-2:2379", "member", "add", "openshift-master-3", "https://172.17.1.23:2380"], "delta": "0:00:01.506743", "end": "2018-11-26 00:54:47.504738", "msg": "non-zero return code", "rc": 1, "start": "2018-11-26 00:54:45.997995", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.1.25:2379: getsockopt: no route to host\n; error #1: client: etcd member https://172.17.1.14:2379 has no leader\n; error #2: client: etcd member https://172.17.1.12:2379 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.1.25:2379: getsockopt: no route to host", "; error #1: client: etcd member https://172.17.1.14:2379 has no leader", "; error #2: client: etcd member https://172.17.1.12:2379 has no leader"], "stdout": "", "stdout_lines": []}[0m Note: 172.17.1.25 is the removed master Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-9.0.1-0.20181013060891.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud with 3 masters 2. Remove one of the masters with openstack overcloud node delete 3. Re-run the overcloud deploy command to re-add the master node back to the deployment Actual results: openshift-ansible fails in playbook-etcd.log Expected results: No failures. Additional info: FWIW the ^ error is caused by the node removed during scale down still being an etcd member. It can be worked around by removing the node from etcd manually after the scale down: /usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://openshift-master-2:2379 member remove $node_id
We need to document the workaround for now, but we could resolve this in TripleO (with some significant rework of our templates there). Workaround to remove node from etcd after scale down: /usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://openshift-master-2:2379 member remove $node_id
Moving to the docs team.
New section has been published here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/installing_openshift_container_platform_on_bare_metal_using_director/index#replacing_a_master_node