Bug 1670513 - [docs][OSasInfra] - Add procedure for `Replacing a master node`.
Summary: [docs][OSasInfra] - Add procedure for `Replacing a master node`.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: documentation
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: beta
: 15.0 (Stein)
Assignee: Martin Lopes
QA Contact: RHOS Documentation Team
URL:
Whiteboard:
Depends On: 1653348
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-29 17:43 UTC by Marius Cornea
Modified: 2019-03-19 09:24 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-19 09:24:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Marius Cornea 2019-01-29 17:43:29 UTC
Description of problem:

Director deployed OCP 3.11: replacing a master node fails during TASK [etcd : Add new etcd members to cluster]:


TASK [etcd : Add new etcd members to cluster] **********************************
[1;30mFAILED - RETRYING: Add new etcd members to cluster (3 retries left).[0m
[1;30mFAILED - RETRYING: Add new etcd members to cluster (2 retries left).[0m
[1;30mFAILED - RETRYING: Add new etcd members to cluster (1 retries left).[0m
[0;31mfatal: [openshift-master-3 -> 192.168.24.23]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", "https://openshift-master-2:2379", "member", "add", "openshift-master-3", "https://172.17.1.23:2380"], "delta": "0:00:01.506743", "end": "2018-11-26 00:54:47.504738", "msg": "non-zero return code", "rc": 1, "start": "2018-11-26 00:54:45.997995", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.1.25:2379: getsockopt: no route to host\n; error #1: client: etcd member https://172.17.1.14:2379 has no leader\n; error #2: client: etcd member https://172.17.1.12:2379 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.1.25:2379: getsockopt: no route to host", "; error #1: client: etcd member https://172.17.1.14:2379 has no leader", "; error #2: client: etcd member https://172.17.1.12:2379 has no leader"], "stdout": "", "stdout_lines": []}[0m

Note: 172.17.1.25 is the removed master

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-9.0.1-0.20181013060891.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with 3 masters
2. Remove one of the masters with openstack overcloud node delete
3. Re-run the overcloud deploy command to re-add the master node back to the deployment

Actual results:
openshift-ansible fails in playbook-etcd.log

Expected results:
No failures.

Additional info:

FWIW the ^ error is caused by the node removed during scale down still being an etcd member. It can be worked around by removing the node from etcd manually after the scale down:
/usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://openshift-master-2:2379  member remove $node_id

Comment 1 John Trowbridge 2019-02-11 14:45:13 UTC
We need to document the workaround for now, but we could resolve this in TripleO (with some significant rework of our templates there).

Workaround to remove node from etcd after scale down:
/usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://openshift-master-2:2379  member remove $node_id

Comment 2 Martin André 2019-03-13 14:35:05 UTC
Moving to the docs team.


Note You need to log in before you can comment on or make changes to this bug.