1670513 – [docs][OSasInfra] - Add procedure for `Replacing a master node`.

Bug 1670513 - [docs][OSasInfra] - Add procedure for `Replacing a master node`.

Summary: [docs][OSasInfra] - Add procedure for `Replacing a master node`.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	documentation
Sub Component:
Version:	15.0 (Stein)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	beta
Target Release:	15.0 (Stein)
Assignee:	Martin Lopes
QA Contact:	RHOS Documentation Team
Docs Contact:
URL:
Whiteboard:
Depends On:	1653348
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-29 17:43 UTC by Marius Cornea
Modified:	2019-03-19 09:24 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-19 09:24:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Marius Cornea 2019-01-29 17:43:29 UTC

Description of problem:

Director deployed OCP 3.11: replacing a master node fails during TASK [etcd : Add new etcd members to cluster]:


TASK [etcd : Add new etcd members to cluster] **********************************
[1;30mFAILED - RETRYING: Add new etcd members to cluster (3 retries left).[0m
[1;30mFAILED - RETRYING: Add new etcd members to cluster (2 retries left).[0m
[1;30mFAILED - RETRYING: Add new etcd members to cluster (1 retries left).[0m
[0;31mfatal: [openshift-master-3 -> 192.168.24.23]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", "https://openshift-master-2:2379", "member", "add", "openshift-master-3", "https://172.17.1.23:2380"], "delta": "0:00:01.506743", "end": "2018-11-26 00:54:47.504738", "msg": "non-zero return code", "rc": 1, "start": "2018-11-26 00:54:45.997995", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.1.25:2379: getsockopt: no route to host\n; error #1: client: etcd member https://172.17.1.14:2379 has no leader\n; error #2: client: etcd member https://172.17.1.12:2379 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.1.25:2379: getsockopt: no route to host", "; error #1: client: etcd member https://172.17.1.14:2379 has no leader", "; error #2: client: etcd member https://172.17.1.12:2379 has no leader"], "stdout": "", "stdout_lines": []}[0m

Note: 172.17.1.25 is the removed master

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-9.0.1-0.20181013060891.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with 3 masters
2. Remove one of the masters with openstack overcloud node delete
3. Re-run the overcloud deploy command to re-add the master node back to the deployment

Actual results:
openshift-ansible fails in playbook-etcd.log

Expected results:
No failures.

Additional info:

FWIW the ^ error is caused by the node removed during scale down still being an etcd member. It can be worked around by removing the node from etcd manually after the scale down:
/usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://openshift-master-2:2379  member remove $node_id

Comment 1 John Trowbridge 2019-02-11 14:45:13 UTC

We need to document the workaround for now, but we could resolve this in TripleO (with some significant rework of our templates there).

Workaround to remove node from etcd after scale down:
/usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://openshift-master-2:2379  member remove $node_id

Comment 2 Martin André 2019-03-13 14:35:05 UTC

Moving to the docs team.

Comment 10 Martin Lopes 2019-03-19 09:24:49 UTC

New section has been published here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/installing_openshift_container_platform_on_bare_metal_using_director/index#replacing_a_master_node

Note You need to log in before you can comment on or make changes to this bug.