2055409 – Cinder ETCd membership management issue while replacing node of DistributedComputeHCI

Bug 2055409 - Cinder ETCd membership management issue while replacing node of DistributedComputeHCI

Summary: Cinder ETCd membership management issue while replacing node of DistributedCo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.2 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	z3
Target Release:	16.2 (Train on RHEL 8.4)
Assignee:	Alan Bishop
QA Contact:	Alfredo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2053595
TreeView+	depends on / blocked

Reported:	2022-02-16 20:39 UTC by Donghwi Cha
Modified:	2022-06-22 16:04 UTC (History)
CC List:	4 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.6.1-2.20220409014848.7c89b16.el8ost
Doc Type:	Bug Fix
Doc Text:	Before this update, during the replacement of a DCN node, the etcd service on the replacement node failed to start and caused the cinder-volume service on that node to fail. This failure was caused by the replacement for a DCN node attempting to start the etcd service as if it were bootstrapping a new etcd cluster, instead of joining the existing etcd cluster. + With this update, a new parameter has been added, `EtcdInitialClusterState`. When `EtcdInitialClusterState` is set to `existing`, the DCN node starts etcd correctly, which causes the cinder-volume service to run successfully.
Clone Of:
Environment:
Last Closed:	2022-06-22 16:04:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	835485	None	MERGED	Etcd: Update cluster membership when replacing a node	2022-04-03 22:54:32 UTC
Red Hat Issue Tracker	OSP-12728	None	None	None	2022-02-16 20:41:32 UTC
Red Hat Product Errata	RHBA-2022:4793	None	None	None	2022-06-22 16:04:43 UTC

Description Donghwi Cha 2022-02-16 20:39:08 UTC

Description of problem:

ETCd membership management has issue while node replacement as seen below 

2022-02-16T20:32:01.096292313+09:00 stderr F 2022-02-16 20:32:01.096196 I | etcdmain: Loading server configuration from "/etc/etcd/etcd.yml"
2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097134 I | etcdmain: etcd Version: 3.3.23
2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097149 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097154 I | etcdmain: Go Version: go1.15.14
2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097157 I | etcdmain: Go OS/Arch: linux/amd64
2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097160 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2022-02-16T20:32:01.097223755+09:00 stderr F 2022-02-16 20:32:01.097205 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2022-02-16T20:32:01.097244473+09:00 stderr F 2022-02-16 20:32:01.097232 I | embed: peerTLS: cert = /etc/pki/tls/certs/etcd.crt, key = /etc/pki/tls/private/etcd.key, ca = , trusted-ca = /etc/ipa/ca.crt, client-cert-auth = true, crl-file =
2022-02-16T20:32:01.097739006+09:00 stderr F 2022-02-16 20:32:01.097717 I | embed: listening for peers on https://[[newly added or replaced node ip address]]:2380
2022-02-16T20:32:01.097769463+09:00 stderr F 2022-02-16 20:32:01.097754 I | embed: listening for client requests on [[newly added or replaced node ip address]]:2379
2022-02-16T20:32:01.098630828+09:00 stderr F 2022-02-16 20:32:01.098601 I | pkg/netutil: resolving [[newly added or replaced node hostname]]:2380 to [[newly added or replaced node ip address]]:2380
2022-02-16T20:32:01.117897651+09:00 stderr F 2022-02-16 20:32:01.117840 C | etcdmain: member 6b9c10eb341bdd59 has already been bootstrapped

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Prepare node replacement of DistributedComputeHCI using manual ceph scale down of this document 
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/deploying_an_overcloud_with_containerized_red_hat_ceph/index#Replacing_Ceph_Storage_Nodes
2. run the Scale down using this doc below 
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/director_installation_and_usage/index#removing-compute-nodes 
3. during step 1 and 2, include any possible commands or troubleshooting necessary and finish successfully 
4. Add new node DistributedComputeHCI using openstack overcloud deployment command 
5. Go to one of the existing node of DistributedComputeHCI that were not deleted, and check etcd member list by using command line of "etcdctl member list" inside of etcd container 
6. the new node of DistributedComputeHCI can not join etcd cluster since the member record of the deleted node is not deleted and still persisting 

Actual results:
old member record is not deleted in etc cluster of DistributedComputeHCI nodes while node replacement 

Expected results:
- old member record of etc cluster should be deleted by using etcdctl command line 
- new node of DistributedComputeHCI for the replacement should be added to the etcd cluster by using etcdctl command line 
- new node of DistributedComputeHCI should start etcd container using the option and value of initial-cluster-state existing in its config and yml file 

Additional info:

Two possible options for implementation 
- adding new block of scale task to the cinder puppet yaml like tripleO cinder-volume-container-puppet.yaml
- or doing the same not to cinder puppet but to nova-compute-container-puppet.yaml by using ansible to make it run only if the role is DistributedComputeHCI 
- below are the sample commands of what's necessary for the implementations 


[root@d0-hci-1 /]#  ETCDCTL_API=3 etcdctl --endpoints=https://192.168.111.142:2379 --cacert=/etc/ipa/ca.crt --cert=/etc/pki/tls/certs/etcd.crt --key=/etc/pki/tls/private/etcd.key member remove 6b9c10eb341bdd59

[root@d0-hci-1 /]#  ETCDCTL_API=3 etcdctl --endpoints=https://192.168.111.141:2379,https://192.168.111.142:2379 --cacert=/etc/ipa/ca.crt --cert=/etc/pki/tls/certs/etcd.crt --key=/etc/pki/tls/private/etcd.key member add d0-hci-3.internalapi.opqr.my.opqr --peer-urls=https://d0-hci-3.internalapi.opqr.my.opqr:2380 


>> change config on the new node 
vi /var/lib/config-data/puppet-generated/etcd/etc/etcd/etcd.yml
#initial-cluster-state: "new"
initial-cluster-state: "existing"

vi /var/lib/config-data/puppet-generated/etcd/etc/etcd/etcd.conf
#ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_CLUSTER_STATE="existing"

podman restart etcd

Comment 1 Gregory Charot 2022-04-28 09:55:44 UTC

setting priority to high as this is a DCN requirement from KT

Comment 5 Greg Rakauskas 2022-05-17 18:20:05 UTC

Hi Alan,

An RHOSP Doc team script extracts the contents of the "Doc Text" field in this
BZ for use in the RHOSP release notes.

I have edited the "Doc Text" contents to conform to Red Hat style guidelines.
(See below).

Please add a comment to this BZ indicating whether my doc text edits:

   1. are OK, or
   2. require some changes.

If condition (2) is true, then please provide the required changes.

Thanks for your help with this,
--Greg


PROPOSED DOC TEXT EDIT
----------------------
Before this update, the etcd service on the replacement node failed to start,
which caused the cinder-volume service on that node to fail. This failure was
caused by the replacement for a DCN node attempting to start the etcd service as
if it were bootstrapping a new etcd cluster, instead of joining the existing
etcd cluster.

With this update, a new parameter has been added, `EtcdInitialClusterState`.
When `EtcdInitialClusterState` is set to `existing`, the DCN node starts etcd
properly which then causes the cinder-volume service to run successfully.

Comment 6 Alan Bishop 2022-05-17 20:36:00 UTC

Hi Greg,

The replacement text is technically accurate, but I want to note the bug (and this fix) is only relevant to replacing DCN nodes. I mention this because the opening sentence of your text mentions "the replacement node" without the DCN context (although "DCN" is mentioned later). I'll leave to you to decide whether you want to refine the first sentence. Otherwise, the replacement text looks good.

Comment 12 errata-xmlrpc 2022-06-22 16:04:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.3 (Train)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:4793

Note You need to log in before you can comment on or make changes to this bug.