Bug 2055409

Summary: Cinder ETCd membership management issue while replacing node of DistributedComputeHCI
Product: Red Hat OpenStack Reporter: Donghwi Cha <dcha>
Component: openstack-tripleo-heat-templatesAssignee: Alan Bishop <abishop>
Status: CLOSED ERRATA QA Contact: Alfredo <alfrgarc>
Severity: medium Docs Contact:
Priority: high    
Version: 16.2 (Train)CC: abishop, gcharot, gregraka, mburns
Target Milestone: z3Keywords: Triaged, ZStream
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.6.1-2.20220409014848.7c89b16.el8ost Doc Type: Bug Fix
Doc Text:
Before this update, during the replacement of a DCN node, the etcd service on the replacement node failed to start and caused the cinder-volume service on that node to fail. This failure was caused by the replacement for a DCN node attempting to start the etcd service as if it were bootstrapping a new etcd cluster, instead of joining the existing etcd cluster. + With this update, a new parameter has been added, `EtcdInitialClusterState`. When `EtcdInitialClusterState` is set to `existing`, the DCN node starts etcd correctly, which causes the cinder-volume service to run successfully.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-22 16:04:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2053595    

Description Donghwi Cha 2022-02-16 20:39:08 UTC
Description of problem:

ETCd membership management has issue while node replacement as seen below 

2022-02-16T20:32:01.096292313+09:00 stderr F 2022-02-16 20:32:01.096196 I | etcdmain: Loading server configuration from "/etc/etcd/etcd.yml"
2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097134 I | etcdmain: etcd Version: 3.3.23
2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097149 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097154 I | etcdmain: Go Version: go1.15.14
2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097157 I | etcdmain: Go OS/Arch: linux/amd64
2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097160 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2022-02-16T20:32:01.097223755+09:00 stderr F 2022-02-16 20:32:01.097205 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2022-02-16T20:32:01.097244473+09:00 stderr F 2022-02-16 20:32:01.097232 I | embed: peerTLS: cert = /etc/pki/tls/certs/etcd.crt, key = /etc/pki/tls/private/etcd.key, ca = , trusted-ca = /etc/ipa/ca.crt, client-cert-auth = true, crl-file =
2022-02-16T20:32:01.097739006+09:00 stderr F 2022-02-16 20:32:01.097717 I | embed: listening for peers on https://[[newly added or replaced node ip address]]:2380
2022-02-16T20:32:01.097769463+09:00 stderr F 2022-02-16 20:32:01.097754 I | embed: listening for client requests on [[newly added or replaced node ip address]]:2379
2022-02-16T20:32:01.098630828+09:00 stderr F 2022-02-16 20:32:01.098601 I | pkg/netutil: resolving [[newly added or replaced node hostname]]:2380 to [[newly added or replaced node ip address]]:2380
2022-02-16T20:32:01.117897651+09:00 stderr F 2022-02-16 20:32:01.117840 C | etcdmain: member 6b9c10eb341bdd59 has already been bootstrapped

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Prepare node replacement of DistributedComputeHCI using manual ceph scale down of this document 
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/deploying_an_overcloud_with_containerized_red_hat_ceph/index#Replacing_Ceph_Storage_Nodes
2. run the Scale down using this doc below 
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/director_installation_and_usage/index#removing-compute-nodes 
3. during step 1 and 2, include any possible commands or troubleshooting necessary and finish successfully 
4. Add new node DistributedComputeHCI using openstack overcloud deployment command 
5. Go to one of the existing node of DistributedComputeHCI that were not deleted, and check etcd member list by using command line of "etcdctl member list" inside of etcd container 
6. the new node of DistributedComputeHCI can not join etcd cluster since the member record of the deleted node is not deleted and still persisting 

Actual results:
old member record is not deleted in etc cluster of DistributedComputeHCI nodes while node replacement 

Expected results:
- old member record of etc cluster should be deleted by using etcdctl command line 
- new node of DistributedComputeHCI for the replacement should be added to the etcd cluster by using etcdctl command line 
- new node of DistributedComputeHCI should start etcd container using the option and value of initial-cluster-state existing in its config and yml file 

Additional info:

Two possible options for implementation 
- adding new block of scale task to the cinder puppet yaml like tripleO cinder-volume-container-puppet.yaml
- or doing the same not to cinder puppet but to nova-compute-container-puppet.yaml by using ansible to make it run only if the role is DistributedComputeHCI 
- below are the sample commands of what's necessary for the implementations 


[root@d0-hci-1 /]#  ETCDCTL_API=3 etcdctl --endpoints=https://192.168.111.142:2379 --cacert=/etc/ipa/ca.crt --cert=/etc/pki/tls/certs/etcd.crt --key=/etc/pki/tls/private/etcd.key member remove 6b9c10eb341bdd59

[root@d0-hci-1 /]#  ETCDCTL_API=3 etcdctl --endpoints=https://192.168.111.141:2379,https://192.168.111.142:2379 --cacert=/etc/ipa/ca.crt --cert=/etc/pki/tls/certs/etcd.crt --key=/etc/pki/tls/private/etcd.key member add d0-hci-3.internalapi.opqr.my.opqr --peer-urls=https://d0-hci-3.internalapi.opqr.my.opqr:2380 


>> change config on the new node 
vi /var/lib/config-data/puppet-generated/etcd/etc/etcd/etcd.yml
#initial-cluster-state: "new"
initial-cluster-state: "existing"

vi /var/lib/config-data/puppet-generated/etcd/etc/etcd/etcd.conf
#ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_CLUSTER_STATE="existing"

podman restart etcd

Comment 1 Gregory Charot 2022-04-28 09:55:44 UTC
setting priority to high as this is a DCN requirement from KT

Comment 5 Greg Rakauskas 2022-05-17 18:20:05 UTC
Hi Alan,

An RHOSP Doc team script extracts the contents of the "Doc Text" field in this
BZ for use in the RHOSP release notes.

I have edited the "Doc Text" contents to conform to Red Hat style guidelines.
(See below).

Please add a comment to this BZ indicating whether my doc text edits:

   1. are OK, or
   2. require some changes.

If condition (2) is true, then please provide the required changes.

Thanks for your help with this,
--Greg


PROPOSED DOC TEXT EDIT
----------------------
Before this update, the etcd service on the replacement node failed to start,
which caused the cinder-volume service on that node to fail. This failure was
caused by the replacement for a DCN node attempting to start the etcd service as
if it were bootstrapping a new etcd cluster, instead of joining the existing
etcd cluster.

With this update, a new parameter has been added, `EtcdInitialClusterState`.
When `EtcdInitialClusterState` is set to `existing`, the DCN node starts etcd
properly which then causes the cinder-volume service to run successfully.

Comment 6 Alan Bishop 2022-05-17 20:36:00 UTC
Hi Greg,

The replacement text is technically accurate, but I want to note the bug (and this fix) is only relevant to replacing DCN nodes. I mention this because the opening sentence of your text mentions "the replacement node" without the DCN context (although "DCN" is mentioned later). I'll leave to you to decide whether you want to refine the first sentence. Otherwise, the replacement text looks good.

Comment 12 errata-xmlrpc 2022-06-22 16:04:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.3 (Train)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:4793