Description of problem: ETCd membership management has issue while node replacement as seen below 2022-02-16T20:32:01.096292313+09:00 stderr F 2022-02-16 20:32:01.096196 I | etcdmain: Loading server configuration from "/etc/etcd/etcd.yml" 2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097134 I | etcdmain: etcd Version: 3.3.23 2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097149 I | etcdmain: Git SHA: Not provided (use ./build instead of go build) 2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097154 I | etcdmain: Go Version: go1.15.14 2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097157 I | etcdmain: Go OS/Arch: linux/amd64 2022-02-16T20:32:01.097168190+09:00 stderr F 2022-02-16 20:32:01.097160 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8 2022-02-16T20:32:01.097223755+09:00 stderr F 2022-02-16 20:32:01.097205 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2022-02-16T20:32:01.097244473+09:00 stderr F 2022-02-16 20:32:01.097232 I | embed: peerTLS: cert = /etc/pki/tls/certs/etcd.crt, key = /etc/pki/tls/private/etcd.key, ca = , trusted-ca = /etc/ipa/ca.crt, client-cert-auth = true, crl-file = 2022-02-16T20:32:01.097739006+09:00 stderr F 2022-02-16 20:32:01.097717 I | embed: listening for peers on https://[[newly added or replaced node ip address]]:2380 2022-02-16T20:32:01.097769463+09:00 stderr F 2022-02-16 20:32:01.097754 I | embed: listening for client requests on [[newly added or replaced node ip address]]:2379 2022-02-16T20:32:01.098630828+09:00 stderr F 2022-02-16 20:32:01.098601 I | pkg/netutil: resolving [[newly added or replaced node hostname]]:2380 to [[newly added or replaced node ip address]]:2380 2022-02-16T20:32:01.117897651+09:00 stderr F 2022-02-16 20:32:01.117840 C | etcdmain: member 6b9c10eb341bdd59 has already been bootstrapped Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Prepare node replacement of DistributedComputeHCI using manual ceph scale down of this document https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/deploying_an_overcloud_with_containerized_red_hat_ceph/index#Replacing_Ceph_Storage_Nodes 2. run the Scale down using this doc below https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/director_installation_and_usage/index#removing-compute-nodes 3. during step 1 and 2, include any possible commands or troubleshooting necessary and finish successfully 4. Add new node DistributedComputeHCI using openstack overcloud deployment command 5. Go to one of the existing node of DistributedComputeHCI that were not deleted, and check etcd member list by using command line of "etcdctl member list" inside of etcd container 6. the new node of DistributedComputeHCI can not join etcd cluster since the member record of the deleted node is not deleted and still persisting Actual results: old member record is not deleted in etc cluster of DistributedComputeHCI nodes while node replacement Expected results: - old member record of etc cluster should be deleted by using etcdctl command line - new node of DistributedComputeHCI for the replacement should be added to the etcd cluster by using etcdctl command line - new node of DistributedComputeHCI should start etcd container using the option and value of initial-cluster-state existing in its config and yml file Additional info: Two possible options for implementation - adding new block of scale task to the cinder puppet yaml like tripleO cinder-volume-container-puppet.yaml - or doing the same not to cinder puppet but to nova-compute-container-puppet.yaml by using ansible to make it run only if the role is DistributedComputeHCI - below are the sample commands of what's necessary for the implementations [root@d0-hci-1 /]# ETCDCTL_API=3 etcdctl --endpoints=https://192.168.111.142:2379 --cacert=/etc/ipa/ca.crt --cert=/etc/pki/tls/certs/etcd.crt --key=/etc/pki/tls/private/etcd.key member remove 6b9c10eb341bdd59 [root@d0-hci-1 /]# ETCDCTL_API=3 etcdctl --endpoints=https://192.168.111.141:2379,https://192.168.111.142:2379 --cacert=/etc/ipa/ca.crt --cert=/etc/pki/tls/certs/etcd.crt --key=/etc/pki/tls/private/etcd.key member add d0-hci-3.internalapi.opqr.my.opqr --peer-urls=https://d0-hci-3.internalapi.opqr.my.opqr:2380 >> change config on the new node vi /var/lib/config-data/puppet-generated/etcd/etc/etcd/etcd.yml #initial-cluster-state: "new" initial-cluster-state: "existing" vi /var/lib/config-data/puppet-generated/etcd/etc/etcd/etcd.conf #ETCD_INITIAL_CLUSTER_STATE="new" ETCD_INITIAL_CLUSTER_STATE="existing" podman restart etcd
setting priority to high as this is a DCN requirement from KT
Hi Alan, An RHOSP Doc team script extracts the contents of the "Doc Text" field in this BZ for use in the RHOSP release notes. I have edited the "Doc Text" contents to conform to Red Hat style guidelines. (See below). Please add a comment to this BZ indicating whether my doc text edits: 1. are OK, or 2. require some changes. If condition (2) is true, then please provide the required changes. Thanks for your help with this, --Greg PROPOSED DOC TEXT EDIT ---------------------- Before this update, the etcd service on the replacement node failed to start, which caused the cinder-volume service on that node to fail. This failure was caused by the replacement for a DCN node attempting to start the etcd service as if it were bootstrapping a new etcd cluster, instead of joining the existing etcd cluster. With this update, a new parameter has been added, `EtcdInitialClusterState`. When `EtcdInitialClusterState` is set to `existing`, the DCN node starts etcd properly which then causes the cinder-volume service to run successfully.
Hi Greg, The replacement text is technically accurate, but I want to note the bug (and this fix) is only relevant to replacing DCN nodes. I mention this because the opening sentence of your text mentions "the replacement node" without the DCN context (although "DCN" is mentioned later). I'll leave to you to decide whether you want to refine the first sentence. Otherwise, the replacement text looks good.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.3 (Train)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:4793