Description of problem: ----------------------- Node replacement is a tedious and error prone process. Provide an Ansible role to perform the node replacement.
The role has been provided and integrated with rhhi-engine side. The role is embedded in reinstall flow from rhhi-engine. The role requires 3 parameters and a condition of gluster being supported in the cluster and a minimum number of 3 host in the same gluster supported cluster needs to be satisfied to run the playbook (i) the oldNode: the node which is needed to reinstalled (ii) the clusterNode_1: the maintenance node (iii) the clusterNode_2: the second maintenance node As of now this reconfigure gluster role is triggered if the cluster is gluster supported. It is included with reinstall flow , i.e every time a node reinstalled from the rhhi-engine side , the node's gluster gets reconfigured and gets added back to gluster peer network. (A front end checkbox is under WIP , which enables customers to choose whether the selected node needs to gluster reconfigured while reinstalling, meaning the reconfigureGluster is only called if the customer checks the checkbox as true , else the node will be reinstalled as usual) What the roles does :- The role basically deletes the existing gluster directory and remove the affected node from the gluster peer network. The role then gets the peer info and other details of the corrupted node from the other two nodes and reconfigure it, and then add it back to the gluster peer network, after which self heal happens and all the nodes will be in sync.
Tested with RHV 4.4.0-33 Used the following steps: 1. Created a 3 node RHHI-V cluster 2. Randomly powered off one of the node ( not the primary volfile server node ) 3. Formatted all the disks on that node, and reinstalled the node freshly with new RHVH 4.4 IS 4. Prepared the node with node_prep_inventory.yml and node_prep.yml playbook ( as in the RHHI-V maintenance guide ) a. This playbook configures the node, creates the bricks and prepares the node to become part of the existing RHHI-V cluster 5. From RHV Manager UI, it still shows that the host is 'inactive' and move that host to maintenance 6. Copy the ovirt-engine SSH public key and add it to ~/.ssh/authorized_keys file 7. Select 'Reinstall' option after selecting that host with 'Reconfigure Gluster' When trying to replace the host from RHV Manager UI, the process fails. <snip> 2020-05-06 09:39:33 UTC - TASK [gluster-replace-peers : Copy the details of the old node] **************** 2020-05-06 09:39:39 UTC - fatal: [rhsqa-grafton7-nic2.lab.eng.blr.redhat.com -> rhsqa-grafton7-nic2.lab.eng.blr.redhat.com]: FAILED! => {"changed": false, "file": "/var/lib/glusterd/peers/", "msg": "remote file is a directory, fetch cannot work on directories"} 2020-05-06 09:39:39 UTC - { "status" : "OK", "msg" : "", "data" : { "uuid" : "0fb9e7d6-5f9c-41da-b5c0-5838e82f166f", "counter" : 26, "stdout" : "fatal: [rhsqa-grafton7-nic2.lab.eng.blr.redhat.com -> rhsqa-grafton7-nic2.lab.eng.blr.redhat.com]: FAILED! => {\"changed\": false, \"file\": \"/var/lib/glusterd/peers/\", \"msg\": \"remote file is a directory, fetch cannot work on directories\"}", "start_line" : 25, "end_line" : 26, "runner_ident" : "6de8fd8a-8f7d-11ea-8f1c-004755204901", "event" : "runner_on_failed", "pid" : 115000, "created" : "2020-05-06T09:39:36.812789", "parent_uuid" : "00475520-4901-8085-eb1a-000000000016", "event_data" : { "playbook" : "replace-gluster.yml", "playbook_uuid" : "4c94465f-6dda-4f21-b845-35b09f617834", "play" : "all", "play_uuid" : "00475520-4901-8085-eb1a-000000000008", "play_pattern" : "all", "task" : "Copy the details of the old node", "task_uuid" : "00475520-4901-8085-eb1a-000000000016", "task_action" : "fetch", "task_args" : "", "task_path" : "/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/gluster-replace-peers/tasks/peers.yml:42", "role" : "gluster-replace-peers", "host" : "rhsqa-grafton7-nic2.lab.eng.blr.redhat.com", "remote_addr" : "rhsqa-grafton7-nic2.lab.eng.blr.redhat.com", "res" : { "file" : "/var/lib/glusterd/peers/", "msg" : "remote file is a directory, fetch cannot work on directories", "_ansible_no_log" : false, "changed" : false, "_ansible_delegated_vars" : { "ansible_host" : "rhsqa-grafton7-nic2.lab.eng.blr.redhat.com" } }, "start" : "2020-05-06T09:39:33.044182", "end" : "2020-05-06T09:39:36.812541", "duration" : 3.768359, "ignore_errors" : null, "event_loop" : null, "uuid" : "0fb9e7d6-5f9c-41da-b5c0-5838e82f166f" } } } </snip>
Created attachment 1685621 [details] host deploy log file
So Prajith and I could figure out the problem here. When the 'Reconfigure Gluster' is enabled when opting for 'Reinstall Host', the replace host procedure works with front-end FQDN, but the gluster volumes and peers are configured with back-end FQDN. The usage of 2 network FQDN was not initially considered with the requirement. Also the reset-brick implementation is also required to restore the volume configuration. Apart from this, as of now 'Reconfigure Gluster' is always checked, when trying to 'Reinstall Host' This should not be the case, as there are other use cases where user may prefer to 'Reinstall Host' To summarize the new changes should include: 1, Make 'Reconfigure Gluster' by default disabled 2. Code flexibility to allow single interface node and dedicated network interfaces for gluster and ovirtmgmt should work 3. Implement reset-brick as well
This feature missed to consider the requirement of 2 network - ovirtmgmt, gluster_network. To incorporate that in to flow, there is a small change required in ovirt-engine that helps to translate the front-end FQDN/IP to in backend FQDN/IP. And that involves more works, and now moving this bug to RHV 4.4.z Partially implemented feature in RHV Manager UI will be rolled back and tracked with this bug - https://bugzilla.redhat.com/show_bug.cgi?id=1840083
Tested with RHV 4.4.3 (4.4.3.12-0.1.el8ev), but the replacing the host procedure failed. Tested with the following steps: 1. Create a 3 node RHHI-V deployment 2. Created a separate gluster network for gluster and attached that HC hosts 3. Simulated a failure of one of the node, by abruptly turning off that node 4. Reinstalled the OS on that node, without formatting the other disks 5. Copied the authorized_keys from other hosts to the newly installed host 6. Removed the LVM filter from /etc/lvm/lvm.conf 7. Performed replace host procedure from RHV admin portal and that failed. The relevant logs will be attached soon. As this bug is not a blocker for the release, retargeting this bug for RHV 4.4.4 The replace host playbook still works good to replacing the same host
Created attachment 1731305 [details] host-deploy logfile
Created attachment 1731309 [details] engine.log
Created attachment 1731313 [details] vdsm.log
This RFE has not customer cases, was filed >2 years ago and was not implemented - I suggest closing-wontfix (or deferred).
(In reply to Yaniv Kaul from comment #21) > This RFE has not customer cases, was filed >2 years ago and was not > implemented - I suggest closing-wontfix (or deferred). Thanks Yaniv. QE is in favor of the proposal. Ack to CLOSE this bug as CLOSED-WONTFIX. We should be keen to remove the changes done so far in the engine side needs. @Gobinda, Could you take care of the required items related to context ?
(In reply to SATHEESARAN from comment #22) > (In reply to Yaniv Kaul from comment #21) > > This RFE has not customer cases, was filed >2 years ago and was not > > implemented - I suggest closing-wontfix (or deferred). > > Thanks Yaniv. QE is in favor of the proposal. Ack to CLOSE this bug as > CLOSED-WONTFIX. > We should be keen to remove the changes done so far in the engine side needs. > > @Gobinda, Could you take care of the required items related to context ? For now moving this bug out of RHHI-V 1.8.3
Based on Comment#21 and Comment#22, I am closing this bug .