Bug 1633126

Summary: [RFE] Replacing the host with same host post reprovisioning from RHV Manager UI
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: SATHEESARAN <sasundar>
Component: rhhiAssignee: Prajith <pkesavap>
Status: CLOSED DEFERRED QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: high    
Version: rhhiv-1.8CC: godas, rcyriac, rhs-bugs, sabose, sanandpa, sasundar
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1632158
: 1898430 (view as bug list) Environment:
Last Closed: 2021-01-11 05:24:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1438386, 1641431, 1826282    
Bug Blocks: 1898430    
Attachments:
Description Flags
host deploy log file
none
host-deploy logfile
none
engine.log
none
vdsm.log none

Description SATHEESARAN 2018-09-26 09:02:54 UTC
Description of problem:
-----------------------
Node replacement is a tedious and error prone process.
Provide an Ansible role to perform the node replacement.

Comment 10 Prajith 2020-03-11 12:01:05 UTC
The role has been provided and integrated with rhhi-engine side. The role is embedded in reinstall flow from rhhi-engine. 

The role requires 3 parameters and a condition of gluster being supported in the cluster
and a minimum number of 3 host in the same gluster supported cluster
needs to be satisfied to run the playbook

(i)   the oldNode: the node which is needed to reinstalled
(ii)  the clusterNode_1: the maintenance node
(iii) the clusterNode_2: the second maintenance node


As of now this reconfigure gluster role is triggered if the cluster is gluster supported. It is included with reinstall flow , i.e every time a node reinstalled from the rhhi-engine side , the node's gluster gets reconfigured and gets added back to gluster peer network. (A front end checkbox is under WIP , which enables customers to choose whether the selected node needs to gluster reconfigured while reinstalling, meaning the  reconfigureGluster is only called if the customer checks the checkbox as true , else the node will be reinstalled as usual)

What the roles does :-

The role basically deletes the existing gluster directory and remove the affected node from the gluster peer network. The role then gets the peer info and other details of the corrupted node from the other two nodes and reconfigure it, and then add it back to the gluster peer network, after which self heal happens and all the nodes will be in sync.

Comment 11 SATHEESARAN 2020-05-06 09:52:34 UTC
Tested with RHV 4.4.0-33

Used the following steps:
1. Created a 3 node RHHI-V cluster
2. Randomly powered off one of the node ( not the primary volfile server node )
3. Formatted all the disks on that node, and reinstalled the node freshly with new RHVH 4.4 IS
4. Prepared the node with node_prep_inventory.yml and node_prep.yml playbook ( as in the RHHI-V maintenance guide )
      a. This playbook configures the node, creates the bricks and prepares the node to become part of the existing RHHI-V cluster
5. From RHV Manager UI, it still shows that the host is 'inactive' and move that host to maintenance
6. Copy the ovirt-engine SSH public key and add it to ~/.ssh/authorized_keys file
7. Select 'Reinstall' option after selecting that host with 'Reconfigure Gluster'


When trying to replace the host from RHV Manager UI, the process fails.

<snip>
2020-05-06 09:39:33 UTC - TASK [gluster-replace-peers : Copy the details of the old node] ****************
2020-05-06 09:39:39 UTC - fatal: [rhsqa-grafton7-nic2.lab.eng.blr.redhat.com -> rhsqa-grafton7-nic2.lab.eng.blr.redhat.com]: FAILED! => {"changed": false, "file": "/var/lib/glusterd/peers/", "msg": "remote file is a directory, fetch cannot work on directories"}
2020-05-06 09:39:39 UTC - {
  "status" : "OK",
  "msg" : "",
  "data" : {
    "uuid" : "0fb9e7d6-5f9c-41da-b5c0-5838e82f166f",
    "counter" : 26,
    "stdout" : "fatal: [rhsqa-grafton7-nic2.lab.eng.blr.redhat.com -> rhsqa-grafton7-nic2.lab.eng.blr.redhat.com]: FAILED! => {\"changed\": false, \"file\": \"/var/lib/glusterd/peers/\", \"msg\": \"remote file is a directory, fetch cannot work on directories\"}",
    "start_line" : 25,
    "end_line" : 26,
    "runner_ident" : "6de8fd8a-8f7d-11ea-8f1c-004755204901",
    "event" : "runner_on_failed",
    "pid" : 115000,
    "created" : "2020-05-06T09:39:36.812789",
    "parent_uuid" : "00475520-4901-8085-eb1a-000000000016",
    "event_data" : {
      "playbook" : "replace-gluster.yml",
      "playbook_uuid" : "4c94465f-6dda-4f21-b845-35b09f617834",
      "play" : "all",
      "play_uuid" : "00475520-4901-8085-eb1a-000000000008",
      "play_pattern" : "all",
      "task" : "Copy the details of the old node",
      "task_uuid" : "00475520-4901-8085-eb1a-000000000016",
      "task_action" : "fetch",
      "task_args" : "",
      "task_path" : "/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/gluster-replace-peers/tasks/peers.yml:42",
      "role" : "gluster-replace-peers",
      "host" : "rhsqa-grafton7-nic2.lab.eng.blr.redhat.com",
      "remote_addr" : "rhsqa-grafton7-nic2.lab.eng.blr.redhat.com",
      "res" : {
        "file" : "/var/lib/glusterd/peers/",
        "msg" : "remote file is a directory, fetch cannot work on directories",
        "_ansible_no_log" : false,
        "changed" : false,
        "_ansible_delegated_vars" : {
          "ansible_host" : "rhsqa-grafton7-nic2.lab.eng.blr.redhat.com"
        }
      },
      "start" : "2020-05-06T09:39:33.044182",
      "end" : "2020-05-06T09:39:36.812541",
      "duration" : 3.768359,
      "ignore_errors" : null,
      "event_loop" : null,
      "uuid" : "0fb9e7d6-5f9c-41da-b5c0-5838e82f166f"
    }
  }
}
</snip>

Comment 12 SATHEESARAN 2020-05-06 09:54:15 UTC
Created attachment 1685621 [details]
host deploy log file

Comment 13 SATHEESARAN 2020-05-06 13:33:05 UTC
So Prajith and I could figure out the problem here.

When the 'Reconfigure Gluster' is enabled when opting for 'Reinstall Host',
the replace host procedure works with front-end FQDN, but the gluster volumes and peers
are configured with back-end FQDN.

The usage of 2 network FQDN was not initially considered with the requirement.
Also the reset-brick implementation is also required to restore the volume configuration.

Apart from this, as of now 'Reconfigure Gluster' is always checked, when trying to 'Reinstall Host'
This should not be the case, as there are other use cases where user may prefer to 'Reinstall Host'

To summarize the new changes should include:
1, Make 'Reconfigure Gluster' by default disabled
2. Code flexibility to allow single interface node and dedicated network interfaces for gluster and ovirtmgmt should work
3. Implement reset-brick as well

Comment 14 SATHEESARAN 2020-05-27 04:24:22 UTC
This feature missed to consider the requirement of 2 network - ovirtmgmt, gluster_network.

To incorporate that in to flow, there is a small change required in ovirt-engine that helps
to translate the front-end FQDN/IP to in backend FQDN/IP. And that involves more works, and
now moving this bug to RHV 4.4.z

Partially implemented feature in RHV Manager UI will be rolled back and tracked with this
bug - https://bugzilla.redhat.com/show_bug.cgi?id=1840083

Comment 17 SATHEESARAN 2020-11-20 15:38:04 UTC
Tested with RHV 4.4.3 (4.4.3.12-0.1.el8ev), but the replacing the host procedure failed.

Tested with the following steps:

1. Create a 3 node RHHI-V deployment
2. Created a separate gluster network for gluster and attached that HC hosts
3. Simulated a failure of one of the node, by abruptly turning off that node
4. Reinstalled the OS on that node, without formatting the other disks
5. Copied the authorized_keys from other hosts to the newly installed host
6. Removed the LVM filter from /etc/lvm/lvm.conf
7. Performed replace host procedure from RHV admin portal and that failed.

The relevant logs will be attached soon.
As this bug is not a blocker for the release, retargeting this bug for RHV 4.4.4
The replace host playbook still works good to replacing the same host

Comment 18 SATHEESARAN 2020-11-20 15:47:27 UTC
Created attachment 1731305 [details]
host-deploy logfile

Comment 19 SATHEESARAN 2020-11-20 15:50:22 UTC
Created attachment 1731309 [details]
engine.log

Comment 20 SATHEESARAN 2020-11-20 15:53:21 UTC
Created attachment 1731313 [details]
vdsm.log

Comment 21 Yaniv Kaul 2020-12-16 13:36:36 UTC
This RFE has not customer cases, was filed >2 years ago and was not implemented - I suggest closing-wontfix (or deferred).

Comment 22 SATHEESARAN 2020-12-16 13:49:37 UTC
(In reply to Yaniv Kaul from comment #21)
> This RFE has not customer cases, was filed >2 years ago and was not
> implemented - I suggest closing-wontfix (or deferred).

Thanks Yaniv. QE is in favor of the proposal. Ack to CLOSE this bug as CLOSED-WONTFIX.
We should be keen to remove the changes done so far in the engine side needs.

@Gobinda, Could you take care of the required items related to context ?

Comment 23 SATHEESARAN 2020-12-16 13:50:46 UTC
(In reply to SATHEESARAN from comment #22)
> (In reply to Yaniv Kaul from comment #21)
> > This RFE has not customer cases, was filed >2 years ago and was not
> > implemented - I suggest closing-wontfix (or deferred).
> 
> Thanks Yaniv. QE is in favor of the proposal. Ack to CLOSE this bug as
> CLOSED-WONTFIX.
> We should be keen to remove the changes done so far in the engine side needs.
> 
> @Gobinda, Could you take care of the required items related to context ?

For now moving this bug out of RHHI-V 1.8.3

Comment 24 Gobinda Das 2021-01-11 05:24:04 UTC
Based on Comment#21 and Comment#22, I am closing this bug .