1633126 – [RFE] Replacing the host with same host post reprovisioning from RHV Manager UI

Bug 1633126 - [RFE] Replacing the host with same host post reprovisioning from RHV Manager UI

Summary: [RFE] Replacing the host with same host post reprovisioning from RHV Manager UI

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhhi
Sub Component:
Version:	rhhiv-1.8
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Prajith
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:	1438386 1641431 1826282
Blocks:	1898430
TreeView+	depends on / blocked

Reported:	2018-09-26 09:02 UTC by SATHEESARAN
Modified:	2021-01-13 14:45 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1632158
Clones:	1898430 (view as bug list)
Environment:
Last Closed:	2021-01-11 05:24:04 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
host deploy log file (122.83 KB, application/octet-stream) 2020-05-06 09:54 UTC, SATHEESARAN	no flags	Details
host-deploy logfile (1.53 MB, text/plain) 2020-11-20 15:47 UTC, SATHEESARAN	no flags	Details
engine.log (1.07 MB, application/gzip) 2020-11-20 15:50 UTC, SATHEESARAN	no flags	Details
vdsm.log (559.72 KB, text/plain) 2020-11-20 15:53 UTC, SATHEESARAN	no flags	Details
View All

Description SATHEESARAN 2018-09-26 09:02:54 UTC

Description of problem:
-----------------------
Node replacement is a tedious and error prone process.
Provide an Ansible role to perform the node replacement.

Comment 10 Prajith 2020-03-11 12:01:05 UTC

The role has been provided and integrated with rhhi-engine side. The role is embedded in reinstall flow from rhhi-engine. 

The role requires 3 parameters and a condition of gluster being supported in the cluster
and a minimum number of 3 host in the same gluster supported cluster
needs to be satisfied to run the playbook

(i)   the oldNode: the node which is needed to reinstalled
(ii)  the clusterNode_1: the maintenance node
(iii) the clusterNode_2: the second maintenance node


As of now this reconfigure gluster role is triggered if the cluster is gluster supported. It is included with reinstall flow , i.e every time a node reinstalled from the rhhi-engine side , the node's gluster gets reconfigured and gets added back to gluster peer network. (A front end checkbox is under WIP , which enables customers to choose whether the selected node needs to gluster reconfigured while reinstalling, meaning the  reconfigureGluster is only called if the customer checks the checkbox as true , else the node will be reinstalled as usual)

What the roles does :-

The role basically deletes the existing gluster directory and remove the affected node from the gluster peer network. The role then gets the peer info and other details of the corrupted node from the other two nodes and reconfigure it, and then add it back to the gluster peer network, after which self heal happens and all the nodes will be in sync.

Comment 11 SATHEESARAN 2020-05-06 09:52:34 UTC

Tested with RHV 4.4.0-33

Used the following steps:
1. Created a 3 node RHHI-V cluster
2. Randomly powered off one of the node ( not the primary volfile server node )
3. Formatted all the disks on that node, and reinstalled the node freshly with new RHVH 4.4 IS
4. Prepared the node with node_prep_inventory.yml and node_prep.yml playbook ( as in the RHHI-V maintenance guide )
      a. This playbook configures the node, creates the bricks and prepares the node to become part of the existing RHHI-V cluster
5. From RHV Manager UI, it still shows that the host is 'inactive' and move that host to maintenance
6. Copy the ovirt-engine SSH public key and add it to ~/.ssh/authorized_keys file
7. Select 'Reinstall' option after selecting that host with 'Reconfigure Gluster'


When trying to replace the host from RHV Manager UI, the process fails.

<snip>
2020-05-06 09:39:33 UTC - TASK [gluster-replace-peers : Copy the details of the old node] ****************
2020-05-06 09:39:39 UTC - fatal: [rhsqa-grafton7-nic2.lab.eng.blr.redhat.com -> rhsqa-grafton7-nic2.lab.eng.blr.redhat.com]: FAILED! => {"changed": false, "file": "/var/lib/glusterd/peers/", "msg": "remote file is a directory, fetch cannot work on directories"}
2020-05-06 09:39:39 UTC - {
  "status" : "OK",
  "msg" : "",
  "data" : {
    "uuid" : "0fb9e7d6-5f9c-41da-b5c0-5838e82f166f",
    "counter" : 26,
    "stdout" : "fatal: [rhsqa-grafton7-nic2.lab.eng.blr.redhat.com -> rhsqa-grafton7-nic2.lab.eng.blr.redhat.com]: FAILED! => {\"changed\": false, \"file\": \"/var/lib/glusterd/peers/\", \"msg\": \"remote file is a directory, fetch cannot work on directories\"}",
    "start_line" : 25,
    "end_line" : 26,
    "runner_ident" : "6de8fd8a-8f7d-11ea-8f1c-004755204901",
    "event" : "runner_on_failed",
    "pid" : 115000,
    "created" : "2020-05-06T09:39:36.812789",
    "parent_uuid" : "00475520-4901-8085-eb1a-000000000016",
    "event_data" : {
      "playbook" : "replace-gluster.yml",
      "playbook_uuid" : "4c94465f-6dda-4f21-b845-35b09f617834",
      "play" : "all",
      "play_uuid" : "00475520-4901-8085-eb1a-000000000008",
      "play_pattern" : "all",
      "task" : "Copy the details of the old node",
      "task_uuid" : "00475520-4901-8085-eb1a-000000000016",
      "task_action" : "fetch",
      "task_args" : "",
      "task_path" : "/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/gluster-replace-peers/tasks/peers.yml:42",
      "role" : "gluster-replace-peers",
      "host" : "rhsqa-grafton7-nic2.lab.eng.blr.redhat.com",
      "remote_addr" : "rhsqa-grafton7-nic2.lab.eng.blr.redhat.com",
      "res" : {
        "file" : "/var/lib/glusterd/peers/",
        "msg" : "remote file is a directory, fetch cannot work on directories",
        "_ansible_no_log" : false,
        "changed" : false,
        "_ansible_delegated_vars" : {
          "ansible_host" : "rhsqa-grafton7-nic2.lab.eng.blr.redhat.com"
        }
      },
      "start" : "2020-05-06T09:39:33.044182",
      "end" : "2020-05-06T09:39:36.812541",
      "duration" : 3.768359,
      "ignore_errors" : null,
      "event_loop" : null,
      "uuid" : "0fb9e7d6-5f9c-41da-b5c0-5838e82f166f"
    }
  }
}
</snip>

Comment 12 SATHEESARAN 2020-05-06 09:54:15 UTC

Created attachment 1685621 [details]
host deploy log file

Comment 13 SATHEESARAN 2020-05-06 13:33:05 UTC

So Prajith and I could figure out the problem here.

When the 'Reconfigure Gluster' is enabled when opting for 'Reinstall Host',
the replace host procedure works with front-end FQDN, but the gluster volumes and peers
are configured with back-end FQDN.

The usage of 2 network FQDN was not initially considered with the requirement.
Also the reset-brick implementation is also required to restore the volume configuration.

Apart from this, as of now 'Reconfigure Gluster' is always checked, when trying to 'Reinstall Host'
This should not be the case, as there are other use cases where user may prefer to 'Reinstall Host'

To summarize the new changes should include:
1, Make 'Reconfigure Gluster' by default disabled
2. Code flexibility to allow single interface node and dedicated network interfaces for gluster and ovirtmgmt should work
3. Implement reset-brick as well

Comment 14 SATHEESARAN 2020-05-27 04:24:22 UTC

This feature missed to consider the requirement of 2 network - ovirtmgmt, gluster_network.

To incorporate that in to flow, there is a small change required in ovirt-engine that helps
to translate the front-end FQDN/IP to in backend FQDN/IP. And that involves more works, and
now moving this bug to RHV 4.4.z

Partially implemented feature in RHV Manager UI will be rolled back and tracked with this
bug - https://bugzilla.redhat.com/show_bug.cgi?id=1840083

Comment 17 SATHEESARAN 2020-11-20 15:38:04 UTC

Tested with RHV 4.4.3 (4.4.3.12-0.1.el8ev), but the replacing the host procedure failed.

Tested with the following steps:

1. Create a 3 node RHHI-V deployment
2. Created a separate gluster network for gluster and attached that HC hosts
3. Simulated a failure of one of the node, by abruptly turning off that node
4. Reinstalled the OS on that node, without formatting the other disks
5. Copied the authorized_keys from other hosts to the newly installed host
6. Removed the LVM filter from /etc/lvm/lvm.conf
7. Performed replace host procedure from RHV admin portal and that failed.

The relevant logs will be attached soon.
As this bug is not a blocker for the release, retargeting this bug for RHV 4.4.4
The replace host playbook still works good to replacing the same host

Comment 18 SATHEESARAN 2020-11-20 15:47:27 UTC

Created attachment 1731305 [details]
host-deploy logfile

Comment 19 SATHEESARAN 2020-11-20 15:50:22 UTC

Created attachment 1731309 [details]
engine.log

Comment 20 SATHEESARAN 2020-11-20 15:53:21 UTC

Created attachment 1731313 [details]
vdsm.log

Comment 21 Yaniv Kaul 2020-12-16 13:36:36 UTC

This RFE has not customer cases, was filed >2 years ago and was not implemented - I suggest closing-wontfix (or deferred).

Comment 22 SATHEESARAN 2020-12-16 13:49:37 UTC

(In reply to Yaniv Kaul from comment #21)
> This RFE has not customer cases, was filed >2 years ago and was not
> implemented - I suggest closing-wontfix (or deferred).

Thanks Yaniv. QE is in favor of the proposal. Ack to CLOSE this bug as CLOSED-WONTFIX.
We should be keen to remove the changes done so far in the engine side needs.

@Gobinda, Could you take care of the required items related to context ?

Comment 23 SATHEESARAN 2020-12-16 13:50:46 UTC

(In reply to SATHEESARAN from comment #22)
> (In reply to Yaniv Kaul from comment #21)
> > This RFE has not customer cases, was filed >2 years ago and was not
> > implemented - I suggest closing-wontfix (or deferred).
> 
> Thanks Yaniv. QE is in favor of the proposal. Ack to CLOSE this bug as
> CLOSED-WONTFIX.
> We should be keen to remove the changes done so far in the engine side needs.
> 
> @Gobinda, Could you take care of the required items related to context ?

For now moving this bug out of RHHI-V 1.8.3

Comment 24 Gobinda Das 2021-01-11 05:24:04 UTC

Based on Comment#21 and Comment#22, I am closing this bug .

Note You need to log in before you can comment on or make changes to this bug.