Bug 1825836

Summary: [DOC] document manual cleanup of brick volumes after failed deployment
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Andreas Bleischwitz <ableisch>
Component: doc-Deploying_RHHIAssignee: Laura Bailey <lbailey>
Status: CLOSED CURRENTRELEASE QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhhiv-1.7CC: asriram, lbailey, mmuench, rcyriac, rhs-bugs, sasundar
Target Milestone: ---   
Target Release: RHHI-V 1.7.z Async Update   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: RHHI-V 1.7.z Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1832083 (view as bug list) Environment:
Last Closed: 2020-06-05 08:06:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1832083    

Description Andreas Bleischwitz 2020-04-20 11:17:44 UTC
Description of problem:
After a failed deployment of gluster-storage, the documentation is lacking the advises on manual cleanup of included bricks. This is leading to failures on a re-tried deployment

Version-Release number of selected component (if applicable):
https://access.redhat.com/documentation/en-us/red_hat_hyperconverged_infrastructure_for_virtualization/1.7/html/deploying_red_hat_hyperconverged_infrastructure_for_virtualization/tshoot-deploy-error#failed_to_deploy_storage

How reproducible:
Always

Steps to Reproduce:
1. Deploy storage (and fail for some reason)
2. Follow chapter 13 to clean up
3. Re-run deployment

Actual results:
Second deployment is failing due to existing VDO signatures on brick devices. These need to be manually cleaned up.

Expected results:
Documentation provides information on the need to do so.

Additional info:

Comment 4 SATHEESARAN 2020-04-22 06:44:30 UTC
Hello Andreas,

I had a look at the cleanup log files from the sosreport (sosreport-cfinffm1cp999-02630953-2020-04-15-cfgjhmu.tar.xz)
which is located at '/sosreport-cfinffm1cp999-02630953-2020-04-15-cfgjhmu/var/log/cockpit/ovirt-dashboard/gluster-deployment_cleanup*log'.

Previous cleanup log files are log rotated when new cleanup log file is generated and I see such cleanup log files.
[root@localhost ovirt-dashboard]$ ls gluster-deployment_cleanup* -1
gluster-deployment_cleanup-1586953088971.log
gluster-deployment_cleanup-1586953541116.log
gluster-deployment_cleanup.log

And all these cleanup files indicate failure to reach the hosts.
For example:
<snip>
gluster-deployment_cleanup-1586953541116.log:failed: [cfinffm1cp999.cloud.internal] (item={u'volname': u'vmstore', u'brick': u'/gluster_bricks/vmstore/vmstore', u'arbiter': 0}) => {"ansible_loop_var": "item", "item": {"arbiter": 0, "brick": "/gluster_bricks/vmstore/vmstore", "volname": "vmstore"}, "msg": "Failed to connect to the host via ssh: Host key verification failed.", "unreachable": true}
<./snip>

So this has resulted in 'cleanup' playbook failed because of 'Host key verification failed'.
Because of which removal of VDO never happened, and VDO signature on the disk is left even after reprovisioning.
This resulted in the failure to deploy the subsequent time.

RHHI-V deployment does cleans up the signature on the disk, as part of cleanup playbook.
But deployment playbook, does *not* wipes the disk, due to avoid the risk of overwriting the data,
in the case of erroneous input from user. Intention here is to FAIL in those cases and leaving out
to the users to figure out the problem.

But I agree on the aspect of having it documented under troubleshooting section, when such VDO exists,
validate that and remove the signature on the disk

Comment 7 SATHEESARAN 2020-04-22 07:07:45 UTC
@Laura, Can we add a case to troubleshooting section, regarding, if the deployment failure occurs because of VDO signature on the disk,
then make sure, whether the disk value is right disk, and by accident its not tried to be overwritten, then after confirmation
remove the disk signature on the disk.

Symptom:
--------
<error_in_ansible_playbook_execution>

TASK [gluster.infra/roles/backend_setup : Create VDO with specified size] ******
task path: /etc/ansible/roles/gluster.infra/roles/backend_setup/tasks/vdo_create.yml:9
failed: [host1.example.com] (item={u'writepolicy': u'auto', u'name': u'vdo_sdb', u'readcachesize': u'20M', u'readcache': u'enabled', u'emulate512': u'off', u'logicalsize': u'11000G', u'device': u'/dev/sdb', u'slabsize': u'32G', u'blockmapcachesize': u'128M'}) => {"ansible_loop_var": "item", "changed": false, "err": "vdo: ERROR - vdo signature detected on /dev/sdb at offset 0; use --force to override\n", "item": {"blockmapcachesize": "128M", "device": "/dev/sdb", "emulate512": "off", "logicalsize": "11000G", "name": "vdo_sdb", "readcache": "enabled", "readcachesize": "20M", "slabsize": "32G", "writepolicy": "auto"}, "msg": "Creating VDO vdo_sdb failed.", "rc": 5}

</error_in_ansible_playbook_execution>

Highlight
-----------
>>> vdo signature detected on /dev/sdb at offset 0

User to made aware of:
----------------------
The above error occurs on 2 situations
      Situation 1 - There may be VDO device existing that is created with this disk /dev/sdb
      Situation 2 - just a VDO signature is present on the disk, but no VDO devices

Situation 1 will not happen, if the user does a successful cleanup from cockpit UI, post the
deployment failure. In this case, its highly recommended to run the 'cleanup' option from webconsole

Situation 2 can happen, when the RHVH/HC nodes are reprovisioned without proper cleanup, in this
case VDO signature may be left on the device, with no corresponding VDO device.
In this case, wipe the VDO signature manually. Make sure, the signatures are wiped off from the
intended disks, wiping off the signature on the incorrect disks, may lead to data loss
 # blkid </dev/sdx>
 # wipefs -a </dev/sdx>

Comment 11 SATHEESARAN 2020-04-30 14:35:00 UTC
@Laura, Can we target this fix for RHHI-V 1.7 ? I will create a separate bug for RHHI-V 1.8.

Comment 12 SATHEESARAN 2020-04-30 14:35:54 UTC
(In reply to SATHEESARAN from comment #11)
> @Laura, Can we target this fix for RHHI-V 1.7 ? I will create a separate bug
> for RHHI-V 1.8.

As this bug already has all the acks, I can clone it for RHHI-V 1.7 docs and can we also fix it in RHHI-V 1.7 docs ?

Comment 18 SATHEESARAN 2020-05-16 14:20:18 UTC
Thanks Laura for your quick effort to address the requirement in 1.8 as well as 1.7 guides

Verified the content with 1.7 doc content