Bug 1825836
Summary: | [DOC] document manual cleanup of brick volumes after failed deployment | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Andreas Bleischwitz <ableisch> | |
Component: | doc-Deploying_RHHI | Assignee: | Laura Bailey <lbailey> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | SATHEESARAN <sasundar> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhhiv-1.7 | CC: | asriram, lbailey, mmuench, rcyriac, rhs-bugs, sasundar | |
Target Milestone: | --- | |||
Target Release: | RHHI-V 1.7.z Async Update | |||
Hardware: | All | |||
OS: | All | |||
Whiteboard: | ||||
Fixed In Version: | RHHI-V 1.7.z | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1832083 (view as bug list) | Environment: | ||
Last Closed: | 2020-06-05 08:06:19 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1832083 |
Description
Andreas Bleischwitz
2020-04-20 11:17:44 UTC
Hello Andreas, I had a look at the cleanup log files from the sosreport (sosreport-cfinffm1cp999-02630953-2020-04-15-cfgjhmu.tar.xz) which is located at '/sosreport-cfinffm1cp999-02630953-2020-04-15-cfgjhmu/var/log/cockpit/ovirt-dashboard/gluster-deployment_cleanup*log'. Previous cleanup log files are log rotated when new cleanup log file is generated and I see such cleanup log files. [root@localhost ovirt-dashboard]$ ls gluster-deployment_cleanup* -1 gluster-deployment_cleanup-1586953088971.log gluster-deployment_cleanup-1586953541116.log gluster-deployment_cleanup.log And all these cleanup files indicate failure to reach the hosts. For example: <snip> gluster-deployment_cleanup-1586953541116.log:failed: [cfinffm1cp999.cloud.internal] (item={u'volname': u'vmstore', u'brick': u'/gluster_bricks/vmstore/vmstore', u'arbiter': 0}) => {"ansible_loop_var": "item", "item": {"arbiter": 0, "brick": "/gluster_bricks/vmstore/vmstore", "volname": "vmstore"}, "msg": "Failed to connect to the host via ssh: Host key verification failed.", "unreachable": true} <./snip> So this has resulted in 'cleanup' playbook failed because of 'Host key verification failed'. Because of which removal of VDO never happened, and VDO signature on the disk is left even after reprovisioning. This resulted in the failure to deploy the subsequent time. RHHI-V deployment does cleans up the signature on the disk, as part of cleanup playbook. But deployment playbook, does *not* wipes the disk, due to avoid the risk of overwriting the data, in the case of erroneous input from user. Intention here is to FAIL in those cases and leaving out to the users to figure out the problem. But I agree on the aspect of having it documented under troubleshooting section, when such VDO exists, validate that and remove the signature on the disk @Laura, Can we add a case to troubleshooting section, regarding, if the deployment failure occurs because of VDO signature on the disk,
then make sure, whether the disk value is right disk, and by accident its not tried to be overwritten, then after confirmation
remove the disk signature on the disk.
Symptom:
--------
<error_in_ansible_playbook_execution>
TASK [gluster.infra/roles/backend_setup : Create VDO with specified size] ******
task path: /etc/ansible/roles/gluster.infra/roles/backend_setup/tasks/vdo_create.yml:9
failed: [host1.example.com] (item={u'writepolicy': u'auto', u'name': u'vdo_sdb', u'readcachesize': u'20M', u'readcache': u'enabled', u'emulate512': u'off', u'logicalsize': u'11000G', u'device': u'/dev/sdb', u'slabsize': u'32G', u'blockmapcachesize': u'128M'}) => {"ansible_loop_var": "item", "changed": false, "err": "vdo: ERROR - vdo signature detected on /dev/sdb at offset 0; use --force to override\n", "item": {"blockmapcachesize": "128M", "device": "/dev/sdb", "emulate512": "off", "logicalsize": "11000G", "name": "vdo_sdb", "readcache": "enabled", "readcachesize": "20M", "slabsize": "32G", "writepolicy": "auto"}, "msg": "Creating VDO vdo_sdb failed.", "rc": 5}
</error_in_ansible_playbook_execution>
Highlight
-----------
>>> vdo signature detected on /dev/sdb at offset 0
User to made aware of:
----------------------
The above error occurs on 2 situations
Situation 1 - There may be VDO device existing that is created with this disk /dev/sdb
Situation 2 - just a VDO signature is present on the disk, but no VDO devices
Situation 1 will not happen, if the user does a successful cleanup from cockpit UI, post the
deployment failure. In this case, its highly recommended to run the 'cleanup' option from webconsole
Situation 2 can happen, when the RHVH/HC nodes are reprovisioned without proper cleanup, in this
case VDO signature may be left on the device, with no corresponding VDO device.
In this case, wipe the VDO signature manually. Make sure, the signatures are wiped off from the
intended disks, wiping off the signature on the incorrect disks, may lead to data loss
# blkid </dev/sdx>
# wipefs -a </dev/sdx>
@Laura, Can we target this fix for RHHI-V 1.7 ? I will create a separate bug for RHHI-V 1.8. (In reply to SATHEESARAN from comment #11) > @Laura, Can we target this fix for RHHI-V 1.7 ? I will create a separate bug > for RHHI-V 1.8. As this bug already has all the acks, I can clone it for RHHI-V 1.7 docs and can we also fix it in RHHI-V 1.7 docs ? Thanks Laura for your quick effort to address the requirement in 1.8 as well as 1.7 guides Verified the content with 1.7 doc content |