Bug 1933983 - [CEPH ISCSI GW] HE VM disk is corrupted after qemu-img convert to iSCSI CEPH volume
Summary: [CEPH ISCSI GW] HE VM disk is corrupted after qemu-img convert to iSCSI CEPH ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-setup
Classification: oVirt
Component: General
Version: 2.4.9
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.4.7
: ---
Assignee: Eyal Shenitzky
QA Contact: Petr Kubica
URL:
Whiteboard:
Depends On: 1934092
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-02 08:11 UTC by Petr Kubica
Modified: 2021-07-06 07:28 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2021-07-06 07:28:18 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.4+
lsvaty: blocker-


Attachments (Terms of Use)

Description Petr Kubica 2021-03-02 08:11:30 UTC
Description of problem:
This is a tracker bug about a failure of RHV SHE deployment on CEPH iSCSI

During HE deploy ansible playbook copy local HE VM to destination point and then HE VM is powered up.

Problem is inside:
# qemu-img convert -f qcow2 -O raw -t none -T none /var/tmp/localvm_ee80jvz/images/edf95fe8-29cc-4f85-85ff-de4ad2dfa4f6/c4143adf-1131-4318-8c31-71c299ef085b /rhev/data-center/mnt/blockSD/5588aa11-c93b-4071-bdf4-4aff26956933/images/7bb6ac1d-df41-475d-9c23-b30f9c7713af/bea84fe3-e376-46e5-91d5-f04082435a64

And then when I compare those images with the utility:
[root@oncilla04 ~]# qemu-img compare /var/tmp/localvm_ee80jvz/images/edf95fe8-29cc-4f85-85ff-de4ad2dfa4f6/c4143adf-1131-4318-8c31-71c299ef085b /rhev/data-center/mnt/blockSD/5588aa11-c93b-4071-bdf4-4aff26956933/images//7bb6ac1d-df41-475d-9c23-b30f9c7713af/bea84fe3-e376-46e5-91d5-f04082435a64
Content mismatch at offset 541908992!

Looks like HE VM is corrupted during the copying to the CEPH iSCSI destination
It's necessary to check why it is happening and open a related bug to particular component

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-2.4.9-4.el8ev.noarch
Ceph 4.2.8-115.el8cp
qemu-img-5.1.0-20.module+el8.3.1+9918+230f5c26.x86_64

How reproducible:
always

Steps to Reproduce:
1. Hosted Engine deploy with iSCSI CEPH backend
- playbook will copy image to iSCSI CEPH volume
2. Failing on liveliness check of the VM - VM is up but crashed to maintenance

or

(0. Set cluster to global maintenance after failure, power off HE VM)
1. qemu-img convert -f qcow2 -O raw -t none -T none <HE local temporary disk> <Ceph iSCSI destination>
2. qemu-img compare <HE local temporary disk> <Ceph iSCSI destination>

Actual results:
HE disk is probably corrupted, need further investigation why it is happening

Expected results:
Should work

Additional info:
This is working when using NetApp iSCSI backend

Comment 2 Avihai 2021-06-16 08:03:43 UTC
Fix in Ceph exist and verified[1] , moving this bug to ON_QA.
We can not verify this also in RHV 4.4.7.


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1934092

Comment 3 Petr Kubica 2021-06-23 11:13:36 UTC
Verified with HE 4.4.7 deployed

What was tested:
- comparing qemu-img source and destination with "qemu-img compare" = Images were identical
- fallocate punch holes and writing data worked as expected (detailed information how it was tested here: https://bugzilla.redhat.com/show_bug.cgi?id=1934092)
- dd and fallocate to find any lun corruption - works as expected

# ceph -v
ceph version 14.2.11-179.el8cp (29de9ae52bcc20e38eb86cb8e4163bff2d1719c8) nautilus (stable)

engine:
ovirt-engine-4.4.7.4-0.9.el8ev.noarch

Comment 4 Sandro Bonazzola 2021-07-06 07:28:18 UTC
This bugzilla is included in oVirt 4.4.7 release, published on July 6th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.