Bug 1701736 - files within a qemu vm on glusterfs are randomly overwritten by Zero Bytes
Summary: files within a qemu vm on glusterfs are randomly overwritten by Zero Bytes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.3.0
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ovirt-4.3.5
: 4.3.5.3
Assignee: Sahina Bose
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks: 1745214
TreeView+ depends on / blocked
 
Reported: 2019-04-21 11:29 UTC by zem
Modified: 2019-08-29 18:32 UTC (History)
11 users (show)

Fixed In Version: ovirt-engine-4.3.5.3
Doc Type: Bug Fix
Doc Text:
Cause: Disks on gluster storage domain was mounted using the aio=native qemu option Consequence: There was corruption of disks observed depending on the guest running in the VM (see Bug 1723530) Fix: Reverted to use aio=threads for disks on gluster storage domain Result: Works as expected
Clone Of:
Environment:
Last Closed: 2019-07-30 14:08:23 UTC
oVirt Team: Gluster
Embargoed:
sbonazzo: ovirt-4.3?
sbonazzo: blocker?
sbonazzo: planning_ack?
pm-rhel: devel_ack+
pm-rhel: testing_ack+


Attachments (Terms of Use)
glusterfs client logs as requested (228.98 KB, application/gzip)
2019-04-23 13:20 UTC, zem
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1793904 0 None None None 2019-04-21 11:29:06 UTC
Red Hat Knowledge Base (Solution) 4373791 0 None None None 2019-08-27 04:57:16 UTC
oVirt gerrit 101281 0 master MERGED gluster: Default to aio=threads for gluster storage 2020-10-26 00:45:14 UTC
oVirt gerrit 101406 0 ovirt-engine-4.3 MERGED gluster: Default to aio=threads for gluster storage 2020-10-26 00:45:28 UTC

Description zem 2019-04-21 11:29:07 UTC
Description of problem:

Last year in September I discovered a really odd behaviour when i did run qemu against qcow2 images on glusterfs as described in 

https://bugs.launchpad.net/qemu/+bug/1793904 

When using qcow2 images and the direct qemu-gluster interface, the image gets corrupted. 

However it went away when I did mount my images via fuse filesystem. So I ignored the issue also because i did plan to reinstall the whole thing with ovirt at that point. 

However it turns out that qcow images are getting corrupted on a fresh centos based cloud, too, and I am surprised to see that happening, als the ps shows that it is using the filesystem(fuse) and not gluster://

As I find this behaviour really serious, because i can hardly think of any reason why it affects only my datacenter! 

I do open the bug here as well before I spend more time in research.



Version-Release number of selected component (if applicable): 4.3 (centos 7)

How reproducible:


Steps to Reproduce:
1. Set up a glusterfs storage with replica 2 arbiter 1
2. Set up a Virtual machine 
3. Do some stuff aka upgrades disc IO 

Actual results:

Some files contents are randomly zeroed out 

Expected results:

Should not happen under any circumstances 

Additional info:

https://bugs.launchpad.net/qemu/+bug/1793904

Comment 1 zem 2019-04-23 05:46:13 UTC
This bug could be a problem in glusterfs. As I have spoken to some friends, who are using similar setups (qemu/kvm/gluster) the only difference I can spot so far is that I decided to use ext4 as underlay Filesystem in favor of XFS. We are running on an emergency nfs domain for now.

Comment 2 Sahina Bose 2019-04-23 07:31:53 UTC
Are you encountering this bug when you're running with oVirt?
Can you provide output of "gluster volume info" and also the fuse mount logs for gluster volume?

Comment 3 zem 2019-04-23 13:01:06 UTC
Here is the gluster volume info, the other info follows. 

[root@arbiter0 ~]# gluster volume info
 
Volume Name: ovirt_engine
Type: Distributed-Replicate
Volume ID: ed509cda-c236-49bf-a6e5-ff57855d0558
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (2 + 1) = 12
Transport-type: tcp
Bricks:
Brick1: rack1storage1.wg.csph.cloud:/bricks/sda/ovirt_engine
Brick2: rack2storage1.wg.csph.cloud:/bricks/sda/ovirt_engine
Brick3: arbiter0.wg.csph.cloud:/bricks/sda/ovirt_engine (arbiter)
Brick4: rack1storage1.wg.csph.cloud:/bricks/sdb/ovirt_engine
Brick5: rack2storage1.wg.csph.cloud:/bricks/sdb/ovirt_engine
Brick6: arbiter0.wg.csph.cloud:/bricks/sdb/ovirt_engine (arbiter)
Brick7: rack1storage1.wg.csph.cloud:/bricks/sdc/ovirt_engine
Brick8: rack2storage1.wg.csph.cloud:/bricks/sdc/ovirt_engine
Brick9: arbiter0.wg.csph.cloud:/bricks/sdc/ovirt_engine (arbiter)
Brick10: rack1storage1.wg.csph.cloud:/bricks/sdd/ovirt_engine
Brick11: rack2storage1.wg.csph.cloud:/bricks/sdd/ovirt_engine
Brick12: arbiter0.wg.csph.cloud:/bricks/sdd/ovirt_engine (arbiter)
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
auth.allow: 10.253.0.10
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
 
Volume Name: ovirt_export
Type: Distributed-Replicate
Volume ID: cd114fe8-aae5-42e9-a2e5-3e02319cb6d7
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (2 + 1) = 12
Transport-type: tcp
Bricks:
Brick1: rack1storage1.wg.csph.cloud:/bricks/sda/ovirt_export
Brick2: rack2storage1.wg.csph.cloud:/bricks/sda/ovirt_export
Brick3: arbiter0.wg.csph.cloud:/bricks/sda/ovirt_export (arbiter)
Brick4: rack1storage1.wg.csph.cloud:/bricks/sdb/ovirt_export
Brick5: rack2storage1.wg.csph.cloud:/bricks/sdb/ovirt_export
Brick6: arbiter0.wg.csph.cloud:/bricks/sdb/ovirt_export (arbiter)
Brick7: rack1storage1.wg.csph.cloud:/bricks/sdc/ovirt_export
Brick8: rack2storage1.wg.csph.cloud:/bricks/sdc/ovirt_export
Brick9: arbiter0.wg.csph.cloud:/bricks/sdc/ovirt_export (arbiter)
Brick10: rack1storage1.wg.csph.cloud:/bricks/sdd/ovirt_export
Brick11: rack2storage1.wg.csph.cloud:/bricks/sdd/ovirt_export
Brick12: arbiter0.wg.csph.cloud:/bricks/sdd/ovirt_export (arbiter)
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
auth.allow: 10.253.1.*,10.253.2.*
 
Volume Name: ovirt_images
Type: Distributed-Replicate
Volume ID: dd865c50-49f1-4402-8894-c0cd07853d10
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (2 + 1) = 12
Transport-type: tcp
Bricks:
Brick1: rack1storage1.wg.csph.cloud:/bricks/sda/ovirt_images
Brick2: rack2storage1.wg.csph.cloud:/bricks/sda/ovirt_images
Brick3: arbiter0.wg.csph.cloud:/bricks/sda/ovirt_images (arbiter)
Brick4: rack1storage1.wg.csph.cloud:/bricks/sdb/ovirt_images
Brick5: rack2storage1.wg.csph.cloud:/bricks/sdb/ovirt_images
Brick6: arbiter0.wg.csph.cloud:/bricks/sdb/ovirt_images (arbiter)
Brick7: rack1storage1.wg.csph.cloud:/bricks/sdc/ovirt_images
Brick8: rack2storage1.wg.csph.cloud:/bricks/sdc/ovirt_images
Brick9: arbiter0.wg.csph.cloud:/bricks/sdc/ovirt_images (arbiter)
Brick10: rack1storage1.wg.csph.cloud:/bricks/sdd/ovirt_images
Brick11: rack2storage1.wg.csph.cloud:/bricks/sdd/ovirt_images
Brick12: arbiter0.wg.csph.cloud:/bricks/sdd/ovirt_images (arbiter)
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
auth.allow: 10.253.1.*,10.253.2.*
 
Volume Name: ovirt_iso
Type: Distributed-Replicate
Volume ID: 82adda67-e958-4b19-b94e-bafb0f3ee9f7
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (2 + 1) = 12
Transport-type: tcp
Bricks:
Brick1: rack1storage1.wg.csph.cloud:/bricks/sda/ovirt_iso
Brick2: rack2storage1.wg.csph.cloud:/bricks/sda/ovirt_iso
Brick3: arbiter0.wg.csph.cloud:/bricks/sda/ovirt_iso (arbiter)
Brick4: rack1storage1.wg.csph.cloud:/bricks/sdb/ovirt_iso
Brick5: rack2storage1.wg.csph.cloud:/bricks/sdb/ovirt_iso
Brick6: arbiter0.wg.csph.cloud:/bricks/sdb/ovirt_iso (arbiter)
Brick7: rack1storage1.wg.csph.cloud:/bricks/sdc/ovirt_iso
Brick8: rack2storage1.wg.csph.cloud:/bricks/sdc/ovirt_iso
Brick9: arbiter0.wg.csph.cloud:/bricks/sdc/ovirt_iso (arbiter)
Brick10: rack1storage1.wg.csph.cloud:/bricks/sdd/ovirt_iso
Brick11: rack2storage1.wg.csph.cloud:/bricks/sdd/ovirt_iso
Brick12: arbiter0.wg.csph.cloud:/bricks/sdd/ovirt_iso (arbiter)
Options Reconfigured:
auth.allow: 10.253.1.*,10.253.2.*
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off


Here is the gluster volume info

Comment 4 zem 2019-04-23 13:20:46 UTC
Created attachment 1557669 [details]
glusterfs client logs as requested

logfiles

Comment 5 zem 2019-04-23 13:47:15 UTC
> "Are you encountering this bug when you're running with oVirt?"
Yes I am encountering the bug when running oVirt, which I set up 2 weeks ago. 

I recently installed 4.3. 

Those glusterfs bricks are currently running with ext4 not xfs but I as  I did talk with some friends 
that are runing gluster, this is the one main difference in my setup to theirs. I am already planning 
a test where I move over those bricks to xfs and see if the bug is still there.

Comment 6 Sahina Bose 2019-04-23 13:56:03 UTC
(In reply to zem from comment #3)
> Here is the gluster volume info, the other info follows. 
> 
> [root@arbiter0 ~]# gluster volume info
>  
> Volume Name: ovirt_engine
> Type: Distributed-Replicate
> Volume ID: ed509cda-c236-49bf-a6e5-ff57855d0558
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 4 x (2 + 1) = 12
> Transport-type: tcp
> Bricks:
> Brick1: rack1storage1.wg.csph.cloud:/bricks/sda/ovirt_engine
> Brick2: rack2storage1.wg.csph.cloud:/bricks/sda/ovirt_engine
> Brick3: arbiter0.wg.csph.cloud:/bricks/sda/ovirt_engine (arbiter)
> Brick4: rack1storage1.wg.csph.cloud:/bricks/sdb/ovirt_engine
> Brick5: rack2storage1.wg.csph.cloud:/bricks/sdb/ovirt_engine
> Brick6: arbiter0.wg.csph.cloud:/bricks/sdb/ovirt_engine (arbiter)
> Brick7: rack1storage1.wg.csph.cloud:/bricks/sdc/ovirt_engine
> Brick8: rack2storage1.wg.csph.cloud:/bricks/sdc/ovirt_engine
> Brick9: arbiter0.wg.csph.cloud:/bricks/sdc/ovirt_engine (arbiter)
> Brick10: rack1storage1.wg.csph.cloud:/bricks/sdd/ovirt_engine
> Brick11: rack2storage1.wg.csph.cloud:/bricks/sdd/ovirt_engine
> Brick12: arbiter0.wg.csph.cloud:/bricks/sdd/ovirt_engine (arbiter)
> Options Reconfigured:
> performance.client-io-threads: off
> nfs.disable: on
> transport.address-family: inet
> auth.allow: 10.253.0.10
> features.quota: on
> features.inode-quota: on
> features.quota-deem-statfs: on

Can you disable quota?

>  
> Volume Name: ovirt_export
> Type: Distributed-Replicate
> Volume ID: cd114fe8-aae5-42e9-a2e5-3e02319cb6d7
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 4 x (2 + 1) = 12
> Transport-type: tcp
> Bricks:
> Brick1: rack1storage1.wg.csph.cloud:/bricks/sda/ovirt_export
> Brick2: rack2storage1.wg.csph.cloud:/bricks/sda/ovirt_export
> Brick3: arbiter0.wg.csph.cloud:/bricks/sda/ovirt_export (arbiter)
> Brick4: rack1storage1.wg.csph.cloud:/bricks/sdb/ovirt_export
> Brick5: rack2storage1.wg.csph.cloud:/bricks/sdb/ovirt_export
> Brick6: arbiter0.wg.csph.cloud:/bricks/sdb/ovirt_export (arbiter)
> Brick7: rack1storage1.wg.csph.cloud:/bricks/sdc/ovirt_export
> Brick8: rack2storage1.wg.csph.cloud:/bricks/sdc/ovirt_export
> Brick9: arbiter0.wg.csph.cloud:/bricks/sdc/ovirt_export (arbiter)
> Brick10: rack1storage1.wg.csph.cloud:/bricks/sdd/ovirt_export
> Brick11: rack2storage1.wg.csph.cloud:/bricks/sdd/ovirt_export
> Brick12: arbiter0.wg.csph.cloud:/bricks/sdd/ovirt_export (arbiter)
> Options Reconfigured:
> performance.client-io-threads: off
> nfs.disable: on
> transport.address-family: inet
> auth.allow: 10.253.1.*,10.253.2.*
>  
> Volume Name: ovirt_images
> Type: Distributed-Replicate
> Volume ID: dd865c50-49f1-4402-8894-c0cd07853d10
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 4 x (2 + 1) = 12
> Transport-type: tcp
> Bricks:
> Brick1: rack1storage1.wg.csph.cloud:/bricks/sda/ovirt_images
> Brick2: rack2storage1.wg.csph.cloud:/bricks/sda/ovirt_images
> Brick3: arbiter0.wg.csph.cloud:/bricks/sda/ovirt_images (arbiter)
> Brick4: rack1storage1.wg.csph.cloud:/bricks/sdb/ovirt_images
> Brick5: rack2storage1.wg.csph.cloud:/bricks/sdb/ovirt_images
> Brick6: arbiter0.wg.csph.cloud:/bricks/sdb/ovirt_images (arbiter)
> Brick7: rack1storage1.wg.csph.cloud:/bricks/sdc/ovirt_images
> Brick8: rack2storage1.wg.csph.cloud:/bricks/sdc/ovirt_images
> Brick9: arbiter0.wg.csph.cloud:/bricks/sdc/ovirt_images (arbiter)
> Brick10: rack1storage1.wg.csph.cloud:/bricks/sdd/ovirt_images
> Brick11: rack2storage1.wg.csph.cloud:/bricks/sdd/ovirt_images
> Brick12: arbiter0.wg.csph.cloud:/bricks/sdd/ovirt_images (arbiter)
> Options Reconfigured:
> performance.client-io-threads: off
> nfs.disable: on
> transport.address-family: inet
> auth.allow: 10.253.1.*,10.253.2.*
>  
> Volume Name: ovirt_iso
> Type: Distributed-Replicate
> Volume ID: 82adda67-e958-4b19-b94e-bafb0f3ee9f7
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 4 x (2 + 1) = 12
> Transport-type: tcp
> Bricks:
> Brick1: rack1storage1.wg.csph.cloud:/bricks/sda/ovirt_iso
> Brick2: rack2storage1.wg.csph.cloud:/bricks/sda/ovirt_iso
> Brick3: arbiter0.wg.csph.cloud:/bricks/sda/ovirt_iso (arbiter)
> Brick4: rack1storage1.wg.csph.cloud:/bricks/sdb/ovirt_iso
> Brick5: rack2storage1.wg.csph.cloud:/bricks/sdb/ovirt_iso
> Brick6: arbiter0.wg.csph.cloud:/bricks/sdb/ovirt_iso (arbiter)
> Brick7: rack1storage1.wg.csph.cloud:/bricks/sdc/ovirt_iso
> Brick8: rack2storage1.wg.csph.cloud:/bricks/sdc/ovirt_iso
> Brick9: arbiter0.wg.csph.cloud:/bricks/sdc/ovirt_iso (arbiter)
> Brick10: rack1storage1.wg.csph.cloud:/bricks/sdd/ovirt_iso
> Brick11: rack2storage1.wg.csph.cloud:/bricks/sdd/ovirt_iso
> Brick12: arbiter0.wg.csph.cloud:/bricks/sdd/ovirt_iso (arbiter)
> Options Reconfigured:
> auth.allow: 10.253.1.*,10.253.2.*
> transport.address-family: inet
> nfs.disable: on
> performance.client-io-threads: off
> 
> 
> Here is the gluster volume info

From the volume info it looks like none of the recommended options are set for use as storage domain in oVirt. Please take a look at https://github.com/gluster/glusterfs/blob/master/extras/group-virt.example. You can set these options using "gluster volume set <volumename> group virt"

** NOTE ** setting group virt adds sharding option to the volume. If your volume has data, this will cause issues. So try this with newly created volumes.

You would also need to set the permissions on volume so that the images are accessible by qemu:kvm (How did it work without these settings, I wonder?)

gluster volume set <volumename> storage.owner-uid=36
gluster volume set <volumename> storage.owner-gid=36

Can you ensure your volume has these settings and check again.

Comment 7 zem 2019-04-24 07:24:37 UTC
I did set the storage.owner-uid and gid manually on the mounted filesystem as this was suggested in the first Q&A that I could find via google.  The documentation on ovirt.org, as good as it is, seems a bit incomplete regarding that topic. 

I will incorporate those flags into my storage tests later that week. I remember that I avoided any sort of striping because it somehow did not work relieably, but I am not sure if sharding was already a thing back then.

Comment 8 zem 2019-04-25 09:46:24 UTC
Test description: 

You can use the following script:
 
-----find_zeros.sh----------------------------------
#!/bin/bash
cd /
find opt/ usr/ -type f | (
        EXPOUT="00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................"
        while read f
        do
                TESTOUT=$(xxd -l 16 "${f}")
                if [[ "${TESTOUT}" == "${EXPOUT}" ]]
                then
                        echo ${f}
                fi
        done
)
------------------------------------------------

If your VM is facing issues. I know that this is not absolute but in my experience the results are good enough. 
I do run the test as follows: 

1. Launch a fresh debian stretch qcow2 or raw image. (I did try with raw as mentioned.) 
2. run find_zeros.sh --> should find no output
3. vim /etc/apt/sources.list --> s/stretch/buster/
4. apt dist-upgrade 
5. reboot vm to flush all caches 
6. run find_zeroe.sh --> should list corrupted files if the image is broken. 


So far I did run two tests: 

- nfs-ext4: good
- gluster-ext4-without-group-virt: bad
- gluster-ext4-without-group-virt-thick-provisioned: good
- gluster-ext4-with-group-virt: TBD 
- gluster-xfs-without-group-virt: TBD
- gluster-xfs-with-group-virt: TBD

I plan to make those tests on my cloud this Saturday. Switching to xfs is sort of no going back operation because I dont have enough disks left.

Comment 9 Sahina Bose 2019-04-26 07:45:46 UTC
Thanks, setting needinfo for results with group virt profile

Comment 10 zem 2019-04-28 02:31:20 UTC
Sad news:  Neither Filesystem nor group virt settings are making any difference. 

Tests: 

- gluster-ext4-with-group-virt: bad 
- gluster-xfs-without-group-virt: bad
- gluster-xfs-with-group-virt: bad


------------------------------------------------------------------------------
here the Output of volume info for the new volume on ext4 with group=virt:


Volume Name: ovirt_images_virt
Type: Distributed-Replicate
Volume ID: 9391c6fb-c97b-4dcb-90e0-ea696915cb2b
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (2 + 1) = 12
Transport-type: tcp
Bricks:
Brick1: rack1storage1.wg.csph.cloud:/bricks/sda/ovirt_images_virt
Brick2: rack2storage1.wg.csph.cloud:/bricks/sda/ovirt_images_virt
Brick3: arbiter0.wg.csph.cloud:/bricks/sda/ovirt_images_virt (arbiter)
Brick4: rack1storage1.wg.csph.cloud:/bricks/sdb/ovirt_images_virt
Brick5: rack2storage1.wg.csph.cloud:/bricks/sdb/ovirt_images_virt
Brick6: arbiter0.wg.csph.cloud:/bricks/sdb/ovirt_images_virt (arbiter)
Brick7: rack1storage1.wg.csph.cloud:/bricks/sdc/ovirt_images_virt
Brick8: rack2storage1.wg.csph.cloud:/bricks/sdc/ovirt_images_virt
Brick9: arbiter0.wg.csph.cloud:/bricks/sdc/ovirt_images_virt (arbiter)
Brick10: rack1storage1.wg.csph.cloud:/bricks/sdd/ovirt_images_virt
Brick11: rack2storage1.wg.csph.cloud:/bricks/sdd/ovirt_images_virt
Brick12: arbiter0.wg.csph.cloud:/bricks/sdd/ovirt_images_virt (arbiter)
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
cluster.choose-local: off
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off


----------------------------------------------------------------------------
xfs + group=virt 

 
Volume Name: ovirt_images
Type: Distributed-Replicate
Volume ID: d8398afd-ad3e-4ef4-9753-251f625f1c0b
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (2 + 1) = 12
Transport-type: tcp
Bricks:
Brick1: rack1storage1.wg.csph.cloud:/bricks/sda/ovirt_images
Brick2: rack2storage1.wg.csph.cloud:/bricks/sda/ovirt_images
Brick3: arbiter0.wg.csph.cloud:/bricks/sda/ovirt_images (arbiter)
Brick4: rack1storage1.wg.csph.cloud:/bricks/sdb/ovirt_images
Brick5: rack2storage1.wg.csph.cloud:/bricks/sdb/ovirt_images
Brick6: arbiter0.wg.csph.cloud:/bricks/sdb/ovirt_images (arbiter)
Brick7: rack1storage1.wg.csph.cloud:/bricks/sdc/ovirt_images
Brick8: rack2storage1.wg.csph.cloud:/bricks/sdc/ovirt_images
Brick9: arbiter0.wg.csph.cloud:/bricks/sdc/ovirt_images (arbiter)
Brick10: rack1storage1.wg.csph.cloud:/bricks/sdd/ovirt_images
Brick11: rack2storage1.wg.csph.cloud:/bricks/sdd/ovirt_images
Brick12: arbiter0.wg.csph.cloud:/bricks/sdd/ovirt_images (arbiter)
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
cluster.choose-local: off
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off


---------------------------------------------------------------------
xfs without group=virt 

 
Volume Name: ovirt_images_virt
Type: Distributed-Replicate
Volume ID: ab29a92f-876e-4b82-9963-f2aa75a938ee
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (2 + 1) = 12
Transport-type: tcp
Bricks:
Brick1: rack1storage1.wg.csph.cloud:/bricks/sda/ovirt_images_virt
Brick2: rack2storage1.wg.csph.cloud:/bricks/sda/ovirt_images_virt
Brick3: arbiter0.wg.csph.cloud:/bricks/sda/ovirt_images_virt (arbiter)
Brick4: rack1storage1.wg.csph.cloud:/bricks/sdb/ovirt_images_virt
Brick5: rack2storage1.wg.csph.cloud:/bricks/sdb/ovirt_images_virt
Brick6: arbiter0.wg.csph.cloud:/bricks/sdb/ovirt_images_virt (arbiter)
Brick7: rack1storage1.wg.csph.cloud:/bricks/sdc/ovirt_images_virt
Brick8: rack2storage1.wg.csph.cloud:/bricks/sdc/ovirt_images_virt
Brick9: arbiter0.wg.csph.cloud:/bricks/sdc/ovirt_images_virt (arbiter)
Brick10: rack1storage1.wg.csph.cloud:/bricks/sdd/ovirt_images_virt
Brick11: rack2storage1.wg.csph.cloud:/bricks/sdd/ovirt_images_virt
Brick12: arbiter0.wg.csph.cloud:/bricks/sdd/ovirt_images_virt (arbiter)
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

-----------------------------------------------------------------------------------------------------

Comment 11 zem 2019-04-29 18:32:26 UTC
I did set up 6 virtual machines on my Fedora29 Desktop with glusterfs 5 and glusterfs 6 to provide some reproduceable results, also a fail, but that lead me to an important fact  in my setup that I completely overlooked till now: 

My ovirt-engine is hostet as a libvirt/virt-manager vm on a seperate host, but its storage is also located on the glusterfs storage in volume ovirt_engine. 
There is no reckognizeable image corruption on that engine VM or on VMs that are running via virt-manager and a fuse mount.  so I started digging: 

----------This is how ovirt starts the vm----------------------------------------
        -drive
                file=/rhev/data-center/mnt/glusterSD/gluster:_ovirt__images__virt/c0e6c246-4c2f-4b65-95fe-8495a07689e8/images/a14c6441-84c0-4a98-9483-85f74ce0a80c/bbaa47a9-e802-4a84-9d64-b3734a574aeb,
                format=qcow2,
                if=none,
                id=drive-ua-a14c6441-84c0-4a98-9483-85f74ce0a80c,
                serial=a14c6441-84c0-4a98-9483-85f74ce0a80c,
                werror=stop,
                rerror=stop,
                cache=none,
                aio=native
        -device
                virtio-blk-pci,
                iothread=iothread1,
                scsi=off,
                bus=pci.0,addr=0x6,drive=drive-ua-a14c6441-84c0-4a98-9483-85f74ce0a80c,
                id=ua-a14c6441-84c0-4a98-9483-85f74ce0a80c,
                bootindex=1,
                write-cache=on
--------------------------------------------------------------------------------------


At the moment I could narrow the possible flags down to: 

 - aio=native (EA-Mode: native)
 - cache=none (Buffer mode: none)
 - write-cache=on (Buffer mode: none)

after adding aio=native I had my breakthrough. 

I now have a Fedora workstation here with 3 vms providing glusterfs, and one vm being with aio=native showing the issue. 
I am not sure yet if this was the issue last september, as i still have a few gfapi results not adding up. But it is the problem now. 

How can I change those performance behaviors in oVirt?
Why am I the only one having that problem when running oVirt? (rethorical question)

Comment 12 zem 2019-04-29 19:24:02 UTC
got it!

https://github.com/oVirt/ovirt-engine/commit/df07e633d3cdd2c1e0d21dd90e441fee94c452aa#diff-d6f7100af881feb7d909f23faeda326f
https://bugzilla.redhat.com/show_bug.cgi?id=1630744

If I read that change correctly all installations of Ovirt 4.3 made after 2nd November 2018 using glusterfs are propably affected. 
I am not sure how the config settings database is upgraded, when you upgrade to the new release, I am trying to deactivarte aio=native using 

engine-settings set UseNativeIOForGluster false

Comment 13 zem 2019-04-29 19:39:28 UTC
I may need help setting this property.

Comment 14 zem 2019-04-30 08:18:53 UTC
[root@engine ~]# engine-config -g UseNativeIOForGluster
Error fetching UseNativeIOForGluster value: no such entry. Please verify key name and property file support.

which is a bit annoying as the option should be there, and it should be true

Comment 15 zem 2019-04-30 09:12:44 UTC
accidently unset needinfo flag

Comment 16 Krutika Dhananjay 2019-04-30 09:30:23 UTC
(In reply to zem from comment #12)
> got it!
> 
> https://github.com/oVirt/ovirt-engine/commit/
> df07e633d3cdd2c1e0d21dd90e441fee94c452aa#diff-
> d6f7100af881feb7d909f23faeda326f
> https://bugzilla.redhat.com/show_bug.cgi?id=1630744
> 
> If I read that change correctly all installations of Ovirt 4.3 made after
> 2nd November 2018 using glusterfs are propably affected. 
> I am not sure how the config settings database is upgraded, when you upgrade
> to the new release, I am trying to deactivarte aio=native using 
> 
> engine-settings set UseNativeIOForGluster false

Not sure I understand. Are you saying that aio=native is the culprit here?
And you're NOT seeing the issue with aio=threads?

As far as gluster version is concerned, you're using 4.3? Can you confirm that?

-Krutika

Comment 17 zem 2019-04-30 12:22:05 UTC
That is exactly what my Test shows. 
It is ovirt 4.3  and gluster 5.5 or 5.6 a recent installation made one month ago. 

More Important: 
I can reproduce the behaviour on my Fedora with virt-manager as described, and a fully Virtualized glusterfs. 

AIO=native > fail 
AIO-Hypervisor default (thread) > success

I could not switch my ovirt instance  to aio=thread yet ( don't know how ), so I could not run the test on ovirt 
itself.

Comment 18 zem 2019-04-30 16:36:35 UTC
I figured out how to do the Hotfix: 

-------------------------------------------------------------------------------------------------

1. Patch /etc/ovirt-engine/engine-config/engine-config.properties to make the missing UseNativeIOForGluster parameter available to engine-config 

[root@engine engine-config]# diff -u engine-config.properties.orig engine-config.properties
--- engine-config.properties.orig       2019-04-30 17:30:44.408000000 +0200
+++ engine-config.properties    2019-04-30 17:32:57.738000000 +0200
@@ -513,3 +513,5 @@
 CinderlibCommandTimeoutInMinutes.descritpion=The cinderlib command timeout in minutes
 CinderlibCommandTimeoutInMinutes.type=Integer
 CinderlibCommandTimeoutInMinutes.validValues=0..3000
+UseNativeIOForGluster.description=Access volumes on glusterfs with aio native insteat of thread
+UseNativeIOForGluster.type=Boolean
[root@engine engine-config]#


2. set UseNativeIOForGluster to false

[root@engine ovirt-engine]# engine-config -s UseNativeIOForGluster=False


3. restart engine 

[root@engine ovirt-engine]# systemctl restart ovirt-engine

4. restart/start affected virtual machines

5. run the test cycle to see if the chage has taken effect

-----------------------------------------------------------------------------------

Possible sideeffects: 
   The fix may reintroduce the behaviours that 1630744 intended to fix in the firstplace. (I can live with that) 

This solution is a workaround to prevent my data from being killed, the originating bug is somewhere located in in 
the interface between qemu and glusterfs and should be fixed there. A bugreport for qemu is open for a while now and 
the link can be found here.

Comment 19 Yaniv Kaul 2019-05-02 08:50:13 UTC
Which version of Gluster are you using, btw?

Comment 20 Sahina Bose 2019-05-02 09:54:53 UTC
Also adding needinfo on Sas to try out the test case in Comment 8

Comment 21 zem 2019-05-02 14:41:58 UTC
@Yaniv: as I wrote gluster5 I think it may still be 5.5 on the server and 5.6 in my virtualized test.

Comment 22 SATHEESARAN 2019-05-03 13:24:21 UTC
(In reply to Sahina Bose from comment #20)
> Also adding needinfo on Sas to try out the test case in Comment 8

I have tried out the same scenario with RHV 4.3.3 & RHGS 3.4.4 ( glusterfs-3.12.2-47.el7rhgs )
I have used 2x(2+1) distributed arbitrated replicated volume for this testing.

The issue is not seen. The script returned no output. And VMs are using aio=native

Comment 23 zem 2019-05-03 15:55:06 UTC
@satheesaran: 

- Is Gluster 3.12 the latest RHGS gluster version? As I was using glusterfs 5 (5.5 or 5.6) in my tests. 
- Can you check if your qcow2 Image that you have uses was stored in sparse mode or in qcow2 mode?
- should I prepare a nested qcow2 showing that issue and send it over? 


The qemu version might also be of interest 

This is the CentOS Version of my Storage System: 

[root@rack1storage1 ~]# uname -a 
Linux rack1storage1.boot.csph.cloud 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

[root@rack1storage1 ~]# cat /proc/version 
Linux version 3.10.0-957.10.1.el7.x86_64 (mockbuild.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Mon Mar 18 15:06:45 UTC 2019

[root@rack1storage1 ~]# rpm -qa | grep gluster
glusterfs-5.5-1.el7.x86_64
glusterfs-server-5.5-1.el7.x86_64
glusterfs-client-xlators-5.5-1.el7.x86_64
glusterfs-api-5.5-1.el7.x86_64
glusterfs-cli-5.5-1.el7.x86_64
centos-release-gluster5-1.0-1.el7.centos.noarch
glusterfs-libs-5.5-1.el7.x86_64
glusterfs-fuse-5.5-1.el7.x86_64

[root@rack1server2 ~]#  rpm -qa | grep qemu
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
libvirt-daemon-driver-qemu-4.5.0-10.el7_6.6.x86_64
qemu-kvm-ev-2.12.0-18.el7_6.3.1.x86_64
qemu-img-ev-2.12.0-18.el7_6.3.1.x86_64
qemu-kvm-common-ev-2.12.0-18.el7_6.3.1.x86_64

Comment 24 SATHEESARAN 2019-05-04 02:48:04 UTC
(In reply to zem from comment #23)
> @satheesaran: 
> 
> - Is Gluster 3.12 the latest RHGS gluster version? As I was using glusterfs
> 5 (5.5 or 5.6) in my tests.

Thanks Zem. Gluster 6 is the latest, though 5.6 is latest in Gluster 5 stream.
My answer to Sahina's question was based on downstream product 'Red Hat Gluster Storage'

> - Can you check if your qcow2 Image that you have uses was stored in sparse
> mode or in qcow2 mode?

I have used 'Preallocated' raw image.
Have you thin allocated the image file with qcow2 ?

> - should I prepare a nested qcow2 showing that issue and send it over? 
I can try in our systems.

> 
> 
> The qemu version might also be of interest 
> 
> This is the CentOS Version of my Storage System: 
> 
> [root@rack1storage1 ~]# uname -a 
> Linux rack1storage1.boot.csph.cloud 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon
> Mar 18 15:06:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
> 
> [root@rack1storage1 ~]# cat /proc/version 
> Linux version 3.10.0-957.10.1.el7.x86_64
> (mockbuild.centos.org) (gcc version 4.8.5 20150623 (Red Hat
> 4.8.5-36) (GCC) ) #1 SMP Mon Mar 18 15:06:45 UTC 2019
> 
> [root@rack1storage1 ~]# rpm -qa | grep gluster
> glusterfs-5.5-1.el7.x86_64
> glusterfs-server-5.5-1.el7.x86_64
> glusterfs-client-xlators-5.5-1.el7.x86_64
> glusterfs-api-5.5-1.el7.x86_64
> glusterfs-cli-5.5-1.el7.x86_64
> centos-release-gluster5-1.0-1.el7.centos.noarch
> glusterfs-libs-5.5-1.el7.x86_64
> glusterfs-fuse-5.5-1.el7.x86_64
> 
> [root@rack1server2 ~]#  rpm -qa | grep qemu
> ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
> libvirt-daemon-driver-qemu-4.5.0-10.el7_6.6.x86_64
> qemu-kvm-ev-2.12.0-18.el7_6.3.1.x86_64
> qemu-img-ev-2.12.0-18.el7_6.3.1.x86_64
> qemu-kvm-common-ev-2.12.0-18.el7_6.3.1.x86_64

Its downstream specific qemu that I used.
qemu-kvm-rhev-2.12.0-18.el7_6.4.x86_64

Just guide me with the image creation.
1. type of image: raw or qcow2 ?
2. Resource allocation:
Is the image preallocated or thinly provisioned ?
3. What's the size of the disk ?

Comment 25 zem 2019-05-04 18:24:35 UTC
(In reply to SATHEESARAN from comment #24)
> (In reply to zem from comment #23)
> > @satheesaran: 

> > - Can you check if your qcow2 Image that you have uses was stored in sparse
> > mode or in qcow2 mode?
> 
> I have used 'Preallocated' raw image.
> Have you thin allocated the image file with qcow2 ?

As I pointed out in comment 8 (in the test results) the problem does not occur with 
thick provisioned (preallocated) images.

As far as my testing showed it has to be a sparsed raw or a sparsed qcow2.  


> > - should I prepare a nested qcow2 showing that issue and send it over? 
> I can try in our systems.

OK, fine. :) It should be reproduceable. 


> Its downstream specific qemu that I used.
> qemu-kvm-rhev-2.12.0-18.el7_6.4.x86_64
> 
> Just guide me with the image creation.
> 1. type of image: raw or qcow2 ?

The error shows with both formats as long as they are "thin provisioned" by use of sparse mode 
(you cant create thin provisioned raw files within oVirt as far as I am aware). 

I do the following check on my images: 
If ls -l shows me the 30 GB size of the image and du shows me 1.6 GB the image is thin provisioned by using sparse. 


> 2. Resource allocation:
> Is the image preallocated or thinly provisioned ?

thinly provisioned!


> 3. What's the size of the disk ?

I have used 20-30 GB. My default template has 30GB thin provisioned and 1-1.6 GB in use.  
My experience is that the virtual size does not matter.

Comment 28 Krutika Dhananjay 2019-05-13 06:32:55 UTC
Can you share the gluster fuse mount and brick logs for a run where you hit this issue? What I mean by "gluster fuse mount log" is the logs of the mount where a vm that gets these extra zeroes is installed from.

Also, do you see this issue even when you install your vm with the same qemu parameters but hosted directly on an xfs file system? In other words, when there's no gluster (and fuse kernel) in picture, do you see this bug? I need to know this to isolate the layer which is causing this issue.

-Krutika

Comment 29 zem 2019-05-30 11:31:44 UTC
Hi, 

sorry for the late answer, I had quite busy weeks and as I already offered I can prepare you a qcow2 image with a test setup instead of back and forward guesswork. And that one will contain all the needed logs. 

Spoiler: I think we should not use AIO=native by default!

-------------------------------------------------------------------------
@Krutika: To answer your question first: 

> Also, do you see this issue even when you install your vm with the same qemu parameters but hosted directly on an xfs file system?

No but I would not bet on it, answer why can be found here: https://access.redhat.com/articles/41313 thanks for asking!
-------------------------------------------------------------------------

You should also read and propably rejudge the following bugs: 

https://bugzilla.redhat.com/show_bug.cgi?id=1305886
https://bugzilla.redhat.com/show_bug.cgi?id=1305886#c4

that Yaniv mentioned https://bugzilla.redhat.com/show_bug.cgi?id=1630744#c8 

Especially: 

I do have a very hard time to find any reports of the mentioned "extra tests", of what has been tested and if those tests where successful. I don't need a full report protocol in a change, thats overkill but I could not even find an "I tested sth., no issues!" line, which means I am either so BIASed that I can't see properly or it has not been properly tested!

I also have a very hard time to understand why, with so many cluebat hits, aio=native was made a default? AFAIR the original request was to have the option to do aio=native to deal with some rare performance issues not to have it as a default, so where did that Idea to make aio=native a default even come from?

What I could find was that the patch from 1630744 seems to be sort of "pushed" through gerrit (propably to meet a deadline or so) and it was  even inclomplete as you can see in Comment 18 of this bug.


After all my suggestion for the least inversive solution to this is:

- to go back to aio=threads by default (which also is qemu's default) 
- make nichawla as reporter of https://bugzilla.redhat.com/show_bug.cgi?id=1616270 aware that we do so,
- make add my patch from Comment 18 (this bug) to make the settings configurable. 

4 Lines to change if I did not misscount, and it can be easily be backported to ovirt 4.2 as this release is affected, too. 

regards 
     Hans

Comment 31 Sahina Bose 2019-06-25 06:17:48 UTC
Did you try this with preallocated disks?

Comment 32 zem 2019-06-25 07:11:22 UTC
(In reply to Sahina Bose from comment #31)
> To sasundar: Did you try this with preallocated disks?

Let me rephrase that question as you wrote the answer already in https://bugzilla.redhat.com/show_bug.cgi?id=1701736#c24

_Did have you already tried with thin provisioned disks?_

Comment 33 Avihai 2019-07-07 12:04:59 UTC
Verified at 4.3.5.3-0.1.el7.

The same test as before was run meaning:

1) Tier1 TC's with multiple failures were seen before only on gluster with this issue after starting VM's were not witnessed here.
2) I did the same manual tests that reproduced the issue on the previose engine build ( create 8-12 VM's from RHEL8 template and run them) and none of the VM's got stuck on XFS: Metadata corruption detected at xfs_buf_ioend+0x58/0x1e0 [xfs], xfs_inode block 0x240 xfs_inode_bug_verify

Comment 34 SATHEESARAN 2019-07-19 08:32:01 UTC
(In reply to Sahina Bose from comment #31)
> Did you try this with preallocated disks?

Yes, I tried with preallocated disks, also I tried with aio=native and not facing any issues.
But in all my cases I used RHEL 7 guests.

Not sure that has anything to do with this issue.
But anyhow now aio=threads with oVirt-4.3.5, and lets see that solves the problems, without
affecting the performance

Comment 35 Sandro Bonazzola 2019-07-30 14:08:23 UTC
This bugzilla is included in oVirt 4.3.5 release, published on July 30th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.