Bug 2213556

Summary: qcow2 image corruptions when live migrating several VMs at the same time
Product: Red Hat Enterprise Linux 8 Reporter: Juan Orti <jortialc>
Component: glusterfsAssignee: Sunil Kumar Acharya <sheggodu>
Status: CLOSED WORKSFORME QA Contact:
Severity: urgent Docs Contact:
Priority: high    
Version: 8.6CC: aliang, bkunal, coli, hreitz, jahernan, jinzhao, juzhang, kwolf, mgokhool, moagrawa, pcfe, sheggodu, timao, usurse, vgoyal, virt-maint, zhguo
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-03 11:20:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Juan Orti 2023-06-08 14:15:27 UTC
Description of problem:
In a RHHI-V 1.8 cluster, when we live-migrate several VMs at the same time there's a high probability that some of the VMs will get a corruption in a qcow2 disk after running in the destination host for a while (5-15 minutes).

All qcow2 images are stored in Gluster volumes. Gluster itself appears to be healthy and we haven't seen any errors.

We have seen these kind of corruptions:

qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps with refcount block); further corruption events will be suppressed
qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps with qcow2_header); further corruption events will be suppressed
qcow2: Marking image as corrupt: Preventing invalid allocation of refcount block at offset 0; further corruption events will be suppressed

Version-Release number of selected component (if applicable):
qemu-kvm-6.2.0-11.module+el8.6.0+14707+5aa4b42d.x86_64
libvirt-8.0.0-5.module+el8.6.0+14480+c0a3aa0f.x86_64
kernel-4.18.0-372.16.1.el8_6.x86_64
redhat-release-virtualization-host-4.5.1-1.el8ev.x86_64

How reproducible:
Easily reproducible in customer's environment in several independent clusters.
Not reproduced locally.

Steps to Reproduce:
1. The setup consists of a RHHI-V 1.8 cluster:
    - 3 hosts RHVH 4.5.1.1-0.20220717.0+1
    - RHV-M: ovirt-engine-4.5.3.5-1.el8ev.noarch
    - Gluster storage in the same hosts
    - Live-migrations and Gluster share the same network interface: a bridge using a LACP bond of 2x100 Gbps

2. Live migrate 8 VMs from one host to another
3. After the VMs have been migrated, wait ~15 minutes for any corruption event

Actual results:
There's a high chance of a corruption like this:

~~~
2023-05-25 16:52:36.504+0000: 3852944: debug : qemuMonitorJSONIOProcessLine:222 : Line [{"timestamp": {"seconds": 1685033556, "microseconds": 504261}, "event": "BLOCK_IMAGE_CORRUPTED", "data": {"device": "", "msg": "Preventing invalid write on metadata (overlaps with refcount block)", "offset": 8590327808, "node-name": "libvirt-2-format", "fatal": true, "size": 4096}}]
2023-05-25 16:52:36.504+0000: 3852944: info : qemuMonitorJSONIOProcessLine:237 : QEMU_MONITOR_RECV_EVENT: mon=0x7f9dfc1f95d0 event={"timestamp": {"seconds": 1685033556, "microseconds": 504261}, "event": "BLOCK_IMAGE_CORRUPTED", "data": {"device": "", "msg": "Preventing invalid write on metadata (overlaps with refcount block)", "offset": 8590327808, "node-name": "libvirt-2-format", "fatal": true, "size": 4096}}
2023-05-25 16:52:36.504+0000: 3852944: debug : qemuMonitorJSONIOProcessEvent:185 : mon=0x7f9dfc1f95d0 obj=0x7f9d9c00b190
2023-05-25 16:52:36.504+0000: 3852944: debug : qemuMonitorEmitEvent:1122 : mon=0x7f9dfc1f95d0 event=BLOCK_IMAGE_CORRUPTED
2023-05-25 16:52:36.504+0000: 3852944: debug : qemuProcessHandleEvent:549 : vm=0x7f9de0818400
2023-05-25 16:52:36.504+0000: 3852944: debug : virObjectEventNew:621 : obj=0x7f9da80456e0
2023-05-25 16:52:36.529+0000: 3852944: debug : qemuMonitorJSONIOProcessLine:222 : Line [{"timestamp": {"seconds": 1685033556, "microseconds": 529133}, "event": "BLOCK_IO_ERROR", "data": {"device": "", "nospace": false, "node-name": "libvirt-2-format", "reason": "Input/output error", "operation": "write", "action": "stop"}}]
2023-05-25 16:52:36.529+0000: 3852944: info : qemuMonitorJSONIOProcessLine:237 : QEMU_MONITOR_RECV_EVENT: mon=0x7f9dfc1f95d0 event={"timestamp": {"seconds": 1685033556, "microseconds": 529133}, "event": "BLOCK_IO_ERROR", "data": {"device": "", "nospace": false, "node-name": "libvirt-2-format", "reason": "Input/output error", "operation": "write", "action": "stop"}}
2023-05-25 16:52:36.529+0000: 3852944: debug : qemuMonitorJSONIOProcessEvent:185 : mon=0x7f9dfc1f95d0 obj=0x7f9d9c017ff0
2023-05-25 16:52:36.529+0000: 3852944: debug : qemuMonitorEmitEvent:1122 : mon=0x7f9dfc1f95d0 event=BLOCK_IO_ERROR
2023-05-25 16:52:36.529+0000: 3852944: debug : qemuProcessHandleEvent:549 : vm=0x7f9de0818400
2023-05-25 16:52:36.529+0000: 3852944: debug : virObjectEventNew:621 : obj=0x7f9d9c02e050
2023-05-25 16:52:36.529+0000: 3852944: debug : qemuMonitorJSONIOProcessEvent:209 : handle BLOCK_IO_ERROR handler=0x7f9e16438db0 data=0x7f9d9c055220
2023-05-25 16:52:36.529+0000: 3852944: debug : qemuMonitorEmitIOError:1199 : mon=0x7f9dfc1f95d0
2023-05-25 16:52:36.529+0000: 3852944: debug : virObjectEventNew:621 : obj=0x7f9d9c02e0e0
2023-05-25 16:52:36.529+0000: 3852944: debug : virObjectEventNew:621 : obj=0x7f9d9c02e170
2023-05-25 16:52:36.529+0000: 3852944: debug : qemuProcessHandleIOError:861 : Transitioned guest test1 to paused state due to IO error
2023-05-25 16:52:36.529+0000: 3852944: debug : virObjectEventNew:621 : obj=0x7f9da8045e50
2023-05-25 16:52:36.530+0000: 3852944: debug : qemuProcessHandleIOError:874 : Preserving lock state '<null>'
~~~

qemu command line of this particular VM:

~~~
2023-05-25 16:43:41.816+0000: starting up libvirt version: 8.0.0, package: 5.module+el8.6.0+14480+c0a3aa0f (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2022-03-15-19:57:04, ), qemu version: 6.2.0qemu-kvm-6.2.0-11.module+el8.6.0+14707+5aa4b42d, kernel: 4.18.0-372.16.1.el8_6.x86_64, hostname: host1
LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin \
HOME=/var/lib/libvirt/qemu/domain-68-test1 \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-68-test1/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-68-test1/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-68-test1/.config \
/usr/libexec/qemu-kvm \
-name guest=test1itis.lan,debug-threads=on \
-S \
-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-68-test1/master-key.aes"}' \
-blockdev '{"driver":"file","filename":"/usr/share/OVMF/OVMF_CODE.secboot.fd","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"}' \
-blockdev '{"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/5e7bcb91-0163-4e78-a615-dcf99ee72828.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"}' \
-machine pc-q35-rhel8.6.0,usb=off,smm=on,dump-guest-core=off,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format \
-accel kvm \
-global mch.extended-tseg-mbytes=24 \
-cpu EPYC,ibpb=on,virt-ssbd=on,monitor=off,x2apic=on,hypervisor=on,svm=off,topoext=on \
-global driver=cfi.pflash01,property=secure,value=on \
-m size=8388608k,slots=16,maxmem=33554432k \
-overcommit mem-lock=off \
-smp 4,maxcpus=64,sockets=16,dies=1,cores=4,threads=1 \
-object '{"qom-type":"iothread","id":"iothread1"}' \
-object '{"qom-type":"memory-backend-ram","id":"ram-node0","size":8589934592}' \
-numa node,nodeid=0,cpus=0-63,memdev=ram-node0 \
-uuid 5e7bcb91-0163-4e78-a615-dcf99ee72828 \
-smbios 'type=1,manufacturer=Red Hat,product=RHEL,version=8.6-1.el8ev,serial=fe1b8db3-4c1b-ea11-9fc6-00000000003c,uuid=5e7bcb91-0163-4e78-a615-dcf99ee72828,sku=8.6.0,family=RHV' \
-smbios 'type=2,manufacturer=Red Hat,product=RHEL-AV' \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=83,server=on,wait=off \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=2023-05-25T16:43:40,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-global ICH9-LPC.disable_s3=1 \
-global ICH9-LPC.disable_s4=1 \
-boot menu=on,splash-time=30000,strict=on \
-device pcie-root-port,port=16,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 \
-device pcie-root-port,port=17,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \
-device pcie-root-port,port=18,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 \
-device pcie-root-port,port=19,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 \
-device pcie-root-port,port=20,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 \
-device pcie-root-port,port=21,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \
-device pcie-root-port,port=22,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 \
-device pcie-root-port,port=23,chassis=8,id=pci.8,bus=pcie.0,addr=0x2.0x7 \
-device pcie-root-port,port=24,chassis=9,id=pci.9,bus=pcie.0,multifunction=on,addr=0x3 \
-device pcie-root-port,port=25,chassis=10,id=pci.10,bus=pcie.0,addr=0x3.0x1 \
-device pcie-root-port,port=26,chassis=11,id=pci.11,bus=pcie.0,addr=0x3.0x2 \
-device pcie-root-port,port=27,chassis=12,id=pci.12,bus=pcie.0,addr=0x3.0x3 \
-device pcie-root-port,port=28,chassis=13,id=pci.13,bus=pcie.0,addr=0x3.0x4 \
-device pcie-root-port,port=29,chassis=14,id=pci.14,bus=pcie.0,addr=0x3.0x5 \
-device pcie-root-port,port=30,chassis=15,id=pci.15,bus=pcie.0,addr=0x3.0x6 \
-device pcie-root-port,port=31,chassis=16,id=pci.16,bus=pcie.0,addr=0x3.0x7 \
-device qemu-xhci,p2=8,p3=8,id=ua-868e398a-2ca8-445c-8d03-7d1dc2197fd8,bus=pci.3,addr=0x0 \
-device virtio-scsi-pci,iothread=iothread1,id=ua-80e32062-9b28-489a-82e3-ce4cd33ddd8c,bus=pci.2,addr=0x0 \
-device virtio-serial-pci,id=ua-dbe4f0f4-be14-43fd-a4a1-d6474e2395d1,max_ports=16,bus=pci.4,addr=0x0 \
-device ide-cd,bus=ide.2,id=ua-ca4f4620-c019-4b42-8050-5da43e1b28af,werror=report,rerror=report \
-blockdev '{"driver":"file","filename":"/rhev/data-center/mnt/glusterSD/host1-example.com:_data-storage-01/ebd88800-b2ef-4475-8400-93f1af83b7ab/images/0dada7a0-94d2-48a9-83cf-a968f90f33b6/a1ba2184-5a1b-43e4-8112-0305bc90c7ce","aio":"threads","node-name":"libvirt-2-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-2-storage","backing":null}' \
-device scsi-hd,bus=ua-80e32062-9b28-489a-82e3-ce4cd33ddd8c.0,channel=0,scsi-id=0,lun=0,device_id=0dada7a0-94d2-48a9-83cf-a968f90f33b6,drive=libvirt-2-format,id=ua-0dada7a0-94d2-48a9-83cf-a968f90f33b6,bootindex=1,write-cache=on,serial=0dada7a0-94d2-48a9-83cf-a968f90f33b6,werror=stop,rerror=stop \
-blockdev '{"driver":"file","filename":"/rhev/data-center/mnt/glusterSD/host1-example.com:_data-storage-01/ebd88800-b2ef-4475-8400-93f1af83b7ab/images/41acd2f1-a292-476e-9e41-0b54aaa3c05a/55822524-824c-4dfb-b686-0762529f32af","aio":"threads","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage","backing":null}' \
-device scsi-hd,bus=ua-80e32062-9b28-489a-82e3-ce4cd33ddd8c.0,channel=0,scsi-id=0,lun=1,device_id=41acd2f1-a292-476e-9e41-0b54aaa3c05a,drive=libvirt-1-format,id=ua-41acd2f1-a292-476e-9e41-0b54aaa3c05a,write-cache=on,serial=41acd2f1-a292-476e-9e41-0b54aaa3c05a,werror=stop,rerror=stop \
-netdev tap,fds=84:86:87:88,id=hostua-51b344b8-a525-4e9d-9a55-c65c5451ae93,vhost=on,vhostfds=89:90:91:92 \
-device virtio-net-pci,mq=on,vectors=10,host_mtu=1500,netdev=hostua-51b344b8-a525-4e9d-9a55-c65c5451ae93,id=ua-51b344b8-a525-4e9d-9a55-c65c5451ae93,mac=56:6f:6a:39:00:6b,bootindex=2,bus=pci.1,addr=0x0 \
-chardev socket,id=charchannel0,fd=81,server=on,wait=off \
-device virtserialport,bus=ua-dbe4f0f4-be14-43fd-a4a1-d6474e2395d1.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \
-chardev spicevmc,id=charchannel1,name=vdagent \
-device virtserialport,bus=ua-dbe4f0f4-be14-43fd-a4a1-d6474e2395d1.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 \
-audiodev '{"id":"audio1","driver":"spice"}' \
-spice port=5924,tls-port=5925,addr=192.168.1.1,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on \
-device qxl-vga,id=ua-4fd19b52-d5cc-40a9-8954-86048eb37596,ram_size=67108864,vram_size=33554432,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pcie.0,addr=0x1 \
-incoming defer \
-device virtio-balloon-pci,id=ua-87a286d9-3059-4d66-89ec-beb8fa0c18f4,bus=pci.5,addr=0x0 \
-object '{"qom-type":"rng-random","id":"objua-c60516a5-d513-4722-983e-5b4d737ef918","filename":"/dev/urandom"}' \
-device virtio-rng-pci,rng=objua-c60516a5-d513-4722-983e-5b4d737ef918,id=ua-c60516a5-d513-4722-983e-5b4d737ef918,bus=pci.6,addr=0x0 \
-device vmcoreinfo \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
2023-05-25 16:49:05.494+0000: Domain id=68 is tainted: custom-ga-command
qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps with refcount block); further corruption events will be suppressed
~~~

Expected results:
No corruptions

Additional info:

Comment 3 Hanna Czenczek 2023-06-09 11:29:06 UTC
Hi Juan,

Do you have the command lines for all VMs involved, particularly the source VMs before migration?  It would also be good to see them grouped by each pairing of source and destination.

How does the storage migration work in this case?  I presume the images are not copied anywhere, but source and destination use the very same images on the gluster storage, right?

qcow2 corruption can be caused by concurrent writes to an image while it is in use by a VM, but this should be prevented by qemu’s file locks.  I don’t know anything about gluster, so I’ll just have to ask: Does this particular configuration in this case support OFD file locks (fcntl() with F_OFD_SETLK)?

(Such concurrent access would be a misconfiguration, so is unlikely regardless of whether locking works or not, but is still something that would be good to be able to rule out.)

So far, I don’t have much of an idea.  Seeing the full `qemu-img check` log would be good.

What I find most interesting so far is that

> qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps with qcow2_header); further corruption events will be suppressed
> qcow2: Marking image as corrupt: Preventing invalid allocation of refcount block at offset 0; further corruption events will be suppressed

Both indicate attempted writes to offset 0.  I think this can only be explained if the cached refcount information on the destination is completely wrong, because offset 0 (the image header) can never be available for allocation.

Comment 5 Kevin Wolf 2023-06-09 12:18:56 UTC
The first immediate thought with corruption after migration with shared storage is what cache coherence guarantees the filesystem makes. I understand that we're running on a glusterfs FUSE filesystem here (i.e. not the built-in gluster driver in QEMU).

Specifically, during migration we have an image file opened on two different hosts with O_DIRECT. At first, the destination host uses it read-only and only source host writes to the image, then calls fdatasync() and stops writing to the image. Then the destination host re-reads anything from the image that could have changed and starts writing to it. It is important that the destination host can see everything the source wrote up to its fdatasync(), i.e. the destination host must not read stale data from its local cache.

It would be good if someone who knows gluster could confirm that gluster supports this. If it doesn't, we can't do live migration with shared storage with it.

Sunil, I'm not sure if you're the right person to answer this. If not, can you please forward the request to the appropriate person?

Comment 8 Juan Orti 2023-06-09 13:53:40 UTC
(In reply to Hanna Czenczek from comment #3)
> Hi Juan,
> 
> Do you have the command lines for all VMs involved, particularly the source
> VMs before migration?  It would also be good to see them grouped by each
> pairing of source and destination.

The sosreports we have are all generated after the corruptions happened. Let me review if I can correlate some of the events with the source VM using previous sosreports. I expect the source and destination VMs to be identical, but I'll confirm.

> How does the storage migration work in this case?  I presume the images are
> not copied anywhere, but source and destination use the very same images on
> the gluster storage, right?

That's right, the qcow2 image file is stored in the gluster volume which is mounted by all 3 hosts under the same path. So the image is not copied/migrated anywhere.

> qcow2 corruption can be caused by concurrent writes to an image while it is
> in use by a VM, but this should be prevented by qemu’s file locks.  I don’t
> know anything about gluster, so I’ll just have to ask: Does this particular
> configuration in this case support OFD file locks (fcntl() with F_OFD_SETLK)?

Gluster uses FUSE and does not support OFD file locks (it only supports F_GETLK, F_SETLK and F_SETLKW).

> So far, I don’t have much of an idea.  Seeing the full `qemu-img check` log
> would be good.

I'll try to get that info.

> What I find most interesting so far is that
> 
> > qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps with qcow2_header); further corruption events will be suppressed
> > qcow2: Marking image as corrupt: Preventing invalid allocation of refcount block at offset 0; further corruption events will be suppressed
> 
> Both indicate attempted writes to offset 0.  I think this can only be
> explained if the cached refcount information on the destination is
> completely wrong, because offset 0 (the image header) can never be available
> for allocation.

The Gluster team has suggested turning off some performance optimizations that are enabled by default. We are now waiting for the results of testing this.

  performance.open-behind
  performance.flush-behind
  performance.write-behind 

https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/html/administration_guide/volume_option_table

  "The performance.flush-behind will tell system that flush is done when this is still being done in the background, so this could explain possible corruption of the new VM process on the other node start to read before flush has been completed...this is just a theory. BUT if you detect the corruption 5 mins later after VM was moved I would expect all data being flush by that, so not sure if this will help."

Comment 11 Juan Orti 2023-06-12 08:51:24 UTC
Additional example of corrupted image:

# su vdsm -s /bin/sh -c "qemu-img info -U --backing-chain /rhev/data-center/mnt/glusterSD/host1.example.com:_data-storage-01/ebd88800-b2ef-4475-8400-93f1af83b7ab/images/85d45178-10b9-452d-83cf-e30bb277c34e/1e9894e9-5760-4c6b-a32a-ab54d8e96741"
image: /rhev/data-center/mnt/glusterSD/host1.example.com:_data-storage-01/ebd88800-b2ef-4475-8400-93f1af83b7ab/images/85d45178-10b9-452d-83cf-e30bb277c34e/1e9894e9-5760-4c6b-a32a-ab54d8e96741
file format: qcow2
virtual size: 50 GiB (53687091200 bytes)
disk size: 8.23 GiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: true
    extended l2: false


# su vdsm -s /bin/sh -c "qemu-img check -r all /rhev/data-center/mnt/glusterSD/host1.example.com:_data-storage-01/ebd88800-b2ef-4475-8400-93f1af83b7ab/images/85d45178-10b9-452d-83cf-e30bb277c34e/1e9894e9-5760-4c6b-a32a-ab54d8e96741"
Repairing OFLAG_COPIED data cluster: l2_entry=80000000 refcount=1
Repairing OFLAG_COPIED data cluster: l2_entry=80010000 refcount=1
The following inconsistencies were found and repaired:

    0 leaked clusters
    2 corruptions

Double checking the fixed image now...
No errors were found on the image.
134358/819200 = 16.40% allocated, 9.49% fragmented, 0.00% compressed clusters
Image end offset: 8820621312

Comment 12 Sunil Kumar Acharya 2023-06-12 09:46:08 UTC
(In reply to Kevin Wolf from comment #5)
> The first immediate thought with corruption after migration with shared
> storage is what cache coherence guarantees the filesystem makes. I
> understand that we're running on a glusterfs FUSE filesystem here (i.e. not
> the built-in gluster driver in QEMU).
> 
> Specifically, during migration we have an image file opened on two different
> hosts with O_DIRECT. At first, the destination host uses it read-only and
> only source host writes to the image, then calls fdatasync() and stops
> writing to the image. Then the destination host re-reads anything from the
> image that could have changed and starts writing to it. It is important that
> the destination host can see everything the source wrote up to its
> fdatasync(), i.e. the destination host must not read stale data from its
> local cache.
> 
> It would be good if someone who knows gluster could confirm that gluster
> supports this. If it doesn't, we can't do live migration with shared storage
> with it.
> 
> Sunil, I'm not sure if you're the right person to answer this. If not, can
> you please forward the request to the appropriate person?

This has been already sorted out via https://bugzilla.redhat.com/show_bug.cgi?id=2213809

Comment 13 Juan Orti 2023-06-12 15:05:11 UTC
The customer has tested with the following Gluster volume options:

  performance.open-behind off
  performance.flush-behind off

but no difference. 15 minutes after migrating 10 VMs, they've got 1 VM paused due to corruption.

Comment 15 Juan Orti 2023-06-13 15:02:40 UTC
(In reply to Sunil Kumar Acharya from comment #12)

> This has been already sorted out via
> https://bugzilla.redhat.com/show_bug.cgi?id=2213809

That BZ is about the Gluster OFD locks, and we have seen that the fcntl calls to acquire the locks succeed, even if Gluster doesn't support them. They are translated to regular locks, see:

https://bugzilla.redhat.com/show_bug.cgi?id=2213809#c6

However that doesn't answer the question if when the source host finishes the fdatasync() call, the destination host can immediately read the synced data. Can we get confirm this point?
Thank you.

Comment 16 Mohit Agrawal 2023-06-14 06:00:28 UTC
(In reply to Juan Orti from comment #15)
> (In reply to Sunil Kumar Acharya from comment #12)
> 
> > This has been already sorted out via
> > Red Hathttps://bugzilla.redhat.com/show_bug.cgi?id=2213809
> 
> That BZ is about the Gluster OFD locks, and we have seen that the fcntl
> calls to acquire the locks succeed, even if Gluster doesn't support them.
> They are translated to regular locks, see:
> 
> Red Hathttps://bugzilla.redhat.com/show_bug.cgi?id=2213809#c6
> 
> However that doesn't answer the question if when the source host finishes
> the fdatasync() call, the destination host can immediately read the synced
> data. Can we get confirm this point?
> Thank you.

Can you please share the gluster configuration and what are the parameters passed to a client while mount a volume?
Ideally, the gluster should access the fresh data if cache invalidation is enabled otherwise it might
access the stale data.