Bug 1547095
Summary: | QEMU image locking on NFSv3 prevents VMs from getting restarted on different hosts upon an host crash, seen on RHEL 7.5 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Simone Tiraboschi <stirabos> | ||||||||
Component: | qemu-kvm-rhev | Assignee: | Fam Zheng <famz> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | Ping Li <pingl> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 7.5 | CC: | ahino, akrejcir, aliang, alukiano, areis, berrange, chayang, coli, cshao, dfediuck, famz, fromani, juzhang, kgoldbla, knoel, kwolf, lsurette, michal.skrivanek, michen, mkalinin, msivak, mtessun, ngu, nsoffer, pingl, pnguyen, qzhang, rbalakri, rjones, srevivo, ssigwald, stirabos, timao, virt-maint, xuwei, ycui, yhong, yisun, ykulkarn, ylavi, yzhao | ||||||||
Target Milestone: | pre-dev-freeze | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1553154 (view as bug list) | Environment: | |||||||||
Last Closed: | 2018-05-30 18:45:43 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1547033, 1550016, 1553154, 1556957 | ||||||||||
Attachments: |
|
Description
Simone Tiraboschi
2018-02-20 13:53:14 UTC
A direct qemu invocation fails as well. [root@rose05 ~]# LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -nographic -drive file=/var/run/vdsm/storage/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/e62bf4a4-6132-4c14-8aba-f292febdc4f9/976ecba8-712b-4b9c-b3d3-9d6fe9d7e618,format=raw,if=none,id=drive-virtio-disk0,serial=e62bf4a4-6132-4c14-8aba-f292febdc4f9,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 qemu-kvm: -drive file=/var/run/vdsm/storage/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/e62bf4a4-6132-4c14-8aba-f292febdc4f9/976ecba8-712b-4b9c-b3d3-9d6fe9d7e618,format=raw,if=none,id=drive-virtio-disk0,serial=e62bf4a4-6132-4c14-8aba-f292febdc4f9,cache=none,werror=stop,rerror=stop,aio=threads: 'serial' is deprecated, please use the corresponding option of '-device' instead qemu-kvm: -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1: Failed to get "write" lock Is another process using the image? Attaching strace output Created attachment 1398270 [details]
qemu strace
So it works in 7.5 with that shareable element? can you attach the actual OVF? Does it have the ovf:shareable element? If not (so it's the same as the xml) then your VM definition doesn't have the disk set as shareable. Attaching 28f35d31-d0fc-4902-a44c-9d2251f09e21.ovf as extracted from the OVF_STORE. No shareable there. Created attachment 1398567 [details]
ovf from the OVF_STORE
it's set to ovf:shareable="false", so it's not there. Though I do not quite understand what's the desired state, do you intend to have <shareable> in libvirt xml definition? or something else? We do not want the disk to be shared. We use the lock to ensure exclusive access. (In reply to Martin Sivák from comment #10) > We do not want the disk to be shared. We use the lock to ensure exclusive > access. at vdsm level shared could be: none, exclusive, shared, transient We need exclusive but it seams that at engine level we can set just share true/false. I'm trying to understand if this is conflicting with shared: exclusive ok, that makes sense now. That is supported and should work should you set it in the metadata section. That is missing So the question is how to get it there on the engine side can you paste how exactly it differs between 4.1 and 4.2? It generates a lease for the drive, right? (In reply to Michal Skrivanek from comment #13) > can you paste how exactly it differs between 4.1 and 4.2? It generates a > lease for the drive, right? No, we were using 'shared: exclusive' also before (since 3.4 I think) having VM leases support on engine side and nothing is changed on that area on ovirt-ha-agent side. (In reply to Simone Tiraboschi from comment #14) > (In reply to Michal Skrivanek from comment #13) > > can you paste how exactly it differs between 4.1 and 4.2? It generates a > > lease for the drive, right? > > No, we were using 'shared: exclusive' also before (since 3.4 I think) having > VM leases support on engine side and nothing is changed on that area on > ovirt-ha-agent side. in the resulting libvirt xml I mean. IIUC the current code it's supposed to generate a lease I tried reverting https://gerrit.ovirt.org/#/c/86435/ and reproducing without that and the issue is still there. I don't think that the issue is related to https://bugzilla.redhat.com/show_bug.cgi?id=1504606 Thanks Simone, interesting, then this is likely broken for a longer time, and only showed up because of the more strict 7.5 qemu locking (similar to bug 1395941) There are things to fix within virt (adding "shared" to metadata), but I'm afraid this needs storage involvement anyway. Allon, can anyone take a look at the HE lease mechanism? Doesn't seem to be related to the gap in vmxml (In reply to Michal Skrivanek from comment #20) > Thanks Simone, interesting, then this is likely broken for a longer time, > and only showed up because of the more strict 7.5 qemu locking (similar to > bug 1395941) > There are things to fix within virt (adding "shared" to metadata), but I'm > afraid this needs storage involvement anyway. > > Allon, can anyone take a look at the HE lease mechanism? Doesn't seem to be > related to the gap in vmxml Sure. Nir, Ala, can one of you take a look please? HE uses shared:exclusive, which acquire the volume lease for this drive. The libvirt xml should contain a lease element with the lease path and offset of the active volume lease. Simone: please attach the vm xml to the bug. The error we see come from qemu, looks like libvirt local image locking conflicts with qemu image locking. Daniel: how do you suggest to debug this in libvirt/qemu? Nir, as per comment #19 Simone reproduced the same behavior when using the legacy vm conf (reverted msivak's change to use vm xml), so I assume (and Simone please confirm) that there it used shared=exclusive, there was no change on HE side regarding that part. This lead me to the thought that it is not related to vm xml, but to RHEL 7.5 and/or some other vdsm refactoring Engine generated libvirt XML for sure doesn't contain shared=exclusive and so we have also: https://bugzilla.redhat.com/1547479 But I reproduced it also reverting https://gerrit.ovirt.org/#/c/86435/ , and in that case we use a json vm configuration that contained for sure shared=exclusive, and the issue is there also in that case. (In reply to Michal Skrivanek from comment #23) I think that "shared=exclusive" is a vdsm thing, you will not find it in the vm xml. We use "shared=shared" to add the "sharable" disk attribute (not related to volume leases). When using "shared=exclusive", we add a volume lease to the xml here: 2297 for dev_objs in self._devices.values(): 2298 for dev in dev_objs: 2299 for elem in dev.get_extra_xmls(): 2300 domxml._devices.appendChild(element=elem) The vm xml must have a <lease> element with the details of the volume lease. <lease> <key>volume-uuid</key> <lockspace>sd-uuid</lockspace> <target offset="123" path=".../leases" /> </lease> Francesco is maintaining this area. (In reply to Nir Soffer from comment #25) > (In reply to Michal Skrivanek from comment #23) > I think that "shared=exclusive" is a vdsm thing, you will not find it in the > vm xml. yes, that is understood. It's a gap currently. But that's not the point here when considering comment #24 - the same behavior happens without vm xml now in RHEL 7.5 > When using "shared=exclusive", we add a volume lease to the xml here: > > 2297 for dev_objs in self._devices.values(): > 2298 for dev in dev_objs: > 2299 for elem in dev.get_extra_xmls(): > 2300 domxml._devices.appendChild(element=elem) > > The vm xml must have a <lease> element with the details of the volume lease. understood. This gap needs to be closed regardless. But first it needs to work without vm xml Adding back needinfo for Daniel, see comment 22. Simone, can you confirm that you have the lease element in the vm xml when using vm conf? See comment 25. (In reply to Nir Soffer from comment #28) > Simone, can you confirm that you have the lease element in the vm xml when > using > vm conf? See comment 25. Skipping the libvirt XML generated by the engine (as in 4.1), we have shared=exclusive in the json sent to vdsm: 2018-02-21 23:08:26,916+0200 INFO (jsonrpc/5) [api.virt] START create(vmParams={u'emulatedMachine': u'pc-i440fx-rhel7.5.0', u'vmId': u'28f35d31-d0fc-4902-a44c-9d2251f09e21', u'devices': [{u'index': u'0', u'iface': u'virtio', u'format': u'raw', u'bootOrder': u'1', u'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'volumeID': u'976ecba8-712b-4b9c-b3d3-9d6fe9d7e618', u'imageID': u'e62bf4a4-6132-4c14-8aba-f292febdc4f9', u'readonly': u'false', u'domainID': u'2a7334e7-c1d3-41fd-9552-2aacbfa4f9af', u'deviceId': u'e62bf4a4-6132-4c14-8aba-f292febdc4f9', u'poolID': u'00000000-0000-0000-0000-000000000000', u'device': u'disk', u'shared': u'exclusive', u'propagateErrors': u'off', u'type': u'disk'}, {u'nicModel': u'pv', u'macAddr': u'00:1a:4a:16:10:9f', u'linkActive': u'true', u'network': u'ovirtmgmt', u'deviceId': u'a029a17b-c0d2-4045-9e89-e9b7a0e23b80', u'address': {u'slot': u'0x03', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'bridge', u'type': u'interface'}, {u'device': u'vnc', u'type': u'graphics', u'deviceId': u'30f608bc-8161-4db3-bd8d-c1c567f7ad75', u'address': u'None'}, {u'index': u'2', u'iface': u'ide', u'readonly': u'true', u'deviceId': u'8c3179ac-b322-4f5c-9449-c52e3665e0ae', u'address': {u'bus': u'1', u'controller': u'0', u'type': u'drive', u'target': u'0', u'unit': u'0'}, u'device': u'cdrom', u'shared': u'false', u'path': u'', u'type': u'disk'}, {u'device': u'usb', u'specParams': {u'index': u'0', u'model': u'piix3-uhci'}, u'type': u'controller', u'deviceId': u'b30ade5c-5394-421f-85d7-c499341c0027', u'address': {u'slot': u'0x01', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x2'}}, {u'specParams': {u'index': u'0', u'model': u'virtio-scsi'}, u'deviceId': u'bc5e64f4-98b6-482e-8223-03fc525ae522', u'address': {u'slot': u'0x04', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'scsi', u'model': u'virtio-scsi', u'type': u'controller'}, {u'device': u'ide', u'specParams': {u'index': u'0'}, u'type': u'controller', u'deviceId': u'f565c69a-2b0f-4d4c-b004-3da303c43da5', u'address': {u'slot': u'0x01', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x1'}}, {u'device': u'virtio-serial', u'specParams': {u'index': u'0'}, u'type': u'controller', u'deviceId': u'48a86d50-518a-4f8c-8d93-a81c868ca022', u'address': {u'slot': u'0x05', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}}, {u'device': u'console', u'type': u'console', u'deviceId': u'4af63e2a-1590-41fc-9a31-11d19ec2ada8', u'address': u'None'}, {u'device': u'virtio', u'specParams': {u'source': u'urandom'}, u'model': u'virtio', u'type': u'rng'}], u'smp': u'2', u'memSize': u'8192', u'cpuType': u'Conroe', u'spiceSecureChannels': u'smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir', u'vmName': u'HostedEngine', u'display': u'vnc', u'maxVCpus': u'16'}) from=::1,46156 (api:46) and so the lease element in the XML sent to libvirt: <lease> <key>976ecba8-712b-4b9c-b3d3-9d6fe9d7e618</key> <lockspace>2a7334e7-c1d3-41fd-9552-2aacbfa4f9af</lockspace> <target offset="0" path="/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/images/e62bf4a4-6132-4c14-8aba-f292febdc4f9/976ecba8-712b-4b9c-b3d3-9d6fe9d7e618.lease"/> </lease> And the lease volume is there and for sure we can read: [root@alma07 ~]# ls -l /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/images/e62bf4a4-6132-4c14-8aba-f292febdc4f9/976ecba8-712b-4b9c-b3d3-9d6fe9d7e618.lease -rw-rw----. 1 vdsm kvm 1048576 21 feb 14.11 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/images/e62bf4a4-6132-4c14-8aba-f292febdc4f9/976ecba8-712b-4b9c-b3d3-9d6fe9d7e618.lease [root@alma07 ~]# dd if=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/images/e62bf4a4-6132-4c14-8aba-f292febdc4f9/976ecba8-712b-4b9c-b3d3-9d6fe9d7e618.lease of=/dev/null bs=4k count=1 1+0 records in 1+0 records out 4096 bytes (4,1 kB) copied, 0,000121077 s, 33,8 MB/s but then the VM fails to start: 2018-02-21 23:08:28,173+0200 ERROR (vm/28f35d31) [virt.vm] (vmId='28f35d31-d0fc-4902-a44c-9d2251f09e21') The vm start process failed (vm:939) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 868, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2774, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: internal error: qemu unexpectedly closed the monitor: 2018-02-21T21:08:27.917516Z qemu-kvm: -drive file=/var/run/vdsm/storage/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/e62bf4a4-6132-4c14-8aba-f292febdc4f9/976ecba8-712b-4b9c-b3d3-9d6fe9d7e618,format=raw,if=none,id=drive-virtio-disk0,serial=e62bf4a4-6132-4c14-8aba-f292febdc4f9,cache=none,werror=stop,rerror=stop,aio=threads: 'serial' is deprecated, please use the corresponding option of '-device' instead 2018-02-21T21:08:27.950047Z qemu-kvm: -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1: Failed to get "write" lock Is another process using the image? SELinux is clean [root@alma07 ~]# ausearch -m avc <no matches> Attaching the whole vdsm log file. Created attachment 1399037 [details]
vdsm logs
[root@alma07 ~]# vdsm-client Volume getInfo volumeID=976ecba8-712b-4b9c-b3d3-9d6fe9d7e618 imageID=e62bf4a4-6132-4c14-8aba-f292febdc4f9 storagepoolID=00000000-0000-0000-0000-000000000000 storagedomainID=2a7334e7-c1d3-41fd-9552-2aacbfa4f9af { "status": "OK", "lease": { "owners": [ 1 ], "version": 6 }, "domain": "2a7334e7-c1d3-41fd-9552-2aacbfa4f9af", "capacity": "53687091200", "voltype": "LEAF", "description": "Hosted Engine Image", "parent": "00000000-0000-0000-0000-000000000000", "format": "RAW", "generation": 0, "image": "e62bf4a4-6132-4c14-8aba-f292febdc4f9", "uuid": "976ecba8-712b-4b9c-b3d3-9d6fe9d7e618", "disktype": "2", "legality": "LEGAL", "mtime": "0", "apparentsize": "53687091200", "truesize": "5334265856", "type": "SPARSE", "children": [], "pool": "", "ctime": "1519059771" } (In reply to Nir Soffer from comment #22) > The error we see come from qemu, looks like libvirt local image locking > conflicts with qemu image locking. I don't see any evidence that libvirt is doing locking on this image. Libvirt's fcntl based locking is disabled by default and OVirt has presumably enabled sanlock instead. Even if libvirt's fcntl locks were enabled, libvirt locks at a different byte offset to QEMU so they can co-exist. The error message: qemu-kvm: -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1: Failed to get "write" lock Is another process using the image? Is referring to the disk <disk snapshot="no" type="file" device="disk"> <target dev="vda" bus="virtio" /> <source file="/rhev/data-center/00000000-0000-0000-0000-000000000000/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/images/e62bf4a4-6132-4c14-8aba-f292febdc4f9/976ecba8-712b-4b9c-b3d3-9d6fe9d7e618" /> <driver name="qemu" io="threads" type="raw" error_policy="stop" cache="none" /> <address bus="0x00" domain="0x0000" function="0x0" slot="0x06" type="pci" /> <serial>e62bf4a4-6132-4c14-8aba-f292febdc4f9</serial> </disk> So it simply appears that 2 processes both have /rhev/data-center/00000000-0000-0000-0000-000000000000/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/images/e62bf4a4-6132-4c14-8aba-f292febdc4f9/976ecba8-712b-4b9c-b3d3-9d6fe9d7e618 open at the same time. If you don't have 2 QEMU's running with it at once, perhaps you have a qemu-img process with it open, or qemu-nbd. (In reply to Daniel Berrange from comment #32) ... > So it simply appears that 2 processes both have > > /rhev/data-center/00000000-0000-0000-0000-000000000000/2a7334e7-c1d3-41fd- > 9552-2aacbfa4f9af/images/e62bf4a4-6132-4c14-8aba-f292febdc4f9/976ecba8-712b- > 4b9c-b3d3-9d6fe9d7e618 > > open at the same time. If you don't have 2 QEMU's running with it at once, > perhaps you have a qemu-img process with it open, or qemu-nbd. Vdsm is not accessing the image when starting a vm with qemu-img or qemu-nbd. I think this bug should move to qemu to investigate why locking the image failed. (In reply to Simone Tiraboschi from comment #31) > [root@alma07 ~]# vdsm-client Volume getInfo ... > "lease": { > "owners": [ > 1 > ], > "version": 6 > }, This show that the volume lease xml was generated correctly and libvirt acquired the lease. (In reply to Nir Soffer from comment #33) > (In reply to Daniel Berrange from comment #32) > ... > > So it simply appears that 2 processes both have > > > > /rhev/data-center/00000000-0000-0000-0000-000000000000/2a7334e7-c1d3-41fd- > > 9552-2aacbfa4f9af/images/e62bf4a4-6132-4c14-8aba-f292febdc4f9/976ecba8-712b- > > 4b9c-b3d3-9d6fe9d7e618 > > > > open at the same time. If you don't have 2 QEMU's running with it at once, > > perhaps you have a qemu-img process with it open, or qemu-nbd. > > Vdsm is not accessing the image when starting a vm with qemu-img or qemu-nbd. > I think this bug should move to qemu to investigate why locking the image > failed. Something must be accessing it to get this error - try running 'lslocks' on the server in question when the error happens to see what other processes have locks open on that file. Is the same image, pointed to as /rhev/data-center/00000000-0000-0000-0000-000000000000/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/images/e62bf4a4-6132-4c14-8aba-f292febdc4f9/976ecba8-712b-4b9c-b3d3-9d6fe9d7e618, already open by a QEMU process running on another host (since the $subject says "on a different host")? If so, at libvirt level, "<shareable />" must be used for this setup to work, because from QEMU's point of view, this image _is_ shared. The VM was running on host1 and so that lease was open there, then we forcefully shutdown host1 with 'poweroff -f' and we are not able, also after many hours, to restart that VM on host2. Presumably this image is on NFS. Forceably shutting down an NFS client does *not* release any fcntl() locks it held. IIRC, the locks will only get released when that NFS client boots up and comes back online and flushes stale state on the NFS server. I believe we use sanlock leases for locking for exactly that reason. (In reply to Daniel Berrange from comment #40) > Presumably this image is on NFS. Forceably shutting down an NFS client does > *not* release any fcntl() locks it held. IIRC, the locks will only get > released when that NFS client boots up and comes back online and flushes > stale state on the NFS server. We cannot use file based locking which is not released when qemu is killed. I think the qemu locking is not compatible with oVirt file based storage, and must be disabled in this case. We should use it only for localfs storage. To use qemu locking qemu must use a local resource (e.g. semaphore or local file) for locking. (In reply to Nir Soffer from comment #42) > (In reply to Daniel Berrange from comment #40) > > Presumably this image is on NFS. Forceably shutting down an NFS client does > > *not* release any fcntl() locks it held. IIRC, the locks will only get > > released when that NFS client boots up and comes back online and flushes > > stale state on the NFS server. > > We cannot use file based locking which is not released when qemu is killed. The locks *are* released when QEMU is killed. The problem you've hit here is when the *host* is killed and then never powered back on. (In reply to Daniel Berrange from comment #43) > The locks *are* released when QEMU is killed. The problem you've hit here is > when the *host* is killed and then never powered back on. Exactly. As soon as the host got power on again, we are able to restart the VM on any other host. The share is on NFS v3!!! The share is under /Compute_NFS but we have a lot of locks there on storage server side: lslocks -o COMMAND,PID,TYPE,SIZE,MODE,M,START,END,PATH,BLOCKER COMMAND PID TYPE SIZE MODE M START END PATH BLOCKER libvirtd 1234 POSIX 4B WRITE 0 0 0 /run/libvirtd.pid nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2437 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2440 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2439 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2441 LEASE 0B READ 0 0 0 /RHV_NFS lockd 2432 POSIX 0B READ 0 201 201 /Compute_NFS nfsd 2440 POSIX 0B READ 0 100 101 /Compute_NFS nfsd 2440 POSIX 0B READ 0 103 103 /Compute_NFS nfsd 2440 POSIX 0B READ 0 201 201 /Compute_NFS nfsd 2440 POSIX 0B READ 0 203 203 /Compute_NFS nfsd 2440 POSIX 0B READ 0 100 101 /Compute_NFS nfsd 2440 POSIX 0B READ 0 103 103 /Compute_NFS nfsd 2440 POSIX 0B READ 0 201 201 /Compute_NFS nfsd 2440 POSIX 0B READ 0 203 203 /Compute_NFS (unknown) 1306 FLOCK 0B WRITE 0 0 0 /run nfsd 2442 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2438 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Storage_NFS lockd 2432 POSIX 0B READ 0 100 101 /Compute_NFS nfsd 2439 LEASE 0B READ 0 0 0 / lvmetad 471 POSIX 4B WRITE 0 0 0 /run/lvmetad.pid abrtd 695 POSIX 4B WRITE 0 0 0 /run/abrt/abrtd.pid rhsmcertd 1258 FLOCK 0B WRITE 0 0 0 /run/lock/subsys/rhsmcertd nfsd 2441 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2441 LEASE 0B READ 0 0 0 /RHV_NFS nfsd 2439 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Compute_NFS lockd 2432 POSIX 0B READ 0 100 101 /Compute_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2440 POSIX 0B READ 0 100 101 /RHV_NFS iscsid 1312 POSIX 5B WRITE 0 0 0 /run/iscsid.pid nfsd 2441 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2435 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2439 POSIX 0B READ 0 201 201 /RHV_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2442 POSIX 0B READ 0 100 101 /Compute_NFS nfsd 2442 POSIX 0B READ 0 201 201 /Compute_NFS multipathd 500 POSIX 3B WRITE 0 0 0 /run/multipathd/multipathd.pid crond 1192 FLOCK 5B WRITE 0 0 0 /run/crond.pid nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2440 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2440 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /RHV_NFS nfsd 2442 LEASE 0B READ 0 0 0 /RHV_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 POSIX 0B READ 0 100 101 /Compute_NFS nfsd 2442 POSIX 0B READ 0 103 103 /Compute_NFS lockd 2432 POSIX 0B READ 0 100 101 /Compute_NFS lockd 2432 POSIX 0B READ 0 103 103 /Compute_NFS lockd 2432 POSIX 0B READ 0 201 201 /Compute_NFS lockd 2432 POSIX 0B READ 0 203 203 /Compute_NFS lockd 2432 POSIX 0B READ 0 100 100 /Compute_NFS lockd 2432 POSIX 0B READ 0 201 201 /Compute_NFS lockd 2432 POSIX 0B READ 0 203 203 /Compute_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 POSIX 0B READ 0 100 101 /Compute_NFS nfsd 2441 POSIX 0B READ 0 201 201 /Compute_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2435 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2441 POSIX 0B READ 0 201 201 /Compute_NFS nfsd 2441 POSIX 0B READ 0 203 203 /Compute_NFS nfsd 2441 POSIX 0B READ 0 100 101 /RHV_NFS nfsd 2442 POSIX 0B READ 0 201 201 /RHV_NFS atd 1194 POSIX 5B WRITE 0 0 0 /run/atd.pid master 1686 FLOCK 33B WRITE 0 0 0 /var/spool/postfix/pid/master.pid master 1686 FLOCK 33B WRITE 0 0 0 /var/lib/postfix/master.lock nfsd 2437 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2435 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2439 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2440 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2436 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2440 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Storage_NFS nfsd 2440 LEASE 0B READ 0 0 0 /Compute_NFS nfsd 2442 LEASE 0B READ 0 0 0 /Compute_NFS lockd 2432 POSIX 0B READ 0 100 101 /RHV_NFS lockd 2432 POSIX 0B READ 0 103 103 /RHV_NFS lockd 2432 POSIX 0B READ 0 201 201 /RHV_NFS lockd 2432 POSIX 0B READ 0 203 203 /RHV_NFS lockd 2432 POSIX 0B READ 0 100 100 /RHV_NFS lockd 2432 POSIX 0B READ 0 201 201 /RHV_NFS lockd 2432 POSIX 0B READ 0 203 203 /RHV_NFS nfsd 2441 LEASE 0B READ 0 0 0 /Compute_NFS lockd 2432 POSIX 0B READ 0 100 101 /QE_images lockd 2432 POSIX 0B READ 0 103 103 /QE_images lockd 2432 POSIX 0B READ 0 201 201 /QE_images lockd 2432 POSIX 0B READ 0 203 203 /QE_images lockd 2432 POSIX 0B READ 0 100 100 /QE_images lockd 2432 POSIX 0B READ 0 201 201 /QE_images lockd 2432 POSIX 0B READ 0 203 203 /QE_images (In reply to Simone Tiraboschi from comment #44) > (In reply to Daniel Berrange from comment #43) > > The locks *are* released when QEMU is killed. The problem you've hit here is > > when the *host* is killed and then never powered back on. > > Exactly. > As soon as the host got power on again, we are able to restart the VM on any > other host. > > The share is on NFS v3!!! FYI i get the impression locking works better on NFS v4, as its a standard part of the protocol, rather than an out of band side-service. This might mean dead client detection is better, but I've no test env available to test this. Regardless, NFSv4 is generally a better choice than v3 no matter what. (In reply to Daniel Berrange from comment #43) > The locks *are* released when QEMU is killed. The problem you've hit here is > when the *host* is killed and then never powered back on. This is the same issue from our point of view. We cannot use locking that require the host to be up again. If a host loose power, we must be able to start a VM on another host. We are using sanlock lease to make this safe, and it supports this use case. How do disable locking in libvirt xml? (In reply to Nir Soffer from comment #46) > (In reply to Daniel Berrange from comment #43) > > The locks *are* released when QEMU is killed. The problem you've hit here is > > when the *host* is killed and then never powered back on. > > This is the same issue from our point of view. We cannot use locking that > require > the host to be up again. If a host loose power, we must be able to start a VM > on another host. > > We are using sanlock lease to make this safe, and it supports this use case. > > How do disable locking in libvirt xml? There's no support for controlling QEMU file locking in libvirt at this time - it was turned on unconditionally in QEMU with no interaction from libvirt. Fam, this looks like another backward incompatible change in qemu, that may be good for some users, but is not compatible with RHV use case. Can disable locking in qemu-rhev until we have a better solution? This now seems like a result of a rare host crash combined with the odd NFSv3 behavior. I'm not sure it is worth reverting QEMU image locking as lose all the protection just for that. (In reply to Fam Zheng from comment #49) > This now seems like a result of a rare host crash combined with the odd > NFSv3 behavior. I'm not sure it is worth reverting QEMU image locking as > lose all the protection just for that. HA is there just to restart VMs on host failures: this can break HA capabilities. On oVirt we also have host fencing via IPMI and via sanlock for network unresponsive hosts so a sudden host reboot is not a that a rare event. Moving to qemu since the issue seams there. (In reply to Fam Zheng from comment #49) > This now seems like a result of a rare host crash combined with the odd > NFSv3 behavior. I'm not sure it is worth reverting QEMU image locking as > lose all the protection just for that. Sadly, it's what RHV supports and customers rely on. QEMU new locking is useless in RHV, hence the request to be able to be able to either control that via libvirt or disable it unconditionally in qemu-kvm-rhev I just got an idea we could use in RHV maybe. Mounting the NFS storage with "-o nolock" will disable locking and so the other nodes will never learn about it. I am not sure if we do not use the lock for something else as well though. Nir? What do you think? Daniel, Fam? How will qemu react to FS with no locking support? Btw Fam: rare host crash (or lost connectivity) is exactly what all the distributed systems like RHV and OpenStack need to handle. And this will probably affect OpenStack as well. (In reply to Martin Sivák from comment #53) > Daniel, Fam? How will qemu react to FS with no locking support? As long as the OFD lock API (fcntl(fd, F_OFD_SETLK, ..)) doesn't work, QEMU will disable image locking automatically. > > Btw Fam: rare host crash (or lost connectivity) is exactly what all the > distributed systems like RHV and OpenStack need to handle. And this will > probably affect OpenStack as well. OK, thanks for explaining. We already mount with local_lock=none let's see if nolock works. Which is exactly what configures distributed locking: local_lock: "If this option is not specified, or if none is speci‐ fied, the client assumes that the locks are not local." Disabling locking might actually do what we want, unless we are limited by something else: nolock: "When using the nolock option, applications can lock files, but such locks provide exclusion only against other applications running on the same client. Remote applications are not affected by these locks." (In reply to Martin Sivák from comment #56) > Disabling locking might actually do what we want, unless we are limited by > something else: > > nolock: > > "When using the nolock option, applications > can lock files, but such locks provide exclusion only > against other applications running on the same client. > Remote applications are not affected by these locks." I confirm that the issue is not reproducible mounting the NFS share with nolock option. Not sure if this can introduce any side effect somewhere else. FYI, I'm told by a storage maintainer that this is only really a problem with NFSv3. With NFSv4, locks use an active lease mechanism with the client having to refresh the lease periodically for it to remain valid. So if you are using NFSv4 and the client dies with locks held, they should be revoked by the server after the lease renewal timeout is reached, allowing another host to acquire them. Some more info here about NFSv4 locking here: https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1.idan400/lockv4.htm Given NFSv3 is a legacy protocol, I don't think it justifies disabling locking from QEMU side. The nolock mount option seems like a reasonable workaround for V3, if the sites in question really can't use V4. On the other hand, we might not want to depend on two separate locking mechanisms at the same time. Especially when one would be cluster wide only on NFS. That would be a support nightmare. What we have now is "battle tested" and works even when we use LVM on top of iSCSI/FC or with Gluster as the backing storage tech. But we really need an answer from our storage folks here. Btw, someone should tell the OpenStack team so they check if this affects them as well or not. (In reply to Martin Sivák from comment #59) > On the other hand, we might not want to depend on two separate locking > mechanisms at the same time. Especially when one would be cluster wide only > on NFS. That would be a support nightmare. I agree, we don't want to depend on 2 locking solutions. We have a locking solution that works with *any* storage supported by RHV. We don't want to use a second locking solution that may work (never tested it) on NFS 4. I think the basic issue is (again), changing the default behavior of qemu in a backward incompatible way. This does not work for RHV and probably other solutions built on qemu. This change will break existing RHV 4.1 installations, that must work with RHEL 7.5 - so we must have a solution for 7.5. We cannot fix this using NFS "nolock" option, since: - RHV 4.1 does not support this option - This options works only for NFS - we need a solution for GlusterFS, CephFS, or an other posix-like file system that the RHV can use today What we need is: - 7.5: disable locking or make locking optional - 7.6: if locking is made the default, add option to disable it We need these changes also upstream - the changes break oVirt on Fedora 27. (In reply to Nir Soffer from comment #61) > (In reply to Martin Sivák from comment #59) > > On the other hand, we might not want to depend on two separate locking > > mechanisms at the same time. Especially when one would be cluster wide only > > on NFS. That would be a support nightmare. > > I agree, we don't want to depend on 2 locking solutions. We have a locking > solution that works with *any* storage supported by RHV. We don't want to use > a second locking solution that may work (never tested it) on NFS 4. The sanlock locking mechanism doesn't provide the same level of protection against data corruption that QEMU's built-in locking does, because it relies on everything being done via the RHEV mgmt app. If any application or administrator runs qemu-img / QEMU themselves they're still at risk, which is what QEMU's locking protects against. > I think the basic issue is (again), changing the default behavior of qemu in > a > backward incompatible way. This does not work for RHV and probably other > solutions > built on qemu. From OpenStack POV, the QEMU locking is welcomed as it adds protection against data corruption to images. > We cannot fix this using NFS "nolock" option, since: > - RHV 4.1 does not support this option > - This options works only for NFS - we need a solution for GlusterFS, CephFS, > or an other posix-like file system that the RHV can use today There's no evidence that any other filesystem besides obsolete NFSv3 has a problem that needs fixing, so it doesn't matter that "nolock" doesn't work with them. > What we need is: > - 7.5: disable locking or make locking optional > - 7.6: if locking is made the default, add option to disable it > > We need these changes also upstream - the changes break oVirt on Fedora 27. From upstream / Fedora POV, RHV can be made to use the "nolock" option for NFS v3. (In reply to Nir Soffer from comment #61) > I agree, we don't want to depend on 2 locking solutions. We have a locking > solution that works with *any* storage supported by RHV. We don't want to use > a second locking solution that may work (never tested it) on NFS 4. For some values of "working". The image locking in QEMU is made specifically for cases where users manually modify images (e.g. with qemu-img) while a VM is using them. If your locking were able to prevent this, we would have had quite a few hard to debug bug reports less that turned out not to be a corruption bug in the QEMU code, but simply a user error. Which means that the two locking solutions aren't protecting against the same thing, so neither of them is redundant. > - This options works only for NFS - we need a solution for GlusterFS, CephFS, > or an other posix-like file system that the RHV can use today So we established that file locking is broken in NFSv3, which is hardly a QEMU bug, but an NFS one (and apparently one that is fixed in more recent NFS versions). Did you find out that all of Gluster, Ceph and whatever else you're using are broken, too? Nobody mentioned this so far, and I would certainly hope that it's not the case. *** Bug 1547033 has been marked as a duplicate of this bug. *** (In reply to Daniel Berrange from comment #58) > FYI, I'm told by a storage maintainer that this is only really a problem > with NFSv3. With NFSv4, locks use an active lease mechanism with the client > having to refresh the lease periodically for it to remain valid. So if you > are using NFSv4 and the client dies with locks held, they should be revoked > by the server after the lease renewal timeout is reached, allowing another > host to acquire them. > > Some more info here about NFSv4 locking here: > > https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1. > idan400/lockv4.htm > > Given NFSv3 is a legacy protocol, I don't think it justifies disabling > locking from QEMU side. The nolock mount option seems like a reasonable > workaround for V3, if the sites in question really can't use V4. Can the RHV team test it with NFSv4 to confirm the behavior? What is the lease timeout there? BTW, NFSv4 is the default in RHEL-7. (In reply to Ademar Reis from comment #76) > (In reply to Daniel Berrange from comment #58) > > FYI, I'm told by a storage maintainer that this is only really a problem > > with NFSv3. With NFSv4, locks use an active lease mechanism with the client > > having to refresh the lease periodically for it to remain valid. So if you > > are using NFSv4 and the client dies with locks held, they should be revoked > > by the server after the lease renewal timeout is reached, allowing another > > host to acquire them. > > > > Some more info here about NFSv4 locking here: > > > > https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1. > > idan400/lockv4.htm > > > > Given NFSv3 is a legacy protocol, I don't think it justifies disabling > > locking from QEMU side. The nolock mount option seems like a reasonable > > workaround for V3, if the sites in question really can't use V4. > > Can the RHV team test it with NFSv4 to confirm the behavior? What is the > lease timeout there? > > BTW, NFSv4 is the default in RHEL-7. Also needinfor(QE) for some exploratory testing. Ping Li: can you please reproduce it directly without RHV, using NFSv4 and see which kind of lease timeouts are involved? Thanks. (In reply to Ademar Reis from comment #76) > Can the RHV team test it with NFSv4 to confirm the behavior? What is the > lease timeout there? We already tested that it's not reproducible on NFSv4: see https://bugzilla.redhat.com/show_bug.cgi?id=1547033#c4 > BTW, NFSv4 is the default in RHEL-7. We have also to handle upgrades from systems deployed in the past when NFSv3 was the default (at least on RHV side). IIUC, since the NFS server relies on the client notifying it when it comes back online to release the locks, there should be a way to fake that notification. ie once RHV has fenced the node to guarantee it is offline, RHV could issue a notification to the NFS server to force release the dead nodes' locks. This is something that tools like clustersuite probably know how to do already, since HA deployments have been using NFSv3 for a long time in the past before NFSv4 fixed the locking problems. Looks like there's a similar case with gluster, although the testcase seems to be different (simply blocking the connection instead of a crash): https://bugzilla.redhat.com/show_bug.cgi?id=1550016 (In reply to Daniel Berrange from comment #79) > IIUC, since the NFS server relies on the client notifying it when it comes > back online to release the locks, there should be a way to fake that > notification. ie once RHV has fenced the node to guarantee it is offline, > RHV could issue a notification to the NFS server to force release the dead > nodes' locks. This is something that tools like clustersuite probably know > how to do already, since HA deployments have been using NFSv3 for a long > time in the past before NFSv4 fixed the locking problems. We can do a lot of things, but we are nearing a release as well and this is not something we can address without prior notice and planning. (In reply to Daniel Berrange from comment #79) > IIUC, since the NFS server relies on the client notifying it when it comes > back online to release the locks, there should be a way to fake that > notification. ie once RHV has fenced the node to guarantee it is offline, > RHV could issue a notification to the NFS server to force release the dead > nodes' locks. This is something that tools like clustersuite probably know > how to do already, since HA deployments have been using NFSv3 for a long > time in the past before NFSv4 fixed the locking problems. We can do this only as manual override, when the user confirm that the host is not available. But this is not needed since we are going to disable the NLM locks with NFSv3 by default. This give the same protection as block storage - locks are local. This has been workarounded in RHV (NFSv3 mounts are using 'nolock') and in Cinder they're defaulting to NFSv4 or documenting the limitations (see Bug 1556957). Hence I'm closing this BZ. Currently there are no plans to disable QEMU image locking. *** Bug 1592582 has been marked as a duplicate of this bug. *** |