Bug 1740069
| Summary: | NovaEvacuate: InstanceHA evacuation fails with "Failed to get "write" lock Is another process using the image?" when using NFSv4 | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Yadnesh Kulkarni <ykulkarn> | |
| Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 7.7 | CC: | agk, aherr, cfeist, cluster-maint, dasmith, eglynn, fdinitto, jhakimra, jjoyce, jschluet, kchamart, kmehta, lmiccini, lyarwood, mschuppe, phagara, pkomarov, sbauza, sbradley, sgordon, slinaber, tvignaud, vromanso | |
| Target Milestone: | rc | Keywords: | Reopened, Triaged, ZStream | |
| Target Release: | 7.9 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | resource-agents-4.1.1-40.el7 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1755760 1756262 1775587 (view as bug list) | Environment: | ||
| Last Closed: | 2020-03-31 19:47:12 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1755760, 1756262 | |||
In my test environment, on all the compute nodes I remounted /var/lib/nova/instances with NFS version to 3 instead of 4, along with that I have used "nolock" option. Check below configuration from my compute node ~~~ [root@overcloud-novacomputeiha-0 ~]# cat /etc/fstab LABEL=img-rootfs / xfs defaults 0 1 192.168.122.1:/home/nova /var/lib/nova/instances nfs _netdev,bg,nolock,context=system_u:object_r:nfs_t:s0,vers=3,nfsvers=3 0 0 ~~~ With this workaround I was able to evacuate instances. (In reply to Yadnesh Kulkarni from comment #0) > 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager DiskNotFound: No disk > at /var/lib/nova/instances/739e7248-6fd4-476c-8cf7-c833ae322ee4/disk That's a different disk for a different instance, we should ignore this in the context of this evacuation bug. (In reply to Yadnesh Kulkarni from comment #2) > In my test environment, on all the compute nodes I remounted > /var/lib/nova/instances with NFS version to 3 instead of 4, along with that > I have used "nolock" option. > > Check below configuration from my compute node > ~~~ > [root@overcloud-novacomputeiha-0 ~]# cat /etc/fstab > LABEL=img-rootfs / xfs defaults 0 1 > 192.168.122.1:/home/nova /var/lib/nova/instances nfs > _netdev,bg,nolock,context=system_u:object_r:nfs_t:s0,vers=3,nfsvers=3 0 0 > ~~~ > > With this workaround I was able to evacuate instances. NACK, we actually want the locking provided by NFSv4 here to ensure the destination instance only starts when the source is really dead and no longer accessing the disk(s). Which NFS backend is being used here and what is the currently configured lease timeout? I assume the instance is being evacuated before the lease held by the original compute has timed out. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1067 |
Description of problem: When a compute node crashes instance evacuation fails with below errors(on destination node) ~~~ 2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5] File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1110, in createWithFlags 2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5] if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) 2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5] libvirtError: internal error: qemu unexpectedly closed the monitor: 2019-08-07T18:32:31.158796Z qemu-kvm: -drive file=/var/lib/nova/instances/41916b40-6556-4f86-8ddb-3ec24c7990a5/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none: Failed to get "write" lock 2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5] Is another process using the image? 2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5] 2019-08-07 20:32:32.588 1 INFO nova.compute.manager [req-9a9143dc-3361-4aa8-9337-a49b83d2ce9c 8fc652fb844344f89a58d8ee77c3b918 c9ca1d32e3754527801c00a473ce0a6b - default default] [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5] Successfully reverted task state from rebuild_spawning on failure for instance. 2019-08-07 20:32:32.594 1 ERROR oslo_messaging.rpc.server [req-9a9143dc-3361-4aa8-9337-a49b83d2ce9c 8fc652fb844344f89a58d8ee77c3b918 c9ca1d32e3754527801c00a473ce0a6b - default default] Exception during message handling: libvirtError: internal error: qemu unexpectedly closed the monitor: 2019-08-07T18:32:31.158796Z qemu-kvm: -drive file=/var/lib/nova/instances/41916b40-6556-4f86-8ddb-3ec24c7990a5/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none: Failed to get "write" lock Is another process using the image? ~~~ Once the source compute comes back, observed the below logs in nova-compute ~~~ 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager [req-5ff7be6c-7ea5-4969-9675-bbb48b8e072c - - - - -] Error updating resources for node vos13psctn103.corpnet.pl.: DiskNotFound: No disk at /var/lib/nova/instances/739e7248-6fd4-476c-8cf7-c833ae322ee4/disk 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager Traceback (most recent call last): 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7447, in update_available_resource_for_node 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager rt.update_available_resource(context, nodename) 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 690, in update_available_resource 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager resources = self.driver.get_available_resource(nodename) 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6485, in get_available_resource 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager disk_over_committed = self._get_disk_over_committed_size_total() 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8048, in _get_disk_over_committed_size_total 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager err_ctxt.reraise = False 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager self.force_reraise() 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager six.reraise(self.type_, self.value, self.tb) 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7989, in _get_disk_over_committed_size_total 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager config, block_device_info) 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7891, in _get_instance_disk_info_from_config 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager qemu_img_info = disk_api.get_disk_info(path) 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 136, in get_disk_info 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager return images.qemu_img_info(path) 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in qemu_img_info 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager raise exception.DiskNotFound(location=path) 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/739e7248-6fd4-476c-8cf7-c833ae322ee4/disk ~~~ Environment: RHOSP 13.0.7 with Instance HA and NFSv4 backend for Nova