Bug 1740069

Summary:	NovaEvacuate: InstanceHA evacuation fails with "Failed to get "write" lock Is another process using the image?" when using NFSv4
Product:	Red Hat Enterprise Linux 7	Reporter:	Yadnesh Kulkarni <ykulkarn>
Component:	resource-agents	Assignee:	Oyvind Albrigtsen <oalbrigt>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.7	CC:	agk, aherr, cfeist, cluster-maint, dasmith, eglynn, fdinitto, jhakimra, jjoyce, jschluet, kchamart, kmehta, lmiccini, lyarwood, mschuppe, phagara, pkomarov, sbauza, sbradley, sgordon, slinaber, tvignaud, vromanso
Target Milestone:	rc	Keywords:	Reopened, Triaged, ZStream
Target Release:	7.9
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	resource-agents-4.1.1-40.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1755760 1756262 1775587 (view as bug list)		Environment:
Last Closed:	2020-03-31 19:47:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1755760, 1756262

Description Yadnesh Kulkarni 2019-08-12 08:55:33 UTC

Description of problem:

When a compute node crashes instance evacuation fails with below errors(on destination node)
~~~
2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5]   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1110, in createWithFlags
2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5]     if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5] libvirtError: internal error: qemu unexpectedly closed the monitor: 2019-08-07T18:32:31.158796Z qemu-kvm: -drive file=/var/lib/nova/instances/41916b40-6556-4f86-8ddb-3ec24c7990a5/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none: Failed to get "write" lock
2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5] Is another process using the image?
2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5]
2019-08-07 20:32:32.588 1 INFO nova.compute.manager [req-9a9143dc-3361-4aa8-9337-a49b83d2ce9c 8fc652fb844344f89a58d8ee77c3b918 c9ca1d32e3754527801c00a473ce0a6b - default default] [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5] Successfully reverted task state from rebuild_spawning on failure for instance.
2019-08-07 20:32:32.594 1 ERROR oslo_messaging.rpc.server [req-9a9143dc-3361-4aa8-9337-a49b83d2ce9c 8fc652fb844344f89a58d8ee77c3b918 c9ca1d32e3754527801c00a473ce0a6b - default default] Exception during message handling: libvirtError: internal error: qemu unexpectedly closed the monitor: 2019-08-07T18:32:31.158796Z qemu-kvm: -drive file=/var/lib/nova/instances/41916b40-6556-4f86-8ddb-3ec24c7990a5/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none: Failed to get "write" lock
Is another process using the image?
~~~

Once the source compute comes back, observed the below logs in nova-compute 
~~~
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager [req-5ff7be6c-7ea5-4969-9675-bbb48b8e072c - - - - -] Error updating resources for node vos13psctn103.corpnet.pl.: DiskNotFound: No disk at /var/lib/nova/instances/739e7248-6fd4-476c-8cf7-c833ae322ee4/disk
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager Traceback (most recent call last):
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7447, in update_available_resource_for_node
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     rt.update_available_resource(context, nodename)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 690, in update_available_resource
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     resources = self.driver.get_available_resource(nodename)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6485, in get_available_resource
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     disk_over_committed = self._get_disk_over_committed_size_total()
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8048, in _get_disk_over_committed_size_total
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     err_ctxt.reraise = False
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     self.force_reraise()
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     six.reraise(self.type_, self.value, self.tb)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7989, in _get_disk_over_committed_size_total
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     config, block_device_info)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7891, in _get_instance_disk_info_from_config
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     qemu_img_info = disk_api.get_disk_info(path)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 136, in get_disk_info
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     return images.qemu_img_info(path)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in qemu_img_info
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     raise exception.DiskNotFound(location=path)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/739e7248-6fd4-476c-8cf7-c833ae322ee4/disk
~~~

Environment:
RHOSP 13.0.7 with Instance HA and NFSv4 backend for Nova

Comment 2 Yadnesh Kulkarni 2019-08-14 12:06:43 UTC

In my test environment, on all the compute nodes I remounted /var/lib/nova/instances with NFS version to 3 instead of 4, along with that I have used "nolock" option.

Check below configuration from my compute node
~~~
[root@overcloud-novacomputeiha-0 ~]# cat /etc/fstab 
LABEL=img-rootfs / xfs defaults 0 1
192.168.122.1:/home/nova /var/lib/nova/instances nfs _netdev,bg,nolock,context=system_u:object_r:nfs_t:s0,vers=3,nfsvers=3 0 0
~~~

With this workaround I was able to evacuate instances.

Comment 3 Lee Yarwood 2019-08-15 11:46:48 UTC

(In reply to Yadnesh Kulkarni from comment #0)
> 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager DiskNotFound: No disk
> at /var/lib/nova/instances/739e7248-6fd4-476c-8cf7-c833ae322ee4/disk

That's a different disk for a different instance, we should ignore this in the context of this evacuation bug.

(In reply to Yadnesh Kulkarni from comment #2)
> In my test environment, on all the compute nodes I remounted
> /var/lib/nova/instances with NFS version to 3 instead of 4, along with that
> I have used "nolock" option.
> 
> Check below configuration from my compute node
> ~~~
> [root@overcloud-novacomputeiha-0 ~]# cat /etc/fstab 
> LABEL=img-rootfs / xfs defaults 0 1
> 192.168.122.1:/home/nova /var/lib/nova/instances nfs
> _netdev,bg,nolock,context=system_u:object_r:nfs_t:s0,vers=3,nfsvers=3 0 0
> ~~~
> 
> With this workaround I was able to evacuate instances.

NACK, we actually want the locking provided by NFSv4 here to ensure the destination instance only starts when the source is really dead and no longer accessing the disk(s).

Which NFS backend is being used here and what is the currently configured lease timeout? I assume the instance is being evacuated before the lease held by the original compute has timed out.

Comment 23 errata-xmlrpc 2020-03-31 19:47:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1067