1740069 – NovaEvacuate: InstanceHA evacuation fails with "Failed to get "write" lock Is another process using the image?" when using NFSv4

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1740069 - NovaEvacuate: InstanceHA evacuation fails with "Failed to get "write" lock Is another process using the image?" when using NFSv4

Summary: NovaEvacuate: InstanceHA evacuation fails with "Failed to get "write" lock Is...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.7
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	7.9
Assignee:	Oyvind Albrigtsen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1755760 1756262
TreeView+	depends on / blocked

Reported:	2019-08-12 08:55 UTC by Yadnesh Kulkarni
Modified:	2021-01-18 14:40 UTC (History)
CC List:	23 users (show)
Fixed In Version:	resource-agents-4.1.1-40.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1755760 1756262 1775587 (view as bug list)
Environment:
Last Closed:	2020-03-31 19:47:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	4456351	0	None	None	None	2019-10-02 19:55:00 UTC
Red Hat Product Errata	RHBA-2020:1067	0	None	None	None	2020-03-31 19:48:00 UTC

Description Yadnesh Kulkarni 2019-08-12 08:55:33 UTC

Description of problem:

When a compute node crashes instance evacuation fails with below errors(on destination node)
~~~
2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5]   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1110, in createWithFlags
2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5]     if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5] libvirtError: internal error: qemu unexpectedly closed the monitor: 2019-08-07T18:32:31.158796Z qemu-kvm: -drive file=/var/lib/nova/instances/41916b40-6556-4f86-8ddb-3ec24c7990a5/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none: Failed to get "write" lock
2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5] Is another process using the image?
2019-08-07 20:32:32.176 1 ERROR nova.compute.manager [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5]
2019-08-07 20:32:32.588 1 INFO nova.compute.manager [req-9a9143dc-3361-4aa8-9337-a49b83d2ce9c 8fc652fb844344f89a58d8ee77c3b918 c9ca1d32e3754527801c00a473ce0a6b - default default] [instance: 41916b40-6556-4f86-8ddb-3ec24c7990a5] Successfully reverted task state from rebuild_spawning on failure for instance.
2019-08-07 20:32:32.594 1 ERROR oslo_messaging.rpc.server [req-9a9143dc-3361-4aa8-9337-a49b83d2ce9c 8fc652fb844344f89a58d8ee77c3b918 c9ca1d32e3754527801c00a473ce0a6b - default default] Exception during message handling: libvirtError: internal error: qemu unexpectedly closed the monitor: 2019-08-07T18:32:31.158796Z qemu-kvm: -drive file=/var/lib/nova/instances/41916b40-6556-4f86-8ddb-3ec24c7990a5/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none: Failed to get "write" lock
Is another process using the image?
~~~

Once the source compute comes back, observed the below logs in nova-compute 
~~~
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager [req-5ff7be6c-7ea5-4969-9675-bbb48b8e072c - - - - -] Error updating resources for node vos13psctn103.corpnet.pl.: DiskNotFound: No disk at /var/lib/nova/instances/739e7248-6fd4-476c-8cf7-c833ae322ee4/disk
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager Traceback (most recent call last):
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7447, in update_available_resource_for_node
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     rt.update_available_resource(context, nodename)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 690, in update_available_resource
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     resources = self.driver.get_available_resource(nodename)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6485, in get_available_resource
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     disk_over_committed = self._get_disk_over_committed_size_total()
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8048, in _get_disk_over_committed_size_total
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     err_ctxt.reraise = False
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     self.force_reraise()
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     six.reraise(self.type_, self.value, self.tb)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7989, in _get_disk_over_committed_size_total
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     config, block_device_info)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7891, in _get_instance_disk_info_from_config
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     qemu_img_info = disk_api.get_disk_info(path)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 136, in get_disk_info
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     return images.qemu_img_info(path)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in qemu_img_info
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager     raise exception.DiskNotFound(location=path)
2019-08-05 09:57:39.121 1 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/739e7248-6fd4-476c-8cf7-c833ae322ee4/disk
~~~

Environment:
RHOSP 13.0.7 with Instance HA and NFSv4 backend for Nova

Comment 2 Yadnesh Kulkarni 2019-08-14 12:06:43 UTC

In my test environment, on all the compute nodes I remounted /var/lib/nova/instances with NFS version to 3 instead of 4, along with that I have used "nolock" option.

Check below configuration from my compute node
~~~
[root@overcloud-novacomputeiha-0 ~]# cat /etc/fstab 
LABEL=img-rootfs / xfs defaults 0 1
192.168.122.1:/home/nova /var/lib/nova/instances nfs _netdev,bg,nolock,context=system_u:object_r:nfs_t:s0,vers=3,nfsvers=3 0 0
~~~

With this workaround I was able to evacuate instances.

Comment 3 Lee Yarwood 2019-08-15 11:46:48 UTC

(In reply to Yadnesh Kulkarni from comment #0)
> 2019-08-05 09:57:39.121 1 ERROR nova.compute.manager DiskNotFound: No disk
> at /var/lib/nova/instances/739e7248-6fd4-476c-8cf7-c833ae322ee4/disk

That's a different disk for a different instance, we should ignore this in the context of this evacuation bug.

(In reply to Yadnesh Kulkarni from comment #2)
> In my test environment, on all the compute nodes I remounted
> /var/lib/nova/instances with NFS version to 3 instead of 4, along with that
> I have used "nolock" option.
> 
> Check below configuration from my compute node
> ~~~
> [root@overcloud-novacomputeiha-0 ~]# cat /etc/fstab 
> LABEL=img-rootfs / xfs defaults 0 1
> 192.168.122.1:/home/nova /var/lib/nova/instances nfs
> _netdev,bg,nolock,context=system_u:object_r:nfs_t:s0,vers=3,nfsvers=3 0 0
> ~~~
> 
> With this workaround I was able to evacuate instances.

NACK, we actually want the locking provided by NFSv4 here to ensure the destination instance only starts when the source is really dead and no longer accessing the disk(s).

Which NFS backend is being used here and what is the currently configured lease timeout? I assume the instance is being evacuated before the lease held by the original compute has timed out.

Comment 23 errata-xmlrpc 2020-03-31 19:47:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1067

Note You need to log in before you can comment on or make changes to this bug.

agk
aherr
cfeist
cluster-maint
dasmith
eglynn
fdinitto
jhakimra
jjoyce
jschluet
kchamart
kmehta
lmiccini
lyarwood
mschuppe
phagara
pkomarov
sbauza
sbradley
sgordon
slinaber
tvignaud
vromanso