Bug 1784343 - nova migration never finished due to: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainGetJobStats)
Summary: nova migration never finished due to: Timed out during operation: cannot acqu...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-17 09:43 UTC by Andreas Karis
Modified: 2023-03-24 16:29 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-12-18 14:52:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Andreas Karis 2019-12-17 09:43:08 UTC
Description of problem:

We see head-of-line blocking when migrating all instances from compute-2 to compute-3 with `max_concurrent_live_migrations=1` being configured with the default value of `1`. One concurrent live-migration is configured, and the migration of instance-000005b1 (UUID 3e891943-d71e-4244-bf51-b5d6b4d79df4) has been going on for 3 days:
~~~
| 301 | de342ef8-a5e4-4ff7-9674-c5929d1e15a7 | overcloud-compute-2.localdomain | overcloud-compute-3.localdomain | overcloud-compute-2.localdomain | overcloud-compute-3.localdomain | -           | running   | 3e891943-d71e-4244-bf51-b5d6b4d79df4 | 71         | 71         | 2019-12-13T14:30:28.000000 | 2019-12-13T14:31:31.000000 | live-migration |
~~~

However, we cannot abort this migration:
~~~
(overcloud) [stack@director-beta ~]$ nova live-migration-abort 3e891943-d71e-4244-bf51-b5d6b4d79df4 301
(overcloud) [stack@director-beta ~]$
(overcloud) [stack@director-beta ~]$ nova server-migration-show 3e891943-d71e-4244-bf51-b5d6b4d79df4 301
+------------------------+--------------------------------------+
| Property               | Value                                |
+------------------------+--------------------------------------+
| created_at             | 2019-12-13T14:30:28.000000           |
| dest_compute           | overcloud-compute-3.localdomain      |
| dest_host              | -                                    |
| dest_node              | overcloud-compute-3.localdomain      |
| disk_processed_bytes   | 0                                    |
| disk_remaining_bytes   | 0                                    |
| disk_total_bytes       | 0                                    |
| id                     | 301                                  |
| memory_processed_bytes | 0                                    |
| memory_remaining_bytes | 0                                    |
| memory_total_bytes     | 0                                    |
| server_uuid            | 3e891943-d71e-4244-bf51-b5d6b4d79df4 |
| source_compute         | overcloud-compute-2.localdomain      |
| source_node            | overcloud-compute-2.localdomain      |
| status                 | running                              |
| updated_at             | 2019-12-13T14:31:31.000000           |
| uuid                   | de342ef8-a5e4-4ff7-9674-c5929d1e15a7 |
+------------------------+--------------------------------------+
[root@overcloud-compute-2 desccb]# tail -f /var/log/containers/nova/nova-compute.log
2019-12-16 13:40:46.174 8 WARNING nova.compute.resource_tracker [req-3ccf9ab8-fb45-4f7a-a8c4-de55e2e85aa4 - - - - -] [instance: e810ff94-6d83-4ab6-8871-245ea7202f1a] Instance not resizing, skipping migration.
2019-12-16 13:40:46.174 8 WARNING nova.compute.resource_tracker [req-3ccf9ab8-fb45-4f7a-a8c4-de55e2e85aa4 - - - - -] [instance: 3e891943-d71e-4244-bf51-b5d6b4d79df4] Instance not resizing, skipping migration.
2019-12-16 13:40:47.005 8 INFO nova.compute.resource_tracker [req-3ccf9ab8-fb45-4f7a-a8c4-de55e2e85aa4 - - - - -] Final resource view: name=overcloud-compute-2.localdomain phys_ram=195229MB used_ram=95232MB phys_disk=96571GB used_disk=60GB total_vcpus=112 used_vcpus=65 pci_stats=[]
2019-12-16 13:41:07.198 8 ERROR nova.virt.libvirt.driver [req-8dfd81b8-c788-426d-9d1f-38bfcd807010 5172322575324a0a824341de449766ab a2e7b2ae4cef4c7e8d6397502132d575 - default default] [instance: 3e891943-d71e-4244-bf51-b5d6b4d79df4] Failed to cancel migration Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainGetJobStats): libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainGetJobStats)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server [req-8dfd81b8-c788-426d-9d1f-38bfcd807010 5172322575324a0a824341de449766ab a2e7b2ae4cef4c7e8d6397502132d575 - default default] Exception during message handling: libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainGetJobStats)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 76, in wrapped
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     function_name, call_dict, binary)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     self.force_reraise()
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 67, in wrapped
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/compute/utils.py", line 1021, in decorated_function
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 215, in decorated_function
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     kwargs['instance'], e, sys.exc_info())
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     self.force_reraise()
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 203, in decorated_function
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6378, in live_migration_abort
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     self.driver.live_migration_abort(instance)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6962, in live_migration_abort
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     dom.abortJob()
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     result = proxy_call(self._autowrap, f, *args, **kwargs)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     rv = execute(f, *args, **kwargs)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     six.reraise(c, e, tb)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     rv = meth(*args, **kwargs)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 544, in abortJob
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server     if ret == -1: raise libvirtError ('virDomainAbortJob() failed', dom=self)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainGetJobStats)
2019-12-16 13:41:07.238 8 ERROR oslo_messaging.rpc.server
~~~

virsh cannot list any details for the qemu-kvm process:
~~~
[root@overcloud-compute-2 qemu]# virsh domjobinfo instance-000005b1
error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainGetJobStats)
~~~


Version-Release number of selected component (if applicable):


[akaris@supportshell overcloud-compute-2]$ cat etc/redhat-release 
Red Hat Enterprise Linux Server release 7.7 (Maipo)

[akaris@supportshell overcloud-compute-2]$ cat etc/rhosp-release 
Red Hat OpenStack Platform release 13.0.9 (Queens)

[akaris@supportshell overcloud-compute-2]$ egrep 'libvirt|qemu|kernel' installed-rpms 
erlang-kernel-18.3.4.11-2.el7ost.x86_64                     Tue Aug 27 16:23:00 2019
ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch              Tue Aug 27 16:23:30 2019
kernel-3.10.0-1062.4.1.el7.x86_64                           Wed Oct 30 11:29:50 2019
kernel-3.10.0-1062.4.3.el7.x86_64                           Wed Nov 13 18:58:11 2019
kernel-3.10.0-1062.el7.x86_64                               Tue Aug 27 15:46:11 2019
kernel-tools-3.10.0-1062.4.3.el7.x86_64                     Wed Nov 13 18:58:03 2019
kernel-tools-libs-3.10.0-1062.4.3.el7.x86_64                Wed Nov 13 18:58:03 2019
libvirt-4.5.0-23.el7_7.1.x86_64                             Wed Oct 30 11:29:55 2019
libvirt-bash-completion-4.5.0-23.el7_7.1.x86_64             Wed Oct 30 11:29:47 2019
libvirt-client-4.5.0-23.el7_7.1.x86_64                      Wed Oct 30 11:29:47 2019
libvirt-daemon-4.5.0-23.el7_7.1.x86_64                      Wed Oct 30 11:29:41 2019
libvirt-daemon-config-network-4.5.0-23.el7_7.1.x86_64       Wed Oct 30 11:29:42 2019
libvirt-daemon-config-nwfilter-4.5.0-23.el7_7.1.x86_64      Wed Oct 30 11:29:42 2019
libvirt-daemon-driver-interface-4.5.0-23.el7_7.1.x86_64     Wed Oct 30 11:29:41 2019
libvirt-daemon-driver-lxc-4.5.0-23.el7_7.1.x86_64           Wed Oct 30 11:29:42 2019
libvirt-daemon-driver-network-4.5.0-23.el7_7.1.x86_64       Wed Oct 30 11:29:42 2019
libvirt-daemon-driver-nodedev-4.5.0-23.el7_7.1.x86_64       Wed Oct 30 11:29:41 2019
libvirt-daemon-driver-nwfilter-4.5.0-23.el7_7.1.x86_64      Wed Oct 30 11:29:41 2019
libvirt-daemon-driver-qemu-4.5.0-23.el7_7.1.x86_64          Wed Oct 30 11:29:42 2019
libvirt-daemon-driver-secret-4.5.0-23.el7_7.1.x86_64        Wed Oct 30 11:29:41 2019
libvirt-daemon-driver-storage-4.5.0-23.el7_7.1.x86_64       Wed Oct 30 11:29:46 2019
libvirt-daemon-driver-storage-core-4.5.0-23.el7_7.1.x86_64  Wed Oct 30 11:29:41 2019
libvirt-daemon-driver-storage-disk-4.5.0-23.el7_7.1.x86_64  Wed Oct 30 11:29:42 2019
libvirt-daemon-driver-storage-gluster-4.5.0-23.el7_7.1.x86_64 Wed Oct 30 11:29:42 2019
libvirt-daemon-driver-storage-iscsi-4.5.0-23.el7_7.1.x86_64 Wed Oct 30 11:29:42 2019
libvirt-daemon-driver-storage-logical-4.5.0-23.el7_7.1.x86_64 Wed Oct 30 11:29:44 2019
libvirt-daemon-driver-storage-mpath-4.5.0-23.el7_7.1.x86_64 Wed Oct 30 11:29:42 2019
libvirt-daemon-driver-storage-rbd-4.5.0-23.el7_7.1.x86_64   Wed Oct 30 11:29:46 2019
libvirt-daemon-driver-storage-scsi-4.5.0-23.el7_7.1.x86_64  Wed Oct 30 11:29:42 2019
libvirt-daemon-kvm-4.5.0-23.el7_7.1.x86_64                  Wed Oct 30 11:29:46 2019
libvirt-libs-4.5.0-23.el7_7.1.x86_64                        Wed Oct 30 11:29:39 2019
libvirt-python-4.5.0-1.el7.x86_64                           Tue Aug 27 16:24:03 2019
qemu-guest-agent-2.12.0-3.el7.x86_64                        Tue Aug 27 15:46:18 2019
qemu-img-rhev-2.12.0-33.el7_7.4.x86_64                      Wed Nov 13 18:57:40 2019
qemu-kvm-common-rhev-2.12.0-33.el7_7.4.x86_64               Wed Nov 13 18:57:53 2019
qemu-kvm-rhev-2.12.0-33.el7_7.4.x86_64                      Wed Nov 13 18:58:05 2019

[akaris@supportshell overcloud-compute-2]$ cat uname 
Linux overcloud-compute-2 3.10.0-1062.4.3.el7.x86_64 #1 SMP Tue Nov 12 10:42:40 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

[akaris@supportshell overcloud-compute-2]$ grep libvirt ./sos_commands/docker/docker_ps
3e1229985236        10.16.20.2:8787/rhosp13/openstack-nova-libvirt:13.0-119                "dumb-init --singl..."   4 weeks ago         Up 58 minutes                             nova_libvirt
4e5a23db26cb        10.16.20.2:8787/rhosp13/openstack-nova-libvirt:13.0-119                "dumb-init --singl..."   4 weeks ago         Up 9 days                                 nova_virtlogd



How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Andreas Karis 2019-12-18 14:54:32 UTC
It turns out that a volume was attached in duplicate to the same instance 3e891943-d71e-4244-bf51-b5d6b4d79df4

(overcloud) [stack @ director-beta ~] $ openstack volume show 12128eb5-9af9-47e2-b10c-1eca0cec5d2c

| attachments | [{u'server_id ': u'3e891943-d71e-4244-bf51-b5d6b4d79df4', u'attachment_id ': u'99f3e933-a373-4d94-8bc0-32e0895a5df9', u'attached_at ': u'2019-12-09T : 59: 44.000000 ', u'host_name': u'overcloud-compute-2.localdomain ', u'volume_id': u'12128eb5-9af9-47e2-b10c-1eca0cec5d2c ', u'device': u '/ dev / vdb ', u'id': u'12128eb5-9af9-47e2-b10c-1eca0cec5d2c '}, {u'server_id': u'3e891943-d71e-4244-bf51-b5d6b4d79df4 ', u'attachment_id': u'd0ed9602- d965-4c0e-9e1f-ea758da064e1 ', u'attached_at': u'2019-12-13T14: 31: 26.000000 ', u'host_name': u'overcloud-compute-3.localdomain ', u'volume_id': u ' 12128eb5-9af9-47e2-b10c-1eca0cec5d2c ', u'device': u '/ dev / vdb', u'id ': u'12128eb5-9af9-47e2-b10c-1eca0cec5d2c'}] |


After deattaching this volume, live-migration worked.

cinder reset-state --attach-status detached 12128eb5-9af9-47e2-b10c-1eca0cec5d2c


Note You need to log in before you can comment on or make changes to this bug.