Bug 1566723 - Instance live migration times out after <x> seconds with nova - doesn't copy instance memory
Summary: Instance live migration times out after <x> seconds with nova - doesn't copy ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD
Version: 1.3.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z3
: 2.5
Assignee: Jason Dillaman
QA Contact: ceph-qe-bugs
Bara Ancincova
URL:
Whiteboard:
Depends On:
Blocks: 1622697
TreeView+ depends on / blocked
 
Reported: 2018-04-12 21:01 UTC by Andreas Karis
Modified: 2022-03-13 15:30 UTC (History)
31 users (show)

Fixed In Version: RHEL: ceph-10.2.10-43.el7cp Ubuntu: ceph_10.2.10-37redhat1
Doc Type: Bug Fix
Doc Text:
Restarting OSD daemons, for example for rolling updates, could result in an inconsistent internal state within librbd clients with the exclusive lock feature enabled. As a consequence, live migration of virtual machines (VMs) using RBD images could time out because the source VM would refuse to release its exclusive lock on the RBD image. This bug has been fixed, and the live migration proceeds as expected.
Clone Of:
: 1622697 (view as bug list)
Environment:
Last Closed: 2018-11-27 21:15:40 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 19957 0 None None None 2018-05-03 17:47:22 UTC
Github ceph ceph pull 23760 0 None closed jewel: librbd: fix refuse to release lock when cookie is the same at rewatch 2020-05-19 11:25:05 UTC
Red Hat Issue Tracker RHCEPH-3720 0 None None None 2022-03-13 15:30:00 UTC
Red Hat Knowledge Base (Solution) 3664091 0 None None None 2018-10-24 00:32:36 UTC
Red Hat Product Errata RHBA-2018:3689 0 None None None 2018-11-27 21:16:15 UTC

Description Andreas Karis 2018-04-12 21:01:37 UTC
Description of problem:
Instance live migration times out after 1600 seconds with nova - doesn't copy instance memory

We observed a strange pattern for all instances on a specific compute node. Instances are on compute-0, migration to compute-2:
* Live-migrating -> timeout -> fail
* nova stop instance / nova start instance
* migrate to compute 2 again -> fail (this may be due to a bug where the XML file was already created on the other compute and thus nova migration fails) 
* migrate to compute 2 -> success
* migrate to compute 0 -> success
* migrate to compute 2 -> success

What strikes here is that after the initial failure(s), once a migration worked, the instance can be migrated back and forth without issues.

One of the instances that showed this pattern was the instance ending in 57800 (026c91c9-f01b-47f8-b073-
7e6a39057800). See private comments below for more details.

According to the customer, migration worked for some time. Then, 2 weeks ago, it broke for a specific compute node, possibly for a few compute nodes. They restarted all affected instances and thought that this maybe just a one time issue, but it happened again 2 weeks later. If an instance is migrated away after a restart, migration works again on subsequent attempts and the instance can be migrated in and out all of the time (until the issue randomly happens again).

The issue happens likely not only on this hypervisor, but also on others.

Spawning a new VM on compute-0 and migrating it off immediately works without issues.

There are still test VMs on compute-0, so we could test with those, but obviously after each test, we "lose" a VM for testing because migration works after an instance stop/start.

This looks suspicious: bytes processed=0, remaining=0, total=0
~~~
2018-04-11 16:42:03.864 669431 DEBUG nova.virt.libvirt.driver [req-f31770c1-901b-4a91-bf48-5ea98f3bcd85 13515e05a63e48e0b9adb991b250c5f7 8bc196f362b442a4891986517b60389c - - -] [instance: 559176a0-393e-4f43-ab38-8d064790bb65] Migration running for 105 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) _live_migration_monitor /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:6428
~~~

The above remains at 0 for the entire process. A migration that works, instead, clearly shows progress:
~~~
2018-04-11 17:56:24.512 669431 INFO nova.virt.libvirt.driver [req-37599fe2-e62a-4d21-9473-78907f18e382 13515e05a63e48e0b9adb991b250c5f7 8bc196f362b442a4891986517b60389c - - -] [instance: 026c91c9-f01b-47f8-b073-7e6a39057800] Migration running for 0 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0)
2018-04-11 17:56:29.624 669431 DEBUG nova.virt.libvirt.driver [req-37599fe2-e62a-4d21-9473-78907f18e382 13515e05a63e48e0b9adb991b250c5f7 8bc196f362b442a4891986517b60389c - - -] [instance: 026c91c9-f01b-47f8-b073-7e6a39057800] Migration running for 5 secs, memory 1% remaining; (bytes processed=1513034396, remaining=95154176, total=8607571968) _live_migration_monitor /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:6428
~~~

Version-Release number of selected component (if applicable):
[akaris@fubar sosreport-hpawlowski-ctr.02073714-20180409164123]$ egrep 'nova|libvirt|qemu' installed-rpms | awk '{print $1}' | tr '\n' ' '
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch libvirt-3.2.0-14.el7_4.9.x86_64 libvirt-client-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-config-network-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-config-nwfilter-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-interface-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-lxc-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-network-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-nodedev-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-nwfilter-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-qemu-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-secret-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-core-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-disk-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-iscsi-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-logical-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-mpath-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-rbd-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-scsi-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-kvm-3.2.0-14.el7_4.9.x86_64 libvirt-libs-3.2.0-14.el7_4.9.x86_64 libvirt-python-3.2.0-3.el7_4.1.x86_64 openstack-nova-api-14.1.0-3.el7ost.noarch openstack-nova-cert-14.1.0-3.el7ost.noarch openstack-nova-common-14.1.0-3.el7ost.noarch openstack-nova-compute-14.1.0-3.el7ost.noarch openstack-nova-conductor-14.1.0-3.el7ost.noarch openstack-nova-console-14.1.0-3.el7ost.noarch openstack-nova-migration-14.1.0-3.el7ost.noarch openstack-nova-novncproxy-14.1.0-3.el7ost.noarch openstack-nova-scheduler-14.1.0-3.el7ost.noarch puppet-nova-9.6.0-3.el7ost.noarch python-nova-14.1.0-3.el7ost.noarch python-novaclient-6.0.2-2.el7ost.noarch qemu-guest-agent-2.8.0-2.el7.x86_64 qemu-img-rhev-2.9.0-16.el7_4.13.x86_64 qemu-kvm-common-rhev-2.9.0-16.el7_4.13.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64

Comment 6 Andreas Karis 2018-04-12 21:09:01 UTC
Note that we received a similar case for OSP 12 at the same time. This may or may not be related, I'm just bringing it up: https://bugzilla.redhat.com/show_bug.cgi?id=1566699

Comment 7 Andreas Karis 2018-04-12 21:12:34 UTC
This is Ceph, instance storage is boot from volume.

Comment 56 Jason Dillaman 2018-05-03 17:47:23 UTC
This issue should have been addressed in the ceph-10.2.10-17.el7cp release of Ceph. Is it possible to update the librbd1 RPM on the compute hosts to verify?

Comment 62 Jason Dillaman 2018-06-08 18:41:26 UTC
The issue was fixed in the ceph-10.2.10-17.el7cp release.

Comment 63 Andreas Karis 2018-08-17 15:41:37 UTC
Customer updated to latest RPMs and this issue persists:
[root@njsmain-compute0 ~]# rpm -qa | grep ceph | sort 
ceph-base-10.2.10-28.el7cp.x86_64
ceph-common-10.2.10-28.el7cp.x86_64
ceph-radosgw-10.2.10-28.el7cp.x86_64
ceph-selinux-10.2.10-28.el7cp.x86_64
libcephfs1-10.2.10-28.el7cp.x86_64
puppet-ceph-2.4.2-1.el7ost.noarch
python-cephfs-10.2.10-28.el7cp.x86_64

We'll add further details today and on Monday

Comment 64 Vikhyat Umrao 2018-08-17 22:18:15 UTC
(In reply to Andreas Karis from comment #63)
> Customer updated to latest RPMs and this issue persists:
> [root@njsmain-compute0 ~]# rpm -qa | grep ceph | sort 
> ceph-base-10.2.10-28.el7cp.x86_64
> ceph-common-10.2.10-28.el7cp.x86_64
> ceph-radosgw-10.2.10-28.el7cp.x86_64
> ceph-selinux-10.2.10-28.el7cp.x86_64
> libcephfs1-10.2.10-28.el7cp.x86_64
> puppet-ceph-2.4.2-1.el7ost.noarch
> python-cephfs-10.2.10-28.el7cp.x86_64
> 
> We'll add further details today and on Monday

Thanks Andreas. In qemu-kvm instances, updating packages are not only the job to fix come into existence - either the instance should be stopped and started because instance restart did not kill existing PID and in the memory, process image will not change and new code will not be in memory which has the fix.

OR

If Instance cannot be stopped and started(like no downtime) then it has to be migrated to a different compute node and then bring back to updated compute node then the only new process will be created in updated compute node for this instance and will have fixed code in memory.

So to me looks like the packages were upgraded but instances were not live migrated and bring back. 
  + I know here live-migration is the issue and instance stop/start fixes it.
  + If the instances have admin-socket enabled then check in memory version I am sure it would be still old.
   ceph --admin-daemon /var/run/ceph/<name>.asock version
  + This will help us to understand if the in-memory version is 10.2.10-28.el7cp or still old. 
  + If still old then we need to live-migrate or stop and start the instance and note down the instance name and I am sure the issue should not come back in the live-migrated or stop/started the instance.

  + If admin-socket is not enabled then enable it and either stop/start the instance or live-migrate and note the instance name and the same thing as above option.

In Summary:
==============

- We need to make sure the in-memory version is 10.2.10-28.el7cp and causing the issue if this is the case we need to debug further but if this is not the case then please live-migrate or stop/start all instances.

Comment 65 Andreas Karis 2018-08-20 15:12:05 UTC
We don't have the admin socket and instances were migrated after the patch.  Does the customer also need to patch the Ceph cluster, or only the client libraries? 

Thanks,

Andreas

Comment 69 Andreas Karis 2018-08-20 16:49:46 UTC
WRT to the instance in https://bugzilla.redhat.com/show_bug.cgi?id=1566723#c66 , we took another bt on the destination instance  https://bugzilla.redhat.com/show_bug.cgi?id=1566723#c67

We also took an 18GB gcore of the source instance which the customer is going to attach (I'll let you know once that's done)

Comment 114 errata-xmlrpc 2018-11-27 21:15:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3689


Note You need to log in before you can comment on or make changes to this bug.