Description of problem: Instance live migration times out after 1600 seconds with nova - doesn't copy instance memory We observed a strange pattern for all instances on a specific compute node. Instances are on compute-0, migration to compute-2: * Live-migrating -> timeout -> fail * nova stop instance / nova start instance * migrate to compute 2 again -> fail (this may be due to a bug where the XML file was already created on the other compute and thus nova migration fails) * migrate to compute 2 -> success * migrate to compute 0 -> success * migrate to compute 2 -> success What strikes here is that after the initial failure(s), once a migration worked, the instance can be migrated back and forth without issues. One of the instances that showed this pattern was the instance ending in 57800 (026c91c9-f01b-47f8-b073- 7e6a39057800). See private comments below for more details. According to the customer, migration worked for some time. Then, 2 weeks ago, it broke for a specific compute node, possibly for a few compute nodes. They restarted all affected instances and thought that this maybe just a one time issue, but it happened again 2 weeks later. If an instance is migrated away after a restart, migration works again on subsequent attempts and the instance can be migrated in and out all of the time (until the issue randomly happens again). The issue happens likely not only on this hypervisor, but also on others. Spawning a new VM on compute-0 and migrating it off immediately works without issues. There are still test VMs on compute-0, so we could test with those, but obviously after each test, we "lose" a VM for testing because migration works after an instance stop/start. This looks suspicious: bytes processed=0, remaining=0, total=0 ~~~ 2018-04-11 16:42:03.864 669431 DEBUG nova.virt.libvirt.driver [req-f31770c1-901b-4a91-bf48-5ea98f3bcd85 13515e05a63e48e0b9adb991b250c5f7 8bc196f362b442a4891986517b60389c - - -] [instance: 559176a0-393e-4f43-ab38-8d064790bb65] Migration running for 105 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) _live_migration_monitor /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:6428 ~~~ The above remains at 0 for the entire process. A migration that works, instead, clearly shows progress: ~~~ 2018-04-11 17:56:24.512 669431 INFO nova.virt.libvirt.driver [req-37599fe2-e62a-4d21-9473-78907f18e382 13515e05a63e48e0b9adb991b250c5f7 8bc196f362b442a4891986517b60389c - - -] [instance: 026c91c9-f01b-47f8-b073-7e6a39057800] Migration running for 0 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) 2018-04-11 17:56:29.624 669431 DEBUG nova.virt.libvirt.driver [req-37599fe2-e62a-4d21-9473-78907f18e382 13515e05a63e48e0b9adb991b250c5f7 8bc196f362b442a4891986517b60389c - - -] [instance: 026c91c9-f01b-47f8-b073-7e6a39057800] Migration running for 5 secs, memory 1% remaining; (bytes processed=1513034396, remaining=95154176, total=8607571968) _live_migration_monitor /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:6428 ~~~ Version-Release number of selected component (if applicable): [akaris@fubar sosreport-hpawlowski-ctr.02073714-20180409164123]$ egrep 'nova|libvirt|qemu' installed-rpms | awk '{print $1}' | tr '\n' ' ' ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch libvirt-3.2.0-14.el7_4.9.x86_64 libvirt-client-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-config-network-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-config-nwfilter-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-interface-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-lxc-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-network-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-nodedev-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-nwfilter-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-qemu-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-secret-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-core-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-disk-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-iscsi-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-logical-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-mpath-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-rbd-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-driver-storage-scsi-3.2.0-14.el7_4.9.x86_64 libvirt-daemon-kvm-3.2.0-14.el7_4.9.x86_64 libvirt-libs-3.2.0-14.el7_4.9.x86_64 libvirt-python-3.2.0-3.el7_4.1.x86_64 openstack-nova-api-14.1.0-3.el7ost.noarch openstack-nova-cert-14.1.0-3.el7ost.noarch openstack-nova-common-14.1.0-3.el7ost.noarch openstack-nova-compute-14.1.0-3.el7ost.noarch openstack-nova-conductor-14.1.0-3.el7ost.noarch openstack-nova-console-14.1.0-3.el7ost.noarch openstack-nova-migration-14.1.0-3.el7ost.noarch openstack-nova-novncproxy-14.1.0-3.el7ost.noarch openstack-nova-scheduler-14.1.0-3.el7ost.noarch puppet-nova-9.6.0-3.el7ost.noarch python-nova-14.1.0-3.el7ost.noarch python-novaclient-6.0.2-2.el7ost.noarch qemu-guest-agent-2.8.0-2.el7.x86_64 qemu-img-rhev-2.9.0-16.el7_4.13.x86_64 qemu-kvm-common-rhev-2.9.0-16.el7_4.13.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64
Note that we received a similar case for OSP 12 at the same time. This may or may not be related, I'm just bringing it up: https://bugzilla.redhat.com/show_bug.cgi?id=1566699
This is Ceph, instance storage is boot from volume.
This issue should have been addressed in the ceph-10.2.10-17.el7cp release of Ceph. Is it possible to update the librbd1 RPM on the compute hosts to verify?
The issue was fixed in the ceph-10.2.10-17.el7cp release.
Customer updated to latest RPMs and this issue persists: [root@njsmain-compute0 ~]# rpm -qa | grep ceph | sort ceph-base-10.2.10-28.el7cp.x86_64 ceph-common-10.2.10-28.el7cp.x86_64 ceph-radosgw-10.2.10-28.el7cp.x86_64 ceph-selinux-10.2.10-28.el7cp.x86_64 libcephfs1-10.2.10-28.el7cp.x86_64 puppet-ceph-2.4.2-1.el7ost.noarch python-cephfs-10.2.10-28.el7cp.x86_64 We'll add further details today and on Monday
(In reply to Andreas Karis from comment #63) > Customer updated to latest RPMs and this issue persists: > [root@njsmain-compute0 ~]# rpm -qa | grep ceph | sort > ceph-base-10.2.10-28.el7cp.x86_64 > ceph-common-10.2.10-28.el7cp.x86_64 > ceph-radosgw-10.2.10-28.el7cp.x86_64 > ceph-selinux-10.2.10-28.el7cp.x86_64 > libcephfs1-10.2.10-28.el7cp.x86_64 > puppet-ceph-2.4.2-1.el7ost.noarch > python-cephfs-10.2.10-28.el7cp.x86_64 > > We'll add further details today and on Monday Thanks Andreas. In qemu-kvm instances, updating packages are not only the job to fix come into existence - either the instance should be stopped and started because instance restart did not kill existing PID and in the memory, process image will not change and new code will not be in memory which has the fix. OR If Instance cannot be stopped and started(like no downtime) then it has to be migrated to a different compute node and then bring back to updated compute node then the only new process will be created in updated compute node for this instance and will have fixed code in memory. So to me looks like the packages were upgraded but instances were not live migrated and bring back. + I know here live-migration is the issue and instance stop/start fixes it. + If the instances have admin-socket enabled then check in memory version I am sure it would be still old. ceph --admin-daemon /var/run/ceph/<name>.asock version + This will help us to understand if the in-memory version is 10.2.10-28.el7cp or still old. + If still old then we need to live-migrate or stop and start the instance and note down the instance name and I am sure the issue should not come back in the live-migrated or stop/started the instance. + If admin-socket is not enabled then enable it and either stop/start the instance or live-migrate and note the instance name and the same thing as above option. In Summary: ============== - We need to make sure the in-memory version is 10.2.10-28.el7cp and causing the issue if this is the case we need to debug further but if this is not the case then please live-migrate or stop/start all instances.
We don't have the admin socket and instances were migrated after the patch. Does the customer also need to patch the Ceph cluster, or only the client libraries? Thanks, Andreas
WRT to the instance in https://bugzilla.redhat.com/show_bug.cgi?id=1566723#c66 , we took another bt on the destination instance https://bugzilla.redhat.com/show_bug.cgi?id=1566723#c67 We also took an 18GB gcore of the source instance which the customer is going to attach (I'll let you know once that's done)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3689