Description of problem: When a live migration is performed using !shared storage! and the DST host is the same as the instance got evacuated previously, live migration fails due to the instance definition files are still present. Version-Release number of selected component (if applicable): * OSP6 * python-nova-2014.2.3-48.el7ost.noarch How reproducible: 100% Steps to Reproduce: Shared storage e.g. ceph is a requirement 1. evacuate instance from host A 2. instance is now on host B 3. when host A is back, perform live migration of instance from B -> A Actual results: live migration fails as instance definition files in e.g. /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd are still there Expected results: live migration works as expected Additional info:
Some more details: * destroy_after_evacuate is enabled: 2016-02-17 07:58:02.308 2588 DEBUG nova.openstack.common.service [-] workarounds.destroy_after_evacuate = True log_opt_values /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004 * log shows the instance to be deleted since the host is not equal to the local host: 2016-02-17 07:58:03.346 2588 INFO nova.compute.manager [-] [instance: bba371dd-4e6a-4cba-b066-241bba2d7ccd] Deleting instance as its host (osp6-compute2) is not equal to our host (osp6-compute1). 2016-02-17 07:58:03.346 2588 DEBUG nova.objects.instance [-] Lazy-loading `system_metadata' on Instance uuid bba371dd-4e6a-4cba-b066-241bba2d7ccd obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:579 2016-02-17 07:58:03.400 2588 DEBUG nova.openstack.common.lockutils [-] Created new semaphore "refresh_cache-bba371dd-4e6a-4cba-b066-241bba2d7ccd" internal_lock /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:206 2016-02-17 07:58:03.400 2588 DEBUG nova.openstack.common.lockutils [-] Acquired semaphore "refresh_cache-bba371dd-4e6a-4cba-b066-241bba2d7ccd" lock /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:229 2016-02-17 07:58:03.400 2588 DEBUG nova.network.neutronv2.api [-] [instance: bba371dd-4e6a-4cba-b066-241bba2d7ccd] get_instance_nw_info() _get_instance_nw_info /usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py:611 * Trigger live migration: # nova live-migration bba371dd-4e6a-4cba-b066-241bba2d7ccd * Live migration fails due to /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd is still present: 2016-02-17 08:29:18.694 2588 DEBUG nova.openstack.common.lockutils [req-e8d01cdd-a02c-4094-be76-02b9da62cd21 ] Releasing semaphore "refresh_cache-bba371dd-4e6a-4cba-b066-241bba2d7ccd" lock /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:238 2016-02-17 08:29:18.718 2588 ERROR oslo.messaging.rpc.dispatcher [req-e8d01cdd-a02c-4094-be76-02b9da62cd21 ] Exception during message handling: The supplied disk path (/var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd) already exists, it is expected not to exist. 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last): 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher incoming.message)) 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args) 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args) 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 435, in decorated_function 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs) 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/exception.py", line 88, in wrapped 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher payload) 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/openstack/common/excutils.py", line 82, in __exit__ 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb) 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/exception.py", line 71, in wrapped 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher return f(self, context, *args, **kw) 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 343, in decorated_function 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher kwargs['instance'], e, sys.exc_info()) 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/openstack/common/excutils.py", line 82, in __exit__ 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb) 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 331, in decorated_function 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs) 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 5005, in pre_live_migration 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher migrate_data) 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6052, in pre_live_migration 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher raise exception.DestinationDiskExists(path=instance_dir) 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher DestinationDiskExists: The supplied disk path (/var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd) already exists, it is expected not to exist. 2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher 2016-02-17 08:29:18.719 2588 ERROR oslo.messaging._drivers.common [req-e8d01cdd-a02c-4094-be76-02b9da62cd21 ] Returning exception The supplied disk path (/var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd) already exists, it is expected not to exist. to caller 2016-02-17 08:29:18.719 2588 ERROR oslo.messaging._drivers.common [req-e8d01cdd-a02c-4094-be76-02b9da62cd21 ] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply\n incoming.message))\n', ' File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch\n return self._do_dispatch(endpoint, method, ctxt, args)\n', ' File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch\n result = getattr(endpoint, method)(ctxt, **new_args)\n', ' File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 435, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/nova/exception.py", line 88, in wrapped\n payload)\n', ' File "/usr/lib/python2.7/site-packages/nova/openstack/common/excutils.py", line 82, in __exit__\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python2.7/site-packages/nova/exception.py", line 71, in wrapped\n return f(self, context, *args, **kw)\n', ' File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 343, in decorated_function\n kwargs[\'instance\'], e, sys.exc_info())\n', ' File "/usr/lib/python2.7/site-packages/nova/openstack/common/excutils.py", line 82, in __exit__\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 331, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 5005, in pre_live_migration\n migrate_data)\n', ' File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6052, in pre_live_migration\n raise exception.DestinationDiskExists(path=instance_dir)\n', 'DestinationDiskExists: The supplied disk path (/var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd) already exists, it is expected not to exist.\n'] ... 2016-02-17 08:29:19.490 2588 WARNING nova.virt.libvirt.driver [-] [instance: bba371dd-4e6a-4cba-b066-241bba2d7ccd] During wait destroy, instance disappeared. 016-02-17 08:29:19.664 2588 INFO nova.virt.libvirt.driver [req-e8d01cdd-a02c-4094-be76-02b9da62cd21 None] [instance: bba371dd-4e6a-4cba-b066-241bba2d7ccd] Deleting instance files /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd_del 2016-02-17 08:29:19.665 2588 INFO nova.virt.libvirt.driver [req-e8d01cdd-a02c-4094-be76-02b9da62cd21 None] [instance: bba371dd-4e6a-4cba-b066-241bba2d7ccd] Deletion of /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd_del complete * after above cleanup /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd is gone: # ll /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd ls: cannot access /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd: No such file or directory From the OSP libvirt driver we see that the instance files only get removed when we are NOT on shared disk ( /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py ) : 1207 if destroy_disks: 1208 # NOTE(haomai): destroy volumes if needed 1209 if CONF.libvirt.images_type == 'lvm': 1210 self._cleanup_lvm(instance) 1211 if CONF.libvirt.images_type == 'rbd': 1212 self._cleanup_rbd(instance) 1213 1214 if destroy_disks or ( 1215 migrate_data and migrate_data.get('is_shared_block_storage', 1216 False)): ---> 1217 self._delete_instance_files(instance) In case of evacuation we want to remove the instance files even if we are on shares storage.
(In reply to Martin Schuppert from comment #0) > Steps to Reproduce: > > Shared storage e.g. ceph is a requirement > > 1. evacuate instance from host A > 2. instance is now on host B > 3. when host A is back, perform live migration of instance from B -> A Thanks as ever for the detailed report Martin! Just to be clear, in this instance we are using rbd as the images_type with the following configurables set (obviously excluding my example values) in nova.conf on both the source and destination compute nodes : [libvirt] images_type = rbd images_rbd_pool = vms images_rbd_ceph_conf = /etc/ceph/ceph.conf rbd_user = cinder rbd_secret_uuid = 457eb676-33da-42ec-9a8c-9293d545c337 From the traces in c#1 we can see that DestinationDiskExists is thrown as the instance path exists while is_shared_instance_path is False and _delete_instance_files is then called as part of the cleanup because is_shared_block_storage is also False. AFAIK is_shared_instance_path is expected to be False with an image_type of rbd as the temp file written by check_can_live_migrate_destination to /var/lib/nova/instances/ on the destination host would not be visible from the source. is_shared_block_storage however should be True unless the image_type isn't set to rbd on the destination host. Can we confirm what this is set to on both hosts? Maybe generate a Guru report as well to confirm that it has also been applied?
(In reply to Lee Yarwood from comment #2) > (In reply to Martin Schuppert from comment #0) > > Steps to Reproduce: > > > > Shared storage e.g. ceph is a requirement > > > > 1. evacuate instance from host A > > 2. instance is now on host B > > 3. when host A is back, perform live migration of instance from B -> A > > Thanks as ever for the detailed report Martin! Thanks :) > > Just to be clear, in this instance we are using rbd as the images_type with > the following configurables set (obviously excluding my example values) in > nova.conf on both the source and destination compute nodes : > > [libvirt] > images_type = rbd > images_rbd_pool = vms > images_rbd_ceph_conf = /etc/ceph/ceph.conf > rbd_user = cinder > rbd_secret_uuid = 457eb676-33da-42ec-9a8c-9293d545c337 > > > From the traces in c#1 we can see that DestinationDiskExists is thrown as > the instance path exists while is_shared_instance_path is False and > _delete_instance_files is then called as part of the cleanup because > is_shared_block_storage is also False. > > AFAIK is_shared_instance_path is expected to be False with an image_type of > rbd as the temp file written by check_can_live_migrate_destination to > /var/lib/nova/instances/ on the destination host would not be visible from > the source. > > is_shared_block_storage however should be True unless the image_type isn't > set to rbd on the destination host. Can we confirm what this is set to on > both hosts? Maybe generate a Guru report as well to confirm that it has also > been applied? yes, rbd is set on both computes where I reproduced it. After running Guru report: [root@osp6-compute1 ~]# kill -SIGUSR1 `pgrep nova` [root@osp6-compute1 ~]# grep images_type /var/log/messages Feb 17 14:13:03 localhost nova-compute: images_type = rbd [root@osp6-compute2 ~]# kill -SIGUSR1 `pgrep nova` [root@osp6-compute2 ~]# grep images_type /var/log/messages Feb 17 14:13:51 localhost nova-compute: images_type = rbd From my understanding this happens, please correct me if I am wrong. * when compute A comes back * destroy_after_evacuate is being performed , which we can see from the logs. From my understanding this should cleanup the information from the evacuated instances * with rbd we have destroy_disks = false, since instance discs are not local ( /usr/lib/python2.7/site-packages/nova/compute/manager.py ) 786 try: 787 network_info = self._get_instance_nw_info(context, 788 instance) 789 bdi = self._get_instance_block_device_info(context, 790 instance) 791 destroy_disks = not (self._is_instance_storage_shared( 792 context, instance)) * this calls destroy in ... 801 self.driver.destroy(context, instance, 802 network_info, 803 bdi, destroy_disks) * in /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py in the end cleanup is performed in destroy 1093 def destroy(self, context, instance, network_info, block_device_info=None, 1094 destroy_disks=True, migrate_data=None): 1095 self._destroy(instance) 1096 self.cleanup(context, instance, network_info, block_device_info, 1097 destroy_disks, migrate_data) * since destroy_discs is false , migrate_data is None so we do not remove the instance files in case of RBD during destroy_after_evacuate 1214 if destroy_disks or ( 1215 migrate_data and migrate_data.get('is_shared_block_storage', 1216 False)): 1217 self._delete_instance_files(instance) I have an environment up where I can reproduce this if you want to have a look.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
Nokia has stated that this issues a blocker for their next release due next week.
Lee and I have looked into this again. Unfortunately this bug is more of a design issue. Any tightly focussed fix would involve heuristics, and therefore come with a high risk of regressions. Lee has pointed out, though, that there's a simple workaround. When the migration fails, the cleanup code will automatically delete the offending directory on the destination. Consequently, simply trying the migration again will succeed. As it's a difficult fix with a high risk of regressions, and there's a relatively simple workaround, we've decided that the safest course of action is not to fix it in RHOS 6. We'll continue to work on it upstream, though.
upstream track bug https://bugs.launchpad.net/nova/+bug/1414895