Bug 1309398 - Instance live migration fails to DST host when instance got evacuated previously from DST host
Instance live migration fails to DST host when instance got evacuated previou...
Status: CLOSED WONTFIX
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova (Show other bugs)
6.0 (Juno)
x86_64 Linux
high Severity medium
: async
: 6.0 (Juno)
Assigned To: Matthew Booth
nlevinki
: ZStream
Depends On:
Blocks: 1325106 1456718
  Show dependency treegraph
 
Reported: 2016-02-17 11:39 EST by Martin Schuppert
Modified: 2017-08-29 09:08 EDT (History)
19 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1325106 1456718 (view as bug list)
Environment:
Last Closed: 2016-04-05 09:41:06 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1414895 None None None 2016-02-18 07:54 EST
OpenStack gerrit 281913 None None None 2016-02-21 09:53 EST

  None (edit)
Description Martin Schuppert 2016-02-17 11:39:18 EST
Description of problem:

When a live migration is performed using !shared storage! and the DST host is the same as the instance got evacuated previously, live migration fails due to the instance definition files are still present.

Version-Release number of selected component (if applicable):

* OSP6
* python-nova-2014.2.3-48.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:

Shared storage e.g. ceph is a requirement 

1. evacuate instance from host A
2. instance is now on host B
3. when host A is back, perform live migration of instance from B -> A

Actual results:
live migration fails as instance definition files in e.g. /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd are still there

Expected results:
live migration works as expected

Additional info:
Comment 1 Martin Schuppert 2016-02-17 11:45:15 EST
Some more details:

* destroy_after_evacuate is enabled:

2016-02-17 07:58:02.308 2588 DEBUG nova.openstack.common.service [-] workarounds.destroy_after_evacuate = True log_opt_values /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004

* log shows the instance to be deleted since the host is not equal to the local host:

2016-02-17 07:58:03.346 2588 INFO nova.compute.manager [-] [instance: bba371dd-4e6a-4cba-b066-241bba2d7ccd] Deleting instance as its host (osp6-compute2) is not equal to our host (osp6-compute1).
2016-02-17 07:58:03.346 2588 DEBUG nova.objects.instance [-] Lazy-loading `system_metadata' on Instance uuid bba371dd-4e6a-4cba-b066-241bba2d7ccd obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:579
2016-02-17 07:58:03.400 2588 DEBUG nova.openstack.common.lockutils [-] Created new semaphore "refresh_cache-bba371dd-4e6a-4cba-b066-241bba2d7ccd" internal_lock /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:206
2016-02-17 07:58:03.400 2588 DEBUG nova.openstack.common.lockutils [-] Acquired semaphore "refresh_cache-bba371dd-4e6a-4cba-b066-241bba2d7ccd" lock /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:229
2016-02-17 07:58:03.400 2588 DEBUG nova.network.neutronv2.api [-] [instance: bba371dd-4e6a-4cba-b066-241bba2d7ccd] get_instance_nw_info() _get_instance_nw_info /usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py:611

* Trigger live migration:

# nova live-migration bba371dd-4e6a-4cba-b066-241bba2d7ccd

* Live migration fails due to /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd is still present:

2016-02-17 08:29:18.694 2588 DEBUG nova.openstack.common.lockutils [req-e8d01cdd-a02c-4094-be76-02b9da62cd21 ] Releasing semaphore "refresh_cache-bba371dd-4e6a-4cba-b066-241bba2d7ccd" lock /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:238
2016-02-17 08:29:18.718 2588 ERROR oslo.messaging.rpc.dispatcher [req-e8d01cdd-a02c-4094-be76-02b9da62cd21 ] Exception during message handling: The supplied disk path (/var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd) already exists, it is expected not to exist.
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher     incoming.message))
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher     return self._do_dispatch(endpoint, method, ctxt, args)
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher     result = getattr(endpoint, method)(ctxt, **new_args)
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 435, in decorated_function
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher     return function(self, context, *args, **kwargs)
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/exception.py", line 88, in wrapped
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher     payload)
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher     six.reraise(self.type_, self.value, self.tb)
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/exception.py", line 71, in wrapped
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher     return f(self, context, *args, **kw)
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 343, in decorated_function
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher     kwargs['instance'], e, sys.exc_info())
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher     six.reraise(self.type_, self.value, self.tb)
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 331, in decorated_function
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher     return function(self, context, *args, **kwargs)
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 5005, in pre_live_migration
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher     migrate_data)
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6052, in pre_live_migration
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher     raise exception.DestinationDiskExists(path=instance_dir)
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher DestinationDiskExists: The supplied disk path (/var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd) already exists, it is expected not to exist.
2016-02-17 08:29:18.718 2588 TRACE oslo.messaging.rpc.dispatcher 
2016-02-17 08:29:18.719 2588 ERROR oslo.messaging._drivers.common [req-e8d01cdd-a02c-4094-be76-02b9da62cd21 ] Returning exception The supplied disk path (/var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd) already exists, it is expected not to exist. to caller
2016-02-17 08:29:18.719 2588 ERROR oslo.messaging._drivers.common [req-e8d01cdd-a02c-4094-be76-02b9da62cd21 ] ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply\n    incoming.message))\n', '  File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch\n    return self._do_dispatch(endpoint, method, ctxt, args)\n', '  File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch\n    result = getattr(endpoint, method)(ctxt, **new_args)\n', '  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 435, in decorated_function\n    return function(self, context, *args, **kwargs)\n', '  File "/usr/lib/python2.7/site-packages/nova/exception.py", line 88, in wrapped\n    payload)\n', '  File "/usr/lib/python2.7/site-packages/nova/openstack/common/excutils.py", line 82, in __exit__\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python2.7/site-packages/nova/exception.py", line 71, in wrapped\n    return f(self, context, *args, **kw)\n', '  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 343, in decorated_function\n    kwargs[\'instance\'], e, sys.exc_info())\n', '  File "/usr/lib/python2.7/site-packages/nova/openstack/common/excutils.py", line 82, in __exit__\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 331, in decorated_function\n    return function(self, context, *args, **kwargs)\n', '  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 5005, in pre_live_migration\n    migrate_data)\n', '  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6052, in pre_live_migration\n    raise exception.DestinationDiskExists(path=instance_dir)\n', 'DestinationDiskExists: The supplied disk path (/var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd) already exists, it is expected not to exist.\n']

...

2016-02-17 08:29:19.490 2588 WARNING nova.virt.libvirt.driver [-] [instance: bba371dd-4e6a-4cba-b066-241bba2d7ccd] During wait destroy, instance disappeared.
016-02-17 08:29:19.664 2588 INFO nova.virt.libvirt.driver [req-e8d01cdd-a02c-4094-be76-02b9da62cd21 None] [instance: bba371dd-4e6a-4cba-b066-241bba2d7ccd] Deleting instance files /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd_del
2016-02-17 08:29:19.665 2588 INFO nova.virt.libvirt.driver [req-e8d01cdd-a02c-4094-be76-02b9da62cd21 None] [instance: bba371dd-4e6a-4cba-b066-241bba2d7ccd] Deletion of /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd_del complete

* after above cleanup /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd is gone:

# ll /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd
ls: cannot access /var/lib/nova/instances/bba371dd-4e6a-4cba-b066-241bba2d7ccd: No such file or directory

From the OSP libvirt driver we see that the instance files only get removed when we are NOT on shared disk ( /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py ) :

   1207         if destroy_disks:
   1208             # NOTE(haomai): destroy volumes if needed
   1209             if CONF.libvirt.images_type == 'lvm':
   1210                 self._cleanup_lvm(instance)
   1211             if CONF.libvirt.images_type == 'rbd':
   1212                 self._cleanup_rbd(instance)
   1213 
   1214         if destroy_disks or (
   1215                 migrate_data and migrate_data.get('is_shared_block_storage',
   1216                                                   False)):
--->   1217             self._delete_instance_files(instance)

In case of evacuation we want to remove the instance files even if we are on shares storage.
Comment 2 Lee Yarwood 2016-02-17 13:58:00 EST
(In reply to Martin Schuppert from comment #0)
> Steps to Reproduce:
> 
> Shared storage e.g. ceph is a requirement 
> 
> 1. evacuate instance from host A
> 2. instance is now on host B
> 3. when host A is back, perform live migration of instance from B -> A

Thanks as ever for the detailed report Martin!

Just to be clear, in this instance we are using rbd as the images_type with the following configurables set (obviously excluding my example values) in nova.conf on both the source and destination compute nodes :

[libvirt]
images_type = rbd
images_rbd_pool = vms
images_rbd_ceph_conf = /etc/ceph/ceph.conf
rbd_user = cinder
rbd_secret_uuid = 457eb676-33da-42ec-9a8c-9293d545c337


From the traces in c#1 we can see that DestinationDiskExists is thrown as the instance path exists while is_shared_instance_path is False and _delete_instance_files is then called as part of the cleanup because is_shared_block_storage is also False.

AFAIK is_shared_instance_path is expected to be False with an image_type of rbd as the temp file written by check_can_live_migrate_destination to /var/lib/nova/instances/ on the destination host would not be visible from the source.

is_shared_block_storage however should be True unless the image_type isn't set to rbd on the destination host. Can we confirm what this is set to on both hosts? Maybe generate a Guru report as well to confirm that it has also been applied?
Comment 3 Martin Schuppert 2016-02-18 03:36:25 EST
(In reply to Lee Yarwood from comment #2)
> (In reply to Martin Schuppert from comment #0)
> > Steps to Reproduce:
> > 
> > Shared storage e.g. ceph is a requirement 
> > 
> > 1. evacuate instance from host A
> > 2. instance is now on host B
> > 3. when host A is back, perform live migration of instance from B -> A
> 
> Thanks as ever for the detailed report Martin!

Thanks :)

> 
> Just to be clear, in this instance we are using rbd as the images_type with
> the following configurables set (obviously excluding my example values) in
> nova.conf on both the source and destination compute nodes :
> 
> [libvirt]
> images_type = rbd
> images_rbd_pool = vms
> images_rbd_ceph_conf = /etc/ceph/ceph.conf
> rbd_user = cinder
> rbd_secret_uuid = 457eb676-33da-42ec-9a8c-9293d545c337
> 
> 
> From the traces in c#1 we can see that DestinationDiskExists is thrown as
> the instance path exists while is_shared_instance_path is False and
> _delete_instance_files is then called as part of the cleanup because
> is_shared_block_storage is also False.
> 
> AFAIK is_shared_instance_path is expected to be False with an image_type of
> rbd as the temp file written by check_can_live_migrate_destination to
> /var/lib/nova/instances/ on the destination host would not be visible from
> the source.
> 
> is_shared_block_storage however should be True unless the image_type isn't
> set to rbd on the destination host. Can we confirm what this is set to on
> both hosts? Maybe generate a Guru report as well to confirm that it has also
> been applied?

yes, rbd is set on both computes where I reproduced it. After running Guru report:

[root@osp6-compute1 ~]# kill -SIGUSR1 `pgrep nova`
[root@osp6-compute1 ~]# grep images_type  /var/log/messages
Feb 17 14:13:03 localhost nova-compute: images_type = rbd

[root@osp6-compute2 ~]# kill -SIGUSR1 `pgrep nova`
[root@osp6-compute2 ~]# grep images_type  /var/log/messages
Feb 17 14:13:51 localhost nova-compute: images_type = rbd

From my understanding this happens, please correct me if I am wrong.

* when compute A comes back
* destroy_after_evacuate is being performed , which we can see from the logs. From my understanding this should cleanup the information from the evacuated instances
* with rbd we have destroy_disks = false, since instance discs are not local ( /usr/lib/python2.7/site-packages/nova/compute/manager.py )


    786                 try:
    787                     network_info = self._get_instance_nw_info(context,
    788                                                               instance)
    789                     bdi = self._get_instance_block_device_info(context,
    790                                                                instance)
    791                     destroy_disks = not (self._is_instance_storage_shared(
    792                                                             context, instance))

* this calls destroy in 
...
    801                 self.driver.destroy(context, instance,
    802                                     network_info,
    803                                     bdi, destroy_disks)

* in /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py in the end cleanup is performed in destroy

   1093     def destroy(self, context, instance, network_info, block_device_info=None,
   1094                 destroy_disks=True, migrate_data=None):
   1095         self._destroy(instance)
   1096         self.cleanup(context, instance, network_info, block_device_info,
   1097                      destroy_disks, migrate_data)

* since destroy_discs is false , migrate_data is None so we do not remove the instance files in case of RBD during destroy_after_evacuate

   1214         if destroy_disks or (
   1215                 migrate_data and migrate_data.get('is_shared_block_storage',
   1216                                                   False)):
   1217             self._delete_instance_files(instance)

I have an environment up where I can reproduce this if you want to have a look.
Comment 5 Mike McCune 2016-03-28 19:00:31 EDT
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
Comment 6 GE Scott Knauss 2016-03-30 07:59:54 EDT
Nokia has stated that this issues a blocker for their next release due next week.
Comment 8 Matthew Booth 2016-04-05 09:41:06 EDT
Lee and I have looked into this again. Unfortunately this bug is more of a design issue. Any tightly focussed fix would involve heuristics, and therefore come with a high risk of regressions.

Lee has pointed out, though, that there's a simple workaround. When the migration fails, the cleanup code will automatically delete the offending directory on the destination. Consequently, simply trying the migration again will succeed.

As it's a difficult fix with a high risk of regressions, and there's a relatively simple workaround, we've decided that the safest course of action is not to fix it in RHOS 6. We'll continue to work on it upstream, though.
Comment 9 Jeffrey Zhang 2017-06-27 02:51:43 EDT
upstream track bug https://bugs.launchpad.net/nova/+bug/1414895

Note You need to log in before you can comment on or make changes to this bug.