Red Hat Bugzilla – Bug 915274
Attempting to 'nova live-migrate' to a non-existing host, it fails, & the instance remains in a perpetual state of MIGRATING
Last modified: 2016-04-22 01:01:56 EDT
Upstream bz link: https://bugs.launchpad.net/nova/+bug/1100462 Description of problem: Attempting to 'nova live-migrate' to a non-existing host, it fails, & the state of instance remains in MIGRATING' . Resulting in Version-Release number of selected component (if applicable): -> This is on RHEL 6.4, 2.1 nightly repositories #=================================# [tuser1@interceptor ~(keystone_admin)]$ rpm -q openstack-nova ; cat /etc/redhat-release openstack-nova-2012.2.3-1.el6ost.noarch Red Hat Enterprise Linux Server release 6.4 (Santiago) [tuser1@interceptor ~(keystone_admin)]$ #=================================# How reproducible: Always Steps to Reproduce: -> Ensure the right fix is in: #=================================# [root@interceptor ~(keystone_admin)]# grep -i -A3 -B3 "Live Migration failure" /usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py except Exception as e: with excutils.save_and_reraise_exception(): LOG.error(_("Live Migration failure: %(e)s") % locals(), instance=instance_ref) recover_method(ctxt, instance_ref, dest, block_migration) [root@interceptor ~(keystone_admin)]# #=================================# -> List current running nova instances: #=================================# [tuser1@interceptor ~(keystone_user1)]$ nova list +--------------------------------------+-----------+--------+-------------------+ | ID | Name | Status | Networks | +--------------------------------------+-----------+--------+-------------------+ | 08d616a9-87a1-4c0d-b986-7d6aa5ed6780 | fedora-t1 | ACTIVE | net1=10.65.207.50 | | 3e487977-37e8-4f26-9443-d65ecbdf83c9 | fedora-t2 | ACTIVE | net1=10.65.207.51 | | 48d9e518-a91f-48db-9d9b-965b243e7113 | fedora-t4 | ACTIVE | net1=10.65.207.52 | +--------------------------------------+-----------+--------+-------------------+ #=================================# -> Let's try to live migrate using the 'name' of the guest. I'm trying to migrate to a host which *doesn't* have Compute node running/configured. #=================================# [tuser1@interceptor ~(keystone_user1)]$ nova live-migration fedora-t4 maelstrom.lab.eng.pnq.redhat.com ERROR: Policy doesn't allow compute_extension:admin_actions:migrateLive to be performed. (HTTP 403) (Request-ID: req-ce093ff9-3c4d-48b3-82c9-d68a7f08ecf6) [tuser1@interceptor ~(keystone_user1)]$ #=================================# Note, it says, policy doesn't allow. -> Source the keystone_admin rc file, & try to migrate using the 'Name'. It says the name 'fedora-t4' doesn't exist. But as it can be seen from previous commands, it does exist from 'nova list' #=================================# [tuser1@interceptor ~(keystone_user1)]$ . ~/keystonerc_admin [tuser1@interceptor ~(keystone_admin)]$ nova live-migration fedora-t4 maelstrom.lab.eng.pnq.redhat.com ERROR: No server with a name or ID of 'fedora-t4' exists. #=================================# -> Ok, so, let's try to migrate using the ID of the guest: #=================================# [tuser1@interceptor ~(keystone_admin)]$ nova live-migration 48d9e518-a91f-48db-9d9b-965b243e7113 maelstrom.lab.eng.pnq.redhat.com ERROR: Live migration of instance 48d9e518-a91f-48db-9d9b-965b243e7113 to host maelstrom.lab.eng.pnq.redhat.com failed (HTTP 400) (Request-ID: req-b41ad3bb-2b0c-4aea-b15e-3513106c3bf9) [tuser1@interceptor ~(keystone_admin)]$ #=================================# -> Let's list to see the status of the current instances: #=================================# [tuser1@interceptor ~(keystone_admin)]$ nova list --all-tenants +--------------------------------------+-----------+-----------+-------------------+ | ID | Name | Status | Networks | +--------------------------------------+-----------+-----------+-------------------+ | 08d616a9-87a1-4c0d-b986-7d6aa5ed6780 | fedora-t1 | ACTIVE | net1=10.65.207.50 | | 3e487977-37e8-4f26-9443-d65ecbdf83c9 | fedora-t2 | ACTIVE | net1=10.65.207.51 | | 48d9e518-a91f-48db-9d9b-965b243e7113 | fedora-t4 | MIGRATING | net1=10.65.207.52 | +--------------------------------------+-----------+-----------+-------------------+ [tuser1@interceptor ~(keystone_admin)]$ #=================================# Interesting, from the above, it says 'MIGRATING' - a dubious message. Since, it's the same state for more than a couple of hours. -> Let's grep for the string 'migrate' in nova logs: #=================================# [tuser1@interceptor glance(keystone_admin)]$ sudo grep -i migrate /var/log/nova/*.log /var/log/nova/api.log:2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions File "/usr/lib/python2.6/site-packages/nova/api/openstack/compute/contrib/admin_actions.py", line 288, in _migrate_live /var/log/nova/api.log:2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions File "/usr/lib/python2.6/site-packages/nova/compute/api.py", line 2040, in live_migrate [tuser1@interceptor glance(keystone_admin)]$ date Mon Feb 25 13:45:05 IST 2013 [tuser1@interceptor glance(keystone_admin)]$ #=================================# -> nova's api.log indicating the failure of migration, indicating "compute host not found" #=================================# n service_get_all_compute_by_host 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions return IMPL.service_get_all_compute_by_host(context, host) 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions File "/usr/lib/python2.6/site-packages/nova/db/sqlalchemy/api.py", line 111, in wrapper 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions return f(*args, **kwargs) 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions File "/usr/lib/python2.6/site-packages/nova/db/sqlalchemy/api.py", line 394, in service_get_all_compute_by_host 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions raise exception.ComputeHostNotFound(host=host) 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions ComputeHostNotFound: Compute host maelstrom.lab.eng.pnq.redhat.com could not be found. : . . . 013-02-25 13:04:19 INFO nova.api.openstack.wsgi [req-b41ad3bb-2b0c-4aea-b15e-3513106c3bf9 e34da0f70aaa4b86b97857299d66155f 5aaa100a372248dd9c658f8b7775784c] HTTP exception thrown: Live migration of instance 48d9e518-a91f-48db-9d9b-965b243e7113 to host maelstrom.lab.eng.pnq.redhat.com failed 2013-02-25 13:04:19 INFO nova.api.openstack.wsgi [req-b41ad3bb-2b0c-4aea-b15e-3513106c3bf9 e34da0f70aaa4b86b97857299d66155f 5aaa100a372248dd9c658f8b7775784c] http://127.0.0.1:8774/v1.1/5aaa100a372248dd9c658f8b7775784c/servers/48d9e518-a91f-48db-9d9b-965b243e7113/action returned with HTTP 400 2013-02-25 13:04:19 INFO nova.osapi_compute.wsgi.server [req-b41ad3bb-2b0c-4aea-b15e-3513106c3bf9 e34da0f70aaa4b86b97857299d66155f 5aaa100a372248dd9c658f8b7775784c] 127.0.0.1 - - [25/Feb/2013 13:04:19] "POST /v1.1/5aaa100a372248dd9c658f8b7775784c/servers/48d9e518-a91f-48db-9d9b-965b243e7113/action HTTP/1.1" 400 352 0.370040 . . . . #=================================# -> nova stop fails: #=================================# [tuser1@interceptor ~(keystone_admin)]$ nova stop 48d9e518-a91f-48db-9d9b-965b243e7113 ERROR: Instance 48d9e518-a91f-48db-9d9b-965b243e7113 in task_state migrating. Cannot stop while the instance is in this state. (HTTP 400) (Request-ID: req-7fac4a5d-6510-4aaf-8aed-cfcc5a8c5087) #=================================# Actual results: Migration fails, & the running instance is stuck in a perpetual MIGRATING state. Expected results: Once nova knows the target host doesn't exist/or is invalid, it should retain the state of the instance to 'running' Additional info: -> Let's grep for the string 'migrate' in nova logs: #=================================# [tuser1@interceptor glance(keystone_admin)]$ sudo grep -i migrate /var/log/nova/*.log /var/log/nova/api.log:2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions File "/usr/lib/python2.6/site-packages/nova/api/openstack/compute/contrib/admin_actions.py", line 288, in _migrate_live /var/log/nova/api.log:2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions File "/usr/lib/python2.6/site-packages/nova/compute/api.py", line 2040, in live_migrate [tuser1@interceptor glance(keystone_admin)]$ date Mon Feb 25 13:45:05 IST 2013 [tuser1@interceptor glance(keystone_admin)]$ #=================================# -> nova's api.log indicating the failure of migration, indicating "compute host not found" #=================================# n service_get_all_compute_by_host 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions return IMPL.service_get_all_compute_by_host(context, host) 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions File "/usr/lib/python2.6/site-packages/nova/db/sqlalchemy/api.py", line 111, in wrapper 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions return f(*args, **kwargs) 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions File "/usr/lib/python2.6/site-packages/nova/db/sqlalchemy/api.py", line 394, in service_get_all_compute_by_host 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions raise exception.ComputeHostNotFound(host=host) 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions 2013-02-25 13:04:19 66223 TRACE nova.api.openstack.compute.contrib.admin_actions ComputeHostNotFound: Compute host maelstrom.lab.eng.pnq.redhat.com could not be found. : . . . 013-02-25 13:04:19 INFO nova.api.openstack.wsgi [req-b41ad3bb-2b0c-4aea-b15e-3513106c3bf9 e34da0f70aaa4b86b97857299d66155f 5aaa100a372248dd9c658f8b7775784c] HTTP exception thrown: Live migration of instance 48d9e518-a91f-48db-9d9b-965b243e7113 to host maelstrom.lab.eng.pnq.redhat.com failed 2013-02-25 13:04:19 INFO nova.api.openstack.wsgi [req-b41ad3bb-2b0c-4aea-b15e-3513106c3bf9 e34da0f70aaa4b86b97857299d66155f 5aaa100a372248dd9c658f8b7775784c] http://127.0.0.1:8774/v1.1/5aaa100a372248dd9c658f8b7775784c/servers/48d9e518-a91f-48db-9d9b-965b243e7113/action returned with HTTP 400 2013-02-25 13:04:19 INFO nova.osapi_compute.wsgi.server [req-b41ad3bb-2b0c-4aea-b15e-3513106c3bf9 e34da0f70aaa4b86b97857299d66155f 5aaa100a372248dd9c658f8b7775784c] 127.0.0.1 - - [25/Feb/2013 13:04:19] "POST /v1.1/5aaa100a372248dd9c658f8b7775784c/servers/48d9e518-a91f-48db-9d9b-965b243e7113/action HTTP/1.1" 400 352 0.370040 . . . . #=================================# -> Info about target host, indicating none of the 'nova' services are running: #=================================# [root@maelstrom ~]# hostname maelstrom.lab.eng.pnq.redhat.com #=================================# [root@maelstrom ~]# openstack-status == Nova services == openstack-nova-api: inactive (disabled on boot) openstack-nova-cert: inactive (disabled on boot) openstack-nova-compute: inactive (disabled on boot) openstack-nova-network: inactive (disabled on boot) openstack-nova-scheduler: inactive (disabled on boot) openstack-nova-volume: inactive (disabled on boot) == Support services == libvirtd: active Warning novarc not sourced [root@maelstrom ~]# #=================================#
Some more investigation: -> List the running instances: #=================================# [tuser1@interceptor ~(keystone_admin)]$ nova list --all-tenants +--------------------------------------+-----------+-----------+-------------------+ | ID | Name | Status| Networks | +--------------------------------------+-----------+-----------+-------------------+ | 08d616a9-87a1-4c0d-b986-7d6aa5ed6780 | fedora-t1 | ACTIVE| net1=10.65.207.50 | | 3e487977-37e8-4f26-9443-d65ecbdf83c9 | fedora-t2 | MIGRATING | net1=10.65.207.51 | | 48d9e518-a91f-48db-9d9b-965b243e7113 | fedora-t4 | ACTIVE| net1=10.65.207.52 | +--------------------------------------+-----------+-----------+-------------------+ #=================================# -> Let's try to 'delete' the MIGRATING instance #=================================# [tuser1@interceptor ~(keystone_admin)]$ nova delete 3e487977-37e8-4f26-9443-d65ecbdf83c9 #=================================# -> List all the running instances. Note, it indicates, the just deleted instance as 'ACTIVE' (a #=================================# [tuser1@interceptor ~(keystone_admin)]$ nova list --all-tenants +--------------------------------------+-----------+--------+-------------------+ | ID | Name | Status | Networks | +--------------------------------------+-----------+--------+-------------------+ | 08d616a9-87a1-4c0d-b986-7d6aa5ed6780 | fedora-t1 | ACTIVE | net1=10.65.207.50 | | 3e487977-37e8-4f26-9443-d65ecbdf83c9 | fedora-t2 | ACTIVE | net1=10.65.207.51 | | 48d9e518-a91f-48db-9d9b-965b243e7113 | fedora-t4 | ACTIVE | net1=10.65.207.52 | +--------------------------------------+-----------+--------+-------------------+ [tuser1@interceptor ~(keystone_admin)]$ #=================================# -> Let's try to ssh into the just 'active' #=================================# [tuser1@interceptor ~(keystone_admin)]$ ssh -i oskey.priv root@10.65.207.51 ssh: connect to host 10.65.207.51 port 22: No route to host [tuser1@interceptor ~(keystone_admin)]$ #=================================# Since, it says "No route to host", let's use 'virt-cat' to find out if there's a new IP address for the guest #=================================# [tuser1@interceptor ~(keystone_user1)]$ sudo virt-cat instance-0000000d /var/log/messages | grep 'dhclient.*bound to' | tail -5 Feb 25 07:21:18 localhost dhclient[636]: bound to 10.65.207.51 -- renewal in 55 seconds. Feb 25 07:22:14 localhost dhclient[636]: bound to 10.65.207.51 -- renewal in 47 seconds. Feb 25 07:23:01 localhost dhclient[636]: bound to 10.65.207.51 -- renewal in 43 seconds. Feb 25 07:23:44 localhost dhclient[636]: bound to 10.65.207.51 -- renewal in 54 seconds. Feb 25 07:24:38 localhost dhclient[636]: bound to 10.65.207.221 -- renewal in 33524 seconds. [tuser1@interceptor ~(keystone_user1)]$ #=================================# So, turns out, there's a new IP address (note the last line, from the above command) , but it doesn't seem to reflect in the "Networks" status when we list the running instances. -> Let's see if we can ssh into the new IP address #=================================# [tuser1@interceptor ~(keystone_user1)]$ ssh -i oskey.priv root@10.65.207.221 ssh: connect to host 10.65.207.221 port 22: No route to host #=================================# [tuser1@interceptor ~(keystone_user1)]$ ping 10.65.207.52 PING 10.65.207.52 (10.65.207.52) 56(84) bytes of data. From 10.65.207.49 icmp_seq=1 Destination Host Unreachable From 10.65.207.49 icmp_seq=2 Destination Host Unreachable From 10.65.207.49 icmp_seq=3 Destination Host Unreachable ^C --- 10.65.207.52 ping statistics --- 4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3844ms pipe 3 [tuser1@interceptor ~(keystone_user1)]$ #=================================# So, from the above, essentially the guest is hosed, and 'nova list' dubiously indicates the guest is ACTIVE.
Fixed proposed upstream on the stable branch https://review.openstack.org/#/c/22873/ will be backported when accepted there
Short: I still don't see the fix taking effect. the string "Compute service of <hostname> is unavailable at this time" does not relfect in logs (after enabling verbose/debug, restart all nova services) Long: ==== Verification info: 1] Ensure to have the right version: #-------------# $ rpm -q openstack-nova ; uname -r ; arch openstack-nova-2012.2.3-7.el6ost.noarch 2.6.32-358.el6.x86_64 x86_64 #-------------# 2] Ensure the fix is in: #-------------# $ rpm -q openstack-nova --changelog | grep 915274 - Handle unavailable host for live migration #915274 #-------------# 3] Restart all openstack nova services #-------------# $ for j in `for i in $(ls -1 /etc/init.d/openstack-nova-*) ; do $i status | grep running ; done | awk '{print $1}'` ; do service $j restart ; done #-------------# 4] Try live-migration to an invalid host #-------------# [(keystone_user1)]$ nova live-migration f16-t4 maelstrom.lab.eng.pnq.redhat.com ERROR: Policy doesn't allow compute_extension:admin_actions:migrateLive to be performed. (HTTP 403) (Request-ID: req-b054d839-2f6a-4af0-840c-17b522818f16) #-------------# 5] Enable verbose/debug to "true" in nova.conf 5.1] Again restart all nova services (refer instruction 3] above) 6] Attempt to live migrate to an invalid host (as keystone_admin) #-------------# $ nova live-migration 639b3bf0-cb97-466c-9f8e-3cf369077e1f maelstrom.lab.eng.pnq.redhat.com ERROR: Live migration of instance 639b3bf0-cb97-466c-9f8e-3cf369077e1f to host maelstrom.lab.eng.pnq.redhat.com failed (HTTP 400) (Request-ID: req-4f28ef34-3d82-4336-bbe7-db14d949f5cc) #-------------# Observations: ============= a] The instance is still in migrating state: #-------------# $ nova list --all-tenants | grep f16-t4 | 639b3bf0-cb97-466c-9f8e-3cf369077e1f | f16-t4 | MIGRATING | net1=10.65.207.56 | #-------------# b] grep for the string "unavailable" , referred from here -- https://review.openstack.org/#/c/22873/1/nova/exception.py, #-------------# # tail -1000000 /var/log/nova/compute.log| grep -i "unavailable at this time" [root@interceptor python-swiftc(keystone_admin)]# #-------------# Nothing is observed in the logs and the instance is still in "MIGRATING" state: Am I missing anything trivial here ?
Somehow, the script doesn't seem to take effect on the previous machine. VERIFIED on a new machine. Version info: #============================# $ rpm -q openstack-nova-compute-2012.2.3-7.el6ost.noarch --changelog | grep -i 915274 - Handle unavailable host for live migration #915274 #============================# Now, list and then try to migrate using UUID #============================# [root@node-01 ~(keystone_admin)]$ nova list +--------------------------------------+--------+--------+--------------------------+ | ID | Name | Status | Networks | +--------------------------------------+--------+--------+--------------------------+ | 6fdcafa3-b86b-479e-b1cf-5e6859912e65 | server | ACTIVE | novanetwork=192.168.32.2 | +--------------------------------------+--------+--------+--------------------------+ #============================# [root@node-01 ~(keystone_admin)]$ nova live-migration 6fdcafa3-b86b-479e-b1cf-5e6859912e65 maelstrom.lab.eng.pnq.redhat.com ERROR: Compute service of maelstrom.lab.eng.pnq.redhat.com is unavailable at this time. Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 276, in _process_data rval = self.proxy.dispatch(ctxt, version, method, **args) File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/dispatcher.py", line 145, in dispatch return getattr(proxyobj, method)(ctxt, **kwargs) File "/usr/lib/python2.6/site-packages/nova/scheduler/manager.py", line 101, in live_migration context, ex, request_spec) File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__ self.gen.next() File "/usr/lib/python2.6/site-packages/nova/scheduler/manager.py", line 91, in live_migration block_migration, disk_over_commit) File "/usr/lib/python2.6/site-packages/nova/scheduler/driver.py", line 232, in schedule_live_migration self._live_migration_dest_check(context, instance, dest) File "/usr/lib/python2.6/site-packages/nova/scheduler/driver.py", line 281, in _live_migration_dest_check raise exception.ComputeServiceUnavailable(host=dest) ComputeServiceUnavailable: Compute service of maelstrom.lab.eng.pnq.redhat.com is unavailable at this time. #============================# #============================# $ nova live-migration server maelstrom.lab.eng.pnq.redhat.com ERROR: Compute service of maelstrom.lab.eng.pnq.redhat.com is unavailable at this time. #============================# -> Enumerating nova instances shows the server as ACTIVE. As expected. #============================# $ nova list +--------------------------------------+--------+--------+--------------------------+ | ID | Name | Status | Networks | +--------------------------------------+--------+--------+--------------------------+ | 6fdcafa3-b86b-479e-b1cf-5e6859912e65 | server | ACTIVE | novanetwork=192.168.32.2 | +--------------------------------------+--------+--------+--------------------------+ #============================# The original bug is fixed. For the above stack trace, filed a different bug -- https://bugzilla.redhat.com/show_bug.cgi?id=927167
Summary of the above comment: I tried live migration with both UUID and name of the nova instance. It now throws a valid exception that the remote host is not available. And the instance state is ACTIVE ==== $ ping 192.168.32.2 PING 192.168.32.2 (192.168.32.2) 56(84) bytes of data. 64 bytes from 192.168.32.2: icmp_seq=1 ttl=64 time=0.167 ms 64 bytes from 192.168.32.2: icmp_seq=2 ttl=64 time=1.15 ms ====
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0709.html