Bug 1227308 - Can't delete instance stuck in deleting task
Summary: Can't delete instance stuck in deleting task
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 6.0 (Juno)
Hardware: x86_64
OS: Linux
low
low
Target Milestone: ---
: 7.0 (Kilo)
Assignee: Rajesh Tailor
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks: 1394092 1394094
TreeView+ depends on / blocked
 
Reported: 2015-06-02 11:33 UTC by Tzach Shefi
Modified: 2019-09-09 13:41 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1394092 1394094 (view as bug list)
Environment:
Last Closed: 2018-02-05 06:30:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (776.68 KB, application/x-gzip)
2015-06-02 11:33 UTC, Tzach Shefi
no flags Details
Nova and messages (249.47 KB, application/x-gzip)
2015-07-08 09:07 UTC, Tzach Shefi
no flags Details
Nova logs (1.83 MB, text/plain)
2015-07-28 15:02 UTC, Tzach Shefi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1461055 0 None None None Never
Launchpad 1543511 0 None None None 2016-02-09 10:28:32 UTC
Launchpad 1641523 0 None None None 2018-01-09 09:27:44 UTC
OpenStack gerrit 294074 0 'None' MERGED Always set all socket timeouts 2021-01-18 15:09:33 UTC

Description Tzach Shefi 2015-06-02 11:33:51 UTC
Created attachment 1033720 [details]
logs

Description of problem:  On a RHOS6 A2 HA deployment, nova over shared nfs storage, when I deleted an instance it was "deleted":

2015-06-02 11:57:36.273 3505 INFO nova.virt.libvirt.driver [req-4cc54412-a449-4c7a-bbe1-b21d202bcfe7 None] [instance: 7b6c8ad5-7633-4d53-9f84-93b12a701cd3] Deletion of /var/lib/nova/instances/7b6c8ad5-7633-4d53-9f84-93b12a701cd3_del complete

Also instance wasn't found with virsh list all. 
Yet nova list and Horizon both still show this instance as stuck in task deleting, two hours+ pasted since I deleted it. 

Version-Release number of selected component (if applicable):
rhel 7.1
python-nova-2014.2.2-19.el7ost.noarch
openstack-nova-compute-2014.2.2-19.el7ost.noarch
python-novaclient-2.20.0-1.el7ost.noarch
openstack-nova-common-2014.2.2-19.el7ost.noarch

How reproducible:
Unsure, it doesn't happen with every instance deletion, but happened more than this one time. 

Steps to Reproduce:
1. Boot an instance
2. Delete instance 
3. Instance stuck in deleting task on nova/Horozon. 

Actual results:
Stuck with a phantom "deleting" instance, which is basically already dead from Virsh's point of view. 

Expected results:
Instance should get deleted including from nova list/Horizon. 

Additional info:

Workaround doing openstack-service restart for nova on compute node fixed my problem. Instance is totally gone from Nova/Horizon. 

instance virsh id instance-00000d4d.log
instanceID  7b6c8ad5-7633-4d53-9f84-93b12a701cd3

| OS-EXT-STS:power_state               | 1                                                                                                  |
| OS-EXT-STS:task_state                | deleting                                                                                           |
| OS-EXT-STS:vm_state                  | deleted                                                                                            |
| OS-SRV-USG:launched_at               | 2015-05-28T11:06:33.000000                                                                         |
| OS-SRV-USG:terminated_at             | 2015-06-02T08:57:37.000000                                                                         |
......                                 |
| status                               | DELETED

Attached nova log from compute and controller.

Comment 3 Tzach Shefi 2015-06-30 07:09:49 UTC
Stephen I'd reconsider priority/severity.

This just happened again, one more clue to add. 
Out of my two compute nodes total of  66 instances. 
4 instances are stuck in task_state deleting.  
6 have a status of deleted, yet remain listed on instance table with task_state "-" 
All 10 of them reside on same compute node. 

Restarting nova service on compute node removed all phantom instances.

Comment 4 Tzach Shefi 2015-07-08 09:07:10 UTC
Created attachment 1049751 [details]
Nova and messages

Comment 6 Tzach Shefi 2015-07-28 14:25:35 UTC
Happened yet again, had to use #nova reset-state --active and #nova force-delete  followed by compute node nova service restart to get rid of these 4 instances. 

[root@mac848f69fbc607 ~(openstack_blueadmin)]# nova force-delete 799ac283-1c10-4adf-b546-5036af202ffa
[root@mac848f69fbc607 ~(openstack_blueadmin)]# nova list --all-tenants | grep boo
| bee66192-6b8c-4022-ade0-71221799d3ae | boo-compute1                                                  | ACTIVE  | deleting   | Running     | Default-mgmt=10.35.187.147; Nested-bridge=10.35.184.118; Private-network=192.168.50.53 |
| 799ac283-1c10-4adf-b546-5036af202ffa | boo-compute2                                                  | ACTIVE  | deleting   | Running     | Default-mgmt=10.35.187.149; Nested-bridge=10.35.184.120; Private-network=192.168.50.55 |
| 5a3399d3-79d6-496d-9260-0bcdfc6a4497 | boo-controller                                                | DELETED | -          | Running     | Default-mgmt=10.35.187.148; Nested-bridge=10.35.184.119; Private-network=192.168.50.54 |
| 958c2392-e9dd-4898-8f9e-033359c269a3 | boo-network                                                   | DELETED | -          | Running     | Default-mgmt=10.35.187.15; Nested-bridge=10.35.184.121; Private-network=192.168.50.56  |

Attaching nova logs from controller and both compute nodes. 

I'm not 100% sure but suspect that when this happens these deleted instances don't get reduced from tenant quota either.

Comment 7 Tzach Shefi 2015-07-28 15:02:09 UTC
Created attachment 1057052 [details]
Nova logs

Comment 8 Radomir Dopieralski 2016-02-03 12:39:16 UTC
There is a periodic task running on the compute called "update_available_resource" running every "update_resources_interval" that will clean up any instances that are actually deleted, but still in the database. Of course doing a service restart will run that task immediately, so that explains why a restart helped here.

Comment 9 Radomir Dopieralski 2016-02-03 12:43:22 UTC
Now, a more interesting question is why are those instances not getting updated in the nova database when they finish...

Comment 10 Radomir Dopieralski 2016-02-03 14:31:59 UTC
Actually I was wrong in Comment 8, the cleanup is done in _init_instance on startup. This log message confirms it:

2015-07-28 15:45:39.035 16159 INFO nova.compute.manager [-] [instance: 5a3399d3-79d6-496d-9260-0bcdfc6a4497] Service started deleting the instance during the previous run, but did not finish. Restarting the deletion now.

Comment 11 Radomir Dopieralski 2016-02-08 08:25:26 UTC
So the mystery of database not getting updated after a delete is most likely explained by those errors from the log:

2015-07-14 18:33:55.175 10903 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2015-07-14 18:33:56.176 10903 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 192.168.0.3:5672
2015-07-14 18:33:56.204 10903 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 192.168.0.3:5672
2015-07-14 18:34:02.056 10903 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Socket closed
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 655, in ensure
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit     return method()
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 735, in _consume
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit     return self.connection.drain_events(timeout=timeout)
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 275, in drain_events
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit     return self.transport.drain_events(self.connection, **kwargs)
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 91, in drain_events
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit     return connection.drain_events(**kwargs)
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 302, in drain_events
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit     chanmap, None, timeout=timeout,
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 365, in _wait_multiple
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit     channel, method_sig, args, content = read_timeout(timeout)
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 336, in read_timeout
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit     return self.method_reader.read_method()
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit     raise m
2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit IOError: Socket closed

and

2015-07-28 13:49:18.187 10903 ERROR nova.servicegroup.drivers.db [-] model server went away
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line 99, in _report_state
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db     service.service_ref, state_catalog)
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/nova/conductor/api.py", line 180, in service_update
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db     return self._manager.service_update(context, service, values)
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 286, in service_update
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db     service=service_p, values=values)
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 152, in call
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db     retry=self.retry)
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/transport.py", line 90, in _send
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db     timeout=timeout, retry=retry)
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 408, in send
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db     retry=retry)
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 397, in _send
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db     result = self._waiter.wait(msg_id, timeout)
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 298, in wait
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db     reply, ending, trylock = self._poll_queue(msg_id, timeout)
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 238, in _poll_queue
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db     message = self.waiters.get(msg_id, timeout)
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 144, in get
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db     'to message ID %s' % msg_id)
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID a53e9c1f638f41bb827ebc90b79c8dfd
2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db 
2015-07-28 13:49:18.190 10903 WARNING nova.openstack.common.loopingcall [-] task <bound method DbDriver._report_state of <nova.servicegroup.drivers.db.DbDriver object at 0x3676950>> run outlasted interval by 345.34 sec
2015-07-28 13:49:18.198 10903 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager._heal_instance_info_cache: Timed out waiting for a reply to message ID 03a0a028ab45473cb7b18ffb71a3e81f

Comment 12 Peter Lemenkov 2016-11-04 15:32:53 UTC
Just FYI "Failed to consume message from queue: Socket closed" indicates a known issue in python-oslo-messaging which was fixed in the following builds:

* python-oslo-messaging-2.5.0-7.el7ost (OSP8)
* python-oslo-messaging-4.5.0-2.el7ost (OSP9)

Comment 13 Rajesh Tailor 2018-01-09 08:47:31 UTC
The issue is partially fixed by BZ#1217956, which handles quota update scenario for instance stuck at deleting task state and cleaned-up on restarting nova-compute service.

The issue still persist in current master, I have reproduced it.

Comment 14 Rajesh Tailor 2018-02-05 06:30:24 UTC
I issue is fixed by [1] in oslo.messaging. There is no fix provided in nova.

It seems that the issue is no longer present in master, as when I try to reproduce with above mentioned steps, the instance gets deleted every time successfully.

I also tried to create multiple instances (<=100) and deleted all those instances as once, but then also I couldn't get any instance in deleting task state.

If you encounter this issue, please reopen the bug and provide proper reproduction steps.

[1] https://review.openstack.org/#/c/294074/

Comment 15 Rajesh Tailor 2018-02-28 07:09:52 UTC
As the bug was reported for RHOS6, I have also tried to reproduce this on RHOS6 environment.

I have created multiple instances (<100) and deleted all those instances at once, and no instance get into 'deleting' task state, so it seems that the issue is fixed by [1] in oslo.messaging.

[1] https://review.openstack.org/#/c/294074/

If you encounter this issue, please reopen the bug and provide proper reproduction steps.


Note You need to log in before you can comment on or make changes to this bug.