Created attachment 1033720 [details] logs Description of problem: On a RHOS6 A2 HA deployment, nova over shared nfs storage, when I deleted an instance it was "deleted": 2015-06-02 11:57:36.273 3505 INFO nova.virt.libvirt.driver [req-4cc54412-a449-4c7a-bbe1-b21d202bcfe7 None] [instance: 7b6c8ad5-7633-4d53-9f84-93b12a701cd3] Deletion of /var/lib/nova/instances/7b6c8ad5-7633-4d53-9f84-93b12a701cd3_del complete Also instance wasn't found with virsh list all. Yet nova list and Horizon both still show this instance as stuck in task deleting, two hours+ pasted since I deleted it. Version-Release number of selected component (if applicable): rhel 7.1 python-nova-2014.2.2-19.el7ost.noarch openstack-nova-compute-2014.2.2-19.el7ost.noarch python-novaclient-2.20.0-1.el7ost.noarch openstack-nova-common-2014.2.2-19.el7ost.noarch How reproducible: Unsure, it doesn't happen with every instance deletion, but happened more than this one time. Steps to Reproduce: 1. Boot an instance 2. Delete instance 3. Instance stuck in deleting task on nova/Horozon. Actual results: Stuck with a phantom "deleting" instance, which is basically already dead from Virsh's point of view. Expected results: Instance should get deleted including from nova list/Horizon. Additional info: Workaround doing openstack-service restart for nova on compute node fixed my problem. Instance is totally gone from Nova/Horizon. instance virsh id instance-00000d4d.log instanceID 7b6c8ad5-7633-4d53-9f84-93b12a701cd3 | OS-EXT-STS:power_state | 1 | | OS-EXT-STS:task_state | deleting | | OS-EXT-STS:vm_state | deleted | | OS-SRV-USG:launched_at | 2015-05-28T11:06:33.000000 | | OS-SRV-USG:terminated_at | 2015-06-02T08:57:37.000000 | ...... | | status | DELETED Attached nova log from compute and controller.
Stephen I'd reconsider priority/severity. This just happened again, one more clue to add. Out of my two compute nodes total of 66 instances. 4 instances are stuck in task_state deleting. 6 have a status of deleted, yet remain listed on instance table with task_state "-" All 10 of them reside on same compute node. Restarting nova service on compute node removed all phantom instances.
Created attachment 1049751 [details] Nova and messages
Happened yet again, had to use #nova reset-state --active and #nova force-delete followed by compute node nova service restart to get rid of these 4 instances. [root@mac848f69fbc607 ~(openstack_blueadmin)]# nova force-delete 799ac283-1c10-4adf-b546-5036af202ffa [root@mac848f69fbc607 ~(openstack_blueadmin)]# nova list --all-tenants | grep boo | bee66192-6b8c-4022-ade0-71221799d3ae | boo-compute1 | ACTIVE | deleting | Running | Default-mgmt=10.35.187.147; Nested-bridge=10.35.184.118; Private-network=192.168.50.53 | | 799ac283-1c10-4adf-b546-5036af202ffa | boo-compute2 | ACTIVE | deleting | Running | Default-mgmt=10.35.187.149; Nested-bridge=10.35.184.120; Private-network=192.168.50.55 | | 5a3399d3-79d6-496d-9260-0bcdfc6a4497 | boo-controller | DELETED | - | Running | Default-mgmt=10.35.187.148; Nested-bridge=10.35.184.119; Private-network=192.168.50.54 | | 958c2392-e9dd-4898-8f9e-033359c269a3 | boo-network | DELETED | - | Running | Default-mgmt=10.35.187.15; Nested-bridge=10.35.184.121; Private-network=192.168.50.56 | Attaching nova logs from controller and both compute nodes. I'm not 100% sure but suspect that when this happens these deleted instances don't get reduced from tenant quota either.
Created attachment 1057052 [details] Nova logs
There is a periodic task running on the compute called "update_available_resource" running every "update_resources_interval" that will clean up any instances that are actually deleted, but still in the database. Of course doing a service restart will run that task immediately, so that explains why a restart helped here.
Now, a more interesting question is why are those instances not getting updated in the nova database when they finish...
Actually I was wrong in Comment 8, the cleanup is done in _init_instance on startup. This log message confirms it: 2015-07-28 15:45:39.035 16159 INFO nova.compute.manager [-] [instance: 5a3399d3-79d6-496d-9260-0bcdfc6a4497] Service started deleting the instance during the previous run, but did not finish. Restarting the deletion now.
So the mystery of database not getting updated after a delete is most likely explained by those errors from the log: 2015-07-14 18:33:55.175 10903 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds... 2015-07-14 18:33:56.176 10903 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 192.168.0.3:5672 2015-07-14 18:33:56.204 10903 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 192.168.0.3:5672 2015-07-14 18:34:02.056 10903 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Socket closed 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last): 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 655, in ensure 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit return method() 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 735, in _consume 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit return self.connection.drain_events(timeout=timeout) 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 275, in drain_events 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit return self.transport.drain_events(self.connection, **kwargs) 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 91, in drain_events 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit return connection.drain_events(**kwargs) 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 302, in drain_events 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit chanmap, None, timeout=timeout, 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 365, in _wait_multiple 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit channel, method_sig, args, content = read_timeout(timeout) 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 336, in read_timeout 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit return self.method_reader.read_method() 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit raise m 2015-07-14 18:34:02.056 10903 TRACE oslo.messaging._drivers.impl_rabbit IOError: Socket closed and 2015-07-28 13:49:18.187 10903 ERROR nova.servicegroup.drivers.db [-] model server went away 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db Traceback (most recent call last): 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line 99, in _report_state 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db service.service_ref, state_catalog) 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/conductor/api.py", line 180, in service_update 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db return self._manager.service_update(context, service, values) 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 286, in service_update 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db service=service_p, values=values) 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 152, in call 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db retry=self.retry) 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/transport.py", line 90, in _send 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db timeout=timeout, retry=retry) 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 408, in send 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db retry=retry) 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 397, in _send 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db result = self._waiter.wait(msg_id, timeout) 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 298, in wait 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db reply, ending, trylock = self._poll_queue(msg_id, timeout) 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 238, in _poll_queue 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db message = self.waiters.get(msg_id, timeout) 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 144, in get 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db 'to message ID %s' % msg_id) 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID a53e9c1f638f41bb827ebc90b79c8dfd 2015-07-28 13:49:18.187 10903 TRACE nova.servicegroup.drivers.db 2015-07-28 13:49:18.190 10903 WARNING nova.openstack.common.loopingcall [-] task <bound method DbDriver._report_state of <nova.servicegroup.drivers.db.DbDriver object at 0x3676950>> run outlasted interval by 345.34 sec 2015-07-28 13:49:18.198 10903 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager._heal_instance_info_cache: Timed out waiting for a reply to message ID 03a0a028ab45473cb7b18ffb71a3e81f
Just FYI "Failed to consume message from queue: Socket closed" indicates a known issue in python-oslo-messaging which was fixed in the following builds: * python-oslo-messaging-2.5.0-7.el7ost (OSP8) * python-oslo-messaging-4.5.0-2.el7ost (OSP9)
The issue is partially fixed by BZ#1217956, which handles quota update scenario for instance stuck at deleting task state and cleaned-up on restarting nova-compute service. The issue still persist in current master, I have reproduced it.
I issue is fixed by [1] in oslo.messaging. There is no fix provided in nova. It seems that the issue is no longer present in master, as when I try to reproduce with above mentioned steps, the instance gets deleted every time successfully. I also tried to create multiple instances (<=100) and deleted all those instances as once, but then also I couldn't get any instance in deleting task state. If you encounter this issue, please reopen the bug and provide proper reproduction steps. [1] https://review.openstack.org/#/c/294074/
As the bug was reported for RHOS6, I have also tried to reproduce this on RHOS6 environment. I have created multiple instances (<100) and deleted all those instances at once, and no instance get into 'deleting' task state, so it seems that the issue is fixed by [1] in oslo.messaging. [1] https://review.openstack.org/#/c/294074/ If you encounter this issue, please reopen the bug and provide proper reproduction steps.