Bug 1084833
| Summary: | Guests show "Starting instance..." but claim does not happen, vm never starts and stuck in BUILDING phase | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Ian Wienand <iwienand> |
| Component: | openstack-nova | Assignee: | Nikola Dipanov <ndipanov> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Ami Jeain <ajeain> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.0 | CC: | aortega, eglynn, gdubreui, iwienand, ndipanov, sgordon, wfoster, yeylon |
| Target Milestone: | async | Keywords: | ZStream |
| Target Release: | 4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-04-24 21:22:49 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Ian Wienand
2014-04-06 23:44:05 UTC
To confirm, there is nothing of interest in any of the other logs (especially nothing of interest in libvirtd log, where I'd hoped to see some useful failure)
I ran strace on the parent process, and it's in a epoll() loop sending RPC heartbeat messages and not much else.
All the children nova-compute processes are sitting in sleep(2) loops
---
[root@host07 nova]# strace -p 23336
Process 23336 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
kill(23334, SIG_0) = 0
kill(8288, SIG_0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], SA_RESTORER, 0x330da0f710}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7f99c7ffd9e0) = 0
kill(23334, SIG_0) = 0
kill(8288, SIG_0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], SA_RESTORER, 0x330da0f710}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7f99c7ffd9e0) = 0
---
So there's nothing obviously (to me) blocking
There's actually two places in the same file that can log "Starting instance...", I suggested a change for that [1] and a larger one to add the function to log and audit messages [2]
[1] https://review.openstack.org/#/c/85639/
[2] https://review.openstack.org/#/c/85638/
This has started happening again. What seems to be the issue is that RPC calls have failed. A tail of the logfile is below --- 2014-04-10 00:50:47.474 8288 AUDIT nova.compute.manager [req-4e27bfeb-a7bc-4c5e-92f7-9b5be7a7dd78 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] [instance: c18d4568-ba99-4284-9db3-4a3b 6485ea20] Starting instance... 2014-04-10 01:09:35.923 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Failed to consume message from queue: connection aborted 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid Traceback (most recent call last): 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid return method(*args, **kwargs) 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 597, in _consume 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid nxt_receiver = self.session.next_receiver(timeout=timeout) 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid File "<string>", line 6, in next_receiver 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 665, in next_receiver 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid if self._ecwait(lambda: self.incoming, timeout): 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid result = self._ewait(lambda: self.closed or predicate(), timeout) 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid result = self.connection._ewait(lambda: self.error or predicate(), timeout) 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid self.check_error() 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid raise self.error 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid ConnectionError: connection aborted 2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid 2014-04-10 01:09:35.940 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Failed to consume message from queue: connection aborted 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid Traceback (most recent call last): 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid return method(*args, **kwargs) 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 597, in _consume 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid nxt_receiver = self.session.next_receiver(timeout=timeout) 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid File "<string>", line 6, in next_receiver 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 665, in next_receiver 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid if self._ecwait(lambda: self.incoming, timeout): 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid result = self._ewait(lambda: self.closed or predicate(), timeout) 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid result = self.connection._ewait(lambda: self.error or predicate(), timeout) 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid self.check_error() 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid raise self.error 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid ConnectionError: connection aborted 2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid 2014-04-10 01:09:38.940 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 1 seconds 2014-04-10 01:09:38.942 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 1 seconds 2014-04-10 01:09:39.942 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 2 seconds 2014-04-10 01:09:39.945 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 2 seconds 2014-04-10 01:09:41.955 8288 INFO nova.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.222:5672 2014-04-10 01:09:41.958 8288 INFO nova.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.222:5672 2014-04-10 01:09:44.138 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Failed to publish message to topic 'conductor': connection aborted 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid Traceback (most recent call last): 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid return method(*args, **kwargs) 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 632, in _publisher_send 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid publisher = cls(self.conf, self.session, topic) 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 398, in __init__ 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid super(TopicPublisher, self).__init__(conf, session, node_name) 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 328, in __init__ 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid self.reconnect(session) 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 332, in reconnect 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid self.sender = session.sender(self.address) 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid File "<string>", line 6, in sender 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 597, in sender 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid sender._ewait(lambda: sender.linked) 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 804, in _ewait 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid result = self.session._ewait(lambda: self.error or predicate(), timeout) 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid result = self.connection._ewait(lambda: self.error or predicate(), timeout) 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid self.check_error() 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid raise self.error 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid ConnectionError: connection aborted 2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid 2014-04-10 01:09:44.146 8288 INFO nova.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.222:5672 2014-04-10 03:01:25.399 8288 WARNING nova.openstack.common.loopingcall [-] task run outlasted interval by 1.020018 sec 2014-04-10 19:11:55.220 8288 AUDIT nova.compute.manager [req-93a3c4ba-ea07-4d2d-8735-4a8415081f1e 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] [instance: 3356c492-d599-4579-85c5-0eff3d9a9e6f] Starting instance... 2014-04-11 00:56:28.869 8288 AUDIT nova.compute.manager [req-b7527bcb-8223-4810-ac27-60625f389af0 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] [instance: 6546372e-514f-485b-a47a-992445170050] Starting instance... 2014-04-11 01:18:14.779 8288 AUDIT nova.compute.manager [req-df2d62c5-dd1f-4395-9878-85dd84bea1c3 13090770bacc46ccb8fb7f5e13e5de98 f8e6ba11caa94ea98d24ec819eb746fd] [instance: 1a186694-96ba-40b8-be02-9c98a36d8472] Starting instance... --- You can see at least 3 cases of "Starting instance..." that have never progressed. This is a really old bug so I am taking a shot in the dark here. Inspecting the code for RHEL-OSP 4.0 it seems that a ConnectionError when attempting to send a message to conductor would properly cause the instance to go into an error state instead of getting stuck. This leads me to believe that one possibility would be the greenthread deadlocks somewhere in oslo.messaging library code, in presence of certain error conditions. RHOS 4.0 comes with GuruMeditation report which would give us a state of all the greenthreads in the nova compute service, so if you could provide us with that for one of the compute services that have instances stuck that could be a start (see more on this here https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/5/html/Cloud_Administrator_Guide/section_compute-troubleshooting.html) The code that does this in RHEL-OSP 6 is significantly different (both oslo.messaging and nova compute) that this investigation is unlikely to point us in the right direction for fixing this there if it is even an issue anymore so the fix (if any) will likely be RHEL-OSP 4 only, but leaving targeted for RHOS 6 pending further info for now. Removing needinfo |