Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1084833

Summary: Guests show "Starting instance..." but claim does not happen, vm never starts and stuck in BUILDING phase
Product: Red Hat OpenStack Reporter: Ian Wienand <iwienand>
Component: openstack-novaAssignee: Nikola Dipanov <ndipanov>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Ami Jeain <ajeain>
Severity: medium Docs Contact:
Priority: high    
Version: 4.0CC: aortega, eglynn, gdubreui, iwienand, ndipanov, sgordon, wfoster, yeylon
Target Milestone: asyncKeywords: ZStream
Target Release: 4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-04-24 21:22:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ian Wienand 2014-04-06 23:44:05 UTC
Description of problem:

I found two vm's stuck in building phase.

---
01672c50-03ce-4372-ac43-8fcebe599195
7f7605e2-1737-4da4-99ee-63238e48a058
---

I can see from the scheduler these were both sent to host07

---
scheduler.log:2014-04-06 19:51:00.395 7802 INFO nova.scheduler.filter_scheduler [req-eb84d97e-c52b-4ea2-a5a6-82042c967c0d 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] Attempting to build 1 instance(s) uuids: [u'01672c50-03ce-4372-ac43-8fcebe599195']
scheduler.log:2014-04-06 19:51:01.055 7802 INFO nova.scheduler.filter_scheduler [req-eb84d97e-c52b-4ea2-a5a6-82042c967c0d 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] Choosing host WeighedHost [host: host07.oslab.priv, weight: 11153.0] for instance 01672c50-03ce-4372-ac43-8fcebe599195
2014-04-04 09:36:55.907 7802 INFO nova.scheduler.filter_scheduler [req-a0824316-a202-4617-a6a9-edbf786b8bdc 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] Attempting to build 1 instance(s) uuids: [u'7f7605e2-1737-4da4-99ee-63238e48a058']
2014-04-04 09:36:56.627 7802 INFO nova.scheduler.filter_scheduler [req-a0824316-a202-4617-a6a9-edbf786b8bdc 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] Choosing host WeighedHost [host: host07.oslab.priv, weight: 15249.0] for instance 7f7605e2-1737-4da4-99ee-63238e48a058
---

On host07, I can see that the message to build the vm's was received

---
[root@host07 nova]# tail -n2 compute.log
2014-04-04 09:36:56.864 8288 AUDIT nova.compute.manager [req-a0824316-a202-4617-a6a9-edbf786b8bdc 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] [instance: 7f7605e2-1737-4da4-99ee-63238e48a058] Starting instance...
2014-04-06 19:51:01.871 8288 AUDIT nova.compute.manager [req-eb84d97e-c52b-4ea2-a5a6-82042c967c0d 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] [instance: 01672c50-03ce-4372-ac43-8fcebe599195] Starting instance...
---

but the vm's never seemed to progress any futher.

Looking at another vm that starts correctly, I would expect to see something like

---
2014-04-01 11:31:33.174 8288 AUDIT nova.compute.manager [req-151863cf-b9d8-4cd7-acef-30e044e7b443 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] [instance: b5f1710b-c4d2-46f5-a9cd-5e29323199ef] Starting instance...
2014-04-01 11:31:34.608 8288 AUDIT nova.compute.claims [req-151863cf-b9d8-4cd7-acef-30e044e7b443 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] [instance: b5f1710b-c4d2-46f5-a9cd-5e29323199ef] Attempting claim: memory 4096 MB, disk 40 GB, VCPUs 2
[...]
2014-04-01 11:31:34.613 8288 AUDIT nova.compute.claims [req-151863cf-b9d8-4cd7-acef-30e044e7b443 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] [instance: b5f1710b-c4d2-46f5-a9cd-5e29323199ef] Claim successful
---

So it would seem that something has gone wrong with the claiming step

Comment 2 Ian Wienand 2014-04-07 06:27:42 UTC
To confirm, there is nothing of interest in any of the other logs (especially nothing of interest in libvirtd log, where I'd hoped to see some useful failure)

I ran strace on the parent process, and it's in a epoll() loop sending RPC heartbeat messages and not much else. 

All the children nova-compute processes are sitting in sleep(2) loops

---
[root@host07 nova]# strace -p 23336
Process 23336 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
kill(23334, SIG_0)                      = 0
kill(8288, SIG_0)                       = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], SA_RESTORER, 0x330da0f710}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7f99c7ffd9e0)       = 0
kill(23334, SIG_0)                      = 0
kill(8288, SIG_0)                       = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], SA_RESTORER, 0x330da0f710}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7f99c7ffd9e0)       = 0
---

So there's nothing obviously (to me) blocking


There's actually two places in the same file that can log "Starting instance...", I suggested a change for that [1] and a larger one to add the function to log and audit messages [2]

[1] https://review.openstack.org/#/c/85639/
[2] https://review.openstack.org/#/c/85638/

Comment 3 Ian Wienand 2014-04-11 06:14:13 UTC
This has started happening again.  What seems to be the issue is that RPC calls have failed.  A tail of the logfile is below

---
2014-04-10 00:50:47.474 8288 AUDIT nova.compute.manager [req-4e27bfeb-a7bc-4c5e-92f7-9b5be7a7dd78 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] [instance: c18d4568-ba99-4284-9db3-4a3b
6485ea20] Starting instance...
2014-04-10 01:09:35.923 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Failed to consume message from queue: connection aborted
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid Traceback (most recent call last):
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid     return method(*args, **kwargs)
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 597, in _consume
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid     nxt_receiver = self.session.next_receiver(timeout=timeout)
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "<string>", line 6, in next_receiver
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 665, in next_receiver
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid     if self._ecwait(lambda: self.incoming, timeout):
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid     result = self._ewait(lambda: self.closed or predicate(), timeout)
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid     self.check_error()
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid     raise self.error
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid ConnectionError: connection aborted
2014-04-10 01:09:35.923 8288 TRACE nova.openstack.common.rpc.impl_qpid 
2014-04-10 01:09:35.940 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Failed to consume message from queue: connection aborted
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid Traceback (most recent call last):
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid     return method(*args, **kwargs)
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 597, in _consume
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid     nxt_receiver = self.session.next_receiver(timeout=timeout)
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "<string>", line 6, in next_receiver
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 665, in next_receiver
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid     if self._ecwait(lambda: self.incoming, timeout):
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid     result = self._ewait(lambda: self.closed or predicate(), timeout)
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid     self.check_error()
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid     raise self.error
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid ConnectionError: connection aborted
2014-04-10 01:09:35.940 8288 TRACE nova.openstack.common.rpc.impl_qpid 
2014-04-10 01:09:38.940 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 1 seconds
2014-04-10 01:09:38.942 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 1 seconds
2014-04-10 01:09:39.942 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 2 seconds
2014-04-10 01:09:39.945 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 2 seconds
2014-04-10 01:09:41.955 8288 INFO nova.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.222:5672
2014-04-10 01:09:41.958 8288 INFO nova.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.222:5672
2014-04-10 01:09:44.138 8288 ERROR nova.openstack.common.rpc.impl_qpid [-] Failed to publish message to topic 'conductor': connection aborted
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid Traceback (most recent call last):
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid     return method(*args, **kwargs)
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 632, in _publisher_send
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid     publisher = cls(self.conf, self.session, topic)
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 398, in __init__
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid     super(TopicPublisher, self).__init__(conf, session, node_name)
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 328, in __init__
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid     self.reconnect(session)
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 332, in reconnect
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid     self.sender = session.sender(self.address)
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "<string>", line 6, in sender
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 597, in sender
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid     sender._ewait(lambda: sender.linked)
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 804, in _ewait
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid     result = self.session._ewait(lambda: self.error or predicate(), timeout)
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid     self.check_error()
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid     raise self.error
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid ConnectionError: connection aborted
2014-04-10 01:09:44.138 8288 TRACE nova.openstack.common.rpc.impl_qpid 
2014-04-10 01:09:44.146 8288 INFO nova.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.222:5672
2014-04-10 03:01:25.399 8288 WARNING nova.openstack.common.loopingcall [-] task run outlasted interval by 1.020018 sec
2014-04-10 19:11:55.220 8288 AUDIT nova.compute.manager [req-93a3c4ba-ea07-4d2d-8735-4a8415081f1e 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] [instance: 3356c492-d599-4579-85c5-0eff3d9a9e6f] Starting instance...
2014-04-11 00:56:28.869 8288 AUDIT nova.compute.manager [req-b7527bcb-8223-4810-ac27-60625f389af0 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] [instance: 6546372e-514f-485b-a47a-992445170050] Starting instance...
2014-04-11 01:18:14.779 8288 AUDIT nova.compute.manager [req-df2d62c5-dd1f-4395-9878-85dd84bea1c3 13090770bacc46ccb8fb7f5e13e5de98 f8e6ba11caa94ea98d24ec819eb746fd] [instance: 1a186694-96ba-40b8-be02-9c98a36d8472] Starting instance...
---

You can see at least 3 cases of "Starting instance..." that have never progressed.

Comment 5 Nikola Dipanov 2015-03-03 20:29:41 UTC
This is a really old bug so I am taking a shot in the dark here.

Inspecting the code for RHEL-OSP 4.0 it seems that a ConnectionError when attempting to send a message to conductor would properly cause the instance to go into an error state instead of getting stuck.

This leads me to believe that one possibility would be the greenthread deadlocks somewhere in oslo.messaging library code, in presence of certain error conditions.

RHOS 4.0 comes with GuruMeditation report which would give us a state of all the greenthreads in the nova compute service, so if you could provide us with that for one of the compute services that have instances stuck that could be a start (see more on this here https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/5/html/Cloud_Administrator_Guide/section_compute-troubleshooting.html)

The code that does this in RHEL-OSP 6 is significantly different (both oslo.messaging and nova compute) that this investigation is unlikely to point us in the right direction for fixing this there if it is even an issue anymore so the fix (if any) will likely be RHEL-OSP 4 only, but leaving targeted for RHOS 6 pending further info for now.

Comment 7 Ian Wienand 2023-05-03 01:40:45 UTC
Removing needinfo