Description: 1. the main problem is that after random time we lose ability to start new VMs. The new VMs end up in ERROR state. 2. at this time nova service-list shows both computes down: | 7 | nova-compute | compute-0-0.local | zone0 | enabled | down | 2015-06-04T07:29:49.000000 | - | | 8 | nova-compute | compute-0-1.local | zone1 | enabled | down | 2015-06-04T09:46:30.000000 | - | 3. later on without restarting any service we got one compute back online | 7 | nova-compute | compute-0-0.local | zone0 | enabled | down | 2015-06-04T07:29:49.000000 | - | | 8 | nova-compute | compute-0-1.local | zone1 | enabled | up | 2015-06-04T11:16:29.000000 | - | Looks like this blocks creation of new VMs. Looking around: 1. qpid looks active and all the other services are active and working in the same time 2. on qpid we see connections from all other services but we missing from compute which is down, this is how I check this: qpid-stat -c | grep 10.1.255.251 qpid-stat -c | grep 10.1.255.252 3. on affected compute I still see TCP connection in CLOSE_WAIT state compute-0-1.local:49136->10.1.20.151:amqps (CLOSE_WAIT) CLOSE_WAIT means that qpid closed the connection (not the compute), but the computes hasn't handled it yet.. but it's probably qpid that saw the compute as down and closed the conenction 4. The CLOSE_WAITs eventually got closed on c1, and we can create VMs on C1. But now we have the CLOSE_WAITs on second compute and we cannot start VMs there. Most likely these CLOSE_WAITs stay there for ~1h and after they get closed, we will eventually again be able to start VMs.
@kgiusti, can you take a look at this bug?
by looking at the bug and patch, it seems that it might be a duplicate of this bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1188304 The above bug was fixed upstream and the patch backported for RHOS6. Would it be possible to test this patch and see if it fixes the issue in this bug?
Flavio, Asking customer to test python-oslo-messaging-1.4.1-4.el7ost, and setting needinfo on me for reporting back. Thanks Pablo
Do we have feedback from the customer on whether python-oslo-messaging-1.4.1-4.el7ost helped mitigating the problem?
Yes Flavio, Customer said that problem was solved by last oslo.messaging. Thanks, Pablo
*** This bug has been marked as a duplicate of bug 1188304 ***