Bug 1230134 - Nova computes agents lost AMQP connections
Summary: Nova computes agents lost AMQP connections
Keywords:
Status: CLOSED DUPLICATE of bug 1188304
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-oslo-messaging
Version: 6.0 (Juno)
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 7.0 (Kilo)
Assignee: Flavio Percoco
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks: 1214764
TreeView+ depends on / blocked
 
Reported: 2015-06-10 10:33 UTC by Pablo Iranzo Gómez
Modified: 2023-02-22 23:02 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-08-19 08:32:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Pablo Iranzo Gómez 2015-06-10 10:33:14 UTC
Description:

1. the main problem is that after random time we lose ability to start new VMs. The new VMs end up in ERROR state.

2. at this time nova service-list shows both computes down:
| 7  | nova-compute     | compute-0-0.local | zone0    | enabled | down  | 2015-06-04T07:29:49.000000 | -               |
| 8  | nova-compute     | compute-0-1.local | zone1    | enabled | down  | 2015-06-04T09:46:30.000000 | -               |

3. later on without restarting any service we got one compute back online 
| 7  | nova-compute     | compute-0-0.local | zone0    | enabled | down  | 2015-06-04T07:29:49.000000 | -               |
| 8  | nova-compute     | compute-0-1.local | zone1    | enabled | up    | 2015-06-04T11:16:29.000000 | -               |

Looks like this blocks creation of new VMs. 

Looking around: 
1. qpid looks active and all the other services are active and working in the same time
2. on qpid we see connections from all other services but we missing from compute which is down,
     this is how I check this:
                qpid-stat -c | grep 10.1.255.251
                qpid-stat -c | grep 10.1.255.252 
3. on affected compute I still see TCP connection in  CLOSE_WAIT state   
     compute-0-1.local:49136->10.1.20.151:amqps (CLOSE_WAIT) 
   
     CLOSE_WAIT means that qpid closed the connection (not the compute), but the computes hasn't handled it yet.. 
      but it's probably qpid that saw the compute as down and closed the conenction
4. The CLOSE_WAITs eventually got closed on c1, and we can create VMs on C1. But now we have the CLOSE_WAITs on second compute  and we cannot start VMs there. Most likely these CLOSE_WAITs stay there for ~1h and after they get closed, we will eventually again be able to start VMs.

Comment 5 Perry Myers 2015-06-10 11:10:35 UTC
@kgiusti, can you take a look at this bug?

Comment 6 Flavio Percoco 2015-06-10 11:37:58 UTC
by looking at the bug and patch, it seems that it might be a duplicate of this bug here:

https://bugzilla.redhat.com/show_bug.cgi?id=1188304

The above bug was fixed upstream and the patch backported for RHOS6. Would it be possible to test this patch and see if it fixes the issue in this bug?

Comment 7 Pablo Iranzo Gómez 2015-06-10 12:08:55 UTC
Flavio,
Asking customer to test python-oslo-messaging-1.4.1-4.el7ost, and setting needinfo on me for reporting back.

Thanks
Pablo

Comment 13 Flavio Percoco 2015-08-19 07:46:59 UTC
Do we have feedback from the customer on whether python-oslo-messaging-1.4.1-4.el7ost helped mitigating the problem?

Comment 14 Pablo Iranzo Gómez 2015-08-19 08:03:13 UTC
Yes Flavio,
Customer said that problem was solved by last oslo.messaging.

Thanks,
Pablo

Comment 15 Flavio Percoco 2015-08-19 08:32:14 UTC

*** This bug has been marked as a duplicate of bug 1188304 ***


Note You need to log in before you can comment on or make changes to this bug.