Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1230134

Summary: Nova computes agents lost AMQP connections
Product: Red Hat OpenStack Reporter: Pablo Iranzo Gómez <pablo.iranzo>
Component: python-oslo-messagingAssignee: Flavio Percoco <fpercoco>
Status: CLOSED DUPLICATE QA Contact: nlevinki <nlevinki>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.0 (Juno)CC: apevec, fpercoco, kgiusti, lhh, nyechiel, pablo.iranzo, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-08-19 08:32:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1214764    

Description Pablo Iranzo Gómez 2015-06-10 10:33:14 UTC
Description:

1. the main problem is that after random time we lose ability to start new VMs. The new VMs end up in ERROR state.

2. at this time nova service-list shows both computes down:
| 7  | nova-compute     | compute-0-0.local | zone0    | enabled | down  | 2015-06-04T07:29:49.000000 | -               |
| 8  | nova-compute     | compute-0-1.local | zone1    | enabled | down  | 2015-06-04T09:46:30.000000 | -               |

3. later on without restarting any service we got one compute back online 
| 7  | nova-compute     | compute-0-0.local | zone0    | enabled | down  | 2015-06-04T07:29:49.000000 | -               |
| 8  | nova-compute     | compute-0-1.local | zone1    | enabled | up    | 2015-06-04T11:16:29.000000 | -               |

Looks like this blocks creation of new VMs. 

Looking around: 
1. qpid looks active and all the other services are active and working in the same time
2. on qpid we see connections from all other services but we missing from compute which is down,
     this is how I check this:
                qpid-stat -c | grep 10.1.255.251
                qpid-stat -c | grep 10.1.255.252 
3. on affected compute I still see TCP connection in  CLOSE_WAIT state   
     compute-0-1.local:49136->10.1.20.151:amqps (CLOSE_WAIT) 
   
     CLOSE_WAIT means that qpid closed the connection (not the compute), but the computes hasn't handled it yet.. 
      but it's probably qpid that saw the compute as down and closed the conenction
4. The CLOSE_WAITs eventually got closed on c1, and we can create VMs on C1. But now we have the CLOSE_WAITs on second compute  and we cannot start VMs there. Most likely these CLOSE_WAITs stay there for ~1h and after they get closed, we will eventually again be able to start VMs.

Comment 5 Perry Myers 2015-06-10 11:10:35 UTC
@kgiusti, can you take a look at this bug?

Comment 6 Flavio Percoco 2015-06-10 11:37:58 UTC
by looking at the bug and patch, it seems that it might be a duplicate of this bug here:

https://bugzilla.redhat.com/show_bug.cgi?id=1188304

The above bug was fixed upstream and the patch backported for RHOS6. Would it be possible to test this patch and see if it fixes the issue in this bug?

Comment 7 Pablo Iranzo Gómez 2015-06-10 12:08:55 UTC
Flavio,
Asking customer to test python-oslo-messaging-1.4.1-4.el7ost, and setting needinfo on me for reporting back.

Thanks
Pablo

Comment 13 Flavio Percoco 2015-08-19 07:46:59 UTC
Do we have feedback from the customer on whether python-oslo-messaging-1.4.1-4.el7ost helped mitigating the problem?

Comment 14 Pablo Iranzo Gómez 2015-08-19 08:03:13 UTC
Yes Flavio,
Customer said that problem was solved by last oslo.messaging.

Thanks,
Pablo

Comment 15 Flavio Percoco 2015-08-19 08:32:14 UTC

*** This bug has been marked as a duplicate of bug 1188304 ***