Bug 1230134
| Summary: | Nova computes agents lost AMQP connections | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Pablo Iranzo Gómez <pablo.iranzo> |
| Component: | python-oslo-messaging | Assignee: | Flavio Percoco <fpercoco> |
| Status: | CLOSED DUPLICATE | QA Contact: | nlevinki <nlevinki> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.0 (Juno) | CC: | apevec, fpercoco, kgiusti, lhh, nyechiel, pablo.iranzo, yeylon |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | 7.0 (Kilo) | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-08-19 08:32:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1214764 | ||
@kgiusti, can you take a look at this bug? by looking at the bug and patch, it seems that it might be a duplicate of this bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1188304 The above bug was fixed upstream and the patch backported for RHOS6. Would it be possible to test this patch and see if it fixes the issue in this bug? Flavio, Asking customer to test python-oslo-messaging-1.4.1-4.el7ost, and setting needinfo on me for reporting back. Thanks Pablo Do we have feedback from the customer on whether python-oslo-messaging-1.4.1-4.el7ost helped mitigating the problem? Yes Flavio, Customer said that problem was solved by last oslo.messaging. Thanks, Pablo *** This bug has been marked as a duplicate of bug 1188304 *** |
Description: 1. the main problem is that after random time we lose ability to start new VMs. The new VMs end up in ERROR state. 2. at this time nova service-list shows both computes down: | 7 | nova-compute | compute-0-0.local | zone0 | enabled | down | 2015-06-04T07:29:49.000000 | - | | 8 | nova-compute | compute-0-1.local | zone1 | enabled | down | 2015-06-04T09:46:30.000000 | - | 3. later on without restarting any service we got one compute back online | 7 | nova-compute | compute-0-0.local | zone0 | enabled | down | 2015-06-04T07:29:49.000000 | - | | 8 | nova-compute | compute-0-1.local | zone1 | enabled | up | 2015-06-04T11:16:29.000000 | - | Looks like this blocks creation of new VMs. Looking around: 1. qpid looks active and all the other services are active and working in the same time 2. on qpid we see connections from all other services but we missing from compute which is down, this is how I check this: qpid-stat -c | grep 10.1.255.251 qpid-stat -c | grep 10.1.255.252 3. on affected compute I still see TCP connection in CLOSE_WAIT state compute-0-1.local:49136->10.1.20.151:amqps (CLOSE_WAIT) CLOSE_WAIT means that qpid closed the connection (not the compute), but the computes hasn't handled it yet.. but it's probably qpid that saw the compute as down and closed the conenction 4. The CLOSE_WAITs eventually got closed on c1, and we can create VMs on C1. But now we have the CLOSE_WAITs on second compute and we cannot start VMs there. Most likely these CLOSE_WAITs stay there for ~1h and after they get closed, we will eventually again be able to start VMs.