Bug 1753264
| Summary: | [OSP10] Rabbitmq operation impacted with: error operation basic.publish caused a channel exception not_found: "no exchange". | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | ggrimaux |
| Component: | rabbitmq-server | Assignee: | John Eckersberg <jeckersb> |
| Status: | CLOSED NOTABUG | QA Contact: | pkomarov |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 10.0 (Newton) | CC: | apevec, dhill, dvd, jeckersb, lhh, lmiccini, sandyada |
| Target Milestone: | --- | Keywords: | Triaged, ZStream |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-02-18 16:05:16 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
ggrimaux
2019-09-18 13:49:07 UTC
The problem is on the client side, restarting the rabbitmq nodes is unlikely to help and may cause the problem to get worse if more client connections transition into the same broken state. There are two recent bugs off the top of my head that this problem description reminds me of: https://bugzilla.redhat.com/show_bug.cgi?id=1740681 - python-amqp does not handle socket timeouts correctly when SSL is in use. If SSL is used between openstack services and rabbitmq, this is possibly the solution. https://bugzilla.redhat.com/show_bug.cgi?id=1733930 - nova-compute can miss periodic check-in due to blocked event loop while communicating with libvirt. There should be evidence in nova-compute logs that periodic jobs are taking an abnormally large amount of time if this is the case. I just hit this issue in another case and restarting rabbitmq-clone solved the problem. From the rabbitmq logs, I can see controller-0 saw controller-1 die and come back , controller-1 saw controller-0 and controller-2 die and come back and controller-2 saw controller-1 die and come back. Right after that, we can see the missing exchange errors: =ERROR REPORT==== 4-Oct-2019::10:57:22 === Channel error on connection <0.7719.741> (10.111.92.34:42380 -> 10.111.92.34:5672, vhost: '/', user: 'guest'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_efe6e2342dbb4322aeb6623f89dc265e' in vhost '/'" and various services complaining about amqp timeout issues: 2019-10-04 13:15:08.395 4783 ERROR heat.common.wsgi MessagingTimeout: Timed out waiting for a reply to message ID f3456c0dd42c4c0c823aade12ee0465d Closing out old bugs, IIUC from the case notes, this was happening because an extra compute node was powered on that should not have been, and the configuration state of that node may not have been correct which lead to this behavior. |