Description of problem: Client is experiencing issues with rabbitmq. Looking at the logs we get a lot of: =ERROR REPORT==== 18-Sep-2019::09:59:52 === Channel error on connection <0.31231.0> (172.16.64.62:42990 -> 172.16.64.62:5672, vhost: '/', user: 'guest'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_3c353a3f9435434984cc955e238b8445' in vhost '/'" =ERROR REPORT==== 18-Sep-2019::09:59:52 === Channel error on connection <0.19397.3> (172.16.64.53:42166 -> 172.16.64.62:5672, vhost: '/', user: 'guest'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_1f34795e604545a6be30d0230345a379' in vhost '/'" =ERROR REPORT==== 18-Sep-2019::09:59:52 === Channel error on connection <0.120.2> (172.16.64.62:43880 -> 172.16.64.62:5672, vhost: '/', user: 'guest'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_65d2683e16e84dd594d1c3ed72595bf1' in vhost '/'" =ERROR REPORT==== 18-Sep-2019::09:59:52 === Channel error on connection <0.31810.1> (172.16.64.62:44254 -> 172.16.64.62:5672, vhost: '/', user: 'guest'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_5018bc75c0bb491d89fd2ad42177d5c6' in vhost '/'" =ERROR REPORT==== 18-Sep-2019::09:59:52 === Channel error on connection <0.28142.1> (172.16.64.62:43092 -> 172.16.64.62:5672, vhost: '/', user: 'guest'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_b893f57294814f73b20035b9075d2fbb' in vhost '/'" =ERROR REPORT==== 18-Sep-2019::09:59:52 === Channel error on connection <0.18497.3> (172.16.64.61:34430 -> 172.16.64.62:5672, vhost: '/', user: 'guest'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_c85d8a4057274383a93b1149adcdce34' in vhost '/'" We have this over and over. Rabbitmq was restarted, amnesia was cleared, even all 3 controllers were restarted. Errors are still happening. We need engineering assistance with this. Version-Release number of selected component (if applicable): rabbitmq-server-3.6.3-6.el7ost.noarch How reproducible: N/A Steps to Reproduce: 1. N/A 2. 3. Actual results: Rabbitmq performance issue due to errors. Expected results: Find the source of those errors. Additional info:
The problem is on the client side, restarting the rabbitmq nodes is unlikely to help and may cause the problem to get worse if more client connections transition into the same broken state. There are two recent bugs off the top of my head that this problem description reminds me of: https://bugzilla.redhat.com/show_bug.cgi?id=1740681 - python-amqp does not handle socket timeouts correctly when SSL is in use. If SSL is used between openstack services and rabbitmq, this is possibly the solution. https://bugzilla.redhat.com/show_bug.cgi?id=1733930 - nova-compute can miss periodic check-in due to blocked event loop while communicating with libvirt. There should be evidence in nova-compute logs that periodic jobs are taking an abnormally large amount of time if this is the case.
I just hit this issue in another case and restarting rabbitmq-clone solved the problem. From the rabbitmq logs, I can see controller-0 saw controller-1 die and come back , controller-1 saw controller-0 and controller-2 die and come back and controller-2 saw controller-1 die and come back. Right after that, we can see the missing exchange errors: =ERROR REPORT==== 4-Oct-2019::10:57:22 === Channel error on connection <0.7719.741> (10.111.92.34:42380 -> 10.111.92.34:5672, vhost: '/', user: 'guest'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_efe6e2342dbb4322aeb6623f89dc265e' in vhost '/'" and various services complaining about amqp timeout issues: 2019-10-04 13:15:08.395 4783 ERROR heat.common.wsgi MessagingTimeout: Timed out waiting for a reply to message ID f3456c0dd42c4c0c823aade12ee0465d
Closing out old bugs, IIUC from the case notes, this was happening because an extra compute node was powered on that should not have been, and the configuration state of that node may not have been correct which lead to this behavior.