Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1753264

Summary: [OSP10] Rabbitmq operation impacted with: error operation basic.publish caused a channel exception not_found: "no exchange".
Product: Red Hat OpenStack Reporter: ggrimaux
Component: rabbitmq-serverAssignee: John Eckersberg <jeckersb>
Status: CLOSED NOTABUG QA Contact: pkomarov
Severity: urgent Docs Contact:
Priority: urgent    
Version: 10.0 (Newton)CC: apevec, dhill, dvd, jeckersb, lhh, lmiccini, sandyada
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-18 16:05:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description ggrimaux 2019-09-18 13:49:07 UTC
Description of problem:
Client is experiencing issues with rabbitmq.

Looking at the logs we get a lot of:

=ERROR REPORT==== 18-Sep-2019::09:59:52 ===
Channel error on connection <0.31231.0> (172.16.64.62:42990 -> 172.16.64.62:5672, vhost: '/', user: 'guest'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_3c353a3f9435434984cc955e238b8445' in vhost '/'"

=ERROR REPORT==== 18-Sep-2019::09:59:52 ===
Channel error on connection <0.19397.3> (172.16.64.53:42166 -> 172.16.64.62:5672, vhost: '/', user: 'guest'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_1f34795e604545a6be30d0230345a379' in vhost '/'"

=ERROR REPORT==== 18-Sep-2019::09:59:52 ===
Channel error on connection <0.120.2> (172.16.64.62:43880 -> 172.16.64.62:5672, vhost: '/', user: 'guest'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_65d2683e16e84dd594d1c3ed72595bf1' in vhost '/'"

=ERROR REPORT==== 18-Sep-2019::09:59:52 ===
Channel error on connection <0.31810.1> (172.16.64.62:44254 -> 172.16.64.62:5672, vhost: '/', user: 'guest'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_5018bc75c0bb491d89fd2ad42177d5c6' in vhost '/'"

=ERROR REPORT==== 18-Sep-2019::09:59:52 ===
Channel error on connection <0.28142.1> (172.16.64.62:43092 -> 172.16.64.62:5672, vhost: '/', user: 'guest'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_b893f57294814f73b20035b9075d2fbb' in vhost '/'"

=ERROR REPORT==== 18-Sep-2019::09:59:52 ===
Channel error on connection <0.18497.3> (172.16.64.61:34430 -> 172.16.64.62:5672, vhost: '/', user: 'guest'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_c85d8a4057274383a93b1149adcdce34' in vhost '/'"

We have this over and over.

Rabbitmq was restarted, amnesia was cleared, even all 3 controllers were restarted. Errors are still happening.

We need engineering assistance with this.

Version-Release number of selected component (if applicable):
rabbitmq-server-3.6.3-6.el7ost.noarch

How reproducible:
N/A

Steps to Reproduce:
1. N/A
2.
3.

Actual results:
Rabbitmq performance issue due to errors.

Expected results:
Find the source of those errors.

Additional info:

Comment 4 John Eckersberg 2019-09-18 20:32:11 UTC
The problem is on the client side, restarting the rabbitmq nodes is unlikely to help and may cause the problem to get worse if more client connections transition into the same broken state.

There are two recent bugs off the top of my head that this problem description reminds me of:

https://bugzilla.redhat.com/show_bug.cgi?id=1740681 - python-amqp does not handle socket timeouts correctly when SSL is in use.  If SSL is used between openstack services and rabbitmq, this is possibly the solution.

https://bugzilla.redhat.com/show_bug.cgi?id=1733930 - nova-compute can miss periodic check-in due to blocked event loop while communicating with libvirt.  There should be evidence in nova-compute logs that periodic jobs are taking an abnormally large amount of time if this is the case.

Comment 7 David Hill 2019-10-09 15:55:10 UTC
I just hit this issue in another case and restarting rabbitmq-clone solved the problem.   From the rabbitmq logs, I can see controller-0 saw controller-1 die and come back , controller-1 saw controller-0 and controller-2 die and come back and controller-2 saw controller-1 die and come back.

Right after that, we can see the missing exchange errors:

=ERROR REPORT==== 4-Oct-2019::10:57:22 ===
Channel error on connection <0.7719.741> (10.111.92.34:42380 -> 10.111.92.34:5672, vhost: '/', user: 'guest'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_efe6e2342dbb4322aeb6623f89dc265e' in vhost '/'"

and various services complaining about amqp timeout issues:

2019-10-04 13:15:08.395 4783 ERROR heat.common.wsgi MessagingTimeout: Timed out waiting for a reply to message ID f3456c0dd42c4c0c823aade12ee0465d

Comment 8 John Eckersberg 2020-02-18 16:05:16 UTC
Closing out old bugs, IIUC from the case notes, this was happening because an extra compute node was powered on that should not have been, and the configuration state of that node may not have been correct which lead to this behavior.