Bug 1399237

Summary:	Channel error on connection: operation basic.publish caused a channel exception not_found
Product:	Red Hat OpenStack	Reporter:	Marian Krcmarik <mkrcmari>
Component:	rabbitmq-server	Assignee:	John Eckersberg <jeckersb>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Udi Shkalim <ushkalim>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	10.0 (Newton)	CC:	apevec, cylopez, dhill, fdinitto, igortiunov, ipetrova, jeckersb, lhh, lmiccini, mkrcmari, oblaut, sbandyop, srevivo
Target Milestone:	---	Keywords:	Reopened, Triaged, ZStream
Target Release:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-07-24 06:18:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marian Krcmarik 2016-11-28 15:47:48 UTC

Description of problem:
rabbitmq-server throws a lot of errors on RHOSP10 during tempest run such as:
Channel error on connection <0.12953.1> ([FD00:FD00:FD00:2000::12]:56390 -> [FD00:FD00:FD00:2000::15]:5672, vhost: '/', user: 'guest'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_a374c0d9d2224856b6d7bcba519b6ee8' in vhost '/'"

No failover was triggered by test or tester during testing although It seems from logs that rabbitmq-server was not available for openstack services. Eventually It seems that openstack services (such as neutron agents) were able to connect to the rabbitmq-server (established connections could be seen by netstat) but were not operational (complained about messages not being replied) and many errors as pasted above were thrown in the rabbitmq log. Restarting a service or rabbitmq server seemed to help for some services such as neutron agents. The env is based on IPv6 (rabbitmq listens and connections are made to IPv6 based socket), I have not seen such errors when IPv4 was used.

Version-Release number of selected component (if applicable):
rabbitmq-server-3.6.3-6.el7ost.noarch
python-oslo-messaging-5.10.0-5.el7ost.noarch

How reproducible:
Sometimes (not exact pattern)

Steps to Reproduce:
1. Run full tempests suite on IPv6 based RHOSP10 until It starts failing on many tests caused by opnestack services being connected to rabbitmq but not operational.

Actual results:
Opnestack services such as neutron agents are connected to rabbitmq server but do not get any rabbitmq messages reply and thus are not operational.

Expected results:


Additional info:

Comment 2 Marian Krcmarik 2016-12-02 22:29:20 UTC

I can easily reproduce the Error messages in rabbitmq log of this type:
=ERROR REPORT==== 2-Dec-2016::17:12:49 ===
Channel error on connection <0.845.0> (10.35.169.18:60994 -> 10.35.169.13:5672, vhost: '/', user: 'guest'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_fc5a7174b2fe4c2089c6f1fdf28826b8' in vhost '/'"

=ERROR REPORT==== 2-Dec-2016::17:12:49 ===
Channel error on connection <0.908.0> (10.35.169.18:32778 -> 10.35.169.13:5672, vhost: '/', user: 'guest'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_c243451013434ab49080ba877d386f7a' in vhost '/'"

=ERROR REPORT==== 2-Dec-2016::17:12:49 ===
Channel error on connection <0.845.0> (10.35.169.18:60994 -> 10.35.169.13:5672, vhost: '/', user: 'guest'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_fc5a7174b2fe4c2089c6f1fdf28826b8' in vhost '/'"

I do not know how severe those msgs are and whether It's a consequence of some different problem but I can see many of such messages on multiple setups (either ipv6 or ipv4 based deployments). It is present even though no failover actions are being performed. If needed I can provide a deployments for debugging.

Comment 4 John Eckersberg 2016-12-14 15:13:40 UTC

(In reply to Marian Krcmarik from comment #2)
> If needed I can provide a deployments for debugging.

Please!  If you have one handy that reproduces, it would be a big help.

Comment 6 Marian Krcmarik 2017-01-19 20:30:33 UTC

I tried to reproduce the bug on latest builds and I was not that "sucessful" as two months ago, and not sure why at all. I can see the messages such as:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_fc5a7174b2fe4c2089c6f1fdf28826b8' in vhost '/'"
But very rarely and I cannot see any rebbitmq disconnections as previously. I gonna close the bug for now and reopen If the described behaviour is reproducible again.

Comment 7 David Hill 2017-09-15 23:05:39 UTC

Hi guys,

   We're currently hitting this issue .   I updated python-oslo-messaging and rabbitmq-server to the latest RHOSP 10 versions and we're still hitting this.   Do you know which packages/process/etc caused this issue?

Thank you very much,

David Hill

Comment 10 David Hill 2017-09-15 23:19:29 UTC

And this might be fixed by this commit.

[1] https://review.openstack.org/#/c/315700/8/oslo_messaging/_drivers/impl_rabbit.py

Comment 11 David Hill 2017-09-15 23:30:18 UTC

Testing this patch [1] 

[1] --- impl_rabbit.py.orig 2016-11-22 11:49:09.000000000 +0000
+++ impl_rabbit.py      2017-09-15 23:29:44.716985459 +0000
@@ -1067,11 +1067,21 @@
             if not self.connection.connected:
                 raise self.connection.recoverable_connection_errors[0]

-            while self._new_tags:
+            consume_max_retries = 2
+            while self._new_tags and consume_max_retries:
                 for consumer, tag in self._consumers.items():
                     if tag in self._new_tags:
-                        consumer.consume(self, tag=tag)
-                        self._new_tags.remove(tag)
+                        try:
+                            consumer.consume(tag=tag)
+                            self._new_tags.remove(tag)
+                        except self.connection.channel_errors as exc:
+                            if exc.code == 404 and consume_max_retries:
+                                consumer.declare(self)
+                                self._new_tags = set(self._consumers.values())
+                                consume_max_retries -= 1
+                                break
+                            else:
+                                raise

             poll_timeout = (self._poll_timeout if timeout is None
                             else min(timeout, self._poll_timeout))

Comment 12 David Hill 2017-09-15 23:51:51 UTC

It doesn't seem like it changing anything in this case.

Comment 15 John Eckersberg 2017-10-12 16:50:46 UTC

This error is not totally unexpected under normal conditions.  Here's roughly how it happens:

- Service A creates an exchange/queue pair reply_XYZ for receiving RPC responses.  These are created with auto_delete=true.

- A makes an RPC call to service B.

- B receives the RPC request from A, and does some work.

- A stops or is disconnected from RabbitMQ for $reasons.

- RabbitMQ notices there are zero consumers on reply_XYZ, and auto deletes the queue.  Because the exchange is not bound to any queues now, it is also auto deleted.

- B is done working, and attempts to publish to exchange reply_XYZ, but it is gone.

- B will retry for some time (Offhand, I believe once per second for 60 seconds), hoping that A reconnects and re-declares the queue/exchange.  During this time, you will get this error repeatedly.

- Eventually either (a) A comes back, re-declares reply_XYZ, and B is able to send the reply, or (b) service B gives up after the timeout and stops trying to send the reply.


So it's not necessarily indicative of a problem.  It's supposed to work that way in the general case.  However, it *could* be indicative of a problem, probably related to a partition, either ongoing or at some point in the past.  There's not really any rule to say for sure, you have to examine the context on a case-by-case basis.

Hopefully that helps clarify.

Comment 16 John Eckersberg 2017-10-12 18:06:08 UTC

This may also be the same thing as bug 1484543, if it persists for extended periods of time (more than a minute).

Comment 17 ITD27M01 2018-01-30 07:10:40 UTC

Hi

Currently auto_delete does not enabled on queues (fanout/reply_) by default. Check this: https://bugs.launchpad.net/oslo.messaging/+bug/1495568

RH Configuration References for RHOSP10 does not describe such options but for RHOSP11 this option described:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html-single/configuration_reference/


rabbit_transient_queues_ttl = 1800 (30 mins by default)

So, the error has occurred when messages/queues deleted by Time To Live.

Comment 18 Cyril Lopez 2018-02-01 14:26:58 UTC

Hello,

On newton-eol upstream, we have this feature :
https://github.com/openstack/oslo.messaging/blob/newton-eol/oslo_messaging/_drivers/impl_rabbit.py#L468

I guess we got it in OSP10.

(In reply to ITD27M01 from comment #17)
> Hi
> 
> Currently auto_delete does not enabled on queues (fanout/reply_) by default.
> Check this: https://bugs.launchpad.net/oslo.messaging/+bug/1495568
> 
> RH Configuration References for RHOSP10 does not describe such options but
> for RHOSP11 this option described:
> 
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/
> html-single/configuration_reference/
> 
> 
> rabbit_transient_queues_ttl = 1800 (30 mins by default)
> 
> So, the error has occurred when messages/queues deleted by Time To Live.

Comment 19 ITD27M01 2018-02-01 14:31:30 UTC

(In reply to Cyril Lopez from comment #18)
Yes, but this is not covered by documentation but used by default in OSP 10.


> Hello,
> 
> On newton-eol upstream, we have this feature :
> https://github.com/openstack/oslo.messaging/blob/newton-eol/oslo_messaging/
> _drivers/impl_rabbit.py#L468
> 
> I guess we got it in OSP10.
> 
> (In reply to ITD27M01 from comment #17)
> > Hi
> > 
> > Currently auto_delete does not enabled on queues (fanout/reply_) by default.
> > Check this: https://bugs.launchpad.net/oslo.messaging/+bug/1495568
> > 
> > RH Configuration References for RHOSP10 does not describe such options but
> > for RHOSP11 this option described:
> > 
> > https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/
> > html-single/configuration_reference/
> > 
> > 
> > rabbit_transient_queues_ttl = 1800 (30 mins by default)
> > 
> > So, the error has occurred when messages/queues deleted by Time To Live.

Comment 20 David Hill 2018-05-01 19:26:48 UTC

We hit this issue when they did a maintenance on a network switch and bonding didn't failover to the other nic.  All controllers got isolated (split brain for each host) and then , rabbitmq started exhibiting this issue:

=ERROR REPORT==== 28-Apr-2018::01:30:48 ===
Channel error on connection <0.10376.0> (192.168.5.7:49702 -> 192.168.5.7:5672, vhost: '/', user: 'guest'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_8122e6beca9543f5b4860ad80e3619fe' in vhost '/'"

Restarting rabbitmq-clone solved the problem by forcing the 3 rabbitmq cluster member to restart and then triggering all stuck connection / broken service / etc to reconnect to the AMQP service.

Comment 21 Irina Petrova 2018-08-10 13:15:40 UTC

vhost errors (among others) also in case # 02135077:

Channel error on connection <0.17098.7> (...:57542 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.17098.7> (...:57542 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.22766.7> (...:36624 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.15751.7> (...:57274 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.22766.7> (...:36624 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.15751.7> (...:57274 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.22766.7> (...:36624 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.15751.7> (...:57274 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.22766.7> (...:36624 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.15751.7> (...:57274 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.22766.7> (...:36624 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.23111.7> (...:58592 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.22766.7> (...:36624 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.23111.7> (...:58592 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.23111.7> (...:58592 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.23148.7> (...:58804 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.23189.7> (...:59240 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.23189.7> (...:59240 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.23189.7> (...:59240 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.349.2>   (...:41178 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.19336.3> (...:44944 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.19336.3> (...:44944 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.24749.2> (...:43976 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.24749.2> (...:43976 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.24749.2> (...:43976 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.24749.2> (...:43976 -> ...:5672, vhost: '/', user: 'guest'), channel 1:
Channel error on connection <0.20645.6> (...:55270 -> ...:5672, vhost: '/', user: 'guest'), channel 1:

Comment 22 John Eckersberg 2018-09-25 20:09:37 UTC

This bug seems to have taken a bit of a tangent from the original issue that Marian reported.  The key bit is that Marian was seeing this issue without any network disruption or failover testing.

Meanwhile, the later comments seem to be more related to recovery during a network disruption / partition scenario.  This is a perfectly valid concern, but we should isolate any such case and deal with them separately, otherwise we are going to mix up subtle differences.

Also, these are somewhat old reports (I'm trying to clean up stale bugs) that are missing necessary data (sosreports, etc) to diagnose.

So with that said, I'm going to close this one.  If we still have outstanding case(s) that show a similar problem to this, feel free to open a new bug up and include as much data as possible so we can hopefully get to the bottom of it.

Comment 23 Cyril Lopez 2019-01-25 10:58:24 UTC

Hello,

We are hitting this again exactly like https://bugs.launchpad.net/mos/+bug/1609741

Regards
Cyril

Comment 24 Luca Miccini 2019-07-24 06:18:29 UTC

(In reply to Cyril Lopez from comment #23)
> Hello,
> 
> We are hitting this again exactly like
> https://bugs.launchpad.net/mos/+bug/1609741
> 
> Regards
> Cyril

Hey Cyril,
I am going to close this one - would you mind opening a new BZ if your customer hits this again (as per eck' suggestion)? 

Thanks,
Luca