Bug 1399237
| Summary: | Channel error on connection: operation basic.publish caused a channel exception not_found | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Marian Krcmarik <mkrcmari> |
| Component: | rabbitmq-server | Assignee: | John Eckersberg <jeckersb> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Udi Shkalim <ushkalim> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 10.0 (Newton) | CC: | apevec, cylopez, dhill, fdinitto, igortiunov, ipetrova, jeckersb, lhh, lmiccini, mkrcmari, oblaut, sbandyop, srevivo |
| Target Milestone: | --- | Keywords: | Reopened, Triaged, ZStream |
| Target Release: | 10.0 (Newton) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-07-24 06:18:29 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Marian Krcmarik
2016-11-28 15:47:48 UTC
I can easily reproduce the Error messages in rabbitmq log of this type: =ERROR REPORT==== 2-Dec-2016::17:12:49 === Channel error on connection <0.845.0> (10.35.169.18:60994 -> 10.35.169.13:5672, vhost: '/', user: 'guest'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_fc5a7174b2fe4c2089c6f1fdf28826b8' in vhost '/'" =ERROR REPORT==== 2-Dec-2016::17:12:49 === Channel error on connection <0.908.0> (10.35.169.18:32778 -> 10.35.169.13:5672, vhost: '/', user: 'guest'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_c243451013434ab49080ba877d386f7a' in vhost '/'" =ERROR REPORT==== 2-Dec-2016::17:12:49 === Channel error on connection <0.845.0> (10.35.169.18:60994 -> 10.35.169.13:5672, vhost: '/', user: 'guest'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_fc5a7174b2fe4c2089c6f1fdf28826b8' in vhost '/'" I do not know how severe those msgs are and whether It's a consequence of some different problem but I can see many of such messages on multiple setups (either ipv6 or ipv4 based deployments). It is present even though no failover actions are being performed. If needed I can provide a deployments for debugging. (In reply to Marian Krcmarik from comment #2) > If needed I can provide a deployments for debugging. Please! If you have one handy that reproduces, it would be a big help. I tried to reproduce the bug on latest builds and I was not that "sucessful" as two months ago, and not sure why at all. I can see the messages such as: operation basic.publish caused a channel exception not_found: "no exchange 'reply_fc5a7174b2fe4c2089c6f1fdf28826b8' in vhost '/'" But very rarely and I cannot see any rebbitmq disconnections as previously. I gonna close the bug for now and reopen If the described behaviour is reproducible again. Hi guys, We're currently hitting this issue . I updated python-oslo-messaging and rabbitmq-server to the latest RHOSP 10 versions and we're still hitting this. Do you know which packages/process/etc caused this issue? Thank you very much, David Hill And this might be fixed by this commit. [1] https://review.openstack.org/#/c/315700/8/oslo_messaging/_drivers/impl_rabbit.py Testing this patch [1]
[1] --- impl_rabbit.py.orig 2016-11-22 11:49:09.000000000 +0000
+++ impl_rabbit.py 2017-09-15 23:29:44.716985459 +0000
@@ -1067,11 +1067,21 @@
if not self.connection.connected:
raise self.connection.recoverable_connection_errors[0]
- while self._new_tags:
+ consume_max_retries = 2
+ while self._new_tags and consume_max_retries:
for consumer, tag in self._consumers.items():
if tag in self._new_tags:
- consumer.consume(self, tag=tag)
- self._new_tags.remove(tag)
+ try:
+ consumer.consume(tag=tag)
+ self._new_tags.remove(tag)
+ except self.connection.channel_errors as exc:
+ if exc.code == 404 and consume_max_retries:
+ consumer.declare(self)
+ self._new_tags = set(self._consumers.values())
+ consume_max_retries -= 1
+ break
+ else:
+ raise
poll_timeout = (self._poll_timeout if timeout is None
else min(timeout, self._poll_timeout))
It doesn't seem like it changing anything in this case. This error is not totally unexpected under normal conditions. Here's roughly how it happens: - Service A creates an exchange/queue pair reply_XYZ for receiving RPC responses. These are created with auto_delete=true. - A makes an RPC call to service B. - B receives the RPC request from A, and does some work. - A stops or is disconnected from RabbitMQ for $reasons. - RabbitMQ notices there are zero consumers on reply_XYZ, and auto deletes the queue. Because the exchange is not bound to any queues now, it is also auto deleted. - B is done working, and attempts to publish to exchange reply_XYZ, but it is gone. - B will retry for some time (Offhand, I believe once per second for 60 seconds), hoping that A reconnects and re-declares the queue/exchange. During this time, you will get this error repeatedly. - Eventually either (a) A comes back, re-declares reply_XYZ, and B is able to send the reply, or (b) service B gives up after the timeout and stops trying to send the reply. So it's not necessarily indicative of a problem. It's supposed to work that way in the general case. However, it *could* be indicative of a problem, probably related to a partition, either ongoing or at some point in the past. There's not really any rule to say for sure, you have to examine the context on a case-by-case basis. Hopefully that helps clarify. This may also be the same thing as bug 1484543, if it persists for extended periods of time (more than a minute). Hi Currently auto_delete does not enabled on queues (fanout/reply_) by default. Check this: https://bugs.launchpad.net/oslo.messaging/+bug/1495568 RH Configuration References for RHOSP10 does not describe such options but for RHOSP11 this option described: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html-single/configuration_reference/ rabbit_transient_queues_ttl = 1800 (30 mins by default) So, the error has occurred when messages/queues deleted by Time To Live. Hello, On newton-eol upstream, we have this feature : https://github.com/openstack/oslo.messaging/blob/newton-eol/oslo_messaging/_drivers/impl_rabbit.py#L468 I guess we got it in OSP10. (In reply to ITD27M01 from comment #17) > Hi > > Currently auto_delete does not enabled on queues (fanout/reply_) by default. > Check this: https://bugs.launchpad.net/oslo.messaging/+bug/1495568 > > RH Configuration References for RHOSP10 does not describe such options but > for RHOSP11 this option described: > > https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/ > html-single/configuration_reference/ > > > rabbit_transient_queues_ttl = 1800 (30 mins by default) > > So, the error has occurred when messages/queues deleted by Time To Live. (In reply to Cyril Lopez from comment #18) Yes, but this is not covered by documentation but used by default in OSP 10. > Hello, > > On newton-eol upstream, we have this feature : > https://github.com/openstack/oslo.messaging/blob/newton-eol/oslo_messaging/ > _drivers/impl_rabbit.py#L468 > > I guess we got it in OSP10. > > (In reply to ITD27M01 from comment #17) > > Hi > > > > Currently auto_delete does not enabled on queues (fanout/reply_) by default. > > Check this: https://bugs.launchpad.net/oslo.messaging/+bug/1495568 > > > > RH Configuration References for RHOSP10 does not describe such options but > > for RHOSP11 this option described: > > > > https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/ > > html-single/configuration_reference/ > > > > > > rabbit_transient_queues_ttl = 1800 (30 mins by default) > > > > So, the error has occurred when messages/queues deleted by Time To Live. We hit this issue when they did a maintenance on a network switch and bonding didn't failover to the other nic. All controllers got isolated (split brain for each host) and then , rabbitmq started exhibiting this issue: =ERROR REPORT==== 28-Apr-2018::01:30:48 === Channel error on connection <0.10376.0> (192.168.5.7:49702 -> 192.168.5.7:5672, vhost: '/', user: 'guest'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_8122e6beca9543f5b4860ad80e3619fe' in vhost '/'" Restarting rabbitmq-clone solved the problem by forcing the 3 rabbitmq cluster member to restart and then triggering all stuck connection / broken service / etc to reconnect to the AMQP service. vhost errors (among others) also in case # 02135077: Channel error on connection <0.17098.7> (...:57542 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.17098.7> (...:57542 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.22766.7> (...:36624 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.15751.7> (...:57274 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.22766.7> (...:36624 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.15751.7> (...:57274 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.22766.7> (...:36624 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.15751.7> (...:57274 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.22766.7> (...:36624 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.15751.7> (...:57274 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.22766.7> (...:36624 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.23111.7> (...:58592 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.22766.7> (...:36624 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.23111.7> (...:58592 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.23111.7> (...:58592 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.23148.7> (...:58804 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.23189.7> (...:59240 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.23189.7> (...:59240 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.23189.7> (...:59240 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.349.2> (...:41178 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.19336.3> (...:44944 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.19336.3> (...:44944 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.24749.2> (...:43976 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.24749.2> (...:43976 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.24749.2> (...:43976 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.24749.2> (...:43976 -> ...:5672, vhost: '/', user: 'guest'), channel 1: Channel error on connection <0.20645.6> (...:55270 -> ...:5672, vhost: '/', user: 'guest'), channel 1: This bug seems to have taken a bit of a tangent from the original issue that Marian reported. The key bit is that Marian was seeing this issue without any network disruption or failover testing. Meanwhile, the later comments seem to be more related to recovery during a network disruption / partition scenario. This is a perfectly valid concern, but we should isolate any such case and deal with them separately, otherwise we are going to mix up subtle differences. Also, these are somewhat old reports (I'm trying to clean up stale bugs) that are missing necessary data (sosreports, etc) to diagnose. So with that said, I'm going to close this one. If we still have outstanding case(s) that show a similar problem to this, feel free to open a new bug up and include as much data as possible so we can hopefully get to the bottom of it. Hello, We are hitting this again exactly like https://bugs.launchpad.net/mos/+bug/1609741 Regards Cyril (In reply to Cyril Lopez from comment #23) > Hello, > > We are hitting this again exactly like > https://bugs.launchpad.net/mos/+bug/1609741 > > Regards > Cyril Hey Cyril, I am going to close this one - would you mind opening a new BZ if your customer hits this again (as per eck' suggestion)? Thanks, Luca |