Bug 1423593
| Summary: | RabbitMQ: Bad address when CONFIG_AMQP_ENABLE_SSL=y | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Sachin <sacpatil> | ||||
| Component: | python-oslo-messaging | Assignee: | Victor Stinner <vstinner> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Asaf Hirshberg <ahirshbe> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 10.0 (Newton) | CC: | apevec, eglynn, fdinitto, fpercoco, jeckersb, jruzicka, knakai, lhh, lruzicka, pablo.iranzo, plemenko, pmannidi, rscarazz, sacpatil, srevivo, ushkalim, vstinner | ||||
| Target Milestone: | z3 | Keywords: | Triaged, ZStream | ||||
| Target Release: | 10.0 (Newton) | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | python-oslo-messaging-5.10.1-2.el7ost | Doc Type: | Bug Fix | ||||
| Doc Text: |
A race condition on RPC message acknowledgement in Oslo Messaging caused random RPC timeout in Ceilometer. The problem has been fixed. Now, the message acknowledgement is thread-safe and causes no Ceilometer related timeouts.
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1427792 (view as bug list) | Environment: | |||||
| Last Closed: | 2017-06-28 15:27:21 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1427792 | ||||||
| Attachments: |
|
||||||
|
Description
Sachin
2017-02-17 14:00:19 UTC
/var/log/rabbitmq/rabbit
~~~
=INFO REPORT==== 17-Feb-2017::08:17:15 ===
accepting AMQP connection <0.22380.0> (192.168.100.222:35502 -> 192.168.100.222:5671)
=ERROR REPORT==== 17-Feb-2017::08:17:16 ===
SSL: connection: ssl_cipher.erl:297:Fatal error: bad record mac
=WARNING REPORT==== 17-Feb-2017::08:17:16 ===
closing AMQP connection <0.22380.0> (192.168.100.222:35502 -> 192.168.100.222:5671):
client unexpectedly closed TCP connection
=INFO REPORT==== 17-Feb-2017::08:17:17 ===
accepting AMQP connection <0.22502.0> (192.168.100.222:35504 -> 192.168.100.222:5671)
=ERROR REPORT==== 17-Feb-2017::08:27:13 ===
SSL: connection: ssl_cipher.erl:297:Fatal error: bad record mac
=ERROR REPORT==== 17-Feb-2017::08:27:13 ===
closing AMQP connection <0.22502.0> (192.168.100.222:35504 -> 192.168.100.222:5671):
{inet_error,{tls_alert,"bad record mac"}}
=INFO REPORT==== 17-Feb-2017::08:27:14 ===
accepting AMQP connection <0.22886.0> (192.168.100.222:35686 -> 192.168.100.222:5671)
=INFO REPORT==== 17-Feb-2017::08:37:13 ===
closing AMQP connection <0.20785.0> (192.168.100.222:34896 -> 192.168.100.222:5671)
=INFO REPORT==== 17-Feb-2017::08:37:13 ===
closing AMQP connection <0.20749.0> (192.168.100.222:34880 -> 192.168.100.222:5671)
~~~
According to RabbitMQ docs this could be an issue on a client's side: https://www.rabbitmq.com/troubleshooting-ssl.html ================================ Entries containing {tls_alert,"bad record mac"} The server has tried verifying integrity of a piece of data it received and the check failed. This can be due to problematic network equipment, unintentional socket sharing in the client (e.g. due to the use of fork(2)) or a bug in the client implementation of TLS. ================================ Is it possible that oslo.messaging is trying to send unencrypted data over TLS-connection? Let's say it passes FD to another thread where it uses as a plain TCP socket? On the client side, what are the versions of the following packages? * openssl * Python * eventlet * python-oslo-messaging Moreover, what is your OS and OS version? Sachin was able to reproduce the "Bad address" error on a simple OSP 10 infra: everything running on the same host. The bug is specific to ceilometer: keystone, nova and neutron work fine. It seems like the bug comes from Oslo Messaging using SSL and the threading driver: since OSP 10, ceilometer and gnocchi don't use eventlet nor oslo.service anymore, but native threads. openssl = 1.0.1e(Revision: 1.0.1e) Python = 2.7.5(48.el7) No eventlet python-oslo-messaging = 5.10.0(Release: 5.el7ost) OS info ~~~ cat /etc/os-release NAME="Red Hat Enterprise Linux Server" VERSION="7.3 (Maipo)" ID="rhel" ID_LIKE="fedora" VERSION_ID="7.3" PRETTY_NAME="Red Hat Enterprise Linux Server 7.3 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.3:GA:server" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.3 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.3" ~~~ / additional info / python-amqp: 1.4.6(1.el7ost) python-kombu: 3.0.32(2.el7ost) Ceilometer calls oslo.messaging from different threads. The rabbitmq driver of Oslo Messaging uses kombu and amqp, but amqp is not thread-safe: https://github.com/celery/py-amqp/issues/16 Mehdi Abaakouk is working on a patch to make Oslo Messaging again thread-safe: use a queue to make sure that an AMQP connection is always used in the same thread. https://review.openstack.org/436958 Mehdi fixed a race condition in Oslo Messaging to make it thread-safe: https://review.openstack.org/#/c/436958/ His patch now LGTM and can be backported, but Fabio needs to check the rhos-10.0.z flag. I was able to reproduce the bug by restarting Ceilometer services:
---
systemctl stop openstack-ceilometer-collector.service
systemctl stop openstack-ceilometer-notification.service
systemctl stop openstack-ceilometer-central.service
systemctl start openstack-ceilometer-central.service
systemctl start openstack-ceilometer-notification.service
systemctl start openstack-ceilometer-collector.service
grep 'Bad address' /var/log/ceilometer/collector.log |tail
---
Mehdi modified /etc/ceilometer/pipeline.yaml to run the pipeline every 30 seconds, instead of running it every 10 minutes:
==========================
--- pipeline.yaml.orig 2017-02-22 22:42:16.224649361 +0100
+++ pipeline.yaml 2017-02-22 22:42:04.314370871 +0100
@@ -1,17 +1,20 @@
---
sources:
- name: meter_source
+ interval: 30
meters:
- "*"
sinks:
- meter_sink
- name: cpu_source
+ interval: 30
meters:
- "cpu"
sinks:
- cpu_sink
- cpu_delta_sink
- name: disk_source
+ interval: 30
meters:
- "disk.read.bytes"
- "disk.read.requests"
@@ -24,6 +27,7 @@
sinks:
- disk_sink
- name: network_source
+ interval: 30
meters:
- "network.incoming.bytes"
- "network.incoming.packets"
==========================
=> See attached pipeline.yaml
I built python-oslo-messaging-5.10.1-2.el7ost which includes https://review.openstack.org/#/c/436958/ fix (patch set 18). With this package, I'm unable to reproduce the bug anymore.
I also tested that keystone, nova, neutron, ceilometer and gnocchi still answer (display data) after a reboot with these commands:
---
openstack user list # keystone
nova list
neutron subnet-list
ceilometer sample-list -m cpu_util
gnocchi resource list
---
Created attachment 1256731 [details]
Modified Ceilometer config to reproduce the bug more quickly
Code Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1587 |