Bug 1423593 - RabbitMQ: Bad address when CONFIG_AMQP_ENABLE_SSL=y
Summary: RabbitMQ: Bad address when CONFIG_AMQP_ENABLE_SSL=y
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-oslo-messaging
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z3
: 10.0 (Newton)
Assignee: Victor Stinner
QA Contact: Asaf Hirshberg
URL:
Whiteboard:
Depends On:
Blocks: 1427792
TreeView+ depends on / blocked
 
Reported: 2017-02-17 14:00 UTC by Sachin
Modified: 2020-04-15 15:23 UTC (History)
17 users (show)

Fixed In Version: python-oslo-messaging-5.10.1-2.el7ost
Doc Type: Bug Fix
Doc Text:
A race condition on RPC message acknowledgement in Oslo Messaging caused random RPC timeout in Ceilometer. The problem has been fixed. Now, the message acknowledgement is thread-safe and causes no Ceilometer related timeouts.
Clone Of:
: 1427792 (view as bug list)
Environment:
Last Closed: 2017-06-28 15:27:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Modified Ceilometer config to reproduce the bug more quickly (2.50 KB, text/x-vhdl)
2017-02-22 21:45 UTC, Victor Stinner
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 436958 0 'None' MERGED rabbit: make ack/requeue thread-safe 2020-03-06 03:20:10 UTC
Red Hat Product Errata RHBA-2017:1587 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 Bug Fix and Enhancement Advisory 2017-06-28 19:11:42 UTC

Description Sachin 2017-02-17 14:00:19 UTC
Description of problem:

/var/log/ceilometer/collector.log,
~~~
2017-02-17 05:57:13.771 2445 ERROR oslo.messaging._drivers.impl_rabbit [-] [7c8bc191-060f-45b2-913e-d5101d2e79cd] AMQP server on 192.168.100.222:5671 is unreachable: <AMQPError: unknown error>. Trying again in 1
 seconds. Client port: 60708
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener [-] Fail to ack/requeue message.
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener Traceback (most recent call last):
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/oslo_messaging/notify/listener.py", line 179, in _process_incoming
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     m.acknowledge()
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 119, in acknowledge
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     self.message.acknowledge()
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 251, in acknowledge
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     self._raw_message.ack()
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     self.channel.basic_ack(self.delivery_tag)
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     self._send_method((60, 80), args)
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     self.channel_id, method_sig, args, content,
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     write_frame(1, channel, payload)
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     frame_type, channel, size, payload, 0xce,
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 254, in _write
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     n = write(s)
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib64/python2.7/ssl.py", line 671, in write
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     return self._sslobj.write(data)
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener error: [Errno 14] Bad address
~~~


Similarly I do not see any output in gnocchi command
~~~
# gnocchi metric list
<EMPTY OUTPUT>
~~~

But, I see ceilometer samples for cpu_util & memory
~~~
# ceilometer meter-list -l 10
+---------------------+-------+-------+--------------------------------------+----------------------------------+----------------------------------+
| Name                | Type  | Unit  | Resource ID                          | User ID                          | Project ID                       |
+---------------------+-------+-------+--------------------------------------+----------------------------------+----------------------------------+
| disk.ephemeral.size | gauge | GB    | 66e2e05a-0804-4487-84f7-b4e40c1c1801 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 |
| disk.root.size      | gauge | GB    | 66e2e05a-0804-4487-84f7-b4e40c1c1801 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 |
| image               | gauge | image | c1df4f4f-09b8-4221-9ff1-337a0b634892 | None                             | 156762f2978e4a7bbc31cf840f615cb1 |
| image.download      | delta | B     | c1df4f4f-09b8-4221-9ff1-337a0b634892 | None                             | 156762f2978e4a7bbc31cf840f615cb1 |
| image.serve         | delta | B     | c1df4f4f-09b8-4221-9ff1-337a0b634892 | None                             | 156762f2978e4a7bbc31cf840f615cb1 |
| image.size          | gauge | B     | c1df4f4f-09b8-4221-9ff1-337a0b634892 | None                             | 156762f2978e4a7bbc31cf840f615cb1 |
| memory              | gauge | MB    | 66e2e05a-0804-4487-84f7-b4e40c1c1801 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 |
| memory              | gauge | MB    | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 |
| vcpus               | gauge | vcpu  | 66e2e05a-0804-4487-84f7-b4e40c1c1801 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 |
| vcpus               | gauge | vcpu  | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 |
+---------------------+-------+-------+--------------------------------------+----------------------------------+----------------------------------+

# openstack server list
+--------------------------------------+------+--------+---------------------+------------+
| ID                                   | Name | Status | Networks            | Image Name |
+--------------------------------------+------+--------+---------------------+------------+
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | t1   | ACTIVE | public=172.24.4.231 | cirros     |
+--------------------------------------+------+--------+---------------------+------------+
[root@localhost rabbitmq(keystone_admin)]# ceilometer sample-list -m cpu_util
+--------------------------------------+----------+-------+---------------+------+----------------------------+
| Resource ID                          | Name     | Type  | Volume        | Unit | Timestamp                  |
+--------------------------------------+----------+-------+---------------+------+----------------------------+
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | cpu_util | gauge | 3.13665403118 | %    | 2017-02-17T06:37:13.571000 |
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | cpu_util | gauge | 2.71087971235 | %    | 2017-02-17T06:27:13.568000 |
+--------------------------------------+----------+-------+---------------+------+----------------------------+



# ceilometer sample-list -m memory -l 5
+--------------------------------------+--------+-------+--------+------+----------------------------+
| Resource ID                          | Name   | Type  | Volume | Unit | Timestamp                  |
+--------------------------------------+--------+-------+--------+------+----------------------------+
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0  | MB   | 2017-02-17T06:00:21.001000 |
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0  | MB   | 2017-02-17T05:59:28.582000 |
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0  | MB   | 2017-02-17T05:59:28.564000 |
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0  | MB   | 2017-02-17T05:59:21.642000 |
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0  | MB   | 2017-02-17T05:59:21.556000 |
+--------------------------------------+--------+-------+--------+------+----------------------------+
~~~


Version-Release number of selected component (if applicable):
rabbitmq-server version: 3.6.3(6.el7ost)

How reproducible:
Always reproducible 


Steps to Reproduce:
1. Enable SSL


Actual results:
~~~
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib64/python2.7/ssl.py", line 671, in write
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     return self._sslobj.write(data)
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener error: [Errno 14] Bad address
~~~

Expected results:
Should not see error

Additional info:
- Customer as secure environment. Can't ask for log. But was reproduced internally on single-node enviroment(packstack). Customer encounters exact error.
- Customer can't see ceilometer samples where they are visible in test environment.
- Generated Sosreport

Comment 1 Sachin 2017-02-17 14:02:59 UTC
/var/log/rabbitmq/rabbit
~~~
=INFO REPORT==== 17-Feb-2017::08:17:15 ===
accepting AMQP connection <0.22380.0> (192.168.100.222:35502 -> 192.168.100.222:5671)

=ERROR REPORT==== 17-Feb-2017::08:17:16 ===
SSL: connection: ssl_cipher.erl:297:Fatal error: bad record mac

=WARNING REPORT==== 17-Feb-2017::08:17:16 ===
closing AMQP connection <0.22380.0> (192.168.100.222:35502 -> 192.168.100.222:5671):
client unexpectedly closed TCP connection

=INFO REPORT==== 17-Feb-2017::08:17:17 ===
accepting AMQP connection <0.22502.0> (192.168.100.222:35504 -> 192.168.100.222:5671)

=ERROR REPORT==== 17-Feb-2017::08:27:13 ===
SSL: connection: ssl_cipher.erl:297:Fatal error: bad record mac

=ERROR REPORT==== 17-Feb-2017::08:27:13 ===
closing AMQP connection <0.22502.0> (192.168.100.222:35504 -> 192.168.100.222:5671):
{inet_error,{tls_alert,"bad record mac"}}

=INFO REPORT==== 17-Feb-2017::08:27:14 ===
accepting AMQP connection <0.22886.0> (192.168.100.222:35686 -> 192.168.100.222:5671)

=INFO REPORT==== 17-Feb-2017::08:37:13 ===
closing AMQP connection <0.20785.0> (192.168.100.222:34896 -> 192.168.100.222:5671)

=INFO REPORT==== 17-Feb-2017::08:37:13 ===
closing AMQP connection <0.20749.0> (192.168.100.222:34880 -> 192.168.100.222:5671)
~~~

Comment 3 Peter Lemenkov 2017-02-22 12:37:41 UTC
According to RabbitMQ docs this could be an issue on a client's side:

https://www.rabbitmq.com/troubleshooting-ssl.html
================================
Entries containing {tls_alert,"bad record mac"}

    The server has tried verifying integrity of a piece of data it received and the check failed. This can be due to problematic network equipment, unintentional socket sharing in the client (e.g. due to the use of fork(2)) or a bug in the client implementation of TLS.
================================

Is it possible that oslo.messaging is trying to send unencrypted data over TLS-connection? Let's say it passes FD to another thread where it uses as a plain TCP socket?

Comment 4 Victor Stinner 2017-02-22 12:46:17 UTC
On the client side, what are the versions of the following packages?

* openssl
* Python
* eventlet
* python-oslo-messaging

Moreover, what is your OS and OS version?

Comment 5 Victor Stinner 2017-02-22 14:26:41 UTC
Sachin was able to reproduce the "Bad address" error on a simple OSP 10 infra: everything running on the same host. The bug is specific to ceilometer: keystone, nova and neutron work fine.

It seems like the bug comes from Oslo Messaging using SSL and the threading driver: since OSP 10, ceilometer and gnocchi don't use eventlet nor oslo.service anymore, but native threads.

Comment 6 Sachin 2017-02-22 14:29:23 UTC
openssl = 1.0.1e(Revision: 1.0.1e)
Python = 2.7.5(48.el7)
No eventlet
python-oslo-messaging = 5.10.0(Release: 5.el7ost)

OS info
~~~
cat /etc/os-release 
NAME="Red Hat Enterprise Linux Server"
VERSION="7.3 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="7.3"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.3 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.3:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.3
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.3"
~~~

Comment 7 Sachin 2017-02-22 14:41:08 UTC
/ additional info /

python-amqp: 1.4.6(1.el7ost)
python-kombu: 3.0.32(2.el7ost)

Comment 8 Victor Stinner 2017-02-22 15:13:30 UTC
Ceilometer calls oslo.messaging from different threads.

The rabbitmq driver of Oslo Messaging uses kombu and amqp, but amqp is not thread-safe: https://github.com/celery/py-amqp/issues/16

Mehdi Abaakouk is working on a patch to make Oslo Messaging again thread-safe: use a queue to make sure that an AMQP connection is always used in the same thread.
https://review.openstack.org/436958

Comment 9 Victor Stinner 2017-02-22 18:14:54 UTC
Mehdi fixed a race condition in Oslo Messaging to make it thread-safe: https://review.openstack.org/#/c/436958/

His patch now LGTM and can be backported, but Fabio needs to check the rhos-10.0.z flag.

Comment 10 Victor Stinner 2017-02-22 21:44:04 UTC
I was able to reproduce the bug by restarting Ceilometer services:
---
systemctl stop openstack-ceilometer-collector.service
systemctl stop openstack-ceilometer-notification.service
systemctl stop openstack-ceilometer-central.service
systemctl start openstack-ceilometer-central.service
systemctl start openstack-ceilometer-notification.service
systemctl start openstack-ceilometer-collector.service
grep 'Bad address' /var/log/ceilometer/collector.log |tail
---

Mehdi modified /etc/ceilometer/pipeline.yaml to run the pipeline every 30 seconds, instead of running it every 10 minutes:
==========================
--- pipeline.yaml.orig	2017-02-22 22:42:16.224649361 +0100
+++ pipeline.yaml	2017-02-22 22:42:04.314370871 +0100
@@ -1,17 +1,20 @@
 ---
 sources:
     - name: meter_source
+      interval: 30
       meters:
           - "*"
       sinks:
           - meter_sink
     - name: cpu_source
+      interval: 30
       meters:
           - "cpu"
       sinks:
           - cpu_sink
           - cpu_delta_sink
     - name: disk_source
+      interval: 30
       meters:
           - "disk.read.bytes"
           - "disk.read.requests"
@@ -24,6 +27,7 @@
       sinks:
           - disk_sink
     - name: network_source
+      interval: 30
       meters:
           - "network.incoming.bytes"
           - "network.incoming.packets"
==========================
=> See attached pipeline.yaml

I built python-oslo-messaging-5.10.1-2.el7ost which includes https://review.openstack.org/#/c/436958/ fix (patch set 18). With this package, I'm unable to reproduce the bug anymore.

I also tested that keystone, nova, neutron, ceilometer and gnocchi still answer (display data) after a reboot with these commands:
---
openstack user list # keystone
nova list
neutron subnet-list
ceilometer sample-list -m cpu_util
gnocchi resource list
---

Comment 11 Victor Stinner 2017-02-22 21:45:20 UTC
Created attachment 1256731 [details]
Modified Ceilometer config to reproduce the bug more quickly

Comment 20 Asaf Hirshberg 2017-03-27 05:51:02 UTC
Code Verified.

Comment 25 errata-xmlrpc 2017-06-28 15:27:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1587


Note You need to log in before you can comment on or make changes to this bug.