Bug 1423593

Summary:

RabbitMQ: Bad address when CONFIG_AMQP_ENABLE_SSL=y

Product:

Red Hat OpenStack

Reporter:

Sachin <sacpatil>

Component:

python-oslo-messaging

Assignee:

Victor Stinner <vstinner>

Status:

CLOSED ERRATA

QA Contact:

Asaf Hirshberg <ahirshbe>

Severity:

high

Docs Contact:

Priority:

high

Version:

10.0 (Newton)

CC:

apevec, eglynn, fdinitto, fpercoco, jeckersb, jruzicka, knakai, lhh, lruzicka, pablo.iranzo, plemenko, pmannidi, rscarazz, sacpatil, srevivo, ushkalim, vstinner

Target Milestone:

Keywords:

Triaged, ZStream

Target Release:

10.0 (Newton)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

python-oslo-messaging-5.10.1-2.el7ost

Doc Type:

Bug Fix

Doc Text:

A race condition on RPC message acknowledgement in Oslo Messaging caused random RPC timeout in Ceilometer. The problem has been fixed. Now, the message acknowledgement is thread-safe and causes no Ceilometer related timeouts.

Story Points:

---

Clone Of:

Clones:

1427792 (view as bug list)

Environment:

Last Closed:

2017-06-28 15:27:21 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1427792

Attachments:

Description	Flags
Modified Ceilometer config to reproduce the bug more quickly	none

Description Sachin 2017-02-17 14:00:19 UTC

Description of problem:

/var/log/ceilometer/collector.log,
~~~
2017-02-17 05:57:13.771 2445 ERROR oslo.messaging._drivers.impl_rabbit [-] [7c8bc191-060f-45b2-913e-d5101d2e79cd] AMQP server on 192.168.100.222:5671 is unreachable: <AMQPError: unknown error>. Trying again in 1
 seconds. Client port: 60708
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener [-] Fail to ack/requeue message.
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener Traceback (most recent call last):
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/oslo_messaging/notify/listener.py", line 179, in _process_incoming
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     m.acknowledge()
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 119, in acknowledge
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     self.message.acknowledge()
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 251, in acknowledge
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     self._raw_message.ack()
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     self.channel.basic_ack(self.delivery_tag)
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     self._send_method((60, 80), args)
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     self.channel_id, method_sig, args, content,
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     write_frame(1, channel, payload)
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     frame_type, channel, size, payload, 0xce,
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 254, in _write
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     n = write(s)
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib64/python2.7/ssl.py", line 671, in write
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     return self._sslobj.write(data)
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener error: [Errno 14] Bad address
~~~


Similarly I do not see any output in gnocchi command
~~~
# gnocchi metric list
<EMPTY OUTPUT>
~~~

But, I see ceilometer samples for cpu_util & memory
~~~
# ceilometer meter-list -l 10
+---------------------+-------+-------+--------------------------------------+----------------------------------+----------------------------------+
| Name                | Type  | Unit  | Resource ID                          | User ID                          | Project ID                       |
+---------------------+-------+-------+--------------------------------------+----------------------------------+----------------------------------+
| disk.ephemeral.size | gauge | GB    | 66e2e05a-0804-4487-84f7-b4e40c1c1801 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 |
| disk.root.size      | gauge | GB    | 66e2e05a-0804-4487-84f7-b4e40c1c1801 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 |
| image               | gauge | image | c1df4f4f-09b8-4221-9ff1-337a0b634892 | None                             | 156762f2978e4a7bbc31cf840f615cb1 |
| image.download      | delta | B     | c1df4f4f-09b8-4221-9ff1-337a0b634892 | None                             | 156762f2978e4a7bbc31cf840f615cb1 |
| image.serve         | delta | B     | c1df4f4f-09b8-4221-9ff1-337a0b634892 | None                             | 156762f2978e4a7bbc31cf840f615cb1 |
| image.size          | gauge | B     | c1df4f4f-09b8-4221-9ff1-337a0b634892 | None                             | 156762f2978e4a7bbc31cf840f615cb1 |
| memory              | gauge | MB    | 66e2e05a-0804-4487-84f7-b4e40c1c1801 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 |
| memory              | gauge | MB    | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 |
| vcpus               | gauge | vcpu  | 66e2e05a-0804-4487-84f7-b4e40c1c1801 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 |
| vcpus               | gauge | vcpu  | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 |
+---------------------+-------+-------+--------------------------------------+----------------------------------+----------------------------------+

# openstack server list
+--------------------------------------+------+--------+---------------------+------------+
| ID                                   | Name | Status | Networks            | Image Name |
+--------------------------------------+------+--------+---------------------+------------+
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | t1   | ACTIVE | public=172.24.4.231 | cirros     |
+--------------------------------------+------+--------+---------------------+------------+
[root@localhost rabbitmq(keystone_admin)]# ceilometer sample-list -m cpu_util
+--------------------------------------+----------+-------+---------------+------+----------------------------+
| Resource ID                          | Name     | Type  | Volume        | Unit | Timestamp                  |
+--------------------------------------+----------+-------+---------------+------+----------------------------+
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | cpu_util | gauge | 3.13665403118 | %    | 2017-02-17T06:37:13.571000 |
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | cpu_util | gauge | 2.71087971235 | %    | 2017-02-17T06:27:13.568000 |
+--------------------------------------+----------+-------+---------------+------+----------------------------+



# ceilometer sample-list -m memory -l 5
+--------------------------------------+--------+-------+--------+------+----------------------------+
| Resource ID                          | Name   | Type  | Volume | Unit | Timestamp                  |
+--------------------------------------+--------+-------+--------+------+----------------------------+
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0  | MB   | 2017-02-17T06:00:21.001000 |
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0  | MB   | 2017-02-17T05:59:28.582000 |
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0  | MB   | 2017-02-17T05:59:28.564000 |
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0  | MB   | 2017-02-17T05:59:21.642000 |
| e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0  | MB   | 2017-02-17T05:59:21.556000 |
+--------------------------------------+--------+-------+--------+------+----------------------------+
~~~


Version-Release number of selected component (if applicable):
rabbitmq-server version: 3.6.3(6.el7ost)

How reproducible:
Always reproducible 


Steps to Reproduce:
1. Enable SSL


Actual results:
~~~
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener   File "/usr/lib64/python2.7/ssl.py", line 671, in write
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener     return self._sslobj.write(data)
2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener error: [Errno 14] Bad address
~~~

Expected results:
Should not see error

Additional info:
- Customer as secure environment. Can't ask for log. But was reproduced internally on single-node enviroment(packstack). Customer encounters exact error.
- Customer can't see ceilometer samples where they are visible in test environment.
- Generated Sosreport

Comment 1 Sachin 2017-02-17 14:02:59 UTC

/var/log/rabbitmq/rabbit
~~~
=INFO REPORT==== 17-Feb-2017::08:17:15 ===
accepting AMQP connection <0.22380.0> (192.168.100.222:35502 -> 192.168.100.222:5671)

=ERROR REPORT==== 17-Feb-2017::08:17:16 ===
SSL: connection: ssl_cipher.erl:297:Fatal error: bad record mac

=WARNING REPORT==== 17-Feb-2017::08:17:16 ===
closing AMQP connection <0.22380.0> (192.168.100.222:35502 -> 192.168.100.222:5671):
client unexpectedly closed TCP connection

=INFO REPORT==== 17-Feb-2017::08:17:17 ===
accepting AMQP connection <0.22502.0> (192.168.100.222:35504 -> 192.168.100.222:5671)

=ERROR REPORT==== 17-Feb-2017::08:27:13 ===
SSL: connection: ssl_cipher.erl:297:Fatal error: bad record mac

=ERROR REPORT==== 17-Feb-2017::08:27:13 ===
closing AMQP connection <0.22502.0> (192.168.100.222:35504 -> 192.168.100.222:5671):
{inet_error,{tls_alert,"bad record mac"}}

=INFO REPORT==== 17-Feb-2017::08:27:14 ===
accepting AMQP connection <0.22886.0> (192.168.100.222:35686 -> 192.168.100.222:5671)

=INFO REPORT==== 17-Feb-2017::08:37:13 ===
closing AMQP connection <0.20785.0> (192.168.100.222:34896 -> 192.168.100.222:5671)

=INFO REPORT==== 17-Feb-2017::08:37:13 ===
closing AMQP connection <0.20749.0> (192.168.100.222:34880 -> 192.168.100.222:5671)
~~~

Comment 3 Peter Lemenkov 2017-02-22 12:37:41 UTC

According to RabbitMQ docs this could be an issue on a client's side:

https://www.rabbitmq.com/troubleshooting-ssl.html
================================
Entries containing {tls_alert,"bad record mac"}

    The server has tried verifying integrity of a piece of data it received and the check failed. This can be due to problematic network equipment, unintentional socket sharing in the client (e.g. due to the use of fork(2)) or a bug in the client implementation of TLS.
================================

Is it possible that oslo.messaging is trying to send unencrypted data over TLS-connection? Let's say it passes FD to another thread where it uses as a plain TCP socket?

Comment 4 Victor Stinner 2017-02-22 12:46:17 UTC

On the client side, what are the versions of the following packages?

* openssl
* Python
* eventlet
* python-oslo-messaging

Moreover, what is your OS and OS version?

Comment 5 Victor Stinner 2017-02-22 14:26:41 UTC

Sachin was able to reproduce the "Bad address" error on a simple OSP 10 infra: everything running on the same host. The bug is specific to ceilometer: keystone, nova and neutron work fine.

It seems like the bug comes from Oslo Messaging using SSL and the threading driver: since OSP 10, ceilometer and gnocchi don't use eventlet nor oslo.service anymore, but native threads.

Comment 6 Sachin 2017-02-22 14:29:23 UTC

openssl = 1.0.1e(Revision: 1.0.1e)
Python = 2.7.5(48.el7)
No eventlet
python-oslo-messaging = 5.10.0(Release: 5.el7ost)

OS info
~~~
cat /etc/os-release 
NAME="Red Hat Enterprise Linux Server"
VERSION="7.3 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="7.3"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.3 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.3:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.3
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.3"
~~~

Comment 7 Sachin 2017-02-22 14:41:08 UTC

/ additional info /

python-amqp: 1.4.6(1.el7ost)
python-kombu: 3.0.32(2.el7ost)

Comment 8 Victor Stinner 2017-02-22 15:13:30 UTC

Ceilometer calls oslo.messaging from different threads.

The rabbitmq driver of Oslo Messaging uses kombu and amqp, but amqp is not thread-safe: https://github.com/celery/py-amqp/issues/16

Mehdi Abaakouk is working on a patch to make Oslo Messaging again thread-safe: use a queue to make sure that an AMQP connection is always used in the same thread.
https://review.openstack.org/436958

Comment 9 Victor Stinner 2017-02-22 18:14:54 UTC

Mehdi fixed a race condition in Oslo Messaging to make it thread-safe: https://review.openstack.org/#/c/436958/

His patch now LGTM and can be backported, but Fabio needs to check the rhos-10.0.z flag.

Comment 10 Victor Stinner 2017-02-22 21:44:04 UTC

I was able to reproduce the bug by restarting Ceilometer services:
---
systemctl stop openstack-ceilometer-collector.service
systemctl stop openstack-ceilometer-notification.service
systemctl stop openstack-ceilometer-central.service
systemctl start openstack-ceilometer-central.service
systemctl start openstack-ceilometer-notification.service
systemctl start openstack-ceilometer-collector.service
grep 'Bad address' /var/log/ceilometer/collector.log |tail
---

Mehdi modified /etc/ceilometer/pipeline.yaml to run the pipeline every 30 seconds, instead of running it every 10 minutes:
==========================
--- pipeline.yaml.orig	2017-02-22 22:42:16.224649361 +0100
+++ pipeline.yaml	2017-02-22 22:42:04.314370871 +0100
@@ -1,17 +1,20 @@
 ---
 sources:
     - name: meter_source
+      interval: 30
       meters:
           - "*"
       sinks:
           - meter_sink
     - name: cpu_source
+      interval: 30
       meters:
           - "cpu"
       sinks:
           - cpu_sink
           - cpu_delta_sink
     - name: disk_source
+      interval: 30
       meters:
           - "disk.read.bytes"
           - "disk.read.requests"
@@ -24,6 +27,7 @@
       sinks:
           - disk_sink
     - name: network_source
+      interval: 30
       meters:
           - "network.incoming.bytes"
           - "network.incoming.packets"
==========================
=> See attached pipeline.yaml

I built python-oslo-messaging-5.10.1-2.el7ost which includes https://review.openstack.org/#/c/436958/ fix (patch set 18). With this package, I'm unable to reproduce the bug anymore.

I also tested that keystone, nova, neutron, ceilometer and gnocchi still answer (display data) after a reboot with these commands:
---
openstack user list # keystone
nova list
neutron subnet-list
ceilometer sample-list -m cpu_util
gnocchi resource list
---

Comment 11 Victor Stinner 2017-02-22 21:45:20 UTC

Created attachment 1256731 [details]
Modified Ceilometer config to reproduce the bug more quickly

Comment 20 Asaf Hirshberg 2017-03-27 05:51:02 UTC

Code Verified.

Comment 25 errata-xmlrpc 2017-06-28 15:27:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1587