Description of problem: /var/log/ceilometer/collector.log, ~~~ 2017-02-17 05:57:13.771 2445 ERROR oslo.messaging._drivers.impl_rabbit [-] [7c8bc191-060f-45b2-913e-d5101d2e79cd] AMQP server on 192.168.100.222:5671 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds. Client port: 60708 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener [-] Fail to ack/requeue message. 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener Traceback (most recent call last): 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener File "/usr/lib/python2.7/site-packages/oslo_messaging/notify/listener.py", line 179, in _process_incoming 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener m.acknowledge() 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 119, in acknowledge 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener self.message.acknowledge() 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 251, in acknowledge 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener self._raw_message.ack() 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener self.channel.basic_ack(self.delivery_tag) 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener self._send_method((60, 80), args) 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener self.channel_id, method_sig, args, content, 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener write_frame(1, channel, payload) 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener frame_type, channel, size, payload, 0xce, 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 254, in _write 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener n = write(s) 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener File "/usr/lib64/python2.7/ssl.py", line 671, in write 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener return self._sslobj.write(data) 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener error: [Errno 14] Bad address ~~~ Similarly I do not see any output in gnocchi command ~~~ # gnocchi metric list <EMPTY OUTPUT> ~~~ But, I see ceilometer samples for cpu_util & memory ~~~ # ceilometer meter-list -l 10 +---------------------+-------+-------+--------------------------------------+----------------------------------+----------------------------------+ | Name | Type | Unit | Resource ID | User ID | Project ID | +---------------------+-------+-------+--------------------------------------+----------------------------------+----------------------------------+ | disk.ephemeral.size | gauge | GB | 66e2e05a-0804-4487-84f7-b4e40c1c1801 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 | | disk.root.size | gauge | GB | 66e2e05a-0804-4487-84f7-b4e40c1c1801 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 | | image | gauge | image | c1df4f4f-09b8-4221-9ff1-337a0b634892 | None | 156762f2978e4a7bbc31cf840f615cb1 | | image.download | delta | B | c1df4f4f-09b8-4221-9ff1-337a0b634892 | None | 156762f2978e4a7bbc31cf840f615cb1 | | image.serve | delta | B | c1df4f4f-09b8-4221-9ff1-337a0b634892 | None | 156762f2978e4a7bbc31cf840f615cb1 | | image.size | gauge | B | c1df4f4f-09b8-4221-9ff1-337a0b634892 | None | 156762f2978e4a7bbc31cf840f615cb1 | | memory | gauge | MB | 66e2e05a-0804-4487-84f7-b4e40c1c1801 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 | | memory | gauge | MB | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 | | vcpus | gauge | vcpu | 66e2e05a-0804-4487-84f7-b4e40c1c1801 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 | | vcpus | gauge | vcpu | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | ca5b529ed0ff4be580d5bbcda59fcc93 | 6738642fac97418ea3bc41e76952bfb1 | +---------------------+-------+-------+--------------------------------------+----------------------------------+----------------------------------+ # openstack server list +--------------------------------------+------+--------+---------------------+------------+ | ID | Name | Status | Networks | Image Name | +--------------------------------------+------+--------+---------------------+------------+ | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | t1 | ACTIVE | public=172.24.4.231 | cirros | +--------------------------------------+------+--------+---------------------+------------+ [root@localhost rabbitmq(keystone_admin)]# ceilometer sample-list -m cpu_util +--------------------------------------+----------+-------+---------------+------+----------------------------+ | Resource ID | Name | Type | Volume | Unit | Timestamp | +--------------------------------------+----------+-------+---------------+------+----------------------------+ | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | cpu_util | gauge | 3.13665403118 | % | 2017-02-17T06:37:13.571000 | | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | cpu_util | gauge | 2.71087971235 | % | 2017-02-17T06:27:13.568000 | +--------------------------------------+----------+-------+---------------+------+----------------------------+ # ceilometer sample-list -m memory -l 5 +--------------------------------------+--------+-------+--------+------+----------------------------+ | Resource ID | Name | Type | Volume | Unit | Timestamp | +--------------------------------------+--------+-------+--------+------+----------------------------+ | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0 | MB | 2017-02-17T06:00:21.001000 | | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0 | MB | 2017-02-17T05:59:28.582000 | | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0 | MB | 2017-02-17T05:59:28.564000 | | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0 | MB | 2017-02-17T05:59:21.642000 | | e560e649-41fd-46a2-a3d2-5f4750ba2bb4 | memory | gauge | 512.0 | MB | 2017-02-17T05:59:21.556000 | +--------------------------------------+--------+-------+--------+------+----------------------------+ ~~~ Version-Release number of selected component (if applicable): rabbitmq-server version: 3.6.3(6.el7ost) How reproducible: Always reproducible Steps to Reproduce: 1. Enable SSL Actual results: ~~~ 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener File "/usr/lib64/python2.7/ssl.py", line 671, in write 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener return self._sslobj.write(data) 2017-02-17 05:57:13.774 2445 ERROR oslo_messaging.notify.listener error: [Errno 14] Bad address ~~~ Expected results: Should not see error Additional info: - Customer as secure environment. Can't ask for log. But was reproduced internally on single-node enviroment(packstack). Customer encounters exact error. - Customer can't see ceilometer samples where they are visible in test environment. - Generated Sosreport
/var/log/rabbitmq/rabbit ~~~ =INFO REPORT==== 17-Feb-2017::08:17:15 === accepting AMQP connection <0.22380.0> (192.168.100.222:35502 -> 192.168.100.222:5671) =ERROR REPORT==== 17-Feb-2017::08:17:16 === SSL: connection: ssl_cipher.erl:297:Fatal error: bad record mac =WARNING REPORT==== 17-Feb-2017::08:17:16 === closing AMQP connection <0.22380.0> (192.168.100.222:35502 -> 192.168.100.222:5671): client unexpectedly closed TCP connection =INFO REPORT==== 17-Feb-2017::08:17:17 === accepting AMQP connection <0.22502.0> (192.168.100.222:35504 -> 192.168.100.222:5671) =ERROR REPORT==== 17-Feb-2017::08:27:13 === SSL: connection: ssl_cipher.erl:297:Fatal error: bad record mac =ERROR REPORT==== 17-Feb-2017::08:27:13 === closing AMQP connection <0.22502.0> (192.168.100.222:35504 -> 192.168.100.222:5671): {inet_error,{tls_alert,"bad record mac"}} =INFO REPORT==== 17-Feb-2017::08:27:14 === accepting AMQP connection <0.22886.0> (192.168.100.222:35686 -> 192.168.100.222:5671) =INFO REPORT==== 17-Feb-2017::08:37:13 === closing AMQP connection <0.20785.0> (192.168.100.222:34896 -> 192.168.100.222:5671) =INFO REPORT==== 17-Feb-2017::08:37:13 === closing AMQP connection <0.20749.0> (192.168.100.222:34880 -> 192.168.100.222:5671) ~~~
According to RabbitMQ docs this could be an issue on a client's side: https://www.rabbitmq.com/troubleshooting-ssl.html ================================ Entries containing {tls_alert,"bad record mac"} The server has tried verifying integrity of a piece of data it received and the check failed. This can be due to problematic network equipment, unintentional socket sharing in the client (e.g. due to the use of fork(2)) or a bug in the client implementation of TLS. ================================ Is it possible that oslo.messaging is trying to send unencrypted data over TLS-connection? Let's say it passes FD to another thread where it uses as a plain TCP socket?
On the client side, what are the versions of the following packages? * openssl * Python * eventlet * python-oslo-messaging Moreover, what is your OS and OS version?
Sachin was able to reproduce the "Bad address" error on a simple OSP 10 infra: everything running on the same host. The bug is specific to ceilometer: keystone, nova and neutron work fine. It seems like the bug comes from Oslo Messaging using SSL and the threading driver: since OSP 10, ceilometer and gnocchi don't use eventlet nor oslo.service anymore, but native threads.
openssl = 1.0.1e(Revision: 1.0.1e) Python = 2.7.5(48.el7) No eventlet python-oslo-messaging = 5.10.0(Release: 5.el7ost) OS info ~~~ cat /etc/os-release NAME="Red Hat Enterprise Linux Server" VERSION="7.3 (Maipo)" ID="rhel" ID_LIKE="fedora" VERSION_ID="7.3" PRETTY_NAME="Red Hat Enterprise Linux Server 7.3 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.3:GA:server" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.3 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.3" ~~~
/ additional info / python-amqp: 1.4.6(1.el7ost) python-kombu: 3.0.32(2.el7ost)
Ceilometer calls oslo.messaging from different threads. The rabbitmq driver of Oslo Messaging uses kombu and amqp, but amqp is not thread-safe: https://github.com/celery/py-amqp/issues/16 Mehdi Abaakouk is working on a patch to make Oslo Messaging again thread-safe: use a queue to make sure that an AMQP connection is always used in the same thread. https://review.openstack.org/436958
Mehdi fixed a race condition in Oslo Messaging to make it thread-safe: https://review.openstack.org/#/c/436958/ His patch now LGTM and can be backported, but Fabio needs to check the rhos-10.0.z flag.
I was able to reproduce the bug by restarting Ceilometer services: --- systemctl stop openstack-ceilometer-collector.service systemctl stop openstack-ceilometer-notification.service systemctl stop openstack-ceilometer-central.service systemctl start openstack-ceilometer-central.service systemctl start openstack-ceilometer-notification.service systemctl start openstack-ceilometer-collector.service grep 'Bad address' /var/log/ceilometer/collector.log |tail --- Mehdi modified /etc/ceilometer/pipeline.yaml to run the pipeline every 30 seconds, instead of running it every 10 minutes: ========================== --- pipeline.yaml.orig 2017-02-22 22:42:16.224649361 +0100 +++ pipeline.yaml 2017-02-22 22:42:04.314370871 +0100 @@ -1,17 +1,20 @@ --- sources: - name: meter_source + interval: 30 meters: - "*" sinks: - meter_sink - name: cpu_source + interval: 30 meters: - "cpu" sinks: - cpu_sink - cpu_delta_sink - name: disk_source + interval: 30 meters: - "disk.read.bytes" - "disk.read.requests" @@ -24,6 +27,7 @@ sinks: - disk_sink - name: network_source + interval: 30 meters: - "network.incoming.bytes" - "network.incoming.packets" ========================== => See attached pipeline.yaml I built python-oslo-messaging-5.10.1-2.el7ost which includes https://review.openstack.org/#/c/436958/ fix (patch set 18). With this package, I'm unable to reproduce the bug anymore. I also tested that keystone, nova, neutron, ceilometer and gnocchi still answer (display data) after a reboot with these commands: --- openstack user list # keystone nova list neutron subnet-list ceilometer sample-list -m cpu_util gnocchi resource list ---
Created attachment 1256731 [details] Modified Ceilometer config to reproduce the bug more quickly
Code Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1587