Description of problem: On a RHEL8 undercloud when you run an overcloud deployment, rabbitmq connections from various services will start being disconnected due to missed heartbeats from the client. However if you look at the output of netstat, you'll see connections with RECV-Q > 0. I've seen it affect nova, ironic, mistral services. This originally manifested itself as deployment failures where mistral was missing heartbeats. See Bug 1693834. However we've seen it since increasing the timeouts so it's still a problem. It seems to affect more loaded or slower systems. [root@undercloud-0 rabbitmq]# netstat -napt | grep 5672 ...snip... tcp 8 0 10.109.1.2:50264 10.109.1.2:5672 ESTABLISHED 43881/python3 tcp 0 0 10.109.1.2:5672 10.109.1.2:50130 ESTABLISHED 22878/beam.smp tcp 0 0 10.109.1.2:49950 10.109.1.2:5672 ESTABLISHED 38870/python3 tcp 0 0 10.109.1.2:5672 10.109.1.2:50060 ESTABLISHED 22878/beam.smp tcp 8 0 10.109.1.2:54160 10.109.1.2:5672 ESTABLISHED 38874/python3 tcp 0 0 10.109.1.2:50018 10.109.1.2:5672 ESTABLISHED 43336/l3-agent.log tcp 8 0 10.109.1.2:50268 10.109.1.2:5672 ESTABLISHED 38804/python3 tcp 8 0 10.109.1.2:55910 10.109.1.2:5672 ESTABLISHED 38870/python3 ...snip... tcp 0 0 10.109.1.2:5672 10.109.1.2:50056 ESTABLISHED 22878/beam.smp tcp 8 0 10.109.1.2:33440 10.109.1.2:5672 ESTABLISHED 42281/nova_api_wsgi tcp 0 0 10.109.1.2:5672 10.109.1.2:49984 ESTABLISHED 22878/beam.smp tcp 8 0 10.109.1.2:56252 10.109.1.2:5672 ESTABLISHED 38871/python3 tcp 0 0 10.109.1.2:49994 10.109.1.2:5672 ESTABLISHED 40413/python3 tcp 0 0 10.109.1.2:5672 10.109.1.2:56244 ESTABLISHED 22878/beam.smp tcp 32 0 10.109.1.2:57656 10.109.1.2:5672 ESTABLISHED 42280/nova_api_wsgi ...snip... In the rabbit logs you'll see missed heartbeats from client... 2019-04-15 16:27:16.677 [error] <0.3388.0> closing AMQP connection <0.3388.0> (10.109.1.2:57610 -> 10.109.1.2:5672 - mod_wsgi:20:544ec298-0b37-4123-87e0-2362416c7183): missed heartbeats from client, timeout: 60s 2019-04-15 16:27:18.856 [error] <0.3420.0> closing AMQP connection <0.3420.0> (10.109.1.2:57642 -> 10.109.1.2:5672 - mod_wsgi:20:f78bdc50-0734-4823-b91b-7c9ac4227fd0): missed heartbeats from client, timeout: 60s 2019-04-15 16:27:25.754 [info] <0.3565.0> accepting AMQP connection <0.3565.0> (10.109.1.2:33356 -> 10.109.1.2:5672) 2019-04-15 16:27:25.756 [info] <0.3565.0> Connection <0.3565.0> (10.109.1.2:33356 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:20:f78bdc50-0734-4823-b91b-7c9ac4227fd0 2019-04-15 16:27:25.757 [info] <0.3565.0> connection <0.3565.0> (10.109.1.2:33356 -> 10.109.1.2:5672 - mod_wsgi:20:f78bdc50-0734-4823-b91b-7c9ac4227fd0): user 'guest' authenticated and granted access to vhost '/' 2019-04-15 16:28:25.182 [info] <0.3588.0> accepting AMQP connection <0.3588.0> (10.109.1.2:33440 -> 10.109.1.2:5672) 2019-04-15 16:28:25.185 [info] <0.3588.0> Connection <0.3588.0> (10.109.1.2:33440 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:20:544ec298-0b37-4123-87e0-2362416c7183 2019-04-15 16:28:25.189 [info] <0.3588.0> connection <0.3588.0> (10.109.1.2:33440 -> 10.109.1.2:5672 - mod_wsgi:20:544ec298-0b37-4123-87e0-2362416c7183): user 'guest' authenticated and granted access to vhost '/' 2019-04-15 16:29:11.796 [error] <0.3251.0> closing AMQP connection <0.3251.0> (10.109.1.2:57528 -> 10.109.1.2:5672 - mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1): missed heartbeats from client, timeout: 60s 2019-04-15 16:29:21.203 [error] <0.3432.0> closing AMQP connection <0.3432.0> (10.109.1.2:57656 -> 10.109.1.2:5672 - mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283): missed heartbeats from client, timeout: 60s 2019-04-15 16:30:28.408 [info] <0.3614.0> accepting AMQP connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672) 2019-04-15 16:30:28.421 [info] <0.3614.0> Connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283 2019-04-15 16:30:28.423 [info] <0.3614.0> connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672 - mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283): user 'guest' authenticated and granted access to vhost '/' 2019-04-15 16:31:25.191 [error] <0.3588.0> closing AMQP connection <0.3588.0> (10.109.1.2:33440 -> 10.109.1.2:5672 - mod_wsgi:20:544ec298-0b37-4123-87e0-2362416c7183): missed heartbeats from client, timeout: 60s 2019-04-15 16:31:25.760 [error] <0.3565.0> closing AMQP connection <0.3565.0> (10.109.1.2:33356 -> 10.109.1.2:5672 - mod_wsgi:20:f78bdc50-0734-4823-b91b-7c9ac4227fd0): missed heartbeats from client, timeout: 60s 2019-04-15 16:31:27.054 [info] <0.3640.0> accepting AMQP connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672) 2019-04-15 16:31:27.066 [info] <0.3640.0> Connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1 2019-04-15 16:31:27.067 [info] <0.3640.0> connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672 - mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1): user 'guest' authenticated and granted access to vhost '/' 2019-04-15 16:32:57.777 [error] <0.2891.0> closing AMQP connection <0.2891.0> (10.109.1.2:48034 -> 10.109.1.2:5672 - mistral-server:7:8eafa1b3-63bd-4e57-b0a7-9107043200bb): missed heartbeats from client, timeout: 60s In the server logs you'll see server unexpectedly closed the connection... [root@undercloud-0 containers]# grep oslo.messaging * -r | grep -E ' (ERROR|WARNING) ' mistral/executor.log:2019-04-15 16:33:00.471 7 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection mistral/executor.log:2019-04-15 16:34:26.005 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [7b83de9e-9a0d-4f11-8b5d-afe8ed97e22e] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer mistral/executor.log:2019-04-15 16:34:26.007 7 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionResetError: [Errno 104] Connection reset by peer nova/nova-api.log:2019-04-15 16:02:07.857 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection nova/nova-api.log:2019-04-15 16:16:01.207 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionResetError: [Errno 104] Connection reset by peer nova/nova-api.log:2019-04-15 16:27:24.364 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection nova/nova-api.log:2019-04-15 16:27:24.365 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection nova/nova-api.log:2019-04-15 16:27:24.740 20 ERROR oslo.messaging._drivers.impl_rabbit [req-82a5fdc7-5fe8-4314-a31c-394eabdc418d b36fa1be80ae4275bb4511bb74d78eff 89d60bdd36ab49c5b0720eb43fa7d1dd - default default] [f78bdc50-0734-4823-b91b-7c9ac4227fd0] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection nova/nova-api.log:2019-04-15 16:30:27.358 19 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection nova/nova-api.log:2019-04-15 16:30:27.359 19 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection nova/nova-api.log:2019-04-15 16:30:27.393 19 ERROR oslo.messaging._drivers.impl_rabbit [req-6ffb7d2d-e22b-4d2a-b7f0-f433cb38578d b36fa1be80ae4275bb4511bb74d78eff 89d60bdd36ab49c5b0720eb43fa7d1dd - default default] [564a2fe0-6ccd-4398-a160-eb096ba3d283] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection nova/nova-api.log:2019-04-15 16:33:30.183 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection nova/nova-api.log:2019-04-15 16:33:30.184 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionResetError: [Errno 104] Connection reset by peer nova/nova-api.log:2019-04-15 16:33:30.227 20 ERROR oslo.messaging._drivers.impl_rabbit [req-af8979e1-654f-4082-831f-20a5d2bbf654 b36fa1be80ae4275bb4511bb74d78eff 89d60bdd36ab49c5b0720eb43fa7d1dd - default default] [f78bdc50-0734-4823-b91b-7c9ac4227fd0] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection Version-Release number of selected component (if applicable): [root@undercloud-0 containers]# rpm -qa | grep tripleo openstack-tripleo-common-containers-10.6.2-0.20190412150355.0ec6518.el8ost.noarch python3-tripleoclient-heat-installer-11.4.1-0.20190411190358.0ca816d.el8ost.noarch python3-tripleo-common-10.6.2-0.20190412150355.0ec6518.el8ost.noarch openstack-tripleo-validations-10.3.1-0.20190412103227.b4c2270.el8ost.noarch openstack-tripleo-common-10.6.2-0.20190412150355.0ec6518.el8ost.noarch openstack-tripleo-puppet-elements-10.2.1-0.20190410200345.47aba45.el8ost.noarch openstack-tripleo-heat-templates-10.4.1-0.20190412000410.b934fdd.el8ost.noarch puppet-tripleo-10.3.1-0.20190409110339.541e9ea.el8ost.noarch openstack-tripleo-image-elements-10.3.1-0.20190325204940.253fe88.el8ost.noarch ansible-tripleo-ipsec-9.0.1-0.20190220162047.f60ad6c.el8ost.noarch ansible-role-tripleo-modify-image-1.0.1-0.20190402220346.012209a.el8ost.noarch python3-tripleoclient-11.4.1-0.20190411190358.0ca816d.el8ost.noarch [root@undercloud-0 containers]# rpm -qa | grep rabbitmq puppet-rabbitmq-9.0.1-0.20190321232600.5477e69.el8ost.noarch [root@undercloud-0 containers]# podman exec -it -u root rabbitmq rpm -qa | grep -E "(rabbit|erlang)" erlang-kernel-20.3.8.18-2.el8ost.x86_64 erlang-syntax_tools-20.3.8.18-2.el8ost.x86_64 erlang-asn1-20.3.8.18-2.el8ost.x86_64 erlang-tools-20.3.8.18-2.el8ost.x86_64 erlang-os_mon-20.3.8.18-2.el8ost.x86_64 erlang-erts-20.3.8.18-2.el8ost.x86_64 erlang-stdlib-20.3.8.18-2.el8ost.x86_64 erlang-compiler-20.3.8.18-2.el8ost.x86_64 erlang-mnesia-20.3.8.18-2.el8ost.x86_64 erlang-snmp-20.3.8.18-2.el8ost.x86_64 erlang-public_key-20.3.8.18-2.el8ost.x86_64 erlang-ssl-20.3.8.18-2.el8ost.x86_64 erlang-sasl-20.3.8.18-2.el8ost.x86_64 erlang-otp_mibs-20.3.8.18-2.el8ost.x86_64 erlang-parsetools-20.3.8.18-2.el8ost.x86_64 erlang-xmerl-20.3.8.18-2.el8ost.x86_64 puppet-rabbitmq-9.0.1-0.20190321232600.5477e69.el8ost.noarch erlang-crypto-20.3.8.18-2.el8ost.x86_64 erlang-hipe-20.3.8.18-2.el8ost.x86_64 erlang-runtime_tools-20.3.8.18-2.el8ost.x86_64 erlang-inets-20.3.8.18-2.el8ost.x86_64 erlang-eldap-20.3.8.18-2.el8ost.x86_64 erlang-sd_notify-1.0-5.el8ost.x86_64 rabbitmq-server-3.7.14-1.el8ost.noarch How reproducible: Every deploy. Steps to Reproduce: 1. Deploy RHEL8 undercloud 2. Run overcloud deployment 3. wait 4. Check logs Actual results: rabbitmq connection problems Expected results: No connections problems Additional info:
I think these lost heartbeats and the erratic exchange of AMQP heartbeats in general between rabbitmq and openstack services are due to a subtle change in how we run the containers' healthchecks with systemd and podman. Some context and reminder information below: 1) When an OpenStack service is connected to rabbitmq, they both exchange AMQP heartbeat packets when there is no AMQP traffic since a long time, to check whether the other side is alive. rabbitmq sends a heartbeat packet every 30s and will forcibly close a connection if two consecutive heartbeats failed, i.e. no answer from the client after 60s. 2) OpenStack services rely on eventlet to schedule an asynchronous, recurring greenlet that sends AMQP heartbeats to keep the rabbitmq connection alive while the AMQP socket is idle. 3) In TripleO, OpenStack services run under Apache mod_wsgi, which is a native executable that embeds its own Python interpreter: . mod_wsgi calls the embedded Python interpreter whenever there's an OpenStack API call to process. . While the interpreter is running, eventlet will check whether the timer for executing the AMQP heartbeat greenlet has been reached or exceeded and run the greelet as necessary. . Once the API call is processed, the Python interpreter returns and goes to rest until mod_wsgi receives another request to serve. 4) In TripleO, all OpenStack services are containerized and they are monitored by means of a healthcheck script that checks whether the mod_wsgi process is running and responding to a simple HTTP request. Now as to why we experience erratic AMQP heartbeats: Currently the AMQP heartbeat greenlet is configured to be allowed to run every 15 seconds [1], in case heartbeats must be sent. But for that timer to be watched by eventlet, the embedded Python interpreter must be running inside mod_wsgi, which only happens when mod_wsgi is processing some traffic. Thanks to container healthchecks, we have a means of forcing mod_wsgi to wake up at regular intervals and resume its embedded Python interpreter even when there's no other traffic to the API service. Up to OSP 14, the container's healthcheck script is ran by docker every 30 seconds, which guarantees that eventlet will wake up and send an AMQP heartbeat packet if there was no AMQP traffic in the last 15s. It also guarantees that any incoming AMQP heartbeat packet from rabbitmq will be processed in at most 30s. In OSP15, containers healthcheck are now implemented via systemd timers, and the current setting [2] allows systemd to start the healthcheck every 60s with a randomized 45s delay on top, which is too long to guarantee that mod_wsgi will always wake up on time to send/receive AMQP heartbeats to/from rabbitmq. Tweaking the healthcheck to match the check_interval from OSP14 and before should be enough to fix the issue. [1] https://github.com/openstack/oslo.messaging/blob/f276f0853afcda949a665d2fa12d712e96d7115f/oslo_messaging/_drivers/impl_rabbit.py#L524 [2] https://github.com/openstack/paunch/blob/95ad4705f12f0da30c91adf7c5e218a8a81a5bd0/paunch/utils/systemd.py#L205
(In reply to Damien Ciabrini from comment #4) > I think these lost heartbeats and the erratic exchange of AMQP > heartbeats in general between rabbitmq and openstack services are due > to a subtle change in how we run the containers' healthchecks with > systemd and podman. > Actually, I'm taking that comment back... healthcheck are probably not involved. In my deployment I can see that the heat_api service, which runs under mod_wsgi as well, is immune to that loss of AMQP heartbeats; the AMQP heartbeat greenlet seems to run every 15s as expected. 2019-04-26 08:41:44.732 18 DEBUG amqp [-] heartbeat_tick : for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700 2019-04-26 08:41:44.732 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 58/134, now - 58/135, monotonic - 95201.275375187, last_heartbeat_sent - 95186.273410011, heartbeat int. - 60 for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722 2019-04-26 08:41:59.734 18 DEBUG amqp [-] heartbeat_tick : for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700 2019-04-26 08:41:59.734 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 58/135, now - 58/135, monotonic - 95216.277477823, last_heartbeat_sent - 95186.273410011, heartbeat int. - 60 for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722 2019-04-26 08:42:14.308 18 DEBUG amqp [-] heartbeat_tick : for connection 44ec1ada6cd94026a721146c3e3a3962 heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700 2019-04-26 08:42:14.309 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 48/136, now - 48/137, monotonic - 95230.851789552, last_heartbeat_sent - 95170.858355861, heartbeat int. - 60 for connection 44ec1ada6cd94026a721146c3e3a3962 heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722 2019-04-26 08:42:14.736 18 DEBUG amqp [-] heartbeat_tick : for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700 2019-04-26 08:42:14.736 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 58/135, now - 58/136, monotonic - 95231.279408946, last_heartbeat_sent - 95186.273410011, heartbeat int. - 60 for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722 2019-04-26 08:42:29.738 18 DEBUG amqp [-] heartbeat_tick : for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700 2019-04-26 08:42:29.738 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 58/136, now - 58/136, monotonic - 95246.281495191, last_heartbeat_sent - 95186.273410011, heartbeat int. - 60 for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722 I can see that heat_api_wsgi creates AMQP connections when I use Heat (e.g. with 'heat service-list'): [root@undercloud-0 ~]# ps -ef | grep heat_api_wsgi 42418 222454 222443 0 07:29 ? 00:00:04 heat_api_wsgi -DFOREGROUND root@undercloud-0 ~]# ss -tnp | grep 5672 | grep httpd ESTAB 0 0 192.168.24.1:58900 192.168.24.1:5672 users:(("httpd",pid=222454,fd=18)) ESTAB 0 0 192.168.24.1:58898 192.168.24.1:5672 users:(("httpd",pid=222454,fd=17)) When inspecting the process' callstack: I see that WSGI thread is idle, waiting for incoming Openstack API call to be processed: Thread 4 (Thread 0x7f13b2b57700 (LWP 222465)): #0 0x00007f13bd0ffd97 in epoll_wait () from target:/lib64/libc.so.6 #1 0x00007f13bd81272a in impl_pollset_poll () from target:/lib64/libapr-1.so.0 #2 0x00007f13b4e2d5f2 in wsgi_daemon_thread () from target:/etc/httpd/modules/mod_wsgi_python3.so #3 0x00007f13bd5d32de in start_thread () from target:/lib64/libpthread.so.0 #4 0x00007f13bd0ffa63 in clone () from target:/lib64/libc.so.6 But I also see two additional threads which run the Python interpreter. Thread 6 (Thread 0x7f139ee04700 (LWP 238346)): #0 0x00007f13bd5dc032 in do_futex_wait () from target:/lib64/libpthread.so.0 #1 0x00007f13bd5dc143 in __new_sem_wait_slow () from target:/lib64/libpthread.so.0 #2 0x00007f13b49a385e in PyThread_acquire_lock_timed () from target:/lib64/libpython3.6m.so.1.0 #3 0x00007f13b4a2109b in lock_PyThread_acquire_lock () from target:/lib64/libpython3.6m.so.1.0 [...] Thread 5 (Thread 0x7f139f605700 (LWP 238345)): #0 0x00007f13bd0f47e1 in poll () from target:/lib64/libc.so.6 #1 0x00007f13ab1a75df in sock_recv () from target:/usr/lib64/python3.6/lib-dynload/_socket.cpython-36m-x86_64-linux-gnu.so #2 0x00007f13b4a23d02 in _PyCFunction_FastCallDict () from target:/lib64/libpython3.6m.so.1.0 #3 0x00007f13b4a31b8d in call_function () from target:/lib64/libpython3.6m.so.1.0 [...] At this point I think eventlet manages its timer-based greenlets in a dedicated pthread, which is awaken independently from mod_wsgi's traffic, which is good. Question now is why do we see those threads in heat_api_wsgi, and not in nova_api_wsgi. As hinted in [1], this could be due to eventlet monkey_patching, I'm trying to verify that theory now. [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005413.html
OK I found the discrepancy between the heat_api container (which works as expected on my environment) and the nova_api container (which processes AMQP heartbeats erratically). . When I instrument the AMQP heartbeat thread [1] in the heat_api container, I see that it checks the content of the AMQP socket every 15s by waiting on self._heartbeat_exit_event, which is a threading.Event . Whereas in the case on nova_api, variable self._heartbeat_exit_event is of type oslo_utils.eventletutils.EventletEvent. And instead of waiting 15s as requested, the execution resume sometimes 60s after, sometimes even later. I don't know if that object type is expected (looks like it isn't), I'm gonna check how nova configures eventlet and if changing the location of eventlet.monkey_patch() fixes this behaviour or if something else must be fixed. PS: so far, the monkey patching fix mentioned in the mailing list discussion [2] upstream doesn't seem to fix the issue for me. I will double check that I applied it properly. [1] https://github.com/openstack/oslo.messaging/blob/433d34ed4149ad2dcbb860b07754dea25cb43407/oslo_messaging/_drivers/impl_rabbit.py#L947 [2] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005413.html
That would certainly explain it. It seems like heat_api isn't using eventlet at all under mod_wsgi? The eventletutils.Event function chooses which implementation to use based on whether or not the thread module has been monkey-patched: https://github.com/openstack/oslo.utils/blob/master/oslo_utils/eventletutils.py#L183 So if you're seeing the threading.Event implementation in heat_api, then it hasn't patched the thread module. (Or the logic to check whether the module has been patched is broken somehow).
FYI also take care about this: oslo.messaging waiting for some defined exceptions [1] to ensure connection with the rabbit cluster nodes but kombu redefine some possible stdlib exceptions like: - socket.timeout.TimeoutError [2] possibly raised by amqp so oslo.messaging can't catch properly the exception due to the namespace of the raised exception in that case (kombu.exceptions.TimeoutError), - also some other amqp exceptions [3] are also rebranded by kombu and not properly expected on the oslo.messaging side, I guess we need to also catch kombu.exceptions.ConnectionError and amqp.ConnectionError in that scenario. (kombu.exceptions.ConnectionError = amqp.ConnectionError) My analyze come from this BZ [4] where a lot of logs entries appears when we power-off one node on a 3-nodes clustering controllers. This is on OSP13 BZ but the exception management is the same on the latest version. The heartbeat doesn't work properly due to these exceptions redefining and we never call the ensure_connection function and the connection destination is never be switched. Is not the same problem than yours but possibly you fallout in some kind of similar possible scenario depending your tests, so just take care about this :) I need to do more tests on my side to check if we need to catch such or such exceptions and I'll submit a patch soon to fix my BZ and the described behaviour. [1] https://github.com/openstack/oslo.messaging/blob/433d34ed4149ad2dcbb860b07754dea25cb43407/oslo_messaging/_drivers/impl_rabbit.py#L937 [2] https://github.com/celery/kombu/blob/master/kombu/exceptions.py#L4 [3] https://github.com/celery/kombu/blob/master/kombu/exceptions.py#L6 [4] https://bugzilla.redhat.com/show_bug.cgi?id=1665399
*** Bug 1705730 has been marked as a duplicate of this bug. ***
*** Bug 1705694 has been marked as a duplicate of this bug. ***
*** Bug 1693834 has been marked as a duplicate of this bug. ***
[stack@undercloud-0 ~]$ rhos-release -L Installed repositories (rhel-8.0): 15 ceph-4 ceph-osd-4 rhel-8.0 [stack@undercloud-0 ~]$ cat core_puddle_version RHOS_TRUNK-15.0-RHEL-8-20190523.n.1 [stack@undercloud-0 ~]$ rpm -qa|grep openstack-tripleo-common-10.7.1-0.20190522180807.438b9fb.el8ost openstack-tripleo-common-10.7.1-0.20190522180807.438b9fb.el8ost.noarch [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a"echo 'Current Date:';date;grep -ri 'Too many heartbeats missed' /var/log/"|grep -v grep [WARNING]: Found both group and host with same name: undercloud [DEPRECATION WARNING]: Distribution redhat 8.0 on host compute-0 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This feature will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg. compute-0 | CHANGED | rc=0 >> Current Date: Tue Jun 4 18:33:30 UTC 2019 Binary file /var/log/journal/07264b1c2c0b42c0b1db6070c41fee41/system.journal matches [DEPRECATION WARNING]: Distribution redhat 8.0 on host compute-1 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This feature will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg. compute-1 | CHANGED | rc=0 >> Current Date: Tue Jun 4 18:33:30 UTC 2019 Binary file /var/log/journal/971a7ca9e4d6490eb234d2a301401bd7/system.journal matches [DEPRECATION WARNING]: Distribution redhat 8.0 on host controller-0 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This feature will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg. controller-0 | CHANGED | rc=0 >> Current Date: Tue Jun 4 18:33:31 UTC 2019 Binary file /var/log/journal/393cd21b194e4ea29d9f5aa6634819f1/system.journal matches [DEPRECATION WARNING]: Distribution redhat 8.0 on host controller-1 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This feature will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg. controller-1 | CHANGED | rc=0 >> Current Date: Tue Jun 4 18:33:31 UTC 2019 Binary file /var/log/journal/e4e7e4ed4a4e41d5b750dffb24d3e85c/system.journal matches [DEPRECATION WARNING]: Distribution redhat 8.0 on host controller-2 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This feature will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg. controller-2 | CHANGED | rc=0 >> Current Date: Tue Jun 4 18:33:31 UTC 2019 Binary file /var/log/journal/fac0596f05a048439fbd62707a549126/system.journal matches
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:2811