Bug 1700044
| Summary: | [osp15] rabbitmq connections fail due to missed heartbeats | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Alex Schultz <aschultz> |
| Component: | openstack-tripleo-common | Assignee: | Adriano Petrich <apetrich> |
| Status: | CLOSED ERRATA | QA Contact: | pkomarov |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 15.0 (Stein) | CC: | ahrechan, apetrich, atonner, dciabrin, hberaud, jeckersb, lmiccini, mburns, mcornea, michele, pkomarov, slinaber, smalleni |
| Target Milestone: | beta | Keywords: | Triaged |
| Target Release: | 15.0 (Stein) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-tripleo-common-10.7.1-0.20190522180807.438b9fb.el8ost | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-09-21 11:21:11 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Alex Schultz
2019-04-15 16:39:24 UTC
I think these lost heartbeats and the erratic exchange of AMQP heartbeats in general between rabbitmq and openstack services are due to a subtle change in how we run the containers' healthchecks with systemd and podman. Some context and reminder information below: 1) When an OpenStack service is connected to rabbitmq, they both exchange AMQP heartbeat packets when there is no AMQP traffic since a long time, to check whether the other side is alive. rabbitmq sends a heartbeat packet every 30s and will forcibly close a connection if two consecutive heartbeats failed, i.e. no answer from the client after 60s. 2) OpenStack services rely on eventlet to schedule an asynchronous, recurring greenlet that sends AMQP heartbeats to keep the rabbitmq connection alive while the AMQP socket is idle. 3) In TripleO, OpenStack services run under Apache mod_wsgi, which is a native executable that embeds its own Python interpreter: . mod_wsgi calls the embedded Python interpreter whenever there's an OpenStack API call to process. . While the interpreter is running, eventlet will check whether the timer for executing the AMQP heartbeat greenlet has been reached or exceeded and run the greelet as necessary. . Once the API call is processed, the Python interpreter returns and goes to rest until mod_wsgi receives another request to serve. 4) In TripleO, all OpenStack services are containerized and they are monitored by means of a healthcheck script that checks whether the mod_wsgi process is running and responding to a simple HTTP request. Now as to why we experience erratic AMQP heartbeats: Currently the AMQP heartbeat greenlet is configured to be allowed to run every 15 seconds [1], in case heartbeats must be sent. But for that timer to be watched by eventlet, the embedded Python interpreter must be running inside mod_wsgi, which only happens when mod_wsgi is processing some traffic. Thanks to container healthchecks, we have a means of forcing mod_wsgi to wake up at regular intervals and resume its embedded Python interpreter even when there's no other traffic to the API service. Up to OSP 14, the container's healthcheck script is ran by docker every 30 seconds, which guarantees that eventlet will wake up and send an AMQP heartbeat packet if there was no AMQP traffic in the last 15s. It also guarantees that any incoming AMQP heartbeat packet from rabbitmq will be processed in at most 30s. In OSP15, containers healthcheck are now implemented via systemd timers, and the current setting [2] allows systemd to start the healthcheck every 60s with a randomized 45s delay on top, which is too long to guarantee that mod_wsgi will always wake up on time to send/receive AMQP heartbeats to/from rabbitmq. Tweaking the healthcheck to match the check_interval from OSP14 and before should be enough to fix the issue. [1] https://github.com/openstack/oslo.messaging/blob/f276f0853afcda949a665d2fa12d712e96d7115f/oslo_messaging/_drivers/impl_rabbit.py#L524 [2] https://github.com/openstack/paunch/blob/95ad4705f12f0da30c91adf7c5e218a8a81a5bd0/paunch/utils/systemd.py#L205 (In reply to Damien Ciabrini from comment #4) > I think these lost heartbeats and the erratic exchange of AMQP > heartbeats in general between rabbitmq and openstack services are due > to a subtle change in how we run the containers' healthchecks with > systemd and podman. > Actually, I'm taking that comment back... healthcheck are probably not involved. In my deployment I can see that the heat_api service, which runs under mod_wsgi as well, is immune to that loss of AMQP heartbeats; the AMQP heartbeat greenlet seems to run every 15s as expected. 2019-04-26 08:41:44.732 18 DEBUG amqp [-] heartbeat_tick : for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700 2019-04-26 08:41:44.732 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 58/134, now - 58/135, monotonic - 95201.275375187, last_heartbeat_sent - 95186.273410011, heartbeat int. - 60 for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722 2019-04-26 08:41:59.734 18 DEBUG amqp [-] heartbeat_tick : for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700 2019-04-26 08:41:59.734 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 58/135, now - 58/135, monotonic - 95216.277477823, last_heartbeat_sent - 95186.273410011, heartbeat int. - 60 for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722 2019-04-26 08:42:14.308 18 DEBUG amqp [-] heartbeat_tick : for connection 44ec1ada6cd94026a721146c3e3a3962 heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700 2019-04-26 08:42:14.309 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 48/136, now - 48/137, monotonic - 95230.851789552, last_heartbeat_sent - 95170.858355861, heartbeat int. - 60 for connection 44ec1ada6cd94026a721146c3e3a3962 heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722 2019-04-26 08:42:14.736 18 DEBUG amqp [-] heartbeat_tick : for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700 2019-04-26 08:42:14.736 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 58/135, now - 58/136, monotonic - 95231.279408946, last_heartbeat_sent - 95186.273410011, heartbeat int. - 60 for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722 2019-04-26 08:42:29.738 18 DEBUG amqp [-] heartbeat_tick : for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700 2019-04-26 08:42:29.738 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 58/136, now - 58/136, monotonic - 95246.281495191, last_heartbeat_sent - 95186.273410011, heartbeat int. - 60 for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722 I can see that heat_api_wsgi creates AMQP connections when I use Heat (e.g. with 'heat service-list'): [root@undercloud-0 ~]# ps -ef | grep heat_api_wsgi 42418 222454 222443 0 07:29 ? 00:00:04 heat_api_wsgi -DFOREGROUND root@undercloud-0 ~]# ss -tnp | grep 5672 | grep httpd ESTAB 0 0 192.168.24.1:58900 192.168.24.1:5672 users:(("httpd",pid=222454,fd=18)) ESTAB 0 0 192.168.24.1:58898 192.168.24.1:5672 users:(("httpd",pid=222454,fd=17)) When inspecting the process' callstack: I see that WSGI thread is idle, waiting for incoming Openstack API call to be processed: Thread 4 (Thread 0x7f13b2b57700 (LWP 222465)): #0 0x00007f13bd0ffd97 in epoll_wait () from target:/lib64/libc.so.6 #1 0x00007f13bd81272a in impl_pollset_poll () from target:/lib64/libapr-1.so.0 #2 0x00007f13b4e2d5f2 in wsgi_daemon_thread () from target:/etc/httpd/modules/mod_wsgi_python3.so #3 0x00007f13bd5d32de in start_thread () from target:/lib64/libpthread.so.0 #4 0x00007f13bd0ffa63 in clone () from target:/lib64/libc.so.6 But I also see two additional threads which run the Python interpreter. Thread 6 (Thread 0x7f139ee04700 (LWP 238346)): #0 0x00007f13bd5dc032 in do_futex_wait () from target:/lib64/libpthread.so.0 #1 0x00007f13bd5dc143 in __new_sem_wait_slow () from target:/lib64/libpthread.so.0 #2 0x00007f13b49a385e in PyThread_acquire_lock_timed () from target:/lib64/libpython3.6m.so.1.0 #3 0x00007f13b4a2109b in lock_PyThread_acquire_lock () from target:/lib64/libpython3.6m.so.1.0 [...] Thread 5 (Thread 0x7f139f605700 (LWP 238345)): #0 0x00007f13bd0f47e1 in poll () from target:/lib64/libc.so.6 #1 0x00007f13ab1a75df in sock_recv () from target:/usr/lib64/python3.6/lib-dynload/_socket.cpython-36m-x86_64-linux-gnu.so #2 0x00007f13b4a23d02 in _PyCFunction_FastCallDict () from target:/lib64/libpython3.6m.so.1.0 #3 0x00007f13b4a31b8d in call_function () from target:/lib64/libpython3.6m.so.1.0 [...] At this point I think eventlet manages its timer-based greenlets in a dedicated pthread, which is awaken independently from mod_wsgi's traffic, which is good. Question now is why do we see those threads in heat_api_wsgi, and not in nova_api_wsgi. As hinted in [1], this could be due to eventlet monkey_patching, I'm trying to verify that theory now. [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005413.html OK I found the discrepancy between the heat_api container (which works as expected on my environment) and the nova_api container (which processes AMQP heartbeats erratically). . When I instrument the AMQP heartbeat thread [1] in the heat_api container, I see that it checks the content of the AMQP socket every 15s by waiting on self._heartbeat_exit_event, which is a threading.Event . Whereas in the case on nova_api, variable self._heartbeat_exit_event is of type oslo_utils.eventletutils.EventletEvent. And instead of waiting 15s as requested, the execution resume sometimes 60s after, sometimes even later. I don't know if that object type is expected (looks like it isn't), I'm gonna check how nova configures eventlet and if changing the location of eventlet.monkey_patch() fixes this behaviour or if something else must be fixed. PS: so far, the monkey patching fix mentioned in the mailing list discussion [2] upstream doesn't seem to fix the issue for me. I will double check that I applied it properly. [1] https://github.com/openstack/oslo.messaging/blob/433d34ed4149ad2dcbb860b07754dea25cb43407/oslo_messaging/_drivers/impl_rabbit.py#L947 [2] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005413.html That would certainly explain it. It seems like heat_api isn't using eventlet at all under mod_wsgi? The eventletutils.Event function chooses which implementation to use based on whether or not the thread module has been monkey-patched: https://github.com/openstack/oslo.utils/blob/master/oslo_utils/eventletutils.py#L183 So if you're seeing the threading.Event implementation in heat_api, then it hasn't patched the thread module. (Or the logic to check whether the module has been patched is broken somehow). FYI also take care about this: oslo.messaging waiting for some defined exceptions [1] to ensure connection with the rabbit cluster nodes but kombu redefine some possible stdlib exceptions like: - socket.timeout.TimeoutError [2] possibly raised by amqp so oslo.messaging can't catch properly the exception due to the namespace of the raised exception in that case (kombu.exceptions.TimeoutError), - also some other amqp exceptions [3] are also rebranded by kombu and not properly expected on the oslo.messaging side, I guess we need to also catch kombu.exceptions.ConnectionError and amqp.ConnectionError in that scenario. (kombu.exceptions.ConnectionError = amqp.ConnectionError) My analyze come from this BZ [4] where a lot of logs entries appears when we power-off one node on a 3-nodes clustering controllers. This is on OSP13 BZ but the exception management is the same on the latest version. The heartbeat doesn't work properly due to these exceptions redefining and we never call the ensure_connection function and the connection destination is never be switched. Is not the same problem than yours but possibly you fallout in some kind of similar possible scenario depending your tests, so just take care about this :) I need to do more tests on my side to check if we need to catch such or such exceptions and I'll submit a patch soon to fix my BZ and the described behaviour. [1] https://github.com/openstack/oslo.messaging/blob/433d34ed4149ad2dcbb860b07754dea25cb43407/oslo_messaging/_drivers/impl_rabbit.py#L937 [2] https://github.com/celery/kombu/blob/master/kombu/exceptions.py#L4 [3] https://github.com/celery/kombu/blob/master/kombu/exceptions.py#L6 [4] https://bugzilla.redhat.com/show_bug.cgi?id=1665399 *** Bug 1705730 has been marked as a duplicate of this bug. *** *** Bug 1705694 has been marked as a duplicate of this bug. *** *** Bug 1693834 has been marked as a duplicate of this bug. *** [stack@undercloud-0 ~]$ rhos-release -L Installed repositories (rhel-8.0): 15 ceph-4 ceph-osd-4 rhel-8.0 [stack@undercloud-0 ~]$ cat core_puddle_version RHOS_TRUNK-15.0-RHEL-8-20190523.n.1 [stack@undercloud-0 ~]$ rpm -qa|grep openstack-tripleo-common-10.7.1-0.20190522180807.438b9fb.el8ost openstack-tripleo-common-10.7.1-0.20190522180807.438b9fb.el8ost.noarch [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a"echo 'Current Date:';date;grep -ri 'Too many heartbeats missed' /var/log/"|grep -v grep [WARNING]: Found both group and host with same name: undercloud [DEPRECATION WARNING]: Distribution redhat 8.0 on host compute-0 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This feature will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg. compute-0 | CHANGED | rc=0 >> Current Date: Tue Jun 4 18:33:30 UTC 2019 Binary file /var/log/journal/07264b1c2c0b42c0b1db6070c41fee41/system.journal matches [DEPRECATION WARNING]: Distribution redhat 8.0 on host compute-1 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This feature will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg. compute-1 | CHANGED | rc=0 >> Current Date: Tue Jun 4 18:33:30 UTC 2019 Binary file /var/log/journal/971a7ca9e4d6490eb234d2a301401bd7/system.journal matches [DEPRECATION WARNING]: Distribution redhat 8.0 on host controller-0 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This feature will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg. controller-0 | CHANGED | rc=0 >> Current Date: Tue Jun 4 18:33:31 UTC 2019 Binary file /var/log/journal/393cd21b194e4ea29d9f5aa6634819f1/system.journal matches [DEPRECATION WARNING]: Distribution redhat 8.0 on host controller-1 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This feature will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg. controller-1 | CHANGED | rc=0 >> Current Date: Tue Jun 4 18:33:31 UTC 2019 Binary file /var/log/journal/e4e7e4ed4a4e41d5b750dffb24d3e85c/system.journal matches [DEPRECATION WARNING]: Distribution redhat 8.0 on host controller-2 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This feature will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg. controller-2 | CHANGED | rc=0 >> Current Date: Tue Jun 4 18:33:31 UTC 2019 Binary file /var/log/journal/fac0596f05a048439fbd62707a549126/system.journal matches Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:2811 |