Bug 1700044

Summary:	[osp15] rabbitmq connections fail due to missed heartbeats
Product:	Red Hat OpenStack	Reporter:	Alex Schultz <aschultz>
Component:	openstack-tripleo-common	Assignee:	Adriano Petrich <apetrich>
Status:	CLOSED ERRATA	QA Contact:	pkomarov
Severity:	high	Docs Contact:
Priority:	high
Version:	15.0 (Stein)	CC:	ahrechan, apetrich, atonner, dciabrin, hberaud, jeckersb, lmiccini, mburns, mcornea, michele, pkomarov, slinaber, smalleni
Target Milestone:	beta	Keywords:	Triaged
Target Release:	15.0 (Stein)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-common-10.7.1-0.20190522180807.438b9fb.el8ost	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-09-21 11:21:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alex Schultz 2019-04-15 16:39:24 UTC

Description of problem:
On a RHEL8 undercloud when you run an overcloud deployment, rabbitmq connections from various services will start being disconnected due to missed heartbeats from the client. However if you look at the output of netstat, you'll see connections with RECV-Q > 0.  I've seen it affect nova, ironic, mistral services.  This originally manifested itself as deployment failures where mistral was missing heartbeats. See Bug 1693834. However we've seen it since increasing the timeouts so it's still a problem. It seems to affect more loaded or slower systems.


[root@undercloud-0 rabbitmq]# netstat -napt | grep 5672
...snip...
tcp        8      0 10.109.1.2:50264        10.109.1.2:5672         ESTABLISHED 43881/python3       
tcp        0      0 10.109.1.2:5672         10.109.1.2:50130        ESTABLISHED 22878/beam.smp      
tcp        0      0 10.109.1.2:49950        10.109.1.2:5672         ESTABLISHED 38870/python3       
tcp        0      0 10.109.1.2:5672         10.109.1.2:50060        ESTABLISHED 22878/beam.smp      
tcp        8      0 10.109.1.2:54160        10.109.1.2:5672         ESTABLISHED 38874/python3       
tcp        0      0 10.109.1.2:50018        10.109.1.2:5672         ESTABLISHED 43336/l3-agent.log  
tcp        8      0 10.109.1.2:50268        10.109.1.2:5672         ESTABLISHED 38804/python3       
tcp        8      0 10.109.1.2:55910        10.109.1.2:5672         ESTABLISHED 38870/python3       
...snip...
tcp        0      0 10.109.1.2:5672         10.109.1.2:50056        ESTABLISHED 22878/beam.smp      
tcp        8      0 10.109.1.2:33440        10.109.1.2:5672         ESTABLISHED 42281/nova_api_wsgi 
tcp        0      0 10.109.1.2:5672         10.109.1.2:49984        ESTABLISHED 22878/beam.smp      
tcp        8      0 10.109.1.2:56252        10.109.1.2:5672         ESTABLISHED 38871/python3       
tcp        0      0 10.109.1.2:49994        10.109.1.2:5672         ESTABLISHED 40413/python3       
tcp        0      0 10.109.1.2:5672         10.109.1.2:56244        ESTABLISHED 22878/beam.smp      
tcp       32      0 10.109.1.2:57656        10.109.1.2:5672         ESTABLISHED 42280/nova_api_wsgi 
...snip...

In the rabbit logs you'll see missed heartbeats from client...

2019-04-15 16:27:16.677 [error] <0.3388.0> closing AMQP connection <0.3388.0> (10.109.1.2:57610 -> 10.109.1.2:5672 - mod_wsgi:20:544ec298-0b37-4123-87e0-2362416c7183):
missed heartbeats from client, timeout: 60s
2019-04-15 16:27:18.856 [error] <0.3420.0> closing AMQP connection <0.3420.0> (10.109.1.2:57642 -> 10.109.1.2:5672 - mod_wsgi:20:f78bdc50-0734-4823-b91b-7c9ac4227fd0):
missed heartbeats from client, timeout: 60s
2019-04-15 16:27:25.754 [info] <0.3565.0> accepting AMQP connection <0.3565.0> (10.109.1.2:33356 -> 10.109.1.2:5672)
2019-04-15 16:27:25.756 [info] <0.3565.0> Connection <0.3565.0> (10.109.1.2:33356 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:20:f78bdc50-0734-4823-b91b-7c9ac4227fd0
2019-04-15 16:27:25.757 [info] <0.3565.0> connection <0.3565.0> (10.109.1.2:33356 -> 10.109.1.2:5672 - mod_wsgi:20:f78bdc50-0734-4823-b91b-7c9ac4227fd0): user 'guest' authenticated and granted access to vhost '/'
2019-04-15 16:28:25.182 [info] <0.3588.0> accepting AMQP connection <0.3588.0> (10.109.1.2:33440 -> 10.109.1.2:5672)
2019-04-15 16:28:25.185 [info] <0.3588.0> Connection <0.3588.0> (10.109.1.2:33440 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:20:544ec298-0b37-4123-87e0-2362416c7183
2019-04-15 16:28:25.189 [info] <0.3588.0> connection <0.3588.0> (10.109.1.2:33440 -> 10.109.1.2:5672 - mod_wsgi:20:544ec298-0b37-4123-87e0-2362416c7183): user 'guest' authenticated and granted access to vhost '/'
2019-04-15 16:29:11.796 [error] <0.3251.0> closing AMQP connection <0.3251.0> (10.109.1.2:57528 -> 10.109.1.2:5672 - mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1):
missed heartbeats from client, timeout: 60s
2019-04-15 16:29:21.203 [error] <0.3432.0> closing AMQP connection <0.3432.0> (10.109.1.2:57656 -> 10.109.1.2:5672 - mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283):
missed heartbeats from client, timeout: 60s
2019-04-15 16:30:28.408 [info] <0.3614.0> accepting AMQP connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672)
2019-04-15 16:30:28.421 [info] <0.3614.0> Connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283
2019-04-15 16:30:28.423 [info] <0.3614.0> connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672 - mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283): user 'guest' authenticated and granted access to vhost '/'
2019-04-15 16:31:25.191 [error] <0.3588.0> closing AMQP connection <0.3588.0> (10.109.1.2:33440 -> 10.109.1.2:5672 - mod_wsgi:20:544ec298-0b37-4123-87e0-2362416c7183):
missed heartbeats from client, timeout: 60s
2019-04-15 16:31:25.760 [error] <0.3565.0> closing AMQP connection <0.3565.0> (10.109.1.2:33356 -> 10.109.1.2:5672 - mod_wsgi:20:f78bdc50-0734-4823-b91b-7c9ac4227fd0):
missed heartbeats from client, timeout: 60s
2019-04-15 16:31:27.054 [info] <0.3640.0> accepting AMQP connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672)
2019-04-15 16:31:27.066 [info] <0.3640.0> Connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1
2019-04-15 16:31:27.067 [info] <0.3640.0> connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672 - mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1): user 'guest' authenticated and granted access to vhost '/'
2019-04-15 16:32:57.777 [error] <0.2891.0> closing AMQP connection <0.2891.0> (10.109.1.2:48034 -> 10.109.1.2:5672 - mistral-server:7:8eafa1b3-63bd-4e57-b0a7-9107043200bb):
missed heartbeats from client, timeout: 60s


In the server logs you'll see server unexpectedly closed the connection...
[root@undercloud-0 containers]# grep oslo.messaging * -r | grep -E ' (ERROR|WARNING) '
mistral/executor.log:2019-04-15 16:33:00.471 7 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
mistral/executor.log:2019-04-15 16:34:26.005 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [7b83de9e-9a0d-4f11-8b5d-afe8ed97e22e] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
mistral/executor.log:2019-04-15 16:34:26.007 7 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionResetError: [Errno 104] Connection reset by peer
nova/nova-api.log:2019-04-15 16:02:07.857 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:16:01.207 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionResetError: [Errno 104] Connection reset by peer
nova/nova-api.log:2019-04-15 16:27:24.364 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:27:24.365 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:27:24.740 20 ERROR oslo.messaging._drivers.impl_rabbit [req-82a5fdc7-5fe8-4314-a31c-394eabdc418d b36fa1be80ae4275bb4511bb74d78eff 89d60bdd36ab49c5b0720eb43fa7d1dd - default default] [f78bdc50-0734-4823-b91b-7c9ac4227fd0] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:30:27.358 19 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:30:27.359 19 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:30:27.393 19 ERROR oslo.messaging._drivers.impl_rabbit [req-6ffb7d2d-e22b-4d2a-b7f0-f433cb38578d b36fa1be80ae4275bb4511bb74d78eff 89d60bdd36ab49c5b0720eb43fa7d1dd - default default] [564a2fe0-6ccd-4398-a160-eb096ba3d283] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:33:30.183 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:33:30.184 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionResetError: [Errno 104] Connection reset by peer
nova/nova-api.log:2019-04-15 16:33:30.227 20 ERROR oslo.messaging._drivers.impl_rabbit [req-af8979e1-654f-4082-831f-20a5d2bbf654 b36fa1be80ae4275bb4511bb74d78eff 89d60bdd36ab49c5b0720eb43fa7d1dd - default default] [f78bdc50-0734-4823-b91b-7c9ac4227fd0] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection



Version-Release number of selected component (if applicable):
[root@undercloud-0 containers]# rpm -qa | grep tripleo
openstack-tripleo-common-containers-10.6.2-0.20190412150355.0ec6518.el8ost.noarch
python3-tripleoclient-heat-installer-11.4.1-0.20190411190358.0ca816d.el8ost.noarch
python3-tripleo-common-10.6.2-0.20190412150355.0ec6518.el8ost.noarch
openstack-tripleo-validations-10.3.1-0.20190412103227.b4c2270.el8ost.noarch
openstack-tripleo-common-10.6.2-0.20190412150355.0ec6518.el8ost.noarch
openstack-tripleo-puppet-elements-10.2.1-0.20190410200345.47aba45.el8ost.noarch
openstack-tripleo-heat-templates-10.4.1-0.20190412000410.b934fdd.el8ost.noarch
puppet-tripleo-10.3.1-0.20190409110339.541e9ea.el8ost.noarch
openstack-tripleo-image-elements-10.3.1-0.20190325204940.253fe88.el8ost.noarch
ansible-tripleo-ipsec-9.0.1-0.20190220162047.f60ad6c.el8ost.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20190402220346.012209a.el8ost.noarch
python3-tripleoclient-11.4.1-0.20190411190358.0ca816d.el8ost.noarch
[root@undercloud-0 containers]# rpm -qa | grep rabbitmq
puppet-rabbitmq-9.0.1-0.20190321232600.5477e69.el8ost.noarch
[root@undercloud-0 containers]# podman exec -it -u root rabbitmq rpm -qa | grep -E "(rabbit|erlang)"
erlang-kernel-20.3.8.18-2.el8ost.x86_64
erlang-syntax_tools-20.3.8.18-2.el8ost.x86_64
erlang-asn1-20.3.8.18-2.el8ost.x86_64
erlang-tools-20.3.8.18-2.el8ost.x86_64
erlang-os_mon-20.3.8.18-2.el8ost.x86_64
erlang-erts-20.3.8.18-2.el8ost.x86_64
erlang-stdlib-20.3.8.18-2.el8ost.x86_64
erlang-compiler-20.3.8.18-2.el8ost.x86_64
erlang-mnesia-20.3.8.18-2.el8ost.x86_64
erlang-snmp-20.3.8.18-2.el8ost.x86_64
erlang-public_key-20.3.8.18-2.el8ost.x86_64
erlang-ssl-20.3.8.18-2.el8ost.x86_64
erlang-sasl-20.3.8.18-2.el8ost.x86_64
erlang-otp_mibs-20.3.8.18-2.el8ost.x86_64
erlang-parsetools-20.3.8.18-2.el8ost.x86_64
erlang-xmerl-20.3.8.18-2.el8ost.x86_64
puppet-rabbitmq-9.0.1-0.20190321232600.5477e69.el8ost.noarch
erlang-crypto-20.3.8.18-2.el8ost.x86_64
erlang-hipe-20.3.8.18-2.el8ost.x86_64
erlang-runtime_tools-20.3.8.18-2.el8ost.x86_64
erlang-inets-20.3.8.18-2.el8ost.x86_64
erlang-eldap-20.3.8.18-2.el8ost.x86_64
erlang-sd_notify-1.0-5.el8ost.x86_64
rabbitmq-server-3.7.14-1.el8ost.noarch


How reproducible:
Every deploy.


Steps to Reproduce:
1. Deploy RHEL8 undercloud
2. Run overcloud deployment
3. wait
4. Check logs

Actual results:
rabbitmq connection problems

Expected results:
No connections problems

Additional info:

Comment 4 Damien Ciabrini 2019-04-24 21:41:05 UTC

I think these lost heartbeats and the erratic exchange of AMQP
heartbeats in general between rabbitmq and openstack services are due
to a subtle change in how we run the containers' healthchecks with
systemd and podman.

Some context and reminder information below:

1) When an OpenStack service is connected to rabbitmq, they both
exchange AMQP heartbeat packets when there is no AMQP traffic since a
long time, to check whether the other side is alive. rabbitmq sends a
heartbeat packet every 30s and will forcibly close a connection if two
consecutive heartbeats failed, i.e. no answer from the client after
60s.

2) OpenStack services rely on eventlet to schedule an asynchronous,
recurring greenlet that sends AMQP heartbeats to keep the rabbitmq
connection alive while the AMQP socket is idle.

3) In TripleO, OpenStack services run under Apache mod_wsgi, which is
a native executable that embeds its own Python interpreter:

. mod_wsgi calls the embedded Python interpreter whenever there's
an OpenStack API call to process.

. While the interpreter is running, eventlet will check whether the
timer for executing the AMQP heartbeat greenlet has been reached or
exceeded and run the greelet as necessary.

. Once the API call is processed, the Python interpreter returns and
goes to rest until mod_wsgi receives another request to serve.

4) In TripleO, all OpenStack services are containerized and they are
monitored by means of a healthcheck script that checks whether the
mod_wsgi process is running and responding to a simple HTTP request.

Now as to why we experience erratic AMQP heartbeats:

Currently the AMQP heartbeat greenlet is configured to be allowed to
run every 15 seconds [1], in case heartbeats must be sent. But for
that timer to be watched by eventlet, the embedded Python interpreter
must be running inside mod_wsgi, which only happens when mod_wsgi
is processing some traffic.

Thanks to container healthchecks, we have a means of forcing mod_wsgi
to wake up at regular intervals and resume its embedded Python
interpreter even when there's no other traffic to the API service.

Up to OSP 14, the container's healthcheck script is ran by docker
every 30 seconds, which guarantees that eventlet will wake up and
send an AMQP heartbeat packet if there was no AMQP traffic in the last
15s. It also guarantees that any incoming AMQP heartbeat packet from
rabbitmq will be processed in at most 30s.

In OSP15, containers healthcheck are now implemented via systemd
timers, and the current setting [2] allows systemd to start the
healthcheck every 60s with a randomized 45s delay on top, which is too
long to guarantee that mod_wsgi will always wake up on time to
send/receive AMQP heartbeats to/from rabbitmq.

Tweaking the healthcheck to match the check_interval from OSP14 and
before should be enough to fix the issue.

[1] https://github.com/openstack/oslo.messaging/blob/f276f0853afcda949a665d2fa12d712e96d7115f/oslo_messaging/_drivers/impl_rabbit.py#L524
[2] https://github.com/openstack/paunch/blob/95ad4705f12f0da30c91adf7c5e218a8a81a5bd0/paunch/utils/systemd.py#L205

Comment 5 Damien Ciabrini 2019-04-26 08:50:00 UTC

(In reply to Damien Ciabrini from comment #4)
> I think these lost heartbeats and the erratic exchange of AMQP
> heartbeats in general between rabbitmq and openstack services are due
> to a subtle change in how we run the containers' healthchecks with
> systemd and podman.
> 

Actually, I'm taking that comment back... healthcheck are probably not involved.

In my deployment I can see that the heat_api service, which runs under mod_wsgi as
well, is immune to that loss of AMQP heartbeats; the AMQP heartbeat greenlet
seems to run every 15s as expected. 
2019-04-26 08:41:44.732 18 DEBUG amqp [-] heartbeat_tick : for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700
2019-04-26 08:41:44.732 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 58/134, now - 58/135, monotonic - 95201.275375187, last_heartbeat_sent - 95186.273410011, heartbeat int. - 60 for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722
2019-04-26 08:41:59.734 18 DEBUG amqp [-] heartbeat_tick : for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700
2019-04-26 08:41:59.734 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 58/135, now - 58/135, monotonic - 95216.277477823, last_heartbeat_sent - 95186.273410011, heartbeat int. - 60 for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722
2019-04-26 08:42:14.308 18 DEBUG amqp [-] heartbeat_tick : for connection 44ec1ada6cd94026a721146c3e3a3962 heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700
2019-04-26 08:42:14.309 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 48/136, now - 48/137, monotonic - 95230.851789552, last_heartbeat_sent - 95170.858355861, heartbeat int. - 60 for connection 44ec1ada6cd94026a721146c3e3a3962 heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722
2019-04-26 08:42:14.736 18 DEBUG amqp [-] heartbeat_tick : for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700
2019-04-26 08:42:14.736 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 58/135, now - 58/136, monotonic - 95231.279408946, last_heartbeat_sent - 95186.273410011, heartbeat int. - 60 for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722
2019-04-26 08:42:29.738 18 DEBUG amqp [-] heartbeat_tick : for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:700
2019-04-26 08:42:29.738 18 DEBUG amqp [-] heartbeat_tick : Prev sent/recv: 58/136, now - 58/136, monotonic - 95246.281495191, last_heartbeat_sent - 95186.273410011, heartbeat int. - 60 for connection 5882801ce77d4e7eb7a8c24ec4acbe1c heartbeat_tick /usr/lib/python3.6/site-packages/amqp/connection.py:722


I can see that heat_api_wsgi creates AMQP connections when I use Heat (e.g. with
'heat service-list'):

[root@undercloud-0 ~]# ps -ef | grep heat_api_wsgi
42418     222454  222443  0 07:29 ?        00:00:04 heat_api_wsgi   -DFOREGROUND

root@undercloud-0 ~]# ss -tnp | grep 5672 | grep httpd
ESTAB       0        0            192.168.24.1:58900       192.168.24.1:5672     users:(("httpd",pid=222454,fd=18))                                             
ESTAB       0        0            192.168.24.1:58898       192.168.24.1:5672     users:(("httpd",pid=222454,fd=17))    

When inspecting the process' callstack: I see that WSGI thread
is idle, waiting for incoming Openstack API call to be processed:

 Thread 4 (Thread 0x7f13b2b57700 (LWP 222465)):
#0  0x00007f13bd0ffd97 in epoll_wait () from target:/lib64/libc.so.6
#1  0x00007f13bd81272a in impl_pollset_poll () from target:/lib64/libapr-1.so.0
#2  0x00007f13b4e2d5f2 in wsgi_daemon_thread () from target:/etc/httpd/modules/mod_wsgi_python3.so
#3  0x00007f13bd5d32de in start_thread () from target:/lib64/libpthread.so.0
#4  0x00007f13bd0ffa63 in clone () from target:/lib64/libc.so.6

But I also see two additional threads which run the Python interpreter.
Thread 6 (Thread 0x7f139ee04700 (LWP 238346)):
#0  0x00007f13bd5dc032 in do_futex_wait () from target:/lib64/libpthread.so.0
#1  0x00007f13bd5dc143 in __new_sem_wait_slow () from target:/lib64/libpthread.so.0
#2  0x00007f13b49a385e in PyThread_acquire_lock_timed () from target:/lib64/libpython3.6m.so.1.0
#3  0x00007f13b4a2109b in lock_PyThread_acquire_lock () from target:/lib64/libpython3.6m.so.1.0
[...]

Thread 5 (Thread 0x7f139f605700 (LWP 238345)):
#0  0x00007f13bd0f47e1 in poll () from target:/lib64/libc.so.6
#1  0x00007f13ab1a75df in sock_recv () from target:/usr/lib64/python3.6/lib-dynload/_socket.cpython-36m-x86_64-linux-gnu.so
#2  0x00007f13b4a23d02 in _PyCFunction_FastCallDict () from target:/lib64/libpython3.6m.so.1.0
#3  0x00007f13b4a31b8d in call_function () from target:/lib64/libpython3.6m.so.1.0
[...]

At this point I think eventlet manages its timer-based greenlets in a dedicated pthread,
which is awaken independently from mod_wsgi's traffic, which is good.

Question now is why do we see those threads in heat_api_wsgi, and not in nova_api_wsgi.

As hinted in [1], this could be due to eventlet monkey_patching, I'm trying to verify
that theory now.

[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005413.html

Comment 6 Damien Ciabrini 2019-04-26 17:06:54 UTC

OK I found the discrepancy between the heat_api container (which works as expected on my environment) and 
the nova_api container (which processes AMQP heartbeats erratically).

 . When I instrument the AMQP heartbeat thread [1] in the heat_api container, I see that it
   checks the content of the AMQP socket every 15s by waiting on self._heartbeat_exit_event,
   which is a threading.Event

 . Whereas in the case on nova_api, variable self._heartbeat_exit_event is of type
   oslo_utils.eventletutils.EventletEvent. And instead of waiting 15s as requested, the
   execution resume sometimes 60s after, sometimes even later.

I don't know if that object type is expected (looks like it isn't), I'm gonna check how
nova configures eventlet and if changing the location of eventlet.monkey_patch() fixes
this behaviour or if something else must be fixed.

PS: so far, the monkey patching fix mentioned in the mailing list discussion [2]
upstream doesn't seem to fix the issue for me. I will double check that I applied
it properly.

[1] https://github.com/openstack/oslo.messaging/blob/433d34ed4149ad2dcbb860b07754dea25cb43407/oslo_messaging/_drivers/impl_rabbit.py#L947
[2] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005413.html

Comment 7 John Eckersberg 2019-04-26 18:55:44 UTC

That would certainly explain it.  It seems like heat_api isn't using eventlet at all under mod_wsgi?  The eventletutils.Event function chooses which implementation to use based on whether or not the thread module has been monkey-patched:

https://github.com/openstack/oslo.utils/blob/master/oslo_utils/eventletutils.py#L183

So if you're seeing the threading.Event implementation in heat_api, then it hasn't patched the thread module.  

(Or the logic to check whether the module has been patched is broken somehow).

Comment 8 Hervé Beraud 2019-04-29 21:09:14 UTC

FYI also take care about this:

oslo.messaging waiting for some defined exceptions [1] to ensure connection with the rabbit cluster nodes but kombu redefine some possible stdlib exceptions like:
   - socket.timeout.TimeoutError [2] possibly raised by amqp so oslo.messaging can't catch properly the exception due to the namespace of the raised exception in that case (kombu.exceptions.TimeoutError), 
   - also some other amqp exceptions [3] are also rebranded by kombu and not properly expected on the oslo.messaging side, I guess we need to also catch kombu.exceptions.ConnectionError and amqp.ConnectionError in that scenario.

(kombu.exceptions.ConnectionError = amqp.ConnectionError)

My analyze come from this BZ [4] where a lot of logs entries appears when we power-off one node on a 3-nodes clustering controllers. This is on OSP13 BZ but the exception management is the same on the latest version.

The heartbeat doesn't work properly due to these exceptions redefining and we never call the ensure_connection function and the connection destination is never be switched.

Is not the same problem than yours but possibly you fallout in some kind of similar possible scenario depending your tests, so just take care about this :)

I need to do more tests on my side to check if we need to catch such or such exceptions and I'll submit a patch soon to fix my BZ and the described behaviour.

[1] https://github.com/openstack/oslo.messaging/blob/433d34ed4149ad2dcbb860b07754dea25cb43407/oslo_messaging/_drivers/impl_rabbit.py#L937
[2] https://github.com/celery/kombu/blob/master/kombu/exceptions.py#L4
[3] https://github.com/celery/kombu/blob/master/kombu/exceptions.py#L6
[4] https://bugzilla.redhat.com/show_bug.cgi?id=1665399

Comment 10 Alex Schultz 2019-05-03 22:02:10 UTC

*** Bug 1705730 has been marked as a duplicate of this bug. ***

Comment 13 Alex Schultz 2019-05-06 20:24:23 UTC

*** Bug 1705694 has been marked as a duplicate of this bug. ***

Comment 14 Alistair Tonner 2019-05-13 12:55:18 UTC

*** Bug 1693834 has been marked as a duplicate of this bug. ***

Comment 16 pkomarov 2019-06-04 18:34:33 UTC

[stack@undercloud-0 ~]$ rhos-release -L
Installed repositories (rhel-8.0):
15
ceph-4
ceph-osd-4
rhel-8.0
[stack@undercloud-0 ~]$ cat core_puddle_version
RHOS_TRUNK-15.0-RHEL-8-20190523.n.1

[stack@undercloud-0 ~]$ rpm -qa|grep openstack-tripleo-common-10.7.1-0.20190522180807.438b9fb.el8ost
openstack-tripleo-common-10.7.1-0.20190522180807.438b9fb.el8ost.noarch

[stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a"echo 'Current Date:';date;grep -ri 'Too many heartbeats missed' /var/log/"|grep -v grep
[WARNING]: Found both group and host with same name: undercloud
[DEPRECATION WARNING]: Distribution redhat 8.0 on host compute-0 should use /usr/libexec/platform-python, but
is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will
default to using the discovered platform python for this host. See
https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This
feature will be removed in version 2.12. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
compute-0 | CHANGED | rc=0 >>
Current Date:
Tue Jun 4 18:33:30 UTC 2019
Binary file /var/log/journal/07264b1c2c0b42c0b1db6070c41fee41/system.journal matches
[DEPRECATION WARNING]: Distribution redhat 8.0 on host compute-1 should use /usr/libexec/platform-python, but
is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will
default to using the discovered platform python for this host. See
https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This
feature will be removed in version 2.12. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
compute-1 | CHANGED | rc=0 >>
Current Date:
Tue Jun 4 18:33:30 UTC 2019
Binary file /var/log/journal/971a7ca9e4d6490eb234d2a301401bd7/system.journal matches
[DEPRECATION WARNING]: Distribution redhat 8.0 on host controller-0 should use /usr/libexec/platform-python,
but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release
will default to using the discovered platform python for this host. See
https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This
feature will be removed in version 2.12. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
controller-0 | CHANGED | rc=0 >>
Current Date:
Tue Jun 4 18:33:31 UTC 2019
Binary file /var/log/journal/393cd21b194e4ea29d9f5aa6634819f1/system.journal matches
[DEPRECATION WARNING]: Distribution redhat 8.0 on host controller-1 should use /usr/libexec/platform-python,
but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release
will default to using the discovered platform python for this host. See
https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This
feature will be removed in version 2.12. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
controller-1 | CHANGED | rc=0 >>
Current Date:
Tue Jun 4 18:33:31 UTC 2019
Binary file /var/log/journal/e4e7e4ed4a4e41d5b750dffb24d3e85c/system.journal matches
[DEPRECATION WARNING]: Distribution redhat 8.0 on host controller-2 should use /usr/libexec/platform-python,
but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release
will default to using the discovered platform python for this host. See
https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This
feature will be removed in version 2.12. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
controller-2 | CHANGED | rc=0 >>
Current Date:
Tue Jun 4 18:33:31 UTC 2019
Binary file /var/log/journal/fac0596f05a048439fbd62707a549126/system.journal matches

Comment 19 errata-xmlrpc 2019-09-21 11:21:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811