Bug 1711794 - [OSP15][deployment] AMQP heartbeat thread missing heartbeats when running under nova_api
Summary: [OSP15][deployment] AMQP heartbeat thread missing heartbeats when running und...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Martin Schuppert
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On: 1706456
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-20 06:48 UTC by Martin Schuppert
Modified: 2019-05-20 12:46 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1706456
Environment:
Last Closed: 2019-05-20 12:46:27 UTC


Attachments (Terms of Use)

Description Martin Schuppert 2019-05-20 06:48:54 UTC
+++ This bug was initially created as a clone of Bug #1706456 +++

Description of problem:
in OSP15, nova_api currently uses eventlet.monkey_patch() to enable some asynchronous
tasks like listing instances across multiple cells.

This has an adverse effect on oslo_message and in particular the AMQP heartbeat thread.
This thread is supposed to be running every 15seconds and checking whether some
AMQP heartbeat packets must be sent to rabbit in order to keep the AMQP connection
alive. Howwever, we see many disconnection in the rabbitmq logs which are linked
to nova_api mod_wsgi processes:

2019-04-15 16:30:28.408 [info] <0.3614.0> accepting AMQP connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672)
2019-04-15 16:30:28.421 [info] <0.3614.0> Connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283
2019-04-15 16:30:28.423 [info] <0.3614.0> connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672 - mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283): user 'guest' authenticated and granted access to vhost '/'
2019-04-15 16:31:25.191 [error] <0.3588.0> closing AMQP connection <0.3588.0> (10.109.1.2:33440 -> 10.109.1.2:5672 - mod_wsgi:20:544ec298-0b37-4123-87e0-2362416c7183):
missed heartbeats from client, timeout: 60s
2019-04-15 16:31:25.760 [error] <0.3565.0> closing AMQP connection <0.3565.0> (10.109.1.2:33356 -> 10.109.1.2:5672 - mod_wsgi:20:f78bdc50-0734-4823-b91b-7c9ac4227fd0):
missed heartbeats from client, timeout: 60s
2019-04-15 16:31:27.054 [info] <0.3640.0> accepting AMQP connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672)
2019-04-15 16:31:27.066 [info] <0.3640.0> Connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1
2019-04-15 16:31:27.067 [info] <0.3640.0> connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672 - mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1): user 'guest' authenticated and granted access to vhost '/'
2019-04-15 16:32:57.777 [error] <0.2891.0> closing AMQP connection <0.2891.0> (10.109.1.2:48034 -> 10.109.1.2:5672 - mistral-server:7:8eafa1b3-63bd-4e57-b0a7-9107043200bb):
missed heartbeats from client, timeout: 60s

Those missed heartbeats make nova_api logs warning/errors and force it to recreate connection to rabbitmq:

nova/nova-api.log:2019-04-15 16:02:07.857 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:16:01.207 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionResetError: [Errno 104] Connection reset by peer
nova/nova-api.log:2019-04-15 16:27:24.364 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:27:24.365 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:27:24.740 20 ERROR oslo.messaging._drivers.impl_rabbit [req-82a5fdc7-5fe8-4314-a31c-394eabdc418d b36fa1be80ae4275bb4511bb74d78eff 89d60bdd36ab49c5b0720eb43fa7d1dd - default default] [f78bdc50-0734-4823-b91b-7c9ac4227fd0] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:30:27.358 19 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:30:27.359 19 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection

Running the AMQP heartbeat thread under mod_wsgi doesn't work as expected when the threading library is monkey_patched, because the thread waits on a data structure [1] that has been monkey patched [2], which makes it yield its execution instead of sleeping for 15s.

Now, because mod_wsgi stops the execution of its embedded interpreter as soon as it finished processing a request, the AMQP heartbeat thread can't be resumed until there's a new request to be processed in the mod_wsgi queue (which would resume the python interpreter and make eventlet resume the thread).

In practice, on an undercloud, there's not a lot of traffic coming through nova_api. This causes the event loop to be stopped for long periods of times (sometimes more than 60s) and make the scheduling of the AMQP heartbeat thread erratic, ultimately causing disconnections and reconnections to rabbitmq.

[1] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L904
[2] https://github.com/openstack/oslo.utils/blob/master/oslo_utils/eventletutils.py#L182

Version-Release number of selected component (if applicable):
openstack-nova-api-19.0.1-0.20190425040353.e6ca383.el8ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy an undercloud
2. Make sure that nova_api has some AMQP connections opened to rabbitmq by calling e.g. 'nova service-list'
3. Wait for 60s and obvserve some disconnection in rabbitmq logs

Actual results:
rabbitmq closes connections that have been idle for more than 60s, and cause warning/errors in nova logs.

Expected results:
nova_api should send regular AMQP heartbeat to keep the connection to rabbit opened when it is idle.

Additional info:
In OSP15, this only affect nova_api as other api services like e.g. heat_api do not use eventlet.monkey_patch() 
The issue is tracked upstream in https://bugs.launchpad.net/nova/+bug/1825584
A workaround consisting in disabling monkey-patching is discussed in https://bugs.launchpad.net/tripleo/+bug/1827744

--- Additional comment from  on 2019-05-17 17:17:26 UTC ---

this is an issue that is currently be investigated upstream however it does not functionally break
the nova api as the heartbeats are not required for it to function correctly. The current consensus
is that it would be incorrect to try and circumvent the wsgi servers thread management to try and 
force the server to keep the heartbeat thread alive.

while the heartbeat can stop this does not break the underlying way that oslo.messaging
works in that it will automatically reconnect to rabbitmq when a new api request is received.

a workaround has been found for deployment which run the nova-api under uwsgi that we believe
will also apply to mod_wsgi. when the nova-api is run under wsgi the wsgi server should be
configured to run the wsgi app with 1 thread per interpreter process.

to address this upstream we are planning to take two approaches.

first document the requirement to use only 1 thread per interpreter process
when running the nova-api under mod_wsgi or uwsgi.

for mod_wsgi this can be done by setting thread=1 as per 
https://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html
and similarly for uwsgi https://uwsgi-docs.readthedocs.io/en/latest/Options.html#threads

in addition to this the log level of the connection closed message will
be reduced to info or debug as this is a recoverable issue that is already handled
by how oslo.messaging is designed. disconnection is not an error and should not be
reported as such in the logs.

from a deployment perspective https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/nova/nova-api-container-puppet.yaml#L209-L244
shoudl be updated to ensure only 1 thread is used and that api parallelism is managed at the process level instead.

--- Additional comment from  on 2019-05-17 17:18:59 UTC ---

note this bug shoudl be cloned to cover the deployment aspect and the initial bug shoudl be re targeted to the oslo.messaging component to
track adjusting the log level.

Comment 1 Martin Schuppert 2019-05-20 08:04:17 UTC
Cloned for deployment, but to note, we already have threads=1 (if not other specified by an
operator) as this is the default in puppet-nova [1] - 10-nova_api_wsgi.conf :

# ************************************
# Vhost template in module puppetlabs-apache
# Managed by Puppet
# ************************************
#
<VirtualHost 192.168.24.1:8774>
  ServerName undercloud-0.ctlplane.localdomain

  ## Vhost docroot
  DocumentRoot "/var/www/cgi-bin/nova"

  ## Directories, there should at least be a declaration for /var/www/cgi-bin/nova

  <Directory "/var/www/cgi-bin/nova">
    Options Indexes FollowSymLinks MultiViews
    AllowOverride None
    Require all granted
  </Directory>

  ## Logging
  ErrorLog "/var/log/httpd/nova_api_wsgi_error.log"
  ServerSignature Off
  CustomLog "/var/log/httpd/nova_api_wsgi_access.log" combined
  SetEnvIf X-Forwarded-Proto https HTTPS=1

  ## WSGI configuration
  WSGIApplicationGroup %{GLOBAL}
  WSGIDaemonProcess nova-api display-name=nova_api_wsgi group=nova processes=4 threads=1 user=nova
  WSGIProcessGroup nova-api
  WSGIScriptAlias / "/var/www/cgi-bin/nova/nova-api"
</VirtualHost>


[1] https://github.com/openstack/puppet-nova/blob/stable/stein/manifests/wsgi/apache_api.pp#L55-L57

Comment 2 Martin Schuppert 2019-05-20 12:46:27 UTC
as mentioned in comment1 the default is 1. Therefore closing this BZ.


Note You need to log in before you can comment on or make changes to this bug.