+++ This bug was initially created as a clone of Bug #1706456 +++ Description of problem: in OSP15, nova_api currently uses eventlet.monkey_patch() to enable some asynchronous tasks like listing instances across multiple cells. This has an adverse effect on oslo_message and in particular the AMQP heartbeat thread. This thread is supposed to be running every 15seconds and checking whether some AMQP heartbeat packets must be sent to rabbit in order to keep the AMQP connection alive. Howwever, we see many disconnection in the rabbitmq logs which are linked to nova_api mod_wsgi processes: 2019-04-15 16:30:28.408 [info] <0.3614.0> accepting AMQP connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672) 2019-04-15 16:30:28.421 [info] <0.3614.0> Connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283 2019-04-15 16:30:28.423 [info] <0.3614.0> connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672 - mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283): user 'guest' authenticated and granted access to vhost '/' 2019-04-15 16:31:25.191 [error] <0.3588.0> closing AMQP connection <0.3588.0> (10.109.1.2:33440 -> 10.109.1.2:5672 - mod_wsgi:20:544ec298-0b37-4123-87e0-2362416c7183): missed heartbeats from client, timeout: 60s 2019-04-15 16:31:25.760 [error] <0.3565.0> closing AMQP connection <0.3565.0> (10.109.1.2:33356 -> 10.109.1.2:5672 - mod_wsgi:20:f78bdc50-0734-4823-b91b-7c9ac4227fd0): missed heartbeats from client, timeout: 60s 2019-04-15 16:31:27.054 [info] <0.3640.0> accepting AMQP connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672) 2019-04-15 16:31:27.066 [info] <0.3640.0> Connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1 2019-04-15 16:31:27.067 [info] <0.3640.0> connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672 - mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1): user 'guest' authenticated and granted access to vhost '/' 2019-04-15 16:32:57.777 [error] <0.2891.0> closing AMQP connection <0.2891.0> (10.109.1.2:48034 -> 10.109.1.2:5672 - mistral-server:7:8eafa1b3-63bd-4e57-b0a7-9107043200bb): missed heartbeats from client, timeout: 60s Those missed heartbeats make nova_api logs warning/errors and force it to recreate connection to rabbitmq: nova/nova-api.log:2019-04-15 16:02:07.857 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection nova/nova-api.log:2019-04-15 16:16:01.207 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionResetError: [Errno 104] Connection reset by peer nova/nova-api.log:2019-04-15 16:27:24.364 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection nova/nova-api.log:2019-04-15 16:27:24.365 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection nova/nova-api.log:2019-04-15 16:27:24.740 20 ERROR oslo.messaging._drivers.impl_rabbit [req-82a5fdc7-5fe8-4314-a31c-394eabdc418d b36fa1be80ae4275bb4511bb74d78eff 89d60bdd36ab49c5b0720eb43fa7d1dd - default default] [f78bdc50-0734-4823-b91b-7c9ac4227fd0] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection nova/nova-api.log:2019-04-15 16:30:27.358 19 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection nova/nova-api.log:2019-04-15 16:30:27.359 19 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection Running the AMQP heartbeat thread under mod_wsgi doesn't work as expected when the threading library is monkey_patched, because the thread waits on a data structure [1] that has been monkey patched [2], which makes it yield its execution instead of sleeping for 15s. Now, because mod_wsgi stops the execution of its embedded interpreter as soon as it finished processing a request, the AMQP heartbeat thread can't be resumed until there's a new request to be processed in the mod_wsgi queue (which would resume the python interpreter and make eventlet resume the thread). In practice, on an undercloud, there's not a lot of traffic coming through nova_api. This causes the event loop to be stopped for long periods of times (sometimes more than 60s) and make the scheduling of the AMQP heartbeat thread erratic, ultimately causing disconnections and reconnections to rabbitmq. [1] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L904 [2] https://github.com/openstack/oslo.utils/blob/master/oslo_utils/eventletutils.py#L182 Version-Release number of selected component (if applicable): openstack-nova-api-19.0.1-0.20190425040353.e6ca383.el8ost.noarch How reproducible: Always Steps to Reproduce: 1. Deploy an undercloud 2. Make sure that nova_api has some AMQP connections opened to rabbitmq by calling e.g. 'nova service-list' 3. Wait for 60s and obvserve some disconnection in rabbitmq logs Actual results: rabbitmq closes connections that have been idle for more than 60s, and cause warning/errors in nova logs. Expected results: nova_api should send regular AMQP heartbeat to keep the connection to rabbit opened when it is idle. Additional info: In OSP15, this only affect nova_api as other api services like e.g. heat_api do not use eventlet.monkey_patch() The issue is tracked upstream in https://bugs.launchpad.net/nova/+bug/1825584 A workaround consisting in disabling monkey-patching is discussed in https://bugs.launchpad.net/tripleo/+bug/1827744 --- Additional comment from on 2019-05-17 17:17:26 UTC --- this is an issue that is currently be investigated upstream however it does not functionally break the nova api as the heartbeats are not required for it to function correctly. The current consensus is that it would be incorrect to try and circumvent the wsgi servers thread management to try and force the server to keep the heartbeat thread alive. while the heartbeat can stop this does not break the underlying way that oslo.messaging works in that it will automatically reconnect to rabbitmq when a new api request is received. a workaround has been found for deployment which run the nova-api under uwsgi that we believe will also apply to mod_wsgi. when the nova-api is run under wsgi the wsgi server should be configured to run the wsgi app with 1 thread per interpreter process. to address this upstream we are planning to take two approaches. first document the requirement to use only 1 thread per interpreter process when running the nova-api under mod_wsgi or uwsgi. for mod_wsgi this can be done by setting thread=1 as per https://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html and similarly for uwsgi https://uwsgi-docs.readthedocs.io/en/latest/Options.html#threads in addition to this the log level of the connection closed message will be reduced to info or debug as this is a recoverable issue that is already handled by how oslo.messaging is designed. disconnection is not an error and should not be reported as such in the logs. from a deployment perspective https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/nova/nova-api-container-puppet.yaml#L209-L244 shoudl be updated to ensure only 1 thread is used and that api parallelism is managed at the process level instead. --- Additional comment from on 2019-05-17 17:18:59 UTC --- note this bug shoudl be cloned to cover the deployment aspect and the initial bug shoudl be re targeted to the oslo.messaging component to track adjusting the log level.
Cloned for deployment, but to note, we already have threads=1 (if not other specified by an operator) as this is the default in puppet-nova [1] - 10-nova_api_wsgi.conf : # ************************************ # Vhost template in module puppetlabs-apache # Managed by Puppet # ************************************ # <VirtualHost 192.168.24.1:8774> ServerName undercloud-0.ctlplane.localdomain ## Vhost docroot DocumentRoot "/var/www/cgi-bin/nova" ## Directories, there should at least be a declaration for /var/www/cgi-bin/nova <Directory "/var/www/cgi-bin/nova"> Options Indexes FollowSymLinks MultiViews AllowOverride None Require all granted </Directory> ## Logging ErrorLog "/var/log/httpd/nova_api_wsgi_error.log" ServerSignature Off CustomLog "/var/log/httpd/nova_api_wsgi_access.log" combined SetEnvIf X-Forwarded-Proto https HTTPS=1 ## WSGI configuration WSGIApplicationGroup %{GLOBAL} WSGIDaemonProcess nova-api display-name=nova_api_wsgi group=nova processes=4 threads=1 user=nova WSGIProcessGroup nova-api WSGIScriptAlias / "/var/www/cgi-bin/nova/nova-api" </VirtualHost> [1] https://github.com/openstack/puppet-nova/blob/stable/stein/manifests/wsgi/apache_api.pp#L55-L57
as mentioned in comment1 the default is 1. Therefore closing this BZ.