Bug 1706456 - Reduce the log level in the nova-api for oslo messing warnings
Summary: Reduce the log level in the nova-api for oslo messing warnings
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: smooney
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks: 1711794 1913177 2036377
TreeView+ depends on / blocked
 
Reported: 2019-05-04 23:26 UTC by Damien Ciabrini
Modified: 2023-09-07 19:59 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1711794 1913177 2036377 (view as bug list)
Environment:
Last Closed: 2020-09-30 20:04:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1825584 0 None None None 2019-05-04 23:26:25 UTC
OpenStack gerrit 661343 0 'None' ABANDONED WIP disable nova_api rabbitmq heartbeats 2021-02-09 23:09:51 UTC
OpenStack gerrit 724188 0 None MERGED Silence amqp heartbeat warning 2021-02-09 23:09:51 UTC
OpenStack gerrit 728057 0 None MERGED Silence amqp heartbeat warning 2021-02-09 23:09:52 UTC
OpenStack gerrit 728059 0 None NEW Silence amqp heartbeat warning 2021-02-09 23:09:52 UTC
Red Hat Issue Tracker OSP-11974 0 None None None 2021-12-31 15:38:28 UTC
Red Hat Knowledge Base (Solution) 5096301 0 None None None 2020-05-21 11:44:31 UTC

Description Damien Ciabrini 2019-05-04 23:26:26 UTC
Description of problem:
in OSP15, nova_api currently uses eventlet.monkey_patch() to enable some asynchronous
tasks like listing instances across multiple cells.

This has an adverse effect on oslo_message and in particular the AMQP heartbeat thread.
This thread is supposed to be running every 15seconds and checking whether some
AMQP heartbeat packets must be sent to rabbit in order to keep the AMQP connection
alive. Howwever, we see many disconnection in the rabbitmq logs which are linked
to nova_api mod_wsgi processes:

2019-04-15 16:30:28.408 [info] <0.3614.0> accepting AMQP connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672)
2019-04-15 16:30:28.421 [info] <0.3614.0> Connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283
2019-04-15 16:30:28.423 [info] <0.3614.0> connection <0.3614.0> (10.109.1.2:33712 -> 10.109.1.2:5672 - mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283): user 'guest' authenticated and granted access to vhost '/'
2019-04-15 16:31:25.191 [error] <0.3588.0> closing AMQP connection <0.3588.0> (10.109.1.2:33440 -> 10.109.1.2:5672 - mod_wsgi:20:544ec298-0b37-4123-87e0-2362416c7183):
missed heartbeats from client, timeout: 60s
2019-04-15 16:31:25.760 [error] <0.3565.0> closing AMQP connection <0.3565.0> (10.109.1.2:33356 -> 10.109.1.2:5672 - mod_wsgi:20:f78bdc50-0734-4823-b91b-7c9ac4227fd0):
missed heartbeats from client, timeout: 60s
2019-04-15 16:31:27.054 [info] <0.3640.0> accepting AMQP connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672)
2019-04-15 16:31:27.066 [info] <0.3640.0> Connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672) has a client-provided name: mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1
2019-04-15 16:31:27.067 [info] <0.3640.0> connection <0.3640.0> (10.109.1.2:33784 -> 10.109.1.2:5672 - mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1): user 'guest' authenticated and granted access to vhost '/'
2019-04-15 16:32:57.777 [error] <0.2891.0> closing AMQP connection <0.2891.0> (10.109.1.2:48034 -> 10.109.1.2:5672 - mistral-server:7:8eafa1b3-63bd-4e57-b0a7-9107043200bb):
missed heartbeats from client, timeout: 60s

Those missed heartbeats make nova_api logs warning/errors and force it to recreate connection to rabbitmq:

nova/nova-api.log:2019-04-15 16:02:07.857 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:16:01.207 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionResetError: [Errno 104] Connection reset by peer
nova/nova-api.log:2019-04-15 16:27:24.364 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:27:24.365 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:27:24.740 20 ERROR oslo.messaging._drivers.impl_rabbit [req-82a5fdc7-5fe8-4314-a31c-394eabdc418d b36fa1be80ae4275bb4511bb74d78eff 89d60bdd36ab49c5b0720eb43fa7d1dd - default default] [f78bdc50-0734-4823-b91b-7c9ac4227fd0] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:30:27.358 19 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:30:27.359 19 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection

Running the AMQP heartbeat thread under mod_wsgi doesn't work as expected when the threading library is monkey_patched, because the thread waits on a data structure [1] that has been monkey patched [2], which makes it yield its execution instead of sleeping for 15s.

Now, because mod_wsgi stops the execution of its embedded interpreter as soon as it finished processing a request, the AMQP heartbeat thread can't be resumed until there's a new request to be processed in the mod_wsgi queue (which would resume the python interpreter and make eventlet resume the thread).

In practice, on an undercloud, there's not a lot of traffic coming through nova_api. This causes the event loop to be stopped for long periods of times (sometimes more than 60s) and make the scheduling of the AMQP heartbeat thread erratic, ultimately causing disconnections and reconnections to rabbitmq.

[1] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L904
[2] https://github.com/openstack/oslo.utils/blob/master/oslo_utils/eventletutils.py#L182

Version-Release number of selected component (if applicable):
openstack-nova-api-19.0.1-0.20190425040353.e6ca383.el8ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy an undercloud
2. Make sure that nova_api has some AMQP connections opened to rabbitmq by calling e.g. 'nova service-list'
3. Wait for 60s and obvserve some disconnection in rabbitmq logs

Actual results:
rabbitmq closes connections that have been idle for more than 60s, and cause warning/errors in nova logs.

Expected results:
nova_api should send regular AMQP heartbeat to keep the connection to rabbit opened when it is idle.

Additional info:
In OSP15, this only affect nova_api as other api services like e.g. heat_api do not use eventlet.monkey_patch() 
The issue is tracked upstream in https://bugs.launchpad.net/nova/+bug/1825584
A workaround consisting in disabling monkey-patching is discussed in https://bugs.launchpad.net/tripleo/+bug/1827744

Comment 1 smooney 2019-05-17 17:17:26 UTC
this is an issue that is currently be investigated upstream however it does not functionally break
the nova api as the heartbeats are not required for it to function correctly. The current consensus
is that it would be incorrect to try and circumvent the wsgi servers thread management to try and 
force the server to keep the heartbeat thread alive.

while the heartbeat can stop this does not break the underlying way that oslo.messaging
works in that it will automatically reconnect to rabbitmq when a new api request is received.

a workaround has been found for deployment which run the nova-api under uwsgi that we believe
will also apply to mod_wsgi. when the nova-api is run under wsgi the wsgi server should be
configured to run the wsgi app with 1 thread per interpreter process.

to address this upstream we are planning to take two approaches.

first document the requirement to use only 1 thread per interpreter process
when running the nova-api under mod_wsgi or uwsgi.

for mod_wsgi this can be done by setting thread=1 as per 
https://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html
and similarly for uwsgi https://uwsgi-docs.readthedocs.io/en/latest/Options.html#threads

in addition to this the log level of the connection closed message will
be reduced to info or debug as this is a recoverable issue that is already handled
by how oslo.messaging is designed. disconnection is not an error and should not be
reported as such in the logs.

from a deployment perspective https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/nova/nova-api-container-puppet.yaml#L209-L244
shoudl be updated to ensure only 1 thread is used and that api parallelism is managed at the process level instead.

Comment 3 Martin Schuppert 2019-05-20 07:26:43 UTC
(In reply to smooney from comment #1)
> this is an issue that is currently be investigated upstream however it does
> not functionally break
> the nova api as the heartbeats are not required for it to function
> correctly. The current consensus
> is that it would be incorrect to try and circumvent the wsgi servers thread
> management to try and 
> force the server to keep the heartbeat thread alive.
> 
> while the heartbeat can stop this does not break the underlying way that
> oslo.messaging
> works in that it will automatically reconnect to rabbitmq when a new api
> request is received.
> 
> a workaround has been found for deployment which run the nova-api under
> uwsgi that we believe
> will also apply to mod_wsgi. when the nova-api is run under wsgi the wsgi
> server should be
> configured to run the wsgi app with 1 thread per interpreter process.
> 
> to address this upstream we are planning to take two approaches.
> 
> first document the requirement to use only 1 thread per interpreter process
> when running the nova-api under mod_wsgi or uwsgi.
> 
> for mod_wsgi this can be done by setting thread=1 as per 
> https://modwsgi.readthedocs.io/en/develop/configuration-directives/
> WSGIDaemonProcess.html
> and similarly for uwsgi
> https://uwsgi-docs.readthedocs.io/en/latest/Options.html#threads
> 
> in addition to this the log level of the connection closed message will
> be reduced to info or debug as this is a recoverable issue that is already
> handled
> by how oslo.messaging is designed. disconnection is not an error and should
> not be
> reported as such in the logs.
> 
> from a deployment perspective
> https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/
> nova/nova-api-container-puppet.yaml#L209-L244
> shoudl be updated to ensure only 1 thread is used and that api parallelism
> is managed at the process level instead.

Cloned [1] for deployment, but to note, we already have threads=1 (if not other specified by an
operator) as this is the default in puppet-nova [1] - 10-nova_api_wsgi.conf :

# ************************************
# Vhost template in module puppetlabs-apache
# Managed by Puppet
# ************************************
#
<VirtualHost 192.168.24.1:8774>
  ServerName undercloud-0.ctlplane.localdomain

  ## Vhost docroot
  DocumentRoot "/var/www/cgi-bin/nova"

  ## Directories, there should at least be a declaration for /var/www/cgi-bin/nova

  <Directory "/var/www/cgi-bin/nova">
    Options Indexes FollowSymLinks MultiViews
    AllowOverride None
    Require all granted
  </Directory>

  ## Logging
  ErrorLog "/var/log/httpd/nova_api_wsgi_error.log"
  ServerSignature Off
  CustomLog "/var/log/httpd/nova_api_wsgi_access.log" combined
  SetEnvIf X-Forwarded-Proto https HTTPS=1

  ## WSGI configuration
  WSGIApplicationGroup %{GLOBAL}
  WSGIDaemonProcess nova-api display-name=nova_api_wsgi group=nova processes=4 threads=1 user=nova
  WSGIProcessGroup nova-api
  WSGIScriptAlias / "/var/www/cgi-bin/nova/nova-api"
</VirtualHost>


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1711794
[2] https://github.com/openstack/puppet-nova/blob/stable/stein/manifests/wsgi/apache_api.pp#L55-L57

Comment 7 Matthew Booth 2019-05-24 15:40:55 UTC
After some discussion, the decision here was to disable oslo.messaging heartbeats entirely in nova.conf for the n-api service. This means there is no need to alter the severity of the log message, as it won't be produced. We believe our existing tcp keepalive settings will effectively serve the same purpose in any case. Disabling heartbeats in oslo.messaging is currently an experimental feature, so it's not clear that upstream would take this solution yet.

We need to test this. We believe that it should be covered by standard 'destructive' testing, but it would be good to both confirm this, and give them a heads up to look out for it. Joe, do you know which tests would cover down controllers or rabbit service, and who is responsible for running them?

Comment 22 smooney 2020-07-08 16:11:20 UTC
not this has been fixed on master and backports are in flight upstream

Comment 23 stchen 2020-09-30 20:04:20 UTC
Closing EOL, OSP 15 has been retired as of Sept 19, 2020


Note You need to log in before you can comment on or make changes to this bug.