1913177 – Reduce the log level in the nova-api for oslo messing warnings

Bug 1913177 - Reduce the log level in the nova-api for oslo messing warnings

Summary: Reduce the log level in the nova-api for oslo messing warnings

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	16.1 (Train)
Hardware:	All
OS:	All
Priority:	high
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	OSP DFG:Compute
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:	1706456
Blocks:	1711794
TreeView+	depends on / blocked

Reported:	2021-01-06 08:14 UTC by Michele Valsecchi
Modified:	2024-12-20 19:29 UTC (History)
CC List:	27 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1706456
Environment:
Last Closed:	2021-03-01 14:32:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1825584	None	None	None	2021-01-06 08:16:42 UTC
OpenStack gerrit	724188	None	MERGED	Silence amqp heartbeat warning	2021-02-18 23:22:46 UTC
OpenStack gerrit	728057	None	MERGED	Silence amqp heartbeat warning	2021-02-18 23:22:47 UTC
OpenStack gerrit	728059	None	NEW	Silence amqp heartbeat warning	2021-02-18 23:22:47 UTC
Red Hat Issue Tracker	OSP-1840	None	None	None	2021-11-19 19:02:35 UTC
Red Hat Knowledge Base (Solution)	5096301	None	None	None	2021-01-06 08:16:42 UTC

Description Michele Valsecchi 2021-01-06 08:14:59 UTC

+++ This bug was initially created as a clone of Bug #1706456 +++
(Note: the patch for Train is already merged https://review.opendev.org/c/openstack/nova/+/728057 https://github.com/openstack/nova/blob/stable/train/nova/config.py#L35-L50)

Description of problem:
in OSP16.1, nova_api currently uses eventlet.monkey_patch() to enable some asynchronous
tasks like listing instances across multiple cells.

This has an adverse effect on oslo_message and in particular the AMQP heartbeat thread.
This thread is supposed to be running every 15seconds and checking whether some
AMQP heartbeat packets must be sent to rabbit in order to keep the AMQP connection
alive. Howwever, we see many disconnection in the rabbitmq logs which are linked
to nova_api mod_wsgi processes:

2019-04-15 16:30:28.408 [info] <0.3614.0> accepting AMQP connection <0.3614.0> (10.X.X.X:33712 -> 10.X.X.X:5672)
2019-04-15 16:30:28.421 [info] <0.3614.0> Connection <0.3614.0> (10.X.X.X:33712 -> 10.X.X.X:5672) has a client-provided name: mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283
2019-04-15 16:30:28.423 [info] <0.3614.0> connection <0.3614.0> (10.X.X.X:33712 -> 10.X.X.X:5672 - mod_wsgi:19:564a2fe0-6ccd-4398-a160-eb096ba3d283): user 'guest' authenticated and granted access to vhost '/'
2019-04-15 16:31:25.191 [error] <0.3588.0> closing AMQP connection <0.3588.0> (10.X.X.X:33440 -> 10.X.X.X:5672 - mod_wsgi:20:544ec298-0b37-4123-87e0-2362416c7183):
missed heartbeats from client, timeout: 60s
2019-04-15 16:31:25.760 [error] <0.3565.0> closing AMQP connection <0.3565.0> (10.X.X.X:33356 -> 10.X.X.X:5672 - mod_wsgi:20:f78bdc50-0734-4823-b91b-7c9ac4227fd0):
missed heartbeats from client, timeout: 60s
2019-04-15 16:31:27.054 [info] <0.3640.0> accepting AMQP connection <0.3640.0> (10.X.X.X:33784 -> 10.X.X.X:5672)
2019-04-15 16:31:27.066 [info] <0.3640.0> Connection <0.3640.0> (10.X.X.X:33784 -> 10.X.X.X:5672) has a client-provided name: mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1
2019-04-15 16:31:27.067 [info] <0.3640.0> connection <0.3640.0> (10.X.X.X:33784 -> 10.X.X.X:5672 - mod_wsgi:19:fde20251-b463-47ff-a1e9-4c43017dc4a1): user 'guest' authenticated and granted access to vhost '/'
2019-04-15 16:32:57.777 [error] <0.2891.0> closing AMQP connection <0.2891.0> (10.X.X.X:48034 -> 10.X.X.X:5672 - mistral-server:7:8eafa1b3-63bd-4e57-b0a7-9107043200bb):
missed heartbeats from client, timeout: 60s

Those missed heartbeats make nova_api logs warning/errors and force it to recreate connection to rabbitmq:

nova/nova-api.log:2019-04-15 16:02:07.857 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:16:01.207 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionResetError: [Errno 104] Connection reset by peer
nova/nova-api.log:2019-04-15 16:27:24.364 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:27:24.365 20 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:27:24.740 20 ERROR oslo.messaging._drivers.impl_rabbit [req-82a5fdc7-5fe8-4314-a31c-394eabdc418d b36fa1be80ae4275bb4511bb74d78eff 89d60bdd36ab49c5b0720eb43fa7d1dd - default default] [f78bdc50-0734-4823-b91b-7c9ac4227fd0] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:30:27.358 19 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection
nova/nova-api.log:2019-04-15 16:30:27.359 19 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: OSError: Server unexpectedly closed connection

Running the AMQP heartbeat thread under mod_wsgi doesn't work as expected when the threading library is monkey_patched, because the thread waits on a data structure [1] that has been monkey patched [2], which makes it yield its execution instead of sleeping for 15s.

Now, because mod_wsgi stops the execution of its embedded interpreter as soon as it finished processing a request, the AMQP heartbeat thread can't be resumed until there's a new request to be processed in the mod_wsgi queue (which would resume the python interpreter and make eventlet resume the thread).

In practice, on an undercloud, there's not a lot of traffic coming through nova_api. This causes the event loop to be stopped for long periods of times (sometimes more than 60s) and make the scheduling of the AMQP heartbeat thread erratic, ultimately causing disconnections and reconnections to rabbitmq.

[1] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L904
[2] https://github.com/openstack/oslo.utils/blob/master/oslo_utils/eventletutils.py#L182

Version-Release number of selected component (if applicable):
I've verified this happens in both

- openstack-nova-api-20.3.1-0.20200626213433.38ee1f3.el8ost.noarch 
(as per https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-nova-api/5de6bdfe5a13461646f8fe2a?tag=16.1-47&push_date=1596018676000&container-tabs=packages)

- openstack-nova-api-20.3.1-0.20200626213436.38ee1f3.el8ost.noarch
(as per https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-nova-api/5de6bdfe5a13461646f8fe2a?tag=16.1-55&push_date=1600264861000&container-tabs=packages)

How reproducible:
Always

Steps to Reproduce:
1. Deploy an undercloud
2. Make sure that nova_api has some AMQP connections opened to rabbitmq by calling e.g. 'nova service-list'
3. Wait for 60s and obvserve some disconnection in rabbitmq logs

Actual results:
rabbitmq closes connections that have been idle for more than 60s, and cause warning/errors in nova logs.

Expected results:
nova_api should send regular AMQP heartbeat to keep the connection to rabbit opened when it is idle.

Additional info:
In OSP16.1, this only affect nova_api as other api services like e.g. heat_api do not use eventlet.monkey_patch() 
- The issue is tracked upstream in https://bugs.launchpad.net/nova/+bug/1825584

- The patch for Train is already merged https://review.opendev.org/c/openstack/nova/+/728057












--- Additional comment from  on 2019-05-17 17:17:26 UTC ---

this is an issue that is currently be investigated upstream however it does not functionally break
the nova api as the heartbeats are not required for it to function correctly. The current consensus
is that it would be incorrect to try and circumvent the wsgi servers thread management to try and 
force the server to keep the heartbeat thread alive.

while the heartbeat can stop this does not break the underlying way that oslo.messaging
works in that it will automatically reconnect to rabbitmq when a new api request is received.

a workaround has been found for deployment which run the nova-api under uwsgi that we believe
will also apply to mod_wsgi. when the nova-api is run under wsgi the wsgi server should be
configured to run the wsgi app with 1 thread per interpreter process.

to address this upstream we are planning to take two approaches.

first document the requirement to use only 1 thread per interpreter process
when running the nova-api under mod_wsgi or uwsgi.

for mod_wsgi this can be done by setting thread=1 as per 
https://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html
and similarly for uwsgi https://uwsgi-docs.readthedocs.io/en/latest/Options.html#threads

in addition to this the log level of the connection closed message will
be reduced to info or debug as this is a recoverable issue that is already handled
by how oslo.messaging is designed. disconnection is not an error and should not be
reported as such in the logs.

from a deployment perspective https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/nova/nova-api-container-puppet.yaml#L209-L244
shoudl be updated to ensure only 1 thread is used and that api parallelism is managed at the process level instead.

--- Additional comment from  on 2019-05-17 17:18:59 UTC ---

note this bug shoudl be cloned to cover the deployment aspect and the initial bug shoudl be re targeted to the oslo.messaging component to
track adjusting the log level.

--- Additional comment from Martin Schuppert on 2019-05-20 07:26:43 UTC ---

(In reply to smooney from comment #1)
> this is an issue that is currently be investigated upstream however it does
> not functionally break
> the nova api as the heartbeats are not required for it to function
> correctly. The current consensus
> is that it would be incorrect to try and circumvent the wsgi servers thread
> management to try and 
> force the server to keep the heartbeat thread alive.
> 
> while the heartbeat can stop this does not break the underlying way that
> oslo.messaging
> works in that it will automatically reconnect to rabbitmq when a new api
> request is received.
> 
> a workaround has been found for deployment which run the nova-api under
> uwsgi that we believe
> will also apply to mod_wsgi. when the nova-api is run under wsgi the wsgi
> server should be
> configured to run the wsgi app with 1 thread per interpreter process.
> 
> to address this upstream we are planning to take two approaches.
> 
> first document the requirement to use only 1 thread per interpreter process
> when running the nova-api under mod_wsgi or uwsgi.
> 
> for mod_wsgi this can be done by setting thread=1 as per 
> https://modwsgi.readthedocs.io/en/develop/configuration-directives/
> WSGIDaemonProcess.html
> and similarly for uwsgi
> https://uwsgi-docs.readthedocs.io/en/latest/Options.html#threads
> 
> in addition to this the log level of the connection closed message will
> be reduced to info or debug as this is a recoverable issue that is already
> handled
> by how oslo.messaging is designed. disconnection is not an error and should
> not be
> reported as such in the logs.
> 
> from a deployment perspective
> https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/
> nova/nova-api-container-puppet.yaml#L209-L244
> shoudl be updated to ensure only 1 thread is used and that api parallelism
> is managed at the process level instead.

Cloned [1] for deployment, but to note, we already have threads=1 (if not other specified by an
operator) as this is the default in puppet-nova [1] - 10-nova_api_wsgi.conf :

# ************************************
# Vhost template in module puppetlabs-apache
# Managed by Puppet
# ************************************
#
<VirtualHost 192.168.24.1:8774>
  ServerName undercloud-0.ctlplane.localdomain

  ## Vhost docroot
  DocumentRoot "/var/www/cgi-bin/nova"

  ## Directories, there should at least be a declaration for /var/www/cgi-bin/nova

  <Directory "/var/www/cgi-bin/nova">
    Options Indexes FollowSymLinks MultiViews
    AllowOverride None
    Require all granted
  </Directory>

  ## Logging
  ErrorLog "/var/log/httpd/nova_api_wsgi_error.log"
  ServerSignature Off
  CustomLog "/var/log/httpd/nova_api_wsgi_access.log" combined
  SetEnvIf X-Forwarded-Proto https HTTPS=1

  ## WSGI configuration
  WSGIApplicationGroup %{GLOBAL}
  WSGIDaemonProcess nova-api display-name=nova_api_wsgi group=nova processes=4 threads=1 user=nova
  WSGIProcessGroup nova-api
  WSGIScriptAlias / "/var/www/cgi-bin/nova/nova-api"
</VirtualHost>


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1711794
[2] https://github.com/openstack/puppet-nova/blob/stable/stein/manifests/wsgi/apache_api.pp#L55-L57

--- Additional comment from Sylvain Bauza on 2019-05-21 09:49:59 UTC ---

(In reply to smooney from comment #1)
> in addition to this the log level of the connection closed message will
> be reduced to info or debug as this is a recoverable issue that is already
> handled
> by how oslo.messaging is designed. disconnection is not an error and should
> not be
> reported as such in the logs.

Since BZ #1711794 matters with Deployment aspects of this issue, I'll move this bug to the PID1 DFG for asking them to change the log level of oslo.messaging down to INFO.

Thanks Sean,
-Sylvain

--- Additional comment from Michele Baldessari on 2019-05-21 11:04:38 UTC ---

I don't think moving this oslo.messaging to INFO level is improving the situation in any way?

1) https://bugzilla.redhat.com/show_bug.cgi?id=1711794 has been closed because we already use a single thread, so that workaround is gone already.

2) If we had moved this heartbeat log message to INFO instead of ERROR we would never have root-caused this problem in mistral. So having this info that the thread is unable to be scheduled is extremely worthwhile and changing it to INFO would have masked the problem.

3) The fact that nova manages to recover from this problem is great, but I think in the end it still needs a proper fix, because if eventlet does not get schedule some threads enough, there is nothing preventing the eventlet code (cells stuff?) to not be scheduled often enough either, masking an actual problem.

This bug is specifically about nova having problematic interactions with mod_wsgi and eventlet. If https://review.opendev.org/657750 fixes things, that is to be checked but we do not think that papering of the loglevel of this message solves much.

--- Additional comment from  on 2019-05-21 12:02:20 UTC ---

From a nova perspective we do not belive that it is correct to circumvent 
the wsgi servers native execution model by spawning a threadpool or process pool
within the wsgi application.

as the inital proponet of the pool idea i find myself coming to agree with that
view point. as such it's unlikel that we will proceed in the direction outlined in
https://review.opendev.org/657750 or current recommendation is to change the log level
as it is not an error that the operator can act on to re mediate.

oslo.messaging correctly handle re connection and of the amqp connection internally and
re transmits pending messages. as such the fact the heartbeat is non functional has no impact on
the nova-api.

we do not run other background task in the nova-api and no other nova component uses both wsgi and 
eventlet. not until stein the nova api did not use eventlet. as such the nova api is the only 
component with this issue so there is no issue with other "cells stuff" not being scheduled.

since the thread pool proposed in https://review.opendev.org/657750 would still be tired to the python
interpreter created by the wsgi server, if that interperter was stop by the wsgi server the tread pool 
would also be stopped as such even in its current form its not clear that that would actuly solve the problem
although since we did not have this issue before we started using eventlet it is likely the heartbeat thread would
prevented teh wsgi server from suspending the wsgi application when there were not requests. that fixes the heartbeat
and prevents teh wsgi server from suspending to save power and reduce resource usage when the application is not serving
a request.

--- Additional comment from Matthew Booth on 2019-05-24 15:40:55 UTC ---

After some discussion, the decision here was to disable oslo.messaging heartbeats entirely in nova.conf for the n-api service. This means there is no need to alter the severity of the log message, as it won't be produced. We believe our existing tcp keepalive settings will effectively serve the same purpose in any case. Disabling heartbeats in oslo.messaging is currently an experimental feature, so it's not clear that upstream would take this solution yet.

We need to test this. We believe that it should be covered by standard 'destructive' testing, but it would be good to both confirm this, and give them a heads up to look out for it. Joe, do you know which tests would cover down controllers or rabbit service, and who is responsible for running them?

--- Additional comment from RHEL Program Management on 2019-05-27 07:13:29 UTC ---

This bugzilla has been removed from the release since it is missing the Triaged keyword or does not have an acked release flag. For details, see https://mojo.redhat.com/docs/DOC-1144661#jive_content_id_OSP_Release_Planning.

--- Additional comment from Bogdan Dobrelya on 2019-05-27 14:14:19 UTC ---

(In reply to Matthew Booth from comment #7)
> After some discussion, the decision here was to disable oslo.messaging
> heartbeats entirely in nova.conf for the n-api service. This means there is
> no need to alter the severity of the log message, as it won't be produced.
> We believe our existing tcp keepalive settings will effectively serve the
> same purpose in any case. Disabling heartbeats in oslo.messaging is
> currently an experimental feature, so it's not clear that upstream would
> take this solution yet.
> 
> We need to test this. We believe that it should be covered by standard
> 'destructive' testing, but it would be good to both confirm this, and give
> them a heads up to look out for it. Joe, do you know which tests would cover
> down controllers or rabbit service, and who is responsible for running them?

IIRC, the original cause for introducing heartbeats was https://bugs.launchpad.net/nova/+bug/856764.

Firstly, I would *not* recommend disabling heartbeats for the preference of TCP-KA, mostly because of high expectations to see "Timed out waiting for a reply" returns (https://bugs.launchpad.net/oslo.messaging/+bug/1338732) and inefficient TCP KA defaults for Linux (see https://bugs.launchpad.net/fuel/+bug/1289200 and https://review.opendev.org/93815 for details).

But if we would do that, the destructive testing is needed, indeed. And to measure it properly if disabling heartbeats "just works" for TCP-ka setup as well, the acceptance criteria should take into account the following:

a) The destructive test should verify a forced power-off of a controller holding the most (better all) of VIPs for openstack APIs;
b) No counts raising across all openstack services for the "Timed out waiting for a reply" message, where the services' logging ranges match the failover-recover period defined for the destructive test (e.g. 120 sec) plus the stabilization time (e.g. 60 sec).
c) Openstack services' rpc_response_timeout settings correspond to the used sysctl TCP-ka (either defaults or the tuned values as we have for it in OSP) and also correspond to the appropriate heartbeat_timeout_threshold setting. See also https://medium.com/@george.shuklin/rabbit-heartbeat-timeouts-in-openstack-fa5875e0309a (go straight to "Conclusion") for how to evaluate the latter. See the example I provided below.


So for example:
- we want to test the full failover-recover window of a 180 sec (120 is for the failure-recover window +60 sec to cover all of the post-stabilization dances for safe-testing),
- we expect all the "useful RPC work done" for no longer than a 60 sec (meaning for some specific service under test, like Nova, to spawn an instance e2e)
- count all "Timed out waiting for a reply" for all involved service logs for that period of a 180 sec.

Then, the corresponding settings may be defined like that:

* oslo's heartbeats enabled:
rpc_response_timeout = 120
heartbeat_timeout_threshold = (rpc_response_timeout - work_time) / 2 - 1, gives us rabbit_heartbeat_threshold = (120 - 60)/2 - 1 = 29

* TCP KA alternative
oslo:
heartbeat_timeout_threshold = 0
sysctls:
net.ipv4.tcp_keepalive_time = 29
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_intvl = 31

Otherwise, I'm afraid we would not get any useful testing outcomes, if all those subtle timings left unaligned, ending up with inevitable HA regressions.

--- Additional comment from Bogdan Dobrelya on 2019-05-27 14:22:15 UTC ---

restored the needinfo I've removed occasionally

--- Additional comment from Damien Ciabrini on 2019-05-27 14:49:13 UTC ---

(In reply to Bogdan Dobrelya from comment #9)
> Firstly, I would *not* recommend disabling heartbeats for the preference of
> TCP-KA, mostly because of high expectations to see "Timed out waiting for a
> reply" returns (https://bugs.launchpad.net/oslo.messaging/+bug/1338732) and
> inefficient TCP KA defaults for Linux (see
> https://bugs.launchpad.net/fuel/+bug/1289200 and
> https://review.opendev.org/93815 for details).
> 

Note that in parallel to that we're also investigating ways to restore the
original heartbeat monitoring behaviour in oslo.messaging, but the idea with
TCP-KA is that they could be a shorter term fix so we'd like to see it it fits
the needs.

> But if we would do that, the destructive testing is needed, indeed. And to
> measure it properly if disabling heartbeats "just works" for TCP-ka setup as
> well, the acceptance criteria should take into account the following:

Agreed. 

> * TCP KA alternative
> oslo:
> heartbeat_timeout_threshold = 0
> sysctls:
> net.ipv4.tcp_keepalive_time = 29
> net.ipv4.tcp_keepalive_probes = 3
> net.ipv4.tcp_keepalive_intvl = 31

The settings for TCP-KA is rather agressive in TripleO:

/proc/sys/net/ipv4/tcp_keepalive_intvl = 1
/proc/sys/net/ipv4/tcp_keepalive_probes = 5
/proc/sys/net/ipv4/tcp_keepalive_time = 5

So this should be good enough to notice rabbit disconnection in a timely manner.
But again, our destructive tests will ultimately validate that short-term
workaround, ang give us time to come up with a cleaner fix in oslo.messaging.

--- Additional comment from  on 2019-05-27 16:24:40 UTC ---

i set comment 11 to private since its a reply to a private message

after our meeting on friday we spoke about this breifly to dan smith
and he expressed his preference upstream to leave keepalive enabled and
reduce the log level in the nova-api for oslo messing warnings.

i have cleared the blocker flag as while the current log message is a little
scary the nova api is functional as we deploy with only 1 wsgi thread per process
so we recover fine.

if we can run those distructive tests and use that as input into or decision in
a short timespan we should likely do that.

either option of disabling the heartbeat or suppressing the log message will effectively
fall back to the tcp keep alive anyway when the api is not underload.

the wsgi server will suspend the nova-api applcation  due to inactivity so the real question is,
is that good enough in a failure senario e.g. rabbitmq got rebooted. 

when the api is under load the wsgi will
not get suspended for being idol and the heartbeat will keep the rabbitmq connection
alive so not disabling the heat beat is preferable in that senario as api request that dont
need rabbit will keep the heat beat alive for api request that do.

the proablem we have with that is the ux of the oslo messaging error message which is not
actionabl by the operator that sees it.

--- Additional comment from Martin Schuppert on 2019-07-18 09:15:50 UTC ---

Joe, could you answer the question in comment7

~~~
We need to test this. We believe that it should be covered by standard 'destructive' testing, but it would be good to both confirm this, and give them a heads up to look out for it. Joe, do you know which tests would cover down controllers or rabbit service, and who is responsible for running them?
~~~

--- Additional comment from Scott Lewis on 2019-09-24 19:39:35 UTC ---

Updating flag and TM post 15.GA; DFGs need to update the TM when committing BZ to a release.

--- Additional comment from Martin Schuppert on 2019-10-01 10:40:40 UTC ---

removing triage keyword to discuss this again in next triage call to see if there is something we need todo, or if we are good?

Note: As I'll be out this Friday, reverting the assignee to default.

--- Additional comment from Martin Schuppert on 2019-10-02 15:20:08 UTC ---

moving back to nova component to track/discuss the log handling

--- Additional comment from Martin Schuppert on 2019-10-06 13:56:27 UTC ---

(In reply to Martin Schuppert from comment #16)
> moving back to nova component to track/discuss the log handling

missed to add triage keyword. added now

--- Additional comment from Matthew Booth on 2019-10-23 13:38:38 UTC ---

Martin, please could you summarise the remaining task and update the bug title accordingly?

--- Additional comment from Martin Schuppert on 2019-10-23 13:39:38 UTC ---

(In reply to Martin Schuppert from comment #16)
> moving back to nova component to track/discuss the log handling

To make it a bit more clear. This BZ is to implement the change in nova regarding the log message handling which
was summarized by Sean in comment12:

~~~
after our meeting on friday we spoke about this breifly to dan smith
and he expressed his preference upstream to leave keepalive enabled and
reduce the log level in the nova-api for oslo messing warnings.
~~~

--- Additional comment from  on 2020-04-23 15:16:32 UTC ---

Hi everyone.

I am having a case with this exact (?) issue but on OSP 16.0.1. Only with nova-api logs.

Example of error:
2020-04-22 10:01:51.075 22 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 104] Connection reset by peer (retrying in 0 seconds): ConnectionResetError: [Errno 104] Connection reset by peer
2020-04-22 10:04:26.967 26 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 104] Connection reset by peer (retrying in 0 seconds): ConnectionResetError: [Errno 104] Connection reset by peer
2020-04-22 10:04:26.976 26 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 104] Connection reset by peer (retrying in 0 seconds): ConnectionResetError: [Errno 104] Connection reset by peer
2020-04-22 10:09:29.231 31 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2020-04-22 10:09:29.241 31 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2020-04-22 10:15:09.192 26 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection 
2020-04-22 10:15:09.200 26 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2020-04-22 10:16:10.439 23 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2020-04-22 10:16:16.622 31 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2020-04-22 10:16:16.631 31 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2020-04-22 10:16:16.636 31 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2020-04-22 10:28:12.478 32 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection 
2020-04-22 10:28:12.486 32 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2020-04-22 10:28:26.717 29 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 104] Connection reset by peer (retrying in 0 seconds): ConnectionResetError: [Errno 104] Connection reset by peer
2020-04-22 10:28:26.727 29 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 104] Connection reset by peer (retrying in 0 seconds): ConnectionResetError: [Errno 104] Connection reset by peer
2020-04-22 10:58:56.526 32 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection 
2020-04-22 10:58:56.535 32 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2020-04-22 10:59:27.198 26 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 104] ECONNRESET (retrying in 0 seconds): ConnectionResetError: [Errno 104] ECONNRESET
2020-04-22 10:59:27.206 26 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out
2020-04-22 11:33:12.274 22 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection 
2020-04-22 11:33:12.283 22 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2020-04-22 11:36:51.309 32 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection 
2020-04-22 11:36:51.318 32 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2020-04-22 11:37:03.316 23 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection 
2020-04-22 11:42:12.942 28 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2020-04-22 11:42:15.605 31 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 104] Connection reset by peer (retrying in 0 seconds): ConnectionResetError: [Errno 104] Connection reset by peer
2020-04-22 11:42:15.615 31 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 104] Connection reset by peer (retrying in 0 seconds): ConnectionResetError: [Errno 104] Connection reset by peer
2020-04-22 11:42:15.621 31 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 104] Connection reset by peer (retrying in 0 seconds): ConnectionResetError: [Errno 104] Connection reset by peer

There is no visible impact from customer point of view BUT now they are afraid of puting this cluster in production. 

I already told them it looks inoffensive and shouldnt be a blocker.

If there's anything you want to try, please let me know.

--- Additional comment from  on 2020-04-23 15:25:10 UTC ---

not really this is is expected if you are running the api using aparche mod_wsgi or uwsgi.

when the api runs in a webservers wsgi server the life time of its treads is managed by the webserver and the reason this happens it the hartbeat thread goes to sleep after a reuest is serviced.

so the hearbeat is missed and you get teh error in the log.

we still have not change the log level upstream so there is no way to get rid of that unless you dont run the api in an extrenal wsgi server.

does the customer have a support excpetion for 16.0.1?
not that that is a short life release whose support will end in august
so if they dont have a supprot excpetion to upgreade it to 16.1 which comes out in july they will have issues when its eol in august.

--- Additional comment from  on 2020-07-08 16:11:20 UTC ---

not this has been fixed on master and backports are in flight upstream

--- Additional comment from  on 2020-09-30 20:04:20 UTC ---

Closing EOL, OSP 15 has been retired as of Sept 19, 2020

Note You need to log in before you can comment on or make changes to this bug.

apevec
astupnik
athomas
bdobreli
dasmith
dbecker
dciabrin
eglynn
enothen
ggrimaux
jeckersb
jhakimra
kchamart
lhh
lyarwood
mbooth
mburns
michele
migawa
morazi
mschuppe
nova-maint
sbauza
sgordon
smooney
stchen
vromanso