Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2036377

Summary:	Silence amqp heartbeat warning
Product:	Red Hat OpenStack	Reporter:	Luigi Tamagnone <ltamagno>
Component:	openstack-nova	Assignee:	smooney
Status:	CLOSED CURRENTRELEASE	QA Contact:	OSP DFG:Compute <osp-dfg-compute>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	16.2 (Train)	CC:	alifshit, apevec, astupnik, athomas, bdobreli, dasmith, dbecker, dciabrin, eglynn, ggrimaux, hberaud, jeckersb, jhakimra, kchamart, lhh, lyarwood, mbooth, mburns, michele, morazi, mschuppe, mwitt, nova-maint, pambre, sbauza, sgordon, smooney, vromanso
Target Milestone:	z6	Keywords:	Triaged
Target Release:	16.2 (Train on RHEL 8.4)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-nova-20.6.2-2.20230509165016.6ed726b.el8osttrunk	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1706456	Environment:
Last Closed:	2023-06-19 16:28:28 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1706456
Bug Blocks:

Description Luigi Tamagnone 2021-12-31 15:31:13 UTC

This bug is similar to Bug #1706456 that should be already solved:
https://review.opendev.org/c/openstack/nova/+/728057/

But from logs perspective on RHOSP16.2 seems it is still present:
~~~
controller-prod-/var/log/containers/nova/nova-api.log.1:
2021-12-27 07:59:27.033 15 ERROR oslo.messaging._drivers.impl_rabbit [-] [4d9b9734-5d3c-4115-bed6-98be55f9750f] AMQP server on controller3-prod.internalapi.it:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
2021-12-27 07:59:30.570 15 INFO oslo.messaging._drivers.impl_rabbit [-] [4d9b9734-5d3c-4115-bed6-98be55f9750f] Reconnected to AMQP server on controller3-prod.internalapi.it:5672 via [amqp] client with port 38674.

controller3-prod/var/log/containers/rabbitmq/rabbit:
2021-12-27 07:59:25.682 [error] <0.32736.676> closing AMQP connection <0.32736.676> (10.0.0.4:58198 -> 10.0.0.132:5672 - mod_wsgi:15:4d9b9734-5d3c-4115-bed6-98be55f9750f):
missed heartbeats from client, timeout: 60s
~~~

Comment 1 melanie witt 2022-01-05 01:16:12 UTC

In the nova logs you show, I don't see any log message mentioning "heartbeat". As you mentioned, a fix [1] in OSP16.2 was done to suppress heartbeat warning log messages like the following:

"Unexpected error during heartbeat thread processing"

The above ^ is the only log message that is covered in the fix [1]. If there are additional log messages that should be suppressed, it will be a new bug that needs a new patch. I can see that you show a correlation between the "missed heartbeats from client" error message in the rabbitmq log and the "AMQP server ... is unreachable: Server unexpectedly closed connection" and "Reconnected to AMQP server" messages in the nova log.

I will add this rhbz to our bug triage call agenda for discussion and add a comment afterward.

[1] https://review.opendev.org/c/openstack/nova/+/728057

Comment 2 melanie witt 2022-01-05 22:06:42 UTC

We discussed this rhbz during our team bug call today and the tl;dr is that suppressing the "Server unexpectedly closed connection" and "Reconnected to AMQP server" log messages is not a reasonable option because these log messages are not exclusively about missed heartbeats. The same log messages are emitted when there are connection issues with rabbitmq such as during a network partition, for example. In such a case, the log messages are helpful for debugging connection issues with rabbitmq in the deployment. If we suppress them, we would lose the ability to debug in the latter situation.

That said, there is a way to avoid the aforementioned log messages that are due to missed heartbeats. There is an experimental feature in oslo.messaging [1] to run the heartbeat check in a native python thread, which will isolate the message queue heartbeat mechanism from the rest of the eventlet-managed process in nova. This means the heartbeat mechanism will not be suspended and resumed by eventlet, so it will send the heartbeats periodically without any interruption, avoiding missed heartbeats due to suspend/resume of nova-api.

To try the feature, set the following config option in the nova_api container nova.conf:

[oslo_messaging_rabbit]
heartbeat_in_pthread = True

It can also be set in the nova puppet hiera data, for example:

  "nova::rabbit_heartbeat_in_pthread": true


Starting in OSP17, the above ^ will be the default setting and will no longer need to be configured unless the customer wants to turn it off.


[1] https://opendev.org/openstack/oslo.messaging/commit/22f240b82fffbd62be8568a7d0d3369134596ace

Comment 5 Luigi Tamagnone 2022-01-20 08:53:58 UTC

Hi,

after the advised change (heartbeat_in_pthread = True), the logs errors are decreased but there is still a couple of them
From the logs:
~~~
var/log/containers/nova/nova-api.log:
2022-01-17 14:25:43.306 14 ERROR oslo.messaging._drivers.impl_rabbit [-] [1a196056-c5e4-4b67-b0a5-c183c8f0e64b] AMQP server on controller1:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
var/log/containers/nova/nova-api.log.5.gz:
2022-01-14 14:47:31.972 13 ERROR oslo.messaging._drivers.impl_rabbit [-] [05f96b2b-1ed7-4b28-b58b-7b5fefa5cd12] AMQP server on controller2:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2022-01-14 14:51:59.814 13 ERROR oslo.messaging._drivers.impl_rabbit [-] [05f96b2b-1ed7-4b28-b58b-7b5fefa5cd12] AMQP server on controller2:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
~~~
Probably it's harmless, Do you think this ERROR should disappear with the new feature[1]? Do you think further analysis could be useful?


[1] https://opendev.org/openstack/oslo.messaging/commit/22f240b82fffbd62be8568a7d0d3369134596ace

Comment 6 melanie witt 2022-01-21 03:26:08 UTC

(In reply to Luigi Tamagnone from comment #5)
> Hi,
> 
> after the advised change (heartbeat_in_pthread = True), the logs errors are
> decreased but there is still a couple of them
> From the logs:
> ~~~
> var/log/containers/nova/nova-api.log:
> 2022-01-17 14:25:43.306 14 ERROR oslo.messaging._drivers.impl_rabbit [-]
> [1a196056-c5e4-4b67-b0a5-c183c8f0e64b] AMQP server on controller1:5672 is
> unreachable: [Errno 104] Connection reset by peer. Trying again in 1
> seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
> var/log/containers/nova/nova-api.log.5.gz:
> 2022-01-14 14:47:31.972 13 ERROR oslo.messaging._drivers.impl_rabbit [-]
> [05f96b2b-1ed7-4b28-b58b-7b5fefa5cd12] AMQP server on controller2:5672 is
> unreachable: [Errno 104] Connection reset by peer. Trying again in 1
> seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
> 2022-01-14 14:51:59.814 13 ERROR oslo.messaging._drivers.impl_rabbit [-]
> [05f96b2b-1ed7-4b28-b58b-7b5fefa5cd12] AMQP server on controller2:5672 is
> unreachable: Server unexpectedly closed connection. Trying again in 1
> seconds.: OSError: Server unexpectedly closed connection
> ~~~
> Probably it's harmless, Do you think this ERROR should disappear with the
> new feature[1]? Do you think further analysis could be useful?
> 
> 
> [1]
> https://opendev.org/openstack/oslo.messaging/commit/
> 22f240b82fffbd62be8568a7d0d3369134596ace

Setting 'heartbeat_in_pthread = True' is enabling the feature [1] entirely. So if you are still seeing occasional messages, it could mean (a) there is a legitimate connection or load issue between nova-api and rabbitmq in the deployment or (b) there is a configuration issue or (c) there is bug in the oslo.messaging feature.

When you see the remaining log error messages, do you find them correlating with missed heartbeats in the rabbitmq logs?

Is nova-api running under mod_wsgi? If so, is the WSGI app configured with threads=1 [2]?

Is nova-api experiencing high load of requests when the connection drops and reconnects? If so, it's possible that nova-api or rabbitmq configs could be adjusted to improve performance. This is an example of a connection reset problem some larger scale operators had with neutron in the past:

http://lists.openstack.org/pipermail/openstack-discuss/2019-June/006936.html

to give an idea of what kind of configuration adjustments are possible.

[2] https://docs.openstack.org/releasenotes/nova/stein.html#known-issues

Comment 7 Luigi Tamagnone 2022-02-09 17:17:08 UTC

Hi,

(In reply to melanie witt from comment #6)

> Setting 'heartbeat_in_pthread = True' is enabling the feature [1] entirely.
> So if you are still seeing occasional messages, it could mean (a) there is a
> legitimate connection or load issue between nova-api and rabbitmq in the
> deployment or (b) there is a configuration issue or (c) there is bug in the
> oslo.messaging feature.
> 
> When you see the remaining log error messages, do you find them correlating
> with missed heartbeats in the rabbitmq logs?

No, I don't find much for the nova. There are two connections reset that I think are not bound with the the heartbeats happen a few seconds after:
~~~
controller2/var/log/containers/nova/nova-api.log.4.gz:
2022-01-25 09:37:47.952 14 ERROR oslo.messaging._drivers.impl_rabbit [-] [2b7a9c0f-e166-4043-af06-542ce12414f4] AMQP server on controller1.internalapi.sdc.itidm.elisa.fi:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer

controller2/var/log/containers/rabbitmq/rabbit\@controller2.log.2.gz:
2022-01-25 09:37:48.989 [info] <0.9034.35> Mirrored queue 'reply_31ea97a000b74eee9586ac477053afe1' in vhost '/': Adding mirror on node 'rabbit@controller1': <46560.8977.285>

controller1/var/log/containers/rabbitmq/rabbit.2.gz:
2022-01-25 09:40:48.972 [error] <0.8960.285> closing AMQP connection <0.8960.285> (10.x.x.68:60820 -> 10.x.x.4:5672 - mod_wsgi:14:2b7a9c0f-e166-4043-af06-542ce12414f4):
missed heartbeats from client, timeout: 60s
-----
controller3/var/log/containers/nova/nova-api.log.14.gz:
2022-01-20 12:47:03.877 13 ERROR oslo.messaging._drivers.impl_rabbit [-] [05f96b2b-1ed7-4b28-b58b-7b5fefa5cd12] AMQP server on controller2:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer

controller2/var/log/containers/rabbitmq/rabbit\@controller2.log.4.gz
2022-01-20 12:47:03.739 [info] <0.14906.9> Mirrored queue 'reply_3b74c684067c4a22b64db9fbd24e7b13' in vhost '/': Adding mirror on node 'rabbit@espsti-controller3v-test1': <46559.23138.150>

controller3/var/log/containers/rabbitmq/rabbit.5.gz:
2022-01-20 12:50:03.716 [error] <0.23121.150> closing AMQP connection <0.23121.150> (10.x.x.4:54492 -> 10.x.x.132:5672 - mod_wsgi:15:97d6d7c1-728c-4f90-b8eb-edce51ffe03a):
missed heartbeats from client, timeout: 60s

~~~

> Is nova-api running under mod_wsgi? If so, is the WSGI app configured with
> threads=1 [2]?
yes

Comment 24 smooney 2023-06-19 16:28:28 UTC

This was adress in 16.1 and is not required for 16.2.x