Bug 2115383 - Services can be stucked after trying to reconnect to rabbitmq
Summary: Services can be stucked after trying to reconnect to rabbitmq
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 17.0 (Wallaby)
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ga
: 17.0
Assignee: Takashi Kajinami
QA Contact: dabarzil
URL:
Whiteboard:
: 2080952 2112909 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-04 14:51 UTC by Slawek Kaplonski
Modified: 2022-09-21 12:25 UTC (History)
22 users (show)

Fixed In Version: openstack-tripleo-heat-templates-14.3.1-0.20220719171720.feca772.el9ost puppet-swift-18.5.1-0.20220307041424.1d4dab1.el9ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2116311 (view as bug list)
Environment:
Last Closed: 2022-09-21 12:24:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1984076 0 None None None 2022-08-16 15:09:39 UTC
OpenStack gerrit 852528 0 None MERGED Support more [oslo_messaging_rabbit] parameters 2022-08-22 00:29:46 UTC
OpenStack gerrit 852713 0 None MERGED Disable [oslo_messaging_rabbit] heartbeat_in_pthread 2022-08-22 00:29:37 UTC
Red Hat Issue Tracker OSP-18030 0 None None None 2022-08-04 14:58:36 UTC
Red Hat Product Errata RHEA-2022:6543 0 None None None 2022-09-21 12:25:14 UTC

Description Slawek Kaplonski 2022-08-04 14:51:54 UTC
This was initially found by Fiorella Yanac (fyanac) and reported in  https://bugzilla.redhat.com/show_bug.cgi?id=2112909 but as it was against neutron and was focused on the result of the issue, not the real problem and could be confusing later I decided to open new bug for it.


Description of problem:
We noticed recently that nova-compute and neutron agents can hangs and do nothing when connectivity to rabbitmq will be broken and later restored.


Version-Release number of selected component (if applicable):
OSP-17.0
python3-oslo-messaging-12.7.3-0.20220430102742.5d6fd1a.el9ost.noarch


How reproducible:
Very often. Almost every time when I restarted rabbitmq cluster, some services were stucked.


Steps to Reproduce:
1. Stop rabbitmq cluster and wait few seconds so nova-compute and neutron agents have in logs info that they can't connect to rabbitmq
2. Start rabbitmq cluster,
3. Wait some time and check if there are any agents DOWN in the API. If there is some service DOWN, You can check that on the node that it is not logging anythin at all.
When You run strace on such process it will be stucked on something like:

strace: Process 411673 attached
futex(0x7f86d000a7e0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

However container will be running and will be seen as "healthy" in podman.
Restart of the container fixes the problem.

Comment 1 Slawek Kaplonski 2022-08-04 14:55:13 UTC
*** Bug 2112909 has been marked as a duplicate of this bug. ***

Comment 2 John Eckersberg 2022-08-04 16:03:52 UTC
Being stuck on futex is normal, there are multiple worker threads and I'm guessing what you are seeing is the main thread blocking until the threads synchronize at service shutdown.  At the very least, use '-f' with strace so you can see all of the threads and you should see other stuff happening.

Most helpful would be to gather a guru meditation report (https://docs.openstack.org/nova/yoga/reference/gmr.html) which will provide stack traces for all of the running greenthreads.  It's been a while since I've done this but I believe the GMR output will end up either in the container output ("podman logs") or in the journal somewhere.  It will not be in the nova/neutron log file(s).

Comment 6 Luca Miccini 2022-08-05 06:21:14 UTC
Far from being an expert in these things but here is my shot at trying to understand what happened.

Rabbtimq was last restarted around 12:XX and these logs are from 13:00 onward, so I'd exclude an issue with the rabbitmq server itself.

On all three controllers rabbitmq logs I see the same pattern (I've taken openvswitch-agent as a reference).

1. connection to neutron-openvswitch-agent gets cut from the server side because of "{inet_error,etimedout}" and immediately recovers:

2022-08-03 13:00:40.635748+00:00 [error] <0.4138.0> closing AMQP connection <0.4138.0> (172.17.1.143:35958 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2):
2022-08-03 13:00:40.635748+00:00 [error] <0.4138.0> {inet_error,etimedout}
2022-08-03 13:00:40.892350+00:00 [error] <0.6911.0> closing AMQP connection <0.6911.0> (172.17.1.143:44250 -> 172.17.1.109:5672 - neutron-l3-agent:26452:86ba24a9-f5da-48b9-a302-2ae6ef0852fe):
2022-08-03 13:00:40.892350+00:00 [error] <0.6911.0> {inet_error,etimedout}
2022-08-03 13:00:41.403332+00:00 [error] <0.20517.0> closing AMQP connection <0.20517.0> (172.17.1.143:48842 -> 172.17.1.109:5672 - neutron-metadata-agent:26706:a3f34642-4b49-4d02-b9b1-3f8ccfdc3922):
2022-08-03 13:00:41.403332+00:00 [error] <0.20517.0> {inet_error,etimedout}
2022-08-03 13:00:41.658890+00:00 [info] <0.23334.0> Connection <0.23334.0> (172.17.1.143:42594 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2
2022-08-03 13:00:41.660419+00:00 [info] <0.23334.0> connection <0.23334.0> (172.17.1.143:42594 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2): user 'guest' authenticated and granted access to vhost '/'
...
2022-08-03 13:26:48.892258+00:00 [error] <0.27201.0> {inet_error,etimedout}
2022-08-03 13:26:49.918467+00:00 [info] <0.2547.1> Connection <0.2547.1> (172.17.1.90:52052 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26647:fe8ef903-f9ed-46d5-8de6-cae93d88c331
2022-08-03 13:26:49.919779+00:00 [info] <0.2547.1> connection <0.2547.1> (172.17.1.90:52052 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:fe8ef903-f9ed-46d5-8de6-cae93d88c331): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:26:50.876213+00:00 [error] <0.18935.0> closing AMQP connection <0.18935.0> (172.17.1.90:33558 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:a1dd2c98-d532-4604-9ae7-4b4b88de4dae):
2022-08-03 13:26:50.876213+00:00 [error] <0.18935.0> {inet_error,etimedout}
2022-08-03 13:26:51.042504+00:00 [info] <0.2578.1> Connection <0.2578.1> (172.17.1.90:52056 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26647:a1dd2c98-d532-4604-9ae7-4b4b88de4dae
2022-08-03 13:26:51.043781+00:00 [info] <0.2578.1> connection <0.2578.1> (172.17.1.90:52056 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:a1dd2c98-d532-4604-9ae7-4b4b88de4dae): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:26:51.451337+00:00 [error] <0.27173.0> closing AMQP connection <0.27173.0> (172.17.1.90:35914 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:64849965-617a-4f21-bddf-7cee0cec5b10):
2022-08-03 13:26:51.451337+00:00 [error] <0.27173.0> {inet_error,etimedout}
2022-08-03 13:26:52.466259+00:00 [info] <0.2605.1> Connection <0.2605.1> (172.17.1.90:48756 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26647:64849965-617a-4f21-bddf-7cee0cec5b10
2022-08-03 13:26:52.469542+00:00 [info] <0.2605.1> connection <0.2605.1> (172.17.1.90:48756 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:64849965-617a-4f21-bddf-7cee0cec5b10): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:28:01.596297+00:00 [error] <0.8143.0> closing AMQP connection <0.8143.0> (172.17.1.143:52420 -> 172.17.1.109:5672 - nova-compute:2:7d6b1e9f-628d-47a9-ad57-18bb675acd75):
2022-08-03 13:28:01.596297+00:00 [error] <0.8143.0> {inet_error,etimedout}
2022-08-03 13:28:02.108287+00:00 [error] <0.30826.0> closing AMQP connection <0.30826.0> (172.17.1.143:47760 -> 172.17.1.109:5672 - neutron-metadata-agent:26461:7c418c56-7476-403b-b99a-d99fef2e868c):
2022-08-03 13:28:02.108287+00:00 [error] <0.30826.0> {inet_error,etimedout}
2022-08-03 13:28:02.491343+00:00 [error] <0.23254.0> closing AMQP connection <0.23254.0> (172.17.1.143:43612 -> 172.17.1.109:5672 - neutron-metadata-agent:26708:bb6c9d43-4cb0-475b-9667-c8c002b432bf):
2022-08-03 13:28:02.491343+00:00 [error] <0.23254.0> {inet_error,etimedout}

>>> one example of connection closed:

2022-08-03 13:28:02.619572+00:00 [error] <0.30888.0> closing AMQP connection <0.30888.0> (172.17.1.143:47790 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f):
2022-08-03 13:28:02.619572+00:00 [error] <0.30888.0> {inet_error,etimedout}
2022-08-03 13:28:03.131383+00:00 [error] <0.24262.0> closing AMQP connection <0.24262.0> (172.17.1.143:32794 -> 172.17.1.109:5672 - neutron-l3-agent:26452:263c9cbc-ac0e-4f8e-acf6-622fa977b967):
2022-08-03 13:28:03.131383+00:00 [error] <0.24262.0> {inet_error,etimedout}

>>> immediate reconnection:

2022-08-03 13:28:03.634413+00:00 [info] <0.3467.1> Connection <0.3467.1> (172.17.1.143:35472 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f
2022-08-03 13:28:03.635569+00:00 [info] <0.3467.1> connection <0.3467.1> (172.17.1.143:35472 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f): user 'guest' authenticated and granted access to vhost '/'


2022-08-03 13:28:03.708322+00:00 [error] <0.30857.0> closing AMQP connection <0.30857.0> (172.17.1.143:47776 -> 172.17.1.109:5672 - neutron-l3-agent:26452:86ba24a9-f5da-48b9-a302-2ae6ef0852fe):
2022-08-03 13:28:03.708322+00:00 [error] <0.30857.0> {inet_error,etimedout}
2022-08-03 13:28:03.708322+00:00 [error] <0.30854.0> closing AMQP connection <0.30854.0> (172.17.1.143:47772 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2):
2022-08-03 13:28:03.708322+00:00 [error] <0.30854.0> {inet_error,etimedout}
2022-08-03 13:28:03.900330+00:00 [error] <0.30932.0> closing AMQP connection <0.30932.0> (172.17.1.143:47812 -> 172.17.1.109:5672 - neutron-metadata-agent:26706:a3f34642-4b49-4d02-b9b1-3f8ccfdc3922):
2022-08-03 13:28:03.900330+00:00 [error] <0.30932.0> {inet_error,etimedout}
2022-08-03 13:28:04.603328+00:00 [error] <0.24210.0> closing AMQP connection <0.24210.0> (172.17.1.143:57096 -> 172.17.1.109:5672 - nova-compute:2:44256c4d-6bfa-4205-a5e3-851d336a021b):
2022-08-03 13:28:04.603328+00:00 [error] <0.24210.0> {inet_error,etimedout}
2022-08-03 13:28:04.723969+00:00 [info] <0.3512.1> Connection <0.3512.1> (172.17.1.143:35494 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2
2022-08-03 13:28:04.725574+00:00 [info] <0.3512.1> connection <0.3512.1> (172.17.1.143:35494 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:28:52.033672+00:00 [info] <0.3705.1> Connection <0.3705.1> (172.17.1.143:36604 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:2db6dc2a-d61d-414f-ac49-254ae513f6fa
2022-08-03 13:28:52.035379+00:00 [info] <0.3705.1> connection <0.3705.1> (172.17.1.143:36604 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:2db6dc2a-d61d-414f-ac49-254ae513f6fa): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:30:21.049079+00:00 [info] <0.2578.1> closing AMQP connection <0.2578.1> (172.17.1.90:52056 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:a1dd2c98-d532-4604-9ae7-4b4b88de4dae, vhost: '/', user: 'guest')
2022-08-03 13:30:21.071887+00:00 [info] <0.28080.0> closing AMQP connection <0.28080.0> (172.17.1.90:35964 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:17b07c8d-fb33-4160-8cca-2bcc7137e056, vhost: '/', user: 'guest')
2022-08-03 13:40:07.355276+00:00 [error] <0.3509.1> closing AMQP connection <0.3509.1> (172.17.1.143:35488 -> 172.17.1.109:5672 - neutron-l3-agent:26452:86ba24a9-f5da-48b9-a302-2ae6ef0852fe):
2022-08-03 13:40:07.355276+00:00 [error] <0.3509.1> {inet_error,etimedout}
2022-08-03 13:40:07.355354+00:00 [error] <0.3512.1> closing AMQP connection <0.3512.1> (172.17.1.143:35494 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2):
2022-08-03 13:40:07.355354+00:00 [error] <0.3512.1> {inet_error,etimedout}
2022-08-03 13:40:07.547348+00:00 [error] <0.3705.1> closing AMQP connection <0.3705.1> (172.17.1.143:36604 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:2db6dc2a-d61d-414f-ac49-254ae513f6fa):
2022-08-03 13:40:07.547348+00:00 [error] <0.3705.1> {inet_error,etimedout}
2022-08-03 13:40:08.373961+00:00 [info] <0.8963.1> Connection <0.8963.1> (172.17.1.143:34654 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2
2022-08-03 13:40:08.375336+00:00 [info] <0.8963.1> connection <0.8963.1> (172.17.1.143:34654 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:40:08.637879+00:00 [error] <0.3467.1> closing AMQP connection <0.3467.1> (172.17.1.143:35472 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f):
2022-08-03 13:40:08.637879+00:00 [error] <0.3467.1> {inet_error,etimedout}
2022-08-03 13:40:09.595324+00:00 [error] <0.8520.1> closing AMQP connection <0.8520.1> (172.17.1.143:52994 -> 172.17.1.109:5672 - neutron-l3-agent:26452:8615d59d-7e1f-4e71-8409-0504185e2a17):
2022-08-03 13:40:09.595324+00:00 [error] <0.8520.1> {inet_error,etimedout}
2022-08-03 13:40:09.660887+00:00 [info] <0.9005.1> Connection <0.9005.1> (172.17.1.143:34670 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f
2022-08-03 13:40:09.663140+00:00 [info] <0.9005.1> connection <0.9005.1> (172.17.1.143:34670 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:40:09.979223+00:00 [error] <0.23379.0> closing AMQP connection <0.23379.0> (172.17.1.143:42634 -> 172.17.1.109:5672 - neutron-metadata-agent:26708:e209d173-89da-47fd-8e73-19f9e046191d):
2022-08-03 13:40:09.979223+00:00 [error] <0.23379.0> {inet_error,etimedout}
2022-08-03 13:40:10.556271+00:00 [error] <0.30765.0> closing AMQP connection <0.30765.0> (172.17.1.143:60448 -> 172.17.1.109:5672 - neutron-metadata-agent:26705:48458e3b-8781-47af-839c-fe21de45e46a):
2022-08-03 13:40:10.556271+00:00 [error] <0.30765.0> {inet_error,etimedout}
2022-08-03 13:40:22.157335+00:00 [info] <0.9087.1> Connection <0.9087.1> (172.17.1.143:41520 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:2db6dc2a-d61d-414f-ac49-254ae513f6fa
2022-08-03 13:40:22.159331+00:00 [info] <0.9087.1> connection <0.9087.1> (172.17.1.143:41520 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:2db6dc2a-d61d-414f-ac49-254ae513f6fa): user 'guest' authenticated and granted access to vhost '/'


2. after a while, around 13:54 we start seeing "missed heartbeats from client, timeout: 60s" and the agent doesn't recover:


2022-08-03 13:54:08.392048+00:00 [error] <0.8963.1> closing AMQP connection <0.8963.1> (172.17.1.143:34654 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2):
2022-08-03 13:54:08.392048+00:00 [error] <0.8963.1> missed heartbeats from client, timeout: 60s
2022-08-03 13:54:51.940011+00:00 [error] <0.30823.0> closing AMQP connection <0.30823.0> (172.17.1.143:47758 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:9413bb77-2f50-42ea-83b5-a0d443faa59f):
2022-08-03 13:54:51.940011+00:00 [error] <0.30823.0> missed heartbeats from client, timeout: 60s
2022-08-03 13:54:52.646134+00:00 [error] <0.30904.0> closing AMQP connection <0.30904.0> (172.17.1.143:47796 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:a934b4df-6751-4260-a84b-c400c78f200b):
2022-08-03 13:54:52.646134+00:00 [error] <0.30904.0> missed heartbeats from client, timeout: 60s
2022-08-03 13:55:09.680409+00:00 [error] <0.9005.1> closing AMQP connection <0.9005.1> (172.17.1.143:34670 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f):


WRT Rodolfo's comment ->  #heartbeat_in_pthread = true this is the new default in 17 so these seem to be legitimate.


I think we need to follow eck' suggestion and either strace the whole thing, capture a GMR and/or set up tcpdumps if we manage to reproduce it.

Comment 30 Slawek Kaplonski 2022-08-18 07:17:34 UTC
*** Bug 2080952 has been marked as a duplicate of this bug. ***

Comment 45 Luca Miccini 2022-08-30 11:35:28 UTC
John, Hervé and myself spent quite some time on this yesterday and today.

We were able to reproduce the issue by simply running "tempest run neutron_plugin.tests.scenario.test_multicast.MulticastTestIPv4Common.test_flooding_when_special_groups" and consistently end up with one vm out of three in ERROR state because it couldn't be deleted in time. 
The error was always a timeout waiting for a specific rpc message that was delivered eventually after more than 60s.

Today we've tried setting "rpc_response_timeout=120" on the compute nodes and this seems to be a valid workaround as we weren't able to get VMs in error state anymore.

In the logs we've found the following timestamps indicating that nova/neutron needed quite some time to delete the VMs:

#2022-08-30 09:25:21.467 2 WARNING oslo.service.loopingcall [-] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 94.51 sec
#2022-08-30 09:25:21.482 2 INFO nova.compute.manager [-] [instance: 1c60b122-78bb-42e1-afea-6c86224d7a83] Took 108.64 seconds to deallocate network for instance.


Out of curiosity we've tried reproducing outside of tempest just by creating and deleting 4 vms at the time in a loop but we were not able to reproduce the issue this way.
We've also tried to stop rabbitmq, wait 3 minutes, start rabbitmq and repeat the create/delete process multiple times but this did not lead to any VM in error state either.

We think it is fair to conclude that the reported issue for which "service could not reconnect to rabbitmq / get stuck" is solved by setting heartbeat_in_pthread=false so we can probably this bz to verified.

Regardless, we should keep investigating to understand what this specific tempest test is doing. We should probably have a dedicated BZ for this.

Comment 50 errata-xmlrpc 2022-09-21 12:24:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543


Note You need to log in before you can comment on or make changes to this bug.