Bug 2115383

Summary: Services can be stucked after trying to reconnect to rabbitmq
Product: Red Hat OpenStack Reporter: Slawek Kaplonski <skaplons>
Component: openstack-tripleo-heat-templatesAssignee: Takashi Kajinami <tkajinam>
Status: CLOSED ERRATA QA Contact: dabarzil
Severity: high Docs Contact:
Priority: medium    
Version: 17.0 (Wallaby)CC: apevec, athomas, bcafarel, bdobreli, bmv, chjones, ekuris, fyanac, jeckersb, jpretori, jschluet, kmehta, lhh, lmiccini, mbollo, mburns, mtomaska, pgrist, ralonsoh, spower, tfreger, tkajinam
Target Milestone: gaKeywords: Triaged
Target Release: 17.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-14.3.1-0.20220719171720.feca772.el9ost puppet-swift-18.5.1-0.20220307041424.1d4dab1.el9ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2116311 (view as bug list) Environment:
Last Closed: 2022-09-21 12:24:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Slawek Kaplonski 2022-08-04 14:51:54 UTC
This was initially found by Fiorella Yanac (fyanac) and reported in  https://bugzilla.redhat.com/show_bug.cgi?id=2112909 but as it was against neutron and was focused on the result of the issue, not the real problem and could be confusing later I decided to open new bug for it.


Description of problem:
We noticed recently that nova-compute and neutron agents can hangs and do nothing when connectivity to rabbitmq will be broken and later restored.


Version-Release number of selected component (if applicable):
OSP-17.0
python3-oslo-messaging-12.7.3-0.20220430102742.5d6fd1a.el9ost.noarch


How reproducible:
Very often. Almost every time when I restarted rabbitmq cluster, some services were stucked.


Steps to Reproduce:
1. Stop rabbitmq cluster and wait few seconds so nova-compute and neutron agents have in logs info that they can't connect to rabbitmq
2. Start rabbitmq cluster,
3. Wait some time and check if there are any agents DOWN in the API. If there is some service DOWN, You can check that on the node that it is not logging anythin at all.
When You run strace on such process it will be stucked on something like:

strace: Process 411673 attached
futex(0x7f86d000a7e0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

However container will be running and will be seen as "healthy" in podman.
Restart of the container fixes the problem.

Comment 1 Slawek Kaplonski 2022-08-04 14:55:13 UTC
*** Bug 2112909 has been marked as a duplicate of this bug. ***

Comment 2 John Eckersberg 2022-08-04 16:03:52 UTC
Being stuck on futex is normal, there are multiple worker threads and I'm guessing what you are seeing is the main thread blocking until the threads synchronize at service shutdown.  At the very least, use '-f' with strace so you can see all of the threads and you should see other stuff happening.

Most helpful would be to gather a guru meditation report (https://docs.openstack.org/nova/yoga/reference/gmr.html) which will provide stack traces for all of the running greenthreads.  It's been a while since I've done this but I believe the GMR output will end up either in the container output ("podman logs") or in the journal somewhere.  It will not be in the nova/neutron log file(s).

Comment 6 Luca Miccini 2022-08-05 06:21:14 UTC
Far from being an expert in these things but here is my shot at trying to understand what happened.

Rabbtimq was last restarted around 12:XX and these logs are from 13:00 onward, so I'd exclude an issue with the rabbitmq server itself.

On all three controllers rabbitmq logs I see the same pattern (I've taken openvswitch-agent as a reference).

1. connection to neutron-openvswitch-agent gets cut from the server side because of "{inet_error,etimedout}" and immediately recovers:

2022-08-03 13:00:40.635748+00:00 [error] <0.4138.0> closing AMQP connection <0.4138.0> (172.17.1.143:35958 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2):
2022-08-03 13:00:40.635748+00:00 [error] <0.4138.0> {inet_error,etimedout}
2022-08-03 13:00:40.892350+00:00 [error] <0.6911.0> closing AMQP connection <0.6911.0> (172.17.1.143:44250 -> 172.17.1.109:5672 - neutron-l3-agent:26452:86ba24a9-f5da-48b9-a302-2ae6ef0852fe):
2022-08-03 13:00:40.892350+00:00 [error] <0.6911.0> {inet_error,etimedout}
2022-08-03 13:00:41.403332+00:00 [error] <0.20517.0> closing AMQP connection <0.20517.0> (172.17.1.143:48842 -> 172.17.1.109:5672 - neutron-metadata-agent:26706:a3f34642-4b49-4d02-b9b1-3f8ccfdc3922):
2022-08-03 13:00:41.403332+00:00 [error] <0.20517.0> {inet_error,etimedout}
2022-08-03 13:00:41.658890+00:00 [info] <0.23334.0> Connection <0.23334.0> (172.17.1.143:42594 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2
2022-08-03 13:00:41.660419+00:00 [info] <0.23334.0> connection <0.23334.0> (172.17.1.143:42594 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2): user 'guest' authenticated and granted access to vhost '/'
...
2022-08-03 13:26:48.892258+00:00 [error] <0.27201.0> {inet_error,etimedout}
2022-08-03 13:26:49.918467+00:00 [info] <0.2547.1> Connection <0.2547.1> (172.17.1.90:52052 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26647:fe8ef903-f9ed-46d5-8de6-cae93d88c331
2022-08-03 13:26:49.919779+00:00 [info] <0.2547.1> connection <0.2547.1> (172.17.1.90:52052 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:fe8ef903-f9ed-46d5-8de6-cae93d88c331): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:26:50.876213+00:00 [error] <0.18935.0> closing AMQP connection <0.18935.0> (172.17.1.90:33558 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:a1dd2c98-d532-4604-9ae7-4b4b88de4dae):
2022-08-03 13:26:50.876213+00:00 [error] <0.18935.0> {inet_error,etimedout}
2022-08-03 13:26:51.042504+00:00 [info] <0.2578.1> Connection <0.2578.1> (172.17.1.90:52056 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26647:a1dd2c98-d532-4604-9ae7-4b4b88de4dae
2022-08-03 13:26:51.043781+00:00 [info] <0.2578.1> connection <0.2578.1> (172.17.1.90:52056 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:a1dd2c98-d532-4604-9ae7-4b4b88de4dae): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:26:51.451337+00:00 [error] <0.27173.0> closing AMQP connection <0.27173.0> (172.17.1.90:35914 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:64849965-617a-4f21-bddf-7cee0cec5b10):
2022-08-03 13:26:51.451337+00:00 [error] <0.27173.0> {inet_error,etimedout}
2022-08-03 13:26:52.466259+00:00 [info] <0.2605.1> Connection <0.2605.1> (172.17.1.90:48756 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26647:64849965-617a-4f21-bddf-7cee0cec5b10
2022-08-03 13:26:52.469542+00:00 [info] <0.2605.1> connection <0.2605.1> (172.17.1.90:48756 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:64849965-617a-4f21-bddf-7cee0cec5b10): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:28:01.596297+00:00 [error] <0.8143.0> closing AMQP connection <0.8143.0> (172.17.1.143:52420 -> 172.17.1.109:5672 - nova-compute:2:7d6b1e9f-628d-47a9-ad57-18bb675acd75):
2022-08-03 13:28:01.596297+00:00 [error] <0.8143.0> {inet_error,etimedout}
2022-08-03 13:28:02.108287+00:00 [error] <0.30826.0> closing AMQP connection <0.30826.0> (172.17.1.143:47760 -> 172.17.1.109:5672 - neutron-metadata-agent:26461:7c418c56-7476-403b-b99a-d99fef2e868c):
2022-08-03 13:28:02.108287+00:00 [error] <0.30826.0> {inet_error,etimedout}
2022-08-03 13:28:02.491343+00:00 [error] <0.23254.0> closing AMQP connection <0.23254.0> (172.17.1.143:43612 -> 172.17.1.109:5672 - neutron-metadata-agent:26708:bb6c9d43-4cb0-475b-9667-c8c002b432bf):
2022-08-03 13:28:02.491343+00:00 [error] <0.23254.0> {inet_error,etimedout}

>>> one example of connection closed:

2022-08-03 13:28:02.619572+00:00 [error] <0.30888.0> closing AMQP connection <0.30888.0> (172.17.1.143:47790 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f):
2022-08-03 13:28:02.619572+00:00 [error] <0.30888.0> {inet_error,etimedout}
2022-08-03 13:28:03.131383+00:00 [error] <0.24262.0> closing AMQP connection <0.24262.0> (172.17.1.143:32794 -> 172.17.1.109:5672 - neutron-l3-agent:26452:263c9cbc-ac0e-4f8e-acf6-622fa977b967):
2022-08-03 13:28:03.131383+00:00 [error] <0.24262.0> {inet_error,etimedout}

>>> immediate reconnection:

2022-08-03 13:28:03.634413+00:00 [info] <0.3467.1> Connection <0.3467.1> (172.17.1.143:35472 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f
2022-08-03 13:28:03.635569+00:00 [info] <0.3467.1> connection <0.3467.1> (172.17.1.143:35472 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f): user 'guest' authenticated and granted access to vhost '/'


2022-08-03 13:28:03.708322+00:00 [error] <0.30857.0> closing AMQP connection <0.30857.0> (172.17.1.143:47776 -> 172.17.1.109:5672 - neutron-l3-agent:26452:86ba24a9-f5da-48b9-a302-2ae6ef0852fe):
2022-08-03 13:28:03.708322+00:00 [error] <0.30857.0> {inet_error,etimedout}
2022-08-03 13:28:03.708322+00:00 [error] <0.30854.0> closing AMQP connection <0.30854.0> (172.17.1.143:47772 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2):
2022-08-03 13:28:03.708322+00:00 [error] <0.30854.0> {inet_error,etimedout}
2022-08-03 13:28:03.900330+00:00 [error] <0.30932.0> closing AMQP connection <0.30932.0> (172.17.1.143:47812 -> 172.17.1.109:5672 - neutron-metadata-agent:26706:a3f34642-4b49-4d02-b9b1-3f8ccfdc3922):
2022-08-03 13:28:03.900330+00:00 [error] <0.30932.0> {inet_error,etimedout}
2022-08-03 13:28:04.603328+00:00 [error] <0.24210.0> closing AMQP connection <0.24210.0> (172.17.1.143:57096 -> 172.17.1.109:5672 - nova-compute:2:44256c4d-6bfa-4205-a5e3-851d336a021b):
2022-08-03 13:28:04.603328+00:00 [error] <0.24210.0> {inet_error,etimedout}
2022-08-03 13:28:04.723969+00:00 [info] <0.3512.1> Connection <0.3512.1> (172.17.1.143:35494 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2
2022-08-03 13:28:04.725574+00:00 [info] <0.3512.1> connection <0.3512.1> (172.17.1.143:35494 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:28:52.033672+00:00 [info] <0.3705.1> Connection <0.3705.1> (172.17.1.143:36604 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:2db6dc2a-d61d-414f-ac49-254ae513f6fa
2022-08-03 13:28:52.035379+00:00 [info] <0.3705.1> connection <0.3705.1> (172.17.1.143:36604 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:2db6dc2a-d61d-414f-ac49-254ae513f6fa): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:30:21.049079+00:00 [info] <0.2578.1> closing AMQP connection <0.2578.1> (172.17.1.90:52056 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:a1dd2c98-d532-4604-9ae7-4b4b88de4dae, vhost: '/', user: 'guest')
2022-08-03 13:30:21.071887+00:00 [info] <0.28080.0> closing AMQP connection <0.28080.0> (172.17.1.90:35964 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26647:17b07c8d-fb33-4160-8cca-2bcc7137e056, vhost: '/', user: 'guest')
2022-08-03 13:40:07.355276+00:00 [error] <0.3509.1> closing AMQP connection <0.3509.1> (172.17.1.143:35488 -> 172.17.1.109:5672 - neutron-l3-agent:26452:86ba24a9-f5da-48b9-a302-2ae6ef0852fe):
2022-08-03 13:40:07.355276+00:00 [error] <0.3509.1> {inet_error,etimedout}
2022-08-03 13:40:07.355354+00:00 [error] <0.3512.1> closing AMQP connection <0.3512.1> (172.17.1.143:35494 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2):
2022-08-03 13:40:07.355354+00:00 [error] <0.3512.1> {inet_error,etimedout}
2022-08-03 13:40:07.547348+00:00 [error] <0.3705.1> closing AMQP connection <0.3705.1> (172.17.1.143:36604 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:2db6dc2a-d61d-414f-ac49-254ae513f6fa):
2022-08-03 13:40:07.547348+00:00 [error] <0.3705.1> {inet_error,etimedout}
2022-08-03 13:40:08.373961+00:00 [info] <0.8963.1> Connection <0.8963.1> (172.17.1.143:34654 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2
2022-08-03 13:40:08.375336+00:00 [info] <0.8963.1> connection <0.8963.1> (172.17.1.143:34654 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:40:08.637879+00:00 [error] <0.3467.1> closing AMQP connection <0.3467.1> (172.17.1.143:35472 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f):
2022-08-03 13:40:08.637879+00:00 [error] <0.3467.1> {inet_error,etimedout}
2022-08-03 13:40:09.595324+00:00 [error] <0.8520.1> closing AMQP connection <0.8520.1> (172.17.1.143:52994 -> 172.17.1.109:5672 - neutron-l3-agent:26452:8615d59d-7e1f-4e71-8409-0504185e2a17):
2022-08-03 13:40:09.595324+00:00 [error] <0.8520.1> {inet_error,etimedout}
2022-08-03 13:40:09.660887+00:00 [info] <0.9005.1> Connection <0.9005.1> (172.17.1.143:34670 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f
2022-08-03 13:40:09.663140+00:00 [info] <0.9005.1> connection <0.9005.1> (172.17.1.143:34670 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f): user 'guest' authenticated and granted access to vhost '/'
2022-08-03 13:40:09.979223+00:00 [error] <0.23379.0> closing AMQP connection <0.23379.0> (172.17.1.143:42634 -> 172.17.1.109:5672 - neutron-metadata-agent:26708:e209d173-89da-47fd-8e73-19f9e046191d):
2022-08-03 13:40:09.979223+00:00 [error] <0.23379.0> {inet_error,etimedout}
2022-08-03 13:40:10.556271+00:00 [error] <0.30765.0> closing AMQP connection <0.30765.0> (172.17.1.143:60448 -> 172.17.1.109:5672 - neutron-metadata-agent:26705:48458e3b-8781-47af-839c-fe21de45e46a):
2022-08-03 13:40:10.556271+00:00 [error] <0.30765.0> {inet_error,etimedout}
2022-08-03 13:40:22.157335+00:00 [info] <0.9087.1> Connection <0.9087.1> (172.17.1.143:41520 -> 172.17.1.109:5672) has a client-provided name: neutron-openvswitch-agent:26781:2db6dc2a-d61d-414f-ac49-254ae513f6fa
2022-08-03 13:40:22.159331+00:00 [info] <0.9087.1> connection <0.9087.1> (172.17.1.143:41520 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:2db6dc2a-d61d-414f-ac49-254ae513f6fa): user 'guest' authenticated and granted access to vhost '/'


2. after a while, around 13:54 we start seeing "missed heartbeats from client, timeout: 60s" and the agent doesn't recover:


2022-08-03 13:54:08.392048+00:00 [error] <0.8963.1> closing AMQP connection <0.8963.1> (172.17.1.143:34654 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:e79ddbde-b7cd-41db-98d0-b43e49864df2):
2022-08-03 13:54:08.392048+00:00 [error] <0.8963.1> missed heartbeats from client, timeout: 60s
2022-08-03 13:54:51.940011+00:00 [error] <0.30823.0> closing AMQP connection <0.30823.0> (172.17.1.143:47758 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:9413bb77-2f50-42ea-83b5-a0d443faa59f):
2022-08-03 13:54:51.940011+00:00 [error] <0.30823.0> missed heartbeats from client, timeout: 60s
2022-08-03 13:54:52.646134+00:00 [error] <0.30904.0> closing AMQP connection <0.30904.0> (172.17.1.143:47796 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:a934b4df-6751-4260-a84b-c400c78f200b):
2022-08-03 13:54:52.646134+00:00 [error] <0.30904.0> missed heartbeats from client, timeout: 60s
2022-08-03 13:55:09.680409+00:00 [error] <0.9005.1> closing AMQP connection <0.9005.1> (172.17.1.143:34670 -> 172.17.1.109:5672 - neutron-openvswitch-agent:26781:71c53821-66b8-4a7d-896e-95b67b07380f):


WRT Rodolfo's comment ->  #heartbeat_in_pthread = true this is the new default in 17 so these seem to be legitimate.


I think we need to follow eck' suggestion and either strace the whole thing, capture a GMR and/or set up tcpdumps if we manage to reproduce it.

Comment 30 Slawek Kaplonski 2022-08-18 07:17:34 UTC
*** Bug 2080952 has been marked as a duplicate of this bug. ***

Comment 45 Luca Miccini 2022-08-30 11:35:28 UTC
John, Hervé and myself spent quite some time on this yesterday and today.

We were able to reproduce the issue by simply running "tempest run neutron_plugin.tests.scenario.test_multicast.MulticastTestIPv4Common.test_flooding_when_special_groups" and consistently end up with one vm out of three in ERROR state because it couldn't be deleted in time. 
The error was always a timeout waiting for a specific rpc message that was delivered eventually after more than 60s.

Today we've tried setting "rpc_response_timeout=120" on the compute nodes and this seems to be a valid workaround as we weren't able to get VMs in error state anymore.

In the logs we've found the following timestamps indicating that nova/neutron needed quite some time to delete the VMs:

#2022-08-30 09:25:21.467 2 WARNING oslo.service.loopingcall [-] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 94.51 sec
#2022-08-30 09:25:21.482 2 INFO nova.compute.manager [-] [instance: 1c60b122-78bb-42e1-afea-6c86224d7a83] Took 108.64 seconds to deallocate network for instance.


Out of curiosity we've tried reproducing outside of tempest just by creating and deleting 4 vms at the time in a loop but we were not able to reproduce the issue this way.
We've also tried to stop rabbitmq, wait 3 minutes, start rabbitmq and repeat the create/delete process multiple times but this did not lead to any VM in error state either.

We think it is fair to conclude that the reported issue for which "service could not reconnect to rabbitmq / get stuck" is solved by setting heartbeat_in_pthread=false so we can probably this bz to verified.

Regardless, we should keep investigating to understand what this specific tempest test is doing. We should probably have a dedicated BZ for this.

Comment 50 errata-xmlrpc 2022-09-21 12:24:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543