Bug 2118728

Summary: Services can be stucked after trying to reconnect to rabbitmq
Product: Red Hat OpenStack Reporter: Takashi Kajinami <tkajinam>
Component: python-oslo-messagingAssignee: Hervé Beraud <hberaud>
Status: CLOSED CURRENTRELEASE QA Contact: pkomarov
Severity: high Docs Contact:
Priority: medium    
Version: 17.1 (Wallaby)CC: apevec, astupnik, athomas, bcafarel, fyanac, hberaud, jeckersb, lhh, lmiccini, mburns, mtomaska, pkomarov, ralonsoh, skaplons, tkajinam
Target Milestone: gaKeywords: TestOnly, Triaged
Target Release: 17.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: python-oslo-messaging-12.7.3-1.20221212170855.5d6fd1a.el9ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 2116311 Environment:
Last Closed: 2023-08-16 13:40:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Takashi Kajinami 2022-08-16 14:48:46 UTC
+++ This bug was initially created as a clone of Bug #2116311 +++

+++ This bug was initially created as a clone of Bug #2115383 +++

This was initially found by Fiorella Yanac (fyanac) and reported in  https://bugzilla.redhat.com/show_bug.cgi?id=2112909 but as it was against neutron and was focused on the result of the issue, not the real problem and could be confusing later I decided to open new bug for it.


Description of problem:
We noticed recently that nova-compute and neutron agents can hangs and do nothing when connectivity to rabbitmq will be broken and later restored.


Version-Release number of selected component (if applicable):
OSP-17.0
python3-oslo-messaging-12.7.3-0.20220430102742.5d6fd1a.el9ost.noarch


How reproducible:
Very often. Almost every time when I restarted rabbitmq cluster, some services were stucked.


Steps to Reproduce:
1. Stop rabbitmq cluster and wait few seconds so nova-compute and neutron agents have in logs info that they can't connect to rabbitmq
2. Start rabbitmq cluster,
3. Wait some time and check if there are any agents DOWN in the API. If there is some service DOWN, You can check that on the node that it is not logging anythin at all.
When You run strace on such process it will be stucked on something like:

strace: Process 411673 attached
futex(0x7f86d000a7e0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

However container will be running and will be seen as "healthy" in podman.
Restart of the container fixes the problem.

Comment 7 Lon Hohberger 2023-08-16 10:34:31 UTC
According to our records, this should be resolved by python-oslo-messaging-12.7.3-1.20221212170855.5d6fd1a.el9ost.  This build is available now.