Bug 2118728 - Services can be stucked after trying to reconnect to rabbitmq
Summary: Services can be stucked after trying to reconnect to rabbitmq
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-oslo-messaging
Version: 17.1 (Wallaby)
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ga
: 17.1
Assignee: Hervé Beraud
QA Contact: pkomarov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-16 14:48 UTC by Takashi Kajinami
Modified: 2023-08-16 13:40 UTC (History)
15 users (show)

Fixed In Version: python-oslo-messaging-12.7.3-1.20221212170855.5d6fd1a.el9ost
Doc Type: No Doc Update
Doc Text:
Clone Of: 2116311
Environment:
Last Closed: 2023-08-16 13:40:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1961402 0 None None None 2022-08-16 14:52:52 UTC
OpenStack gerrit 852251 0 None MERGED Change default value of "heartbeat_in_pthread" to False 2023-08-10 13:35:49 UTC
Red Hat Issue Tracker OSP-18219 0 None None None 2022-08-16 14:59:50 UTC

Description Takashi Kajinami 2022-08-16 14:48:46 UTC
+++ This bug was initially created as a clone of Bug #2116311 +++

+++ This bug was initially created as a clone of Bug #2115383 +++

This was initially found by Fiorella Yanac (fyanac) and reported in  https://bugzilla.redhat.com/show_bug.cgi?id=2112909 but as it was against neutron and was focused on the result of the issue, not the real problem and could be confusing later I decided to open new bug for it.


Description of problem:
We noticed recently that nova-compute and neutron agents can hangs and do nothing when connectivity to rabbitmq will be broken and later restored.


Version-Release number of selected component (if applicable):
OSP-17.0
python3-oslo-messaging-12.7.3-0.20220430102742.5d6fd1a.el9ost.noarch


How reproducible:
Very often. Almost every time when I restarted rabbitmq cluster, some services were stucked.


Steps to Reproduce:
1. Stop rabbitmq cluster and wait few seconds so nova-compute and neutron agents have in logs info that they can't connect to rabbitmq
2. Start rabbitmq cluster,
3. Wait some time and check if there are any agents DOWN in the API. If there is some service DOWN, You can check that on the node that it is not logging anythin at all.
When You run strace on such process it will be stucked on something like:

strace: Process 411673 attached
futex(0x7f86d000a7e0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

However container will be running and will be seen as "healthy" in podman.
Restart of the container fixes the problem.

Comment 7 Lon Hohberger 2023-08-16 10:34:31 UTC
According to our records, this should be resolved by python-oslo-messaging-12.7.3-1.20221212170855.5d6fd1a.el9ost.  This build is available now.


Note You need to log in before you can comment on or make changes to this bug.