+++ This bug was initially created as a clone of Bug #2116311 +++ +++ This bug was initially created as a clone of Bug #2115383 +++ This was initially found by Fiorella Yanac (fyanac) and reported in https://bugzilla.redhat.com/show_bug.cgi?id=2112909 but as it was against neutron and was focused on the result of the issue, not the real problem and could be confusing later I decided to open new bug for it. Description of problem: We noticed recently that nova-compute and neutron agents can hangs and do nothing when connectivity to rabbitmq will be broken and later restored. Version-Release number of selected component (if applicable): OSP-17.0 python3-oslo-messaging-12.7.3-0.20220430102742.5d6fd1a.el9ost.noarch How reproducible: Very often. Almost every time when I restarted rabbitmq cluster, some services were stucked. Steps to Reproduce: 1. Stop rabbitmq cluster and wait few seconds so nova-compute and neutron agents have in logs info that they can't connect to rabbitmq 2. Start rabbitmq cluster, 3. Wait some time and check if there are any agents DOWN in the API. If there is some service DOWN, You can check that on the node that it is not logging anythin at all. When You run strace on such process it will be stucked on something like: strace: Process 411673 attached futex(0x7f86d000a7e0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY However container will be running and will be seen as "healthy" in podman. Restart of the container fixes the problem.
According to our records, this should be resolved by python-oslo-messaging-12.7.3-1.20221212170855.5d6fd1a.el9ost. This build is available now.