Bug 1880416

Summary: rabbitmq network partition
Product: Red Hat OpenStack Reporter: Priscila <pveiga>
Component: rabbitmq-serverAssignee: John Eckersberg <jeckersb>
Status: CLOSED DUPLICATE QA Contact: pkomarov
Severity: medium Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: apevec, jeckersb, lhh
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-18 14:49:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Priscila 2020-09-18 13:25:27 UTC
Description of problem:

abbitmq goes in network partition twice [1] on different time stamp.  We don't see any drop packets or error packets on interface used by the rabbit cluster. 

Even after the Z11 upgrade, we see that rabbitmq is having network partitions at times.

After z11 upgrade, 

1. Network Partition - Aug 18 05:03:33 UTC
2.                   - Sept 8 5:51 UTC.
 
We have had a network partition at  Sept 8 5:51 UTC. Which to snapshot the actual state of rabbit, we did core files via  gcore, not causing the rabbitmq/erlang process to stop or crash.
See the attached core files in that case.
We did not send USR1 signal, as it is not effective (most probably due to the -Bi switch specified on the rabbit-server command line. [1] )

Customer is looking for to give more detailed logs/debugging for this rabbit issue, beyond sending the logs/sosreports, which did not really helped us previously to step forward.
They are looking for other recommendations what else to record before restarting rabbit.


[1] https://github.com/rabbitmq/rabbitmq-server/issues/1231