Red Hat Bugzilla – Bug 1304423
RabbitMQ fails to start on an IPv6 deployment
Last modified: 2016-04-18 03:01:53 EDT
Description of problem:
# pcs status |grep -A2 rabbit
Clone Set: rabbitmq-clone [rabbitmq]
rabbitmq (ocf::heartbeat:rabbitmq-cluster): FAILED overcloud-controller-0 (unmanaged)
Clone Set: memcached-clone [memcached]
Started: [ overcloud-controller-0 ]
* rabbitmq_stop_0 on overcloud-controller-0 'unknown error' (1): call=882, status=Timed Out, exitreason='none',
last-rc-change='Wed Feb 3 09:38:10 2016', queued=0ms, exec=90004ms
Version-Release number of selected component (if applicable):
director poodle: 2016-01-27.1
core poodle: 2016-02-02.2
100%, both on 1 controller and 3 controllers
Steps to Reproduce:
1. Try installing IPv6 overcloud
This seems to be different from https://bugzilla.redhat.com/show_bug.cgi?id=1301404 as the SELinux error that caused that issue was fixed.
[root@overcloud-controller-0 ~]# ls -ltrhZ /etc/machine-id
-r--r--r--. root root unconfined_u:object_r:machineid_t:s0 /etc/machine-id
I have two dev boxes (1 HA and 1 minimal that has this error and can be triaged).
I think I've seen this. In my case, Rabbit actually started correctly during the deployment, and ran long enough for the post-deployment to complete. Shortly after that, Rabbit failed on all 3 controllers, and wouldn't start again on any of them. I was not able to triage the root cause, and on a redeploy I was not getting the same behavior.
*** Bug 1304422 has been marked as a duplicate of this bug. ***
Ok, this is certainly a new, unknown issue, which wasn't addressed in the recent builds. Looking into this.
This is what I've got so far.
* Networking seems slow. Even ssh login takes several seconds.
* I've tried running rabbitmq-server with systemd (just in case - checking that everything was set up correctrly). It does works.
* Starting rabbitmq with systemd takes a lot of time:
[root@overcloud-controller-2 ~]# time systemctl start rabbitmq-server
24 seconds is too much. This might cause timeout in pacemaker.
* Getting current status with rabbitmqctl takes exactly the same 24 seconds:
[root@overcloud-controller-2 ~]# time rabbitmqctl report
* From rabbitmq logs it seems that opening a tcp6 connection takes it almost 18 seconds:
[root@overcloud-controller-2 ~]# cat /email@example.com
=WARNING REPORT==== 3-Feb-2016::04:30:57 ===
msg_store_persistent: rebuilding indices from scratch
=INFO REPORT==== 3-Feb-2016::04:31:13 ===
started TCP Listener on [FD00:FD00:FD00:2000::15]:5672
After Peter's debugging and Dan's suggestion, the issue turned out to be invalid DNS settings.
I was deploying with 18.104.22.168/22.214.171.124 since the beginning of the IPv6 testing and it caused no problems until recently. It seems that it's filtered out on the internal network, so name resolution on the overcloud nodes was broken and resulted in very slow binding for every service.
As it now was the source of a weird and hard to debug issue, we should consider adding a validation step to check for working DNS before we deploy rabbit on the overcloud. I'll open a bug for that.
Closing this bug. Thanks for the help with debugging.