Description of problem: # pcs status |grep -A2 rabbit Clone Set: rabbitmq-clone [rabbitmq] rabbitmq (ocf::heartbeat:rabbitmq-cluster): FAILED overcloud-controller-0 (unmanaged) Clone Set: memcached-clone [memcached] Started: [ overcloud-controller-0 ] -- * rabbitmq_stop_0 on overcloud-controller-0 'unknown error' (1): call=882, status=Timed Out, exitreason='none', last-rc-change='Wed Feb 3 09:38:10 2016', queued=0ms, exec=90004ms Version-Release number of selected component (if applicable): director poodle: 2016-01-27.1 core poodle: 2016-02-02.2 How reproducible: 100%, both on 1 controller and 3 controllers Steps to Reproduce: 1. Try installing IPv6 overcloud Additional info: This seems to be different from https://bugzilla.redhat.com/show_bug.cgi?id=1301404 as the SELinux error that caused that issue was fixed. [root@overcloud-controller-0 ~]# ls -ltrhZ /etc/machine-id -r--r--r--. root root unconfined_u:object_r:machineid_t:s0 /etc/machine-id I have two dev boxes (1 HA and 1 minimal that has this error and can be triaged).
I think I've seen this. In my case, Rabbit actually started correctly during the deployment, and ran long enough for the post-deployment to complete. Shortly after that, Rabbit failed on all 3 controllers, and wouldn't start again on any of them. I was not able to triage the root cause, and on a redeploy I was not getting the same behavior.
*** Bug 1304422 has been marked as a duplicate of this bug. ***
Ok, this is certainly a new, unknown issue, which wasn't addressed in the recent builds. Looking into this.
This is what I've got so far. * Networking seems slow. Even ssh login takes several seconds. * I've tried running rabbitmq-server with systemd (just in case - checking that everything was set up correctrly). It does works. * Starting rabbitmq with systemd takes a lot of time: [root@overcloud-controller-2 ~]# time systemctl start rabbitmq-server real 0m24.656s user 0m0.003s sys 0m0.006s [root@overcloud-controller-2 ~]# 24 seconds is too much. This might cause timeout in pacemaker. * Getting current status with rabbitmqctl takes exactly the same 24 seconds: [root@overcloud-controller-2 ~]# time rabbitmqctl report ... real 0m24.129s user 0m0.067s sys 0m0.030s [root@overcloud-controller-2 ~]# * From rabbitmq logs it seems that opening a tcp6 connection takes it almost 18 seconds: [root@overcloud-controller-2 ~]# cat /var/log/rabbitmq/rabbit ... =WARNING REPORT==== 3-Feb-2016::04:30:57 === msg_store_persistent: rebuilding indices from scratch =INFO REPORT==== 3-Feb-2016::04:31:13 === started TCP Listener on [FD00:FD00:FD00:2000::15]:5672 ...
After Peter's debugging and Dan's suggestion, the issue turned out to be invalid DNS settings. I was deploying with 8.8.8.8/8.8.4.4 since the beginning of the IPv6 testing and it caused no problems until recently. It seems that it's filtered out on the internal network, so name resolution on the overcloud nodes was broken and resulted in very slow binding for every service. As it now was the source of a weird and hard to debug issue, we should consider adding a validation step to check for working DNS before we deploy rabbit on the overcloud. I'll open a bug for that. Closing this bug. Thanks for the help with debugging.