Bug 1304423 - RabbitMQ fails to start on an IPv6 deployment
RabbitMQ fails to start on an IPv6 deployment
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
unspecified Severity unspecified
: ---
: 7.0 (Kilo)
Assigned To: Marios Andreou
: AutomationBlocker
: 1304422 (view as bug list)
Depends On:
  Show dependency treegraph
Reported: 2016-02-03 09:55 EST by Attila Darazs
Modified: 2016-04-18 03:01 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2016-02-04 07:42:24 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Attila Darazs 2016-02-03 09:55:43 EST
Description of problem:

# pcs status |grep -A2 rabbit
 Clone Set: rabbitmq-clone [rabbitmq]
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED overcloud-controller-0 (unmanaged)
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 ]
* rabbitmq_stop_0 on overcloud-controller-0 'unknown error' (1): call=882, status=Timed Out, exitreason='none',
    last-rc-change='Wed Feb  3 09:38:10 2016', queued=0ms, exec=90004ms

Version-Release number of selected component (if applicable):
director poodle: 2016-01-27.1
core poodle: 2016-02-02.2

How reproducible:
100%, both on 1 controller and 3 controllers

Steps to Reproduce:
1. Try installing IPv6 overcloud

Additional info:
This seems to be different from https://bugzilla.redhat.com/show_bug.cgi?id=1301404 as the SELinux error that caused that issue was fixed.

[root@overcloud-controller-0 ~]# ls -ltrhZ /etc/machine-id 
-r--r--r--. root root unconfined_u:object_r:machineid_t:s0 /etc/machine-id

I have two dev boxes (1 HA and 1 minimal that has this error and can be triaged).
Comment 4 Dan Sneddon 2016-02-03 10:52:29 EST
I think I've seen this. In my case, Rabbit actually started correctly during the deployment, and ran long enough for the post-deployment to complete. Shortly after that, Rabbit failed on all 3 controllers, and wouldn't start again on any of them. I was not able to triage the root cause, and on a redeploy I was not getting the same behavior.
Comment 5 Mike Burns 2016-02-03 11:46:10 EST
*** Bug 1304422 has been marked as a duplicate of this bug. ***
Comment 6 Peter Lemenkov 2016-02-04 06:08:18 EST
Ok, this is certainly a new, unknown issue, which wasn't addressed in the recent builds. Looking into this.
Comment 7 Peter Lemenkov 2016-02-04 07:07:55 EST
This is what I've got so far.

* Networking seems slow. Even ssh login takes several seconds.
* I've tried running rabbitmq-server with systemd (just in case - checking that everything was set up correctrly). It does works.
* Starting rabbitmq with systemd takes a lot of time:

[root@overcloud-controller-2 ~]# time systemctl start rabbitmq-server

real	0m24.656s
user	0m0.003s
sys	0m0.006s
[root@overcloud-controller-2 ~]#

24 seconds is too much. This might cause timeout in pacemaker.

* Getting current status with rabbitmqctl takes exactly the same 24 seconds:

[root@overcloud-controller-2 ~]# time rabbitmqctl report

real	0m24.129s
user	0m0.067s
sys	0m0.030s
[root@overcloud-controller-2 ~]# 

* From rabbitmq logs it seems that opening a tcp6 connection takes it almost 18 seconds:

[root@overcloud-controller-2 ~]# cat /var/log/rabbitmq/rabbit@overcloud-controller-2.log

=WARNING REPORT==== 3-Feb-2016::04:30:57 ===
msg_store_persistent: rebuilding indices from scratch

=INFO REPORT==== 3-Feb-2016::04:31:13 ===
started TCP Listener on [FD00:FD00:FD00:2000::15]:5672
Comment 8 Attila Darazs 2016-02-04 07:42:24 EST
After Peter's debugging and Dan's suggestion, the issue turned out to be invalid DNS settings.

I was deploying with since the beginning of the IPv6 testing and it caused no problems until recently. It seems that it's filtered out on the internal network, so name resolution on the overcloud nodes was broken and resulted in very slow binding for every service.

As it now was the source of a weird and hard to debug issue, we should consider adding a validation step to check for working DNS before we deploy rabbit on the overcloud. I'll open a bug for that.

Closing this bug. Thanks for the help with debugging.

Note You need to log in before you can comment on or make changes to this bug.