1304423 – RabbitMQ fails to start on an IPv6 deployment

Bug 1304423 - RabbitMQ fails to start on an IPv6 deployment

Summary: RabbitMQ fails to start on an IPv6 deployment

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rhosp-director
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	7.0 (Kilo)
Assignee:	Marios Andreou
QA Contact:	yeylon@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1304422 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-02-03 14:55 UTC by Attila Darazs
Modified:	2016-04-18 07:01 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-02-04 12:42:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Attila Darazs 2016-02-03 14:55:43 UTC

Description of problem:

# pcs status |grep -A2 rabbit
 Clone Set: rabbitmq-clone [rabbitmq]
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED overcloud-controller-0 (unmanaged)
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 ]
--
* rabbitmq_stop_0 on overcloud-controller-0 'unknown error' (1): call=882, status=Timed Out, exitreason='none',
    last-rc-change='Wed Feb  3 09:38:10 2016', queued=0ms, exec=90004ms

Version-Release number of selected component (if applicable):
director poodle: 2016-01-27.1
core poodle: 2016-02-02.2

How reproducible:
100%, both on 1 controller and 3 controllers

Steps to Reproduce:
1. Try installing IPv6 overcloud

Additional info:
This seems to be different from https://bugzilla.redhat.com/show_bug.cgi?id=1301404 as the SELinux error that caused that issue was fixed.

[root@overcloud-controller-0 ~]# ls -ltrhZ /etc/machine-id 
-r--r--r--. root root unconfined_u:object_r:machineid_t:s0 /etc/machine-id

I have two dev boxes (1 HA and 1 minimal that has this error and can be triaged).

Comment 4 Dan Sneddon 2016-02-03 15:52:29 UTC

I think I've seen this. In my case, Rabbit actually started correctly during the deployment, and ran long enough for the post-deployment to complete. Shortly after that, Rabbit failed on all 3 controllers, and wouldn't start again on any of them. I was not able to triage the root cause, and on a redeploy I was not getting the same behavior.

Comment 5 Mike Burns 2016-02-03 16:46:10 UTC

*** Bug 1304422 has been marked as a duplicate of this bug. ***

Comment 6 Peter Lemenkov 2016-02-04 11:08:18 UTC

Ok, this is certainly a new, unknown issue, which wasn't addressed in the recent builds. Looking into this.

Comment 7 Peter Lemenkov 2016-02-04 12:07:55 UTC

This is what I've got so far.

* Networking seems slow. Even ssh login takes several seconds.
* I've tried running rabbitmq-server with systemd (just in case - checking that everything was set up correctrly). It does works.
* Starting rabbitmq with systemd takes a lot of time:

[root@overcloud-controller-2 ~]# time systemctl start rabbitmq-server

real	0m24.656s
user	0m0.003s
sys	0m0.006s
[root@overcloud-controller-2 ~]#

24 seconds is too much. This might cause timeout in pacemaker.

* Getting current status with rabbitmqctl takes exactly the same 24 seconds:

[root@overcloud-controller-2 ~]# time rabbitmqctl report
...

real	0m24.129s
user	0m0.067s
sys	0m0.030s
[root@overcloud-controller-2 ~]# 

* From rabbitmq logs it seems that opening a tcp6 connection takes it almost 18 seconds:

[root@overcloud-controller-2 ~]# cat /var/log/rabbitmq/rabbit

...
=WARNING REPORT==== 3-Feb-2016::04:30:57 ===
msg_store_persistent: rebuilding indices from scratch

=INFO REPORT==== 3-Feb-2016::04:31:13 ===
started TCP Listener on [FD00:FD00:FD00:2000::15]:5672
...

Comment 8 Attila Darazs 2016-02-04 12:42:24 UTC

After Peter's debugging and Dan's suggestion, the issue turned out to be invalid DNS settings.

I was deploying with 8.8.8.8/8.8.4.4 since the beginning of the IPv6 testing and it caused no problems until recently. It seems that it's filtered out on the internal network, so name resolution on the overcloud nodes was broken and resulted in very slow binding for every service.

As it now was the source of a weird and hard to debug issue, we should consider adding a validation step to check for working DNS before we deploy rabbit on the overcloud. I'll open a bug for that.

Closing this bug. Thanks for the help with debugging.

Note You need to log in before you can comment on or make changes to this bug.