Bug 1643598
Summary: | Rabbitmq fails to start during OSP10/RHEL-7.6 Overcloud ipv6 deployment | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Pavel Sedlák <psedlak> | |
Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> | |
Status: | CLOSED DUPLICATE | QA Contact: | Udi Shkalim <ushkalim> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 10.0 (Newton) | CC: | apevec, chjones, jeckersb, lhh, mburns, plemenko, psedlak, slinaber | |
Target Milestone: | --- | Keywords: | Triaged, ZStream | |
Target Release: | 10.0 (Newton) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1644710 1644711 (view as bug list) | Environment: | ||
Last Closed: | 2018-11-06 20:52:33 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1644710, 1644711 |
Description
Pavel Sedlák
2018-10-26 16:23:35 UTC
Do we have these jobs running on 7.5 as well? Curious if it still works ok there. Also, has this ever worked previously on 7.6? Or has it always been broken there? I only see the four failures in CI, not sure if there's other 7.6 jobs somewhere else. (In reply to John Eckersberg from comment #2) > Do we have these jobs running on 7.5 as well? Curious if it still works ok > there. I'm not sure how/why we would compare 7.5 results to 7.6 results. If you think it would be useful we could retrigger the 7.5 ipv6 jobs, but they are using the same content that passed CI previously. > > Also, has this ever worked previously on 7.6? Or has it always been broken > there? I only see the four failures in CI, not sure if there's other 7.6 > jobs somewhere else. This is the first time we have run the ipv6 jobs against the 7.6 content. It has not passed before. Why do you use systemd? I don't fully understand architecture of this setup - we actually don't use systemd, but rely on pacemaker's scripts for RabbitMQ clustering/etc. Sorry about the confusion with systemd, that was just due to lack of understanding on my part and assuming systemd managed services. It's not based on the setup or it's architecture, just my confusing debug attempt, sorry about that. With pacemaker, it worked after restart (on new setup, not the one with my systemd mess), still not much of info in logs about failure, how can i provide more logs/info please? # pcs status > Cluster name: tripleo_cluster > Stack: corosync > Current DC: controller-1 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum > Last updated: Mon Oct 29 16:55:12 2018 > Last change: Fri Oct 26 17:25:14 2018 by root via crm_attribute on controller-1 > > 3 nodes configured > 18 resources configured > > Online: [ controller-0 controller-1 controller-2 ] > > Full list of resources: > > ip-2620.52.0.13b8.5054.ff.fe3e.6 (ocf::heartbeat:IPaddr2): Started controller-0 > ip-fd00.fd00.fd00.2000..13 (ocf::heartbeat:IPaddr2): Started controller-1 > ip-192.168.24.8 (ocf::heartbeat:IPaddr2): Started controller-2 > Clone Set: haproxy-clone [haproxy] > Started: [ controller-0 controller-1 controller-2 ] > Master/Slave Set: galera-master [galera] > Masters: [ controller-0 controller-1 controller-2 ] > ip-fd00.fd00.fd00.3000..11 (ocf::heartbeat:IPaddr2): Started controller-0 > Clone Set: rabbitmq-clone [rabbitmq] > Stopped: [ controller-0 controller-1 controller-2 ] > ip-fd00.fd00.fd00.4000..16 (ocf::heartbeat:IPaddr2): Started controller-1 > Master/Slave Set: redis-master [redis] > Masters: [ controller-1 ] > Slaves: [ controller-0 controller-2 ] > ip-fd00.fd00.fd00.2000..10 (ocf::heartbeat:IPaddr2): Started controller-1 > > Failed Actions: > * rabbitmq_start_0 on controller-1 'unknown error' (1): call=48, status=complete, exitreason='', > last-rc-change='Fri Oct 26 17:24:45 2018', queued=0ms, exec=6432ms > * rabbitmq_start_0 on controller-0 'unknown error' (1): call=62, status=complete, exitreason='', > last-rc-change='Fri Oct 26 17:25:14 2018', queued=0ms, exec=8064ms > * rabbitmq_start_0 on controller-2 'unknown error' (1): call=53, status=complete, exitreason='', > last-rc-change='Fri Oct 26 17:24:59 2018', queued=0ms, exec=8021ms # pcs resource debug-start rabbitmq > Operation start for rabbitmq:0 (ocf:heartbeat:rabbitmq-cluster) returned: 'ok' (0) > > stdout: Waiting for 'rabbit@controller-0' ... > > stdout: pid is 354998 ... > > stderr: Oct 29 16:56:31 INFO: RabbitMQ server is not running > > stderr: Call cib_query failed (-6): No such device or address > > stderr: Call cib_query failed (-6): No such device or address > > stderr: Oct 29 16:56:31 INFO: Bootstrapping rabbitmq cluster > > stderr: Oct 29 16:56:31 INFO: Waiting for server to start > > stderr: Oct 29 16:56:36 INFO: cluster bootstrapped > > stderr: Oct 29 16:56:37 INFO: Policy set: ha-all ^(?!amq\.).* {"ha-mode":"all"} # pcs resource failcount show rabbitmq > Failcounts for resource 'rabbitmq' > controller-0: INFINITY > controller-1: INFINITY > controller-2: INFINITY # pcs resource failcount reset rabbitmq > Cleaned up rabbitmq:0 on controller-2 > Cleaned up rabbitmq:0 on controller-1 > Cleaned up rabbitmq:0 on controller-0 > Cleaned up rabbitmq:1 on controller-2 > Cleaned up rabbitmq:1 on controller-1 > Cleaned up rabbitmq:1 on controller-0 > Cleaned up rabbitmq:2 on controller-2 > Cleaned up rabbitmq:2 on controller-1 > Cleaned up rabbitmq:2 on controller-0 > Waiting for 3 replies from the CRMd... OK # pcs status > Cluster name: tripleo_cluster > Stack: corosync > Current DC: controller-1 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum > Last updated: Mon Oct 29 17:02:09 2018 > Last change: Mon Oct 29 17:00:30 2018 by root via crm_attribute on controller-2 > > 3 nodes configured > 18 resources configured > > Online: [ controller-0 controller-1 controller-2 ] > > Full list of resources: > > ip-2620.52.0.13b8.5054.ff.fe3e.6 (ocf::heartbeat:IPaddr2): Started controller-0 > ip-fd00.fd00.fd00.2000..13 (ocf::heartbeat:IPaddr2): Started controller-1 > ip-192.168.24.8 (ocf::heartbeat:IPaddr2): Started controller-2 > Clone Set: haproxy-clone [haproxy] > Started: [ controller-0 controller-1 controller-2 ] > Master/Slave Set: galera-master [galera] > Masters: [ controller-0 controller-1 controller-2 ] > ip-fd00.fd00.fd00.3000..11 (ocf::heartbeat:IPaddr2): Started controller-0 > Clone Set: rabbitmq-clone [rabbitmq] > Started: [ controller-0 controller-1 controller-2 ] > ip-fd00.fd00.fd00.4000..16 (ocf::heartbeat:IPaddr2): Started controller-1 > Master/Slave Set: redis-master [redis] > Masters: [ controller-1 ] > Slaves: [ controller-0 controller-2 ] > ip-fd00.fd00.fd00.2000..10 (ocf::heartbeat:IPaddr2): Started controller-1 > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled # rabbitmqctl status > Cluster status of node 'rabbit@controller-0' ... > [{nodes,[{disc,['rabbit@controller-0','rabbit@controller-1', > 'rabbit@controller-2']}]}, > {running_nodes,['rabbit@controller-2','rabbit@controller-1', > 'rabbit@controller-0']}, > {cluster_name,<<"rabbit@controller-0">>}, > {partitions,[]}, > {alarms,[{'rabbit@controller-2',[]}, > {'rabbit@controller-1',[]}, > {'rabbit@controller-0',[]}]}] (In reply to Pavel Sedlák from comment #0) > /var/log/rabbitmq/startup_log contains only one line: > > ERROR: epmd error for host controller-0: address (cannot connect to host/port) Apparently RabbitMQ failed to start epmd and later to register itself with it. Seems that later attempts to start RabbitMQ were successful. Unfortunately we don;t have any logs from epmd - it's a very silent application. My guess is that it takes a lot of time to start for some unknown reason (improper / unset hostname?). And the second attempt succeeds because application was already started during the first time. So this is a IPv6 resolving issue. In order to work RabbitMQ (Erlang actually) needs to resolve its domain name which is usually set in this way: /etc/rabbitmq/rabbitmq-env.conf ... RABBITMQ_NODENAME=rabbit@controller-0 ... So controller-0 must be resolvable. We have a patch which commands RabbitMQ to use IPv6 or IPv4 as a DNS resolver: https://github.com/erlang/otp/commit/5d7dcfc2e0de7e93b29d01f07a2f970720d62f9d And in our case controller-0 cannot be resolved to IPv4, only to IPv6. Thus we need a special /etc/rabbitmq/inetrc file, containing "{inet6, true}." configuration option. Unfortunately this file doesn't exists on the controller nodes, thus RabbitMQ cannot resolve hostname, so it fails to start. I am 99% sure this is the same thing as bug 1557522, which has had a downstream review for the last 8 months and just got lost in the shuffle. I just rebased the patch so it should be ready for merge now. I am going to mark this bug as a duplicate of that one to keep everything tidy, since the existing review already references the other RHBZ#. *** This bug has been marked as a duplicate of bug 1557522 *** |