Bug 1643598

Summary: Rabbitmq fails to start during OSP10/RHEL-7.6 Overcloud ipv6 deployment
Product: Red Hat OpenStack Reporter: Pavel Sedlák <psedlak>
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED DUPLICATE QA Contact: Udi Shkalim <ushkalim>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 10.0 (Newton)CC: apevec, chjones, jeckersb, lhh, mburns, plemenko, psedlak, slinaber
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1644710 1644711 (view as bug list) Environment:
Last Closed: 2018-11-06 20:52:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1644710, 1644711    

Description Pavel Sedlák 2018-10-26 16:23:35 UTC
When running overcloud deploy of OSP10 on RHEL-7.6, with templates for ipv6 setup,
overcloud deploy fails on puppet steps related to keystone on controller-0.
Thas is due to keystone hanging - unable to connect to ampq,
as rabbitmq-server service is dead with only info being:
> ERROR: epmd error for host controller-0: address (cannot connect to host/port)

This issue was so far not observed on ipv4 deployment.
Neither it was reproduced on 1controller setup, only 3controller ones.

on controller-0:
> puppet-rabbitmq-5.6.0-2.el7ost.noarch
> rabbitmq-server-3.6.3-10.el7ost.noarch
> cloud-init-18.2-1.el7.x86_64
on undercloud-0:
> openstack-tripleo-image-elements-5.3.3-1.el7ost.noarch
> openstack-tripleo-0.0.8-0.3.4de13b3git.el7ost.noarch
> openstack-tripleo-puppet-elements-5.3.3-4.el7ost.noarch
> python-tripleoclient-5.4.6-1.el7ost.noarch
> puppet-tripleo-5.6.8-16.el7ost.noarch
> openstack-tripleo-ui-1.2.1-1.el7ost.noarch
> openstack-tripleo-common-5.4.7-6.el7ost.noarch
> openstack-tripleo-heat-templates-5.3.10-17.el7ost.noarch
> openstack-tripleo-validations-5.1.4-3.el7ost.noarch


---

deployment list:
> | f9cfcf0d-fe18-4cf7-8c61-79d60891d0e3 | 8c8bdb81-a5e0-4a9f-81e9-b9157d8221d5 | 033c2c68-8636-4da0-854d-ed6e25c35b03 | CREATE | FAILED   | 2018-10-26T13:15:13Z | deploy_status_code : Deployment exited with non-zero status code: 6 |
nova show for 8c8bdb81 node:
| id                                   | 033c2c68-8636-4da0-854d-ed6e25c35b03                     |
| image                                | overcloud-full (af033990-4b59-48bd-8a7e-80dd197347cd)    |
| key_name                             | default                                                  |
| name                                 | controller-0                                             |

openstack software deployment output show f9cfcf0d-fe18-4cf7-8c61-79d60891d0e3 --all --long
> deploy_stderr: |
>   "keystone", "--description", "OpenStack Identity Service"]]' has been running for more than 170 seconds[0m
>   [1;31mWarning: /Stage[main]/Keystone::Endpoint/Keystone::Resource::Service_identity[keystone]/Keystone_endpoint[regionOne/keystone::identity]: Skipping because of failed dependencies[0m
> deploy_status_code: 6

even a few minutes later after whole deployment failed
on controller-0 there is still hanging 'openstack service create' call:
> root@controller-0 heat-admin]# ps -ef|grep keystone
> keystone   63634   63627  0 13:23 ?        00:00:07 keystone-admin  -DFOREGROUND
> ...
> root       86987       1  0 13:35 ?        00:00:00 /usr/bin/python2 /usr/bin/openstack service create --format shell identity --name keystone --description OpenStack Identity Service
> root      364313  363779  0 15:36 pts/0    00:00:00 grep --color=auto keystone
this command straced shows recvfrom(3, and is connected to keystone (under httpd)
> [root@controller-0 heat-admin]# ss -np |grep 86987
> tcp    ESTAB      0      0      192.168.24.7:49710              192.168.24.7:35357               users:(("openstack",pid=86987,fd=3))
> [root@controller-0 heat-admin]# ss -lnp|grep 5357
> tcp    LISTEN     0      128    192.168.24.7:35357                 *:*                   users:(("httpd",pid=87016,fd=3),("httpd",pid=81243,fd=3),("httpd",pid=63657,fd=3),("httpd",pid=63656,fd=3),("httpd",pid=63655,fd=3),("httpd",pid=63654,fd=3),("httpd",pid=63653,fd=3),("httpd",pid=63652,fd=3),("httpd",pid=63651,fd=3),("httpd",pid=63650,fd=3),("httpd",pid=63627,fd=3))

/var/log/keystone/keystone.log is full of:
> 2018-10-26 15:54:11.164 63634 ERROR oslo.messaging._drivers.impl_rabbit [req-344920f5-4ea8-48ab-8ca4-a4c79d84d12d - - - - -] [901c5e61-dc86-480a-adfe-771e4e3fead9] AMQP server on fd00:fd00:fd00:2000::17:5672 is unreachable: [Errno 111] Connection refused. Trying again in 1 seconds. Client port: None
> 2018-10-26 15:54:12.173 63634 ERROR oslo.messaging._drivers.impl_rabbit [req-344920f5-4ea8-48ab-8ca4-a4c79d84d12d - - - - -] [901c5e61-dc86-480a-adfe-771e4e3fead9] AMQP server on fd00:fd00:fd00:2000::18:5672 is unreachable: [Errno 111] Connection refused. Trying again in 32 seconds. Client port: None

rabbitmq-server.service is dead:
> [root@controller-0 heat-admin]# systemctl status rabbitmq-server
> ● rabbitmq-server.service - RabbitMQ broker
>    Loaded: loaded (/usr/lib/systemd/system/rabbitmq-server.service; disabled; vendor preset: disabled)
>   Drop-In: /etc/systemd/system/rabbitmq-server.service.d
>            └─limits.conf
>    Active: inactive (dead)

journalctl has no entries for rabbitmq-server, /var/log/rabbitmq/startup_err is empty
/var/log/rabbitmq/startup_log contains only one line:
> ERROR: epmd error for host controller-0: address (cannot connect to host/port)

when attempted to start it manually it worked, keystone reconnected, and hanging 'openstack service create' also dissapeared.

> root@controller-0 heat-admin]# systemctl start rabbitmq-server
> [root@controller-0 heat-admin]# systemctl status rabbitmq-server
> ● rabbitmq-server.service - RabbitMQ broker
>    Loaded: loaded (/usr/lib/systemd/system/rabbitmq-server.service; disabled; vendor preset: disabled)
>   Drop-In: /etc/systemd/system/rabbitmq-server.service.d
>            └─limits.conf
>    Active: active (running) since Fri 2018-10-26 15:54:41 UTC; 4s ago
>  Main PID: 400973 (beam.smp)
>    Status: "Initialized"
>     Tasks: 143
>    Memory: 71.8M
>    CGroup: /system.slice/rabbitmq-server.service
>            ├─400973 /usr/lib64/erlang/erts-7.3.1.4/bin/beam.smp -W w -A 128 -K true -P 1048576 -K true -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.6.3/ebin -noshell ...
>            ├─401296 inet_gethost 4
>            └─401297 inet_gethost 4
> 
> Oct 26 15:54:39 controller-0 systemd[1]: Starting RabbitMQ broker...
> Oct 26 15:54:40 controller-0 rabbitmq-server[400973]: RabbitMQ 3.6.3. Copyright (C) 2007-2016 Pivotal Software, Inc.
> Oct 26 15:54:40 controller-0 rabbitmq-server[400973]: ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
> Oct 26 15:54:40 controller-0 rabbitmq-server[400973]: ##  ##
> Oct 26 15:54:40 controller-0 rabbitmq-server[400973]: ##########  Logs: /var/log/rabbitmq/rabbit
> Oct 26 15:54:40 controller-0 rabbitmq-server[400973]: ######  ##        /var/log/rabbitmq/rabbit
> Oct 26 15:54:40 controller-0 rabbitmq-server[400973]: ##########
> Oct 26 15:54:40 controller-0 rabbitmq-server[400973]: Starting broker...
> Oct 26 15:54:41 controller-0 rabbitmq-server[400973]: completed with 0 plugins.
> Oct 26 15:54:41 controller-0 systemd[1]: Started RabbitMQ broker.

Comment 2 John Eckersberg 2018-10-26 18:08:43 UTC
Do we have these jobs running on 7.5 as well?  Curious if it still works ok there.

Also, has this ever worked previously on 7.6?  Or has it always been broken there?  I only see the four failures in CI, not sure if there's other 7.6 jobs somewhere else.

Comment 3 Steve Linabery 2018-10-26 19:45:41 UTC
(In reply to John Eckersberg from comment #2)
> Do we have these jobs running on 7.5 as well?  Curious if it still works ok
> there.

I'm not sure how/why we would compare 7.5 results to 7.6 results. If you think it would be useful we could retrigger the 7.5 ipv6 jobs, but they are using the same content that passed CI previously.

> 
> Also, has this ever worked previously on 7.6?  Or has it always been broken
> there?  I only see the four failures in CI, not sure if there's other 7.6
> jobs somewhere else.

This is the first time we have run the ipv6 jobs against the 7.6 content. It has not passed before.

Comment 4 Peter Lemenkov 2018-10-29 15:13:51 UTC
Why do you use systemd? I don't fully understand architecture of this setup - we actually don't use systemd, but rely on pacemaker's scripts for RabbitMQ clustering/etc.

Comment 6 Pavel Sedlák 2018-10-29 17:07:06 UTC
Sorry about the confusion with systemd, that was just due to lack of understanding on my part and assuming systemd managed services.
It's not based on the setup or it's architecture, just my confusing debug attempt, sorry about that.


With pacemaker, it worked after restart (on new setup, not the one with my systemd mess), still not much of info in logs about failure, how can i provide more logs/info please?

# pcs status
> Cluster name: tripleo_cluster
> Stack: corosync
> Current DC: controller-1 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
> Last updated: Mon Oct 29 16:55:12 2018
> Last change: Fri Oct 26 17:25:14 2018 by root via crm_attribute on controller-1
> 
> 3 nodes configured
> 18 resources configured
> 
> Online: [ controller-0 controller-1 controller-2 ]
> 
> Full list of resources:
> 
>  ip-2620.52.0.13b8.5054.ff.fe3e.6       (ocf::heartbeat:IPaddr2):       Started controller-0
>  ip-fd00.fd00.fd00.2000..13     (ocf::heartbeat:IPaddr2):       Started controller-1
>  ip-192.168.24.8        (ocf::heartbeat:IPaddr2):       Started controller-2
>  Clone Set: haproxy-clone [haproxy]
>      Started: [ controller-0 controller-1 controller-2 ]
>  Master/Slave Set: galera-master [galera]
>      Masters: [ controller-0 controller-1 controller-2 ]
>  ip-fd00.fd00.fd00.3000..11     (ocf::heartbeat:IPaddr2):       Started controller-0
>  Clone Set: rabbitmq-clone [rabbitmq]
>      Stopped: [ controller-0 controller-1 controller-2 ]
>  ip-fd00.fd00.fd00.4000..16     (ocf::heartbeat:IPaddr2):       Started controller-1
>  Master/Slave Set: redis-master [redis]
>      Masters: [ controller-1 ]
>      Slaves: [ controller-0 controller-2 ]
>  ip-fd00.fd00.fd00.2000..10     (ocf::heartbeat:IPaddr2):       Started controller-1
> 
> Failed Actions:
> * rabbitmq_start_0 on controller-1 'unknown error' (1): call=48, status=complete, exitreason='',
>     last-rc-change='Fri Oct 26 17:24:45 2018', queued=0ms, exec=6432ms
> * rabbitmq_start_0 on controller-0 'unknown error' (1): call=62, status=complete, exitreason='',
>     last-rc-change='Fri Oct 26 17:25:14 2018', queued=0ms, exec=8064ms
> * rabbitmq_start_0 on controller-2 'unknown error' (1): call=53, status=complete, exitreason='',
>     last-rc-change='Fri Oct 26 17:24:59 2018', queued=0ms, exec=8021ms

# pcs resource debug-start rabbitmq
> Operation start for rabbitmq:0 (ocf:heartbeat:rabbitmq-cluster) returned: 'ok' (0)
>  >  stdout: Waiting for 'rabbit@controller-0' ...
>  >  stdout: pid is 354998 ...
>  >  stderr: Oct 29 16:56:31 INFO: RabbitMQ server is not running
>  >  stderr: Call cib_query failed (-6): No such device or address
>  >  stderr: Call cib_query failed (-6): No such device or address
>  >  stderr: Oct 29 16:56:31 INFO: Bootstrapping rabbitmq cluster
>  >  stderr: Oct 29 16:56:31 INFO: Waiting for server to start
>  >  stderr: Oct 29 16:56:36 INFO: cluster bootstrapped
>  >  stderr: Oct 29 16:56:37 INFO: Policy set: ha-all ^(?!amq\.).* {"ha-mode":"all"}

# pcs resource failcount show rabbitmq
> Failcounts for resource 'rabbitmq'
>   controller-0: INFINITY
>   controller-1: INFINITY
>   controller-2: INFINITY

# pcs resource failcount reset rabbitmq
> Cleaned up rabbitmq:0 on controller-2
> Cleaned up rabbitmq:0 on controller-1
> Cleaned up rabbitmq:0 on controller-0
> Cleaned up rabbitmq:1 on controller-2
> Cleaned up rabbitmq:1 on controller-1
> Cleaned up rabbitmq:1 on controller-0
> Cleaned up rabbitmq:2 on controller-2
> Cleaned up rabbitmq:2 on controller-1
> Cleaned up rabbitmq:2 on controller-0
> Waiting for 3 replies from the CRMd... OK

# pcs status
> Cluster name: tripleo_cluster
> Stack: corosync
> Current DC: controller-1 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
> Last updated: Mon Oct 29 17:02:09 2018
> Last change: Mon Oct 29 17:00:30 2018 by root via crm_attribute on controller-2
> 
> 3 nodes configured
> 18 resources configured
> 
> Online: [ controller-0 controller-1 controller-2 ]
> 
> Full list of resources:
> 
>  ip-2620.52.0.13b8.5054.ff.fe3e.6       (ocf::heartbeat:IPaddr2):       Started controller-0
>  ip-fd00.fd00.fd00.2000..13     (ocf::heartbeat:IPaddr2):       Started controller-1
>  ip-192.168.24.8        (ocf::heartbeat:IPaddr2):       Started controller-2
>  Clone Set: haproxy-clone [haproxy]
>      Started: [ controller-0 controller-1 controller-2 ]
>  Master/Slave Set: galera-master [galera]
>      Masters: [ controller-0 controller-1 controller-2 ]
>  ip-fd00.fd00.fd00.3000..11     (ocf::heartbeat:IPaddr2):       Started controller-0
>  Clone Set: rabbitmq-clone [rabbitmq]
>      Started: [ controller-0 controller-1 controller-2 ]
>  ip-fd00.fd00.fd00.4000..16     (ocf::heartbeat:IPaddr2):       Started controller-1
>  Master/Slave Set: redis-master [redis]
>      Masters: [ controller-1 ]
>      Slaves: [ controller-0 controller-2 ]
>  ip-fd00.fd00.fd00.2000..10     (ocf::heartbeat:IPaddr2):       Started controller-1
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled

# rabbitmqctl status
> Cluster status of node 'rabbit@controller-0' ...
> [{nodes,[{disc,['rabbit@controller-0','rabbit@controller-1',
>                 'rabbit@controller-2']}]},
>  {running_nodes,['rabbit@controller-2','rabbit@controller-1',
>                  'rabbit@controller-0']},
>  {cluster_name,<<"rabbit@controller-0">>},
>  {partitions,[]},
>  {alarms,[{'rabbit@controller-2',[]},
>           {'rabbit@controller-1',[]},
>           {'rabbit@controller-0',[]}]}]

Comment 9 Peter Lemenkov 2018-10-29 18:39:25 UTC
(In reply to Pavel Sedlák from comment #0)

> /var/log/rabbitmq/startup_log contains only one line:
> > ERROR: epmd error for host controller-0: address (cannot connect to host/port)

Apparently RabbitMQ failed to start epmd and later to register itself with it. Seems that later attempts to start RabbitMQ were successful.

Unfortunately we don;t have any logs from epmd - it's a very silent application. My guess is that it takes a lot of time to start for some unknown reason (improper / unset hostname?). And the second attempt succeeds because application was already started during the first time.

Comment 15 Peter Lemenkov 2018-11-01 16:16:26 UTC
So this is a IPv6 resolving issue. In order to work RabbitMQ (Erlang actually) needs to resolve its domain name which is usually set in this way:

/etc/rabbitmq/rabbitmq-env.conf

...
RABBITMQ_NODENAME=rabbit@controller-0
...

So controller-0 must be resolvable. We have a patch which commands RabbitMQ to use IPv6 or IPv4 as a DNS resolver:

https://github.com/erlang/otp/commit/5d7dcfc2e0de7e93b29d01f07a2f970720d62f9d

And in our case controller-0 cannot be resolved to IPv4, only to IPv6. Thus we need a special /etc/rabbitmq/inetrc file, containing "{inet6, true}." configuration option.

Unfortunately this file doesn't exists on the controller nodes, thus RabbitMQ cannot resolve hostname, so it fails to start.

Comment 17 John Eckersberg 2018-11-06 20:52:33 UTC
I am 99% sure this is the same thing as bug 1557522, which has had a downstream review for the last 8 months and just got lost in the shuffle.  I just rebased the patch so it should be ready for merge now.

I am going to mark this bug as a duplicate of that one to keep everything tidy, since the existing review already references the other RHBZ#.

*** This bug has been marked as a duplicate of bug 1557522 ***