Bug 1476811 - OSP11 -> OSP12 upgrade: major-upgrade-composable-steps-docker.yaml fails in IPv6 environment with ERROR: epmd error for host controller-1: address (cannot connect to host/port)
Summary: OSP11 -> OSP12 upgrade: major-upgrade-composable-steps-docker.yaml fails in I...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: rc
: 12.0 (Pike)
Assignee: RHOS Maint
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks: 1399762
TreeView+ depends on / blocked
 
Reported: 2017-07-31 14:07 UTC by Marius Cornea
Modified: 2023-02-22 23:02 UTC (History)
17 users (show)

Fixed In Version: puppet-tripleo-7.4.1-0.20170928143423.67e1e60.el7ost erlang-18.3.4.5-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-13 21:45:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1708154 0 None None None 2017-08-02 11:55:39 UTC
OpenStack gerrit 475457 0 None MERGED Use rabbitmq ipv6 flag 2020-09-25 16:40:51 UTC
RDO 9203 0 None None None 2017-09-01 11:04:44 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Marius Cornea 2017-07-31 14:07:33 UTC
Description of problem:
OSP11 -> OSP12 upgrade: major-upgrade-converge-docker.yaml fails in IPv6 environment with ERROR: epmd error for host controller-1: address (cannot connect to host/port)

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.0-0.20170721174554.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy IPv6 OSP11 with 3 controllers + 1 compute
2. Upgrade to OSP12


Actual results:
Upgrade gets stuck during major-upgrade-converge-docker.yaml step.

Expected results:
major-upgrade-converge-docker.yaml  step finishes successfully.

Additional info:

[root@controller-1 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.16-12.el7-94ff4df) - partition with quorum
Last updated: Mon Jul 31 14:00:10 2017
Last change: Mon Jul 31 13:07:58 2017 by root via cibadmin on controller-0

12 nodes configured
37 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-0 rabbitmq-bundle-0@controller-1 rabbitmq-bundle-1@controller-2 rabbitmq-bundle-2@controller-0 redis-bundle-0@controller-1 redis-bundle-1@controller-2 redis-bundle-2@controller-0 ]

Full list of resources:

 ip-192.168.24.15	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-2620.52.0.13b8.5054.ff.fe3e.3	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-fd00.fd00.fd00.2000..15	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-fd00.fd00.fd00.2000..16	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-fd00.fd00.fd00.3000..12	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-fd00.fd00.fd00.4000..11	(ocf::heartbeat:IPaddr2):	Started controller-2
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Stopped
 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:2017-07-26.10]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Stopped controller-1
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Stopped controller-2
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Stopped controller-0
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-07-26.10]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-2
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-0
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:2017-07-26.10]
   redis-bundle-0	(ocf::heartbeat:redis):	Slave controller-1
   redis-bundle-1	(ocf::heartbeat:redis):	Slave controller-2
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-0
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:2017-07-26.10]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-1
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-2
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-0

Failed Actions:
* openstack-cinder-volume_start_0 on controller-2 'not running' (7): call=105, status=complete, exitreason='none',
    last-rc-change='Mon Jul 31 13:08:08 2017', queued=0ms, exec=2086ms
* openstack-cinder-volume_start_0 on controller-1 'not running' (7): call=105, status=complete, exitreason='none',
    last-rc-change='Mon Jul 31 13:08:04 2017', queued=0ms, exec=2088ms
* openstack-cinder-volume_start_0 on controller-0 'not running' (7): call=109, status=complete, exitreason='none',
    last-rc-change='Mon Jul 31 13:07:57 2017', queued=0ms, exec=2062ms
* rabbitmq_start_0 on rabbitmq-bundle-0 'unknown error' (1): call=11757, status=complete, exitreason='none',
    last-rc-change='Mon Jul 31 14:00:00 2017', queued=0ms, exec=4700ms


[root@controller-1 heat-admin]# docker exec -it rabbitmq-bundle-docker-0 bash -c 'cat /var/log/rabbitmq/startup_log'
ERROR: epmd error for host controller-1: address (cannot connect to host/port)

[root@controller-1 heat-admin]# docker exec -it rabbitmq-bundle-docker-0 bash -c 'getent hosts controller-1'
fd00:fd00:fd00:2000::1b controller-1.localdomain controller-1

Note: I am not able to reproduce this issue on IPv4 environments.

Comment 2 Marius Cornea 2017-07-31 15:17:39 UTC
Adjusting the title to match the correct step in the upgrade process where it fails: major-upgrade-composable-steps-docker.yaml

Comment 3 Marios Andreou 2017-08-02 11:55:40 UTC
o/ just spent some time looking here. My first question/unkown at present is this upgrades related, or is it an issue with deploying HA/container with IPv6 i.e. do we have jobs that deploy like that already and which are successful so we can rule that out?

Looking in the attached sos report /var/log/messages, I can't quickly see an error around the upgrade_tasks. I see the rabbit container starting here:

     Jul 31 13:05:05 controller-1 docker(rabbitmq-bundle-docker-0)[386299]: INFO: running container rabb       itmq-bundle-docker-0 for the first time$

And then a few seconds later this error about cib_query
    Jul 31 13:05:13 controller-1 rabbitmq-cluster(rabbitmq)[387079]: INFO: failed to bootstrap cluster.        Check SELINUX policy$
    Jul 31 13:05:13 controller-1 rabbitmq-cluster(rabbitmq)[387083]: DEBUG: rabbitmq start : 1$
    Jul 31 13:05:13 controller-1 pacemaker_remoted[386425]:   notice: rabbitmq_start_0:31:stderr [ Call        cib_query failed (-6): No such device or address ]$
    Jul 31 13:05:13 controller-1 pacemaker_remoted[386425]:   notice: rabbitmq_start_0:31:stderr [ Call        cib_query failed (-6): No such device or address ]$
...
    Jul 31 13:05:17 controller-1 pacemaker_remoted[386425]:   notice: rabbitmq_stop_0:514:stderr [   *        unable to connect to epmd (port 4369) on controller-1: address (cannot connect to host/port) ]$

A little before that, this caught my eye:

    Jul 31 13:00:37 controller-1 ansible-command[361402]: Invoked with warn=True executable=None _uses_       shell=True _raw_params=echo 'export ERL_EPMD_ADDRESS=127.0.0.1' > /etc/rabbitmq/rabbitmq-env.conf$
 36852                                                        echo 'export ERL_EPMD_PORT=4370' >> /etc/rab       bitmq/rabbitmq-env.conf$
 36853                                                        for pid in $(pgrep epmd); do if [ "$(lsns -o        NS -p $pid)" == "$(lsns -o NS -p 1)" ]; then kill $pid; break; fi; done removes=None creates=None        chdir=None$

and so am wondering if that hardcoded localhost address is an issue here (traced to https://review.openstack.org/#/c/452889/19/docker/services/pacemaker/rabbitmq.yaml@153 ). 

I think folks from DFG:Containers may need to check here (and possibly PIDONE) too. Marking triaged and filed upstream LP placeholder (in trackers).

Comment 4 Marius Cornea 2017-08-03 08:09:17 UTC
(In reply to marios from comment #3)
> o/ just spent some time looking here. My first question/unkown at present is
> this upgrades related, or is it an issue with deploying HA/container with
> IPv6 i.e. do we have jobs that deploy like that already and which are
> successful so we can rule that out?
> 

It looks like fresh container environments with IPv6 can be deployed fine and the error reported in this bug doesn't show up.

Comment 5 Marios Andreou 2017-08-03 15:24:22 UTC
Hi Rasca - please can we get someone from the team to do a first pass here? The error we see is from pacemaker_remoted (see comment #3). I will leave it as DFG:Upgrades for now, unless you need me to assign it to PIDONE in order to get it checked in which case please go ahead and take it.

Comment 6 Michele Baldessari 2017-08-04 08:20:02 UTC
(In reply to marios from comment #3)
> o/ just spent some time looking here. My first question/unkown at present is
> this upgrades related, or is it an issue with deploying HA/container with
> IPv6 i.e. do we have jobs that deploy like that already and which are
> successful so we can rule that out?

So I am totally surprised by Marius' comment nr 4 where he says he can deploy containers with ipv6 and it works out of the box. Unless the tests were done on a single controller it should not work.
In order to have rabbitmq working with ipv6 three things need to happen:
1) erlang-18.3.4.5-4
2) https://github.com/puppetlabs/puppetlabs-rabbitmq/pull/552 unmerged as of today :/
3) https://review.openstack.org/#/c/475457/ (which depends on 2 being merged)

Without these three patches we get exactly what marius observed in the upgrade
but we see it on deploy as well.

> 
> Looking in the attached sos report /var/log/messages, I can't quickly see an
> error around the upgrade_tasks. I see the rabbit container starting here:
> 
>      Jul 31 13:05:05 controller-1 docker(rabbitmq-bundle-docker-0)[386299]:
> INFO: running container rabb       itmq-bundle-docker-0 for the first time$
> 
> And then a few seconds later this error about cib_query
>     Jul 31 13:05:13 controller-1 rabbitmq-cluster(rabbitmq)[387079]: INFO:
> failed to bootstrap cluster.        Check SELINUX policy$
>     Jul 31 13:05:13 controller-1 rabbitmq-cluster(rabbitmq)[387083]: DEBUG:
> rabbitmq start : 1$
>     Jul 31 13:05:13 controller-1 pacemaker_remoted[386425]:   notice:
> rabbitmq_start_0:31:stderr [ Call        cib_query failed (-6): No such
> device or address ]$
>     Jul 31 13:05:13 controller-1 pacemaker_remoted[386425]:   notice:
> rabbitmq_start_0:31:stderr [ Call        cib_query failed (-6): No such
> device or address ]$
> ...
>     Jul 31 13:05:17 controller-1 pacemaker_remoted[386425]:   notice:
> rabbitmq_stop_0:514:stderr [   *        unable to connect to epmd (port
> 4369) on controller-1: address (cannot connect to host/port) ]$
> 

That is the error which got fixed by eck's patches. Which we get at deploy time as well.

> A little before that, this caught my eye:
> 
>     Jul 31 13:00:37 controller-1 ansible-command[361402]: Invoked with
> warn=True executable=None _uses_       shell=True _raw_params=echo 'export
> ERL_EPMD_ADDRESS=127.0.0.1' > /etc/rabbitmq/rabbitmq-env.conf$
>  36852                                                        echo 'export
> ERL_EPMD_PORT=4370' >> /etc/rab       bitmq/rabbitmq-env.conf$
>  36853                                                        for pid in
> $(pgrep epmd); do if [ "$(lsns -o        NS -p $pid)" == "$(lsns -o NS -p
> 1)" ]; then kill $pid; break; fi; done removes=None creates=None       
> chdir=None$
> 
> and so am wondering if that hardcoded localhost address is an issue here
> (traced to
> https://review.openstack.org/#/c/452889/19/docker/services/pacemaker/
> rabbitmq.yaml@153 ). 

So the "if lsns" is basically the only way we can avoid epmd being started on the host for no reason. The super short version is "even running facter on the host will start up epmd on the host"), the longer description of this problem is here:
http://rhel-ha.etherpad.corp.redhat.com/epmd-container-port-issue

Comment 8 Michele Baldessari 2017-08-15 07:09:41 UTC
(In reply to Michele Baldessari from comment #6)
> (In reply to marios from comment #3)
> > o/ just spent some time looking here. My first question/unkown at present is
> > this upgrades related, or is it an issue with deploying HA/container with
> > IPv6 i.e. do we have jobs that deploy like that already and which are
> > successful so we can rule that out?
> 
> So I am totally surprised by Marius' comment nr 4 where he says he can
> deploy containers with ipv6 and it works out of the box.

Thinking a bit more about this, if you use one node for rabbit then it might just work even without the below patches.

> Unless the tests
> were done on a single controller it should not work.
> In order to have rabbitmq working with ipv6 three things need to happen:
> 1) erlang-18.3.4.5-4
> 2) https://github.com/puppetlabs/puppetlabs-rabbitmq/pull/552 unmerged as of
> today :/
> 3) https://review.openstack.org/#/c/475457/ (which depends on 2 being merged)

So now all (1,2,3) patches merged upstream, so if they are all present in a puddle. It's worth retrying to see if there are any other hiccups on the upgrade path. If you need any help verifying patches and what not, just let me know.

Comment 9 Marius Cornea 2017-08-23 10:13:06 UTC
(In reply to Michele Baldessari from comment #8)
> > Unless the tests
> > were done on a single controller it should not work.
> > In order to have rabbitmq working with ipv6 three things need to happen:
> > 1) erlang-18.3.4.5-4
> > 2) https://github.com/puppetlabs/puppetlabs-rabbitmq/pull/552 unmerged as of
> > today :/
> > 3) https://review.openstack.org/#/c/475457/ (which depends on 2 being merged)
> 
> So now all (1,2,3) patches merged upstream, so if they are all present in a
> puddle. It's worth retrying to see if there are any other hiccups on the
> upgrade path. If you need any help verifying patches and what not, just let
> me know.

Checking the lastest OSP12 puddle:

1 - present: erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64

2 - patch not present: installed version is puppet-rabbitmq-5.6.1-0.20170710225057.62fed8d.el7ost.noarch

3 - patch not present: puppet-tripleo-7.2.1-0.20170807233007.4600842.el7ost.noarch

Comment 10 Marius Cornea 2017-08-23 20:38:38 UTC
(In reply to Marius Cornea from comment #9)
> (In reply to Michele Baldessari from comment #8)
> > > Unless the tests
> > > were done on a single controller it should not work.
> > > In order to have rabbitmq working with ipv6 three things need to happen:
> > > 1) erlang-18.3.4.5-4
> > > 2) https://github.com/puppetlabs/puppetlabs-rabbitmq/pull/552 unmerged as of
> > > today :/
> > > 3) https://review.openstack.org/#/c/475457/ (which depends on 2 being merged)
> > 
> > So now all (1,2,3) patches merged upstream, so if they are all present in a
> > puddle. It's worth retrying to see if there are any other hiccups on the
> > upgrade path. If you need any help verifying patches and what not, just let
> > me know.
> 
> Checking the lastest OSP12 puddle:
> 
> 1 - present: erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64
> 
> 2 - patch not present: installed version is
> puppet-rabbitmq-5.6.1-0.20170710225057.62fed8d.el7ost.noarch
> 
> 3 - patch not present:
> puppet-tripleo-7.2.1-0.20170807233007.4600842.el7ost.noarch

I managed to get the upgrade successfully passing after manually applying patches in 2) and 3) so I think we should keep this BZ open to keep track of those changes landing in the downstream puddle.

Comment 11 Marios Andreou 2017-08-28 11:08:32 UTC
(In reply to Marius Cornea from comment #10)
> (In reply to Marius Cornea from comment #9)
> > (In reply to Michele Baldessari from comment #8)

...

> > 1 - present: erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64
> > 
> > 2 - patch not present: installed version is
> > puppet-rabbitmq-5.6.1-0.20170710225057.62fed8d.el7ost.noarch
> > 
> > 3 - patch not present:
> > puppet-tripleo-7.2.1-0.20170807233007.4600842.el7ost.noarch
> 
> I managed to get the upgrade successfully passing after manually applying
> patches in 2) and 3) so I think we should keep this BZ open to keep track of
> those changes landing in the downstream puddle.

so moving to POST (we aren't tracking the changes here, I assume we have BZ for each of those individually already?)  - or mcornea should this go straing to ON_QA? wdyt

Comment 12 Marius Cornea 2017-08-28 11:31:15 UTC
(In reply to marios from comment #11)
> (In reply to Marius Cornea from comment #10)
> > (In reply to Marius Cornea from comment #9)
> > > (In reply to Michele Baldessari from comment #8)
> 
> ...
> 
> > > 1 - present: erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64
> > > 
> > > 2 - patch not present: installed version is
> > > puppet-rabbitmq-5.6.1-0.20170710225057.62fed8d.el7ost.noarch
> > > 
> > > 3 - patch not present:
> > > puppet-tripleo-7.2.1-0.20170807233007.4600842.el7ost.noarch
> > 
> > I managed to get the upgrade successfully passing after manually applying
> > patches in 2) and 3) so I think we should keep this BZ open to keep track of
> > those changes landing in the downstream puddle.
> 
> so moving to POST (we aren't tracking the changes here, I assume we have BZ
> for each of those individually already?)  - or mcornea should this go
> straing to ON_QA? wdyt

I think we should leave it on POST for now as the changes have merged upstream but have not yet landed in a downstream build.

Comment 13 Alan Pevec 2017-08-29 11:23:17 UTC
> Checking the lastest OSP12 puddle:
> 
> 1 - present: erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64

NOT present in RDO deps repo where we have old Fedora rebuild 18.3.4.4
https://github.com/rdo-common/erlang/commits/common-rdo

There is newer erlang-19.3.6.1-1.el7 Fedora rebuild http://cbs.centos.org/koji/buildinfo?buildID=17627 candidate build, is the fix included and is it safe to push to Common (for all RDO releases) or to Pike only?
Please propose appropriate update to rdoinfo/deps.yml
for an example see https://review.rdoproject.org/r/#/c/8045/4/deps.yml

Comment 14 Michele Baldessari 2017-08-29 11:29:01 UTC
My preferred option would be the following:
1) Use erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64 for anything up to pike
2) For master instead we need to decide which version we want. If we plan a rebase downstream for Queens then sure let's update master upstream. Otherwise let's stay on 1) for master as well.

Peter, thoughts?

Comment 15 Peter Lemenkov 2017-08-29 12:24:17 UTC
(In reply to Michele Baldessari from comment #14)
> My preferred option would be the following:
> 1) Use erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64 for anything up to pike
> 2) For master instead we need to decide which version we want. If we plan a
> rebase downstream for Queens then sure let's update master upstream.
> Otherwise let's stay on 1) for master as well.
> 
> Peter, thoughts?

Hello All!

It's a strange situation when we have multiple Erlang builds floating around. Our one (and under "our" I mean RHOS) has most if not all IPv6-related fixes applied. Even upstream's version 20 doesn't contain all the necessary pieces. So I strongly advise everyone to switch to our most recent build, which is erlang-18.3.4.5-4.el7ost at this moment. At least you'll have someone to blame for :)

Regarding switch to ver. 19 - we will build it soon (for RHOS 12 or likely 13).

Right now I believe that any recent build from 18.x.y.z series "should be enough for everyone" (c)

Comment 16 Michele Baldessari 2017-08-31 13:45:56 UTC
1) puppet-rabbitmq
Not sure I can link a github PR, but I can confirm that the needed changes are in puppetlabs-rabbitmq package (http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/OpenStack/12.0-RHEL-7/2017-08-31.1/RH7-RHOS-12.0/source/puppet-rabbitmq-5.6.1-0.20170808005755.5ac45de.el7ost.src.rpm). The commit was https://github.com/voxpupuli/puppet-rabbitmq/commit/5ac45dedd9b409c9efac654724bc74867cb9233b

puppet-rabbitmq-5.6.1-0.20170808005755.5ac45de has the needed fix.

2) puppet-tripleo
Also puppet-tripleo-7.3.0-0.20170821114704.el7ost.src.rpm has the fix for puppet-tripleo (https://review.openstack.org/#/c/475457/)

3) erlang
Downstream we have do have the proper erlang build http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/OpenStack/12.0-RHEL-7/2017-08-31.1/RH7-RHOS-12.0/source/erlang-18.3.4.5-4.el7ost.src.rpm as well.

The only thing missing here AFAICT is the erlang build in RDO/upstream, which should not block things on this BZ

Comment 17 Raoul Scarazzini 2017-08-31 13:56:13 UTC
Peter can you please take care about the RDO/upstream thing? If you feel like we need to involve someone else feel free to include him here.

Comment 18 Peter Lemenkov 2017-09-01 11:31:28 UTC
(In reply to Raoul Scarazzini from comment #17)
> Peter can you please take care about the RDO/upstream thing? If you feel
> like we need to involve someone else feel free to include him here.

In progress.

Comment 23 errata-xmlrpc 2017-12-13 21:45:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.