Hide Forgot
Description of problem: OSP11 -> OSP12 upgrade: major-upgrade-converge-docker.yaml fails in IPv6 environment with ERROR: epmd error for host controller-1: address (cannot connect to host/port) Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-7.0.0-0.20170721174554.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy IPv6 OSP11 with 3 controllers + 1 compute 2. Upgrade to OSP12 Actual results: Upgrade gets stuck during major-upgrade-converge-docker.yaml step. Expected results: major-upgrade-converge-docker.yaml step finishes successfully. Additional info: [root@controller-1 heat-admin]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-2 (version 1.1.16-12.el7-94ff4df) - partition with quorum Last updated: Mon Jul 31 14:00:10 2017 Last change: Mon Jul 31 13:07:58 2017 by root via cibadmin on controller-0 12 nodes configured 37 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-0 rabbitmq-bundle-0@controller-1 rabbitmq-bundle-1@controller-2 rabbitmq-bundle-2@controller-0 redis-bundle-0@controller-1 redis-bundle-1@controller-2 redis-bundle-2@controller-0 ] Full list of resources: ip-192.168.24.15 (ocf::heartbeat:IPaddr2): Started controller-0 ip-2620.52.0.13b8.5054.ff.fe3e.3 (ocf::heartbeat:IPaddr2): Started controller-1 ip-fd00.fd00.fd00.2000..15 (ocf::heartbeat:IPaddr2): Started controller-2 ip-fd00.fd00.fd00.2000..16 (ocf::heartbeat:IPaddr2): Started controller-0 ip-fd00.fd00.fd00.3000..12 (ocf::heartbeat:IPaddr2): Started controller-1 ip-fd00.fd00.fd00.4000..11 (ocf::heartbeat:IPaddr2): Started controller-2 openstack-cinder-volume (systemd:openstack-cinder-volume): Stopped Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:2017-07-26.10] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-1 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-2 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-0 Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-07-26.10] galera-bundle-0 (ocf::heartbeat:galera): Master controller-1 galera-bundle-1 (ocf::heartbeat:galera): Master controller-2 galera-bundle-2 (ocf::heartbeat:galera): Master controller-0 Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:2017-07-26.10] redis-bundle-0 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-1 (ocf::heartbeat:redis): Slave controller-2 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-0 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:2017-07-26.10] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-2 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-0 Failed Actions: * openstack-cinder-volume_start_0 on controller-2 'not running' (7): call=105, status=complete, exitreason='none', last-rc-change='Mon Jul 31 13:08:08 2017', queued=0ms, exec=2086ms * openstack-cinder-volume_start_0 on controller-1 'not running' (7): call=105, status=complete, exitreason='none', last-rc-change='Mon Jul 31 13:08:04 2017', queued=0ms, exec=2088ms * openstack-cinder-volume_start_0 on controller-0 'not running' (7): call=109, status=complete, exitreason='none', last-rc-change='Mon Jul 31 13:07:57 2017', queued=0ms, exec=2062ms * rabbitmq_start_0 on rabbitmq-bundle-0 'unknown error' (1): call=11757, status=complete, exitreason='none', last-rc-change='Mon Jul 31 14:00:00 2017', queued=0ms, exec=4700ms [root@controller-1 heat-admin]# docker exec -it rabbitmq-bundle-docker-0 bash -c 'cat /var/log/rabbitmq/startup_log' ERROR: epmd error for host controller-1: address (cannot connect to host/port) [root@controller-1 heat-admin]# docker exec -it rabbitmq-bundle-docker-0 bash -c 'getent hosts controller-1' fd00:fd00:fd00:2000::1b controller-1.localdomain controller-1 Note: I am not able to reproduce this issue on IPv4 environments.
Adjusting the title to match the correct step in the upgrade process where it fails: major-upgrade-composable-steps-docker.yaml
o/ just spent some time looking here. My first question/unkown at present is this upgrades related, or is it an issue with deploying HA/container with IPv6 i.e. do we have jobs that deploy like that already and which are successful so we can rule that out? Looking in the attached sos report /var/log/messages, I can't quickly see an error around the upgrade_tasks. I see the rabbit container starting here: Jul 31 13:05:05 controller-1 docker(rabbitmq-bundle-docker-0)[386299]: INFO: running container rabb itmq-bundle-docker-0 for the first time$ And then a few seconds later this error about cib_query Jul 31 13:05:13 controller-1 rabbitmq-cluster(rabbitmq)[387079]: INFO: failed to bootstrap cluster. Check SELINUX policy$ Jul 31 13:05:13 controller-1 rabbitmq-cluster(rabbitmq)[387083]: DEBUG: rabbitmq start : 1$ Jul 31 13:05:13 controller-1 pacemaker_remoted[386425]: notice: rabbitmq_start_0:31:stderr [ Call cib_query failed (-6): No such device or address ]$ Jul 31 13:05:13 controller-1 pacemaker_remoted[386425]: notice: rabbitmq_start_0:31:stderr [ Call cib_query failed (-6): No such device or address ]$ ... Jul 31 13:05:17 controller-1 pacemaker_remoted[386425]: notice: rabbitmq_stop_0:514:stderr [ * unable to connect to epmd (port 4369) on controller-1: address (cannot connect to host/port) ]$ A little before that, this caught my eye: Jul 31 13:00:37 controller-1 ansible-command[361402]: Invoked with warn=True executable=None _uses_ shell=True _raw_params=echo 'export ERL_EPMD_ADDRESS=127.0.0.1' > /etc/rabbitmq/rabbitmq-env.conf$ 36852 echo 'export ERL_EPMD_PORT=4370' >> /etc/rab bitmq/rabbitmq-env.conf$ 36853 for pid in $(pgrep epmd); do if [ "$(lsns -o NS -p $pid)" == "$(lsns -o NS -p 1)" ]; then kill $pid; break; fi; done removes=None creates=None chdir=None$ and so am wondering if that hardcoded localhost address is an issue here (traced to https://review.openstack.org/#/c/452889/19/docker/services/pacemaker/rabbitmq.yaml@153 ). I think folks from DFG:Containers may need to check here (and possibly PIDONE) too. Marking triaged and filed upstream LP placeholder (in trackers).
(In reply to marios from comment #3) > o/ just spent some time looking here. My first question/unkown at present is > this upgrades related, or is it an issue with deploying HA/container with > IPv6 i.e. do we have jobs that deploy like that already and which are > successful so we can rule that out? > It looks like fresh container environments with IPv6 can be deployed fine and the error reported in this bug doesn't show up.
Hi Rasca - please can we get someone from the team to do a first pass here? The error we see is from pacemaker_remoted (see comment #3). I will leave it as DFG:Upgrades for now, unless you need me to assign it to PIDONE in order to get it checked in which case please go ahead and take it.
(In reply to marios from comment #3) > o/ just spent some time looking here. My first question/unkown at present is > this upgrades related, or is it an issue with deploying HA/container with > IPv6 i.e. do we have jobs that deploy like that already and which are > successful so we can rule that out? So I am totally surprised by Marius' comment nr 4 where he says he can deploy containers with ipv6 and it works out of the box. Unless the tests were done on a single controller it should not work. In order to have rabbitmq working with ipv6 three things need to happen: 1) erlang-18.3.4.5-4 2) https://github.com/puppetlabs/puppetlabs-rabbitmq/pull/552 unmerged as of today :/ 3) https://review.openstack.org/#/c/475457/ (which depends on 2 being merged) Without these three patches we get exactly what marius observed in the upgrade but we see it on deploy as well. > > Looking in the attached sos report /var/log/messages, I can't quickly see an > error around the upgrade_tasks. I see the rabbit container starting here: > > Jul 31 13:05:05 controller-1 docker(rabbitmq-bundle-docker-0)[386299]: > INFO: running container rabb itmq-bundle-docker-0 for the first time$ > > And then a few seconds later this error about cib_query > Jul 31 13:05:13 controller-1 rabbitmq-cluster(rabbitmq)[387079]: INFO: > failed to bootstrap cluster. Check SELINUX policy$ > Jul 31 13:05:13 controller-1 rabbitmq-cluster(rabbitmq)[387083]: DEBUG: > rabbitmq start : 1$ > Jul 31 13:05:13 controller-1 pacemaker_remoted[386425]: notice: > rabbitmq_start_0:31:stderr [ Call cib_query failed (-6): No such > device or address ]$ > Jul 31 13:05:13 controller-1 pacemaker_remoted[386425]: notice: > rabbitmq_start_0:31:stderr [ Call cib_query failed (-6): No such > device or address ]$ > ... > Jul 31 13:05:17 controller-1 pacemaker_remoted[386425]: notice: > rabbitmq_stop_0:514:stderr [ * unable to connect to epmd (port > 4369) on controller-1: address (cannot connect to host/port) ]$ > That is the error which got fixed by eck's patches. Which we get at deploy time as well. > A little before that, this caught my eye: > > Jul 31 13:00:37 controller-1 ansible-command[361402]: Invoked with > warn=True executable=None _uses_ shell=True _raw_params=echo 'export > ERL_EPMD_ADDRESS=127.0.0.1' > /etc/rabbitmq/rabbitmq-env.conf$ > 36852 echo 'export > ERL_EPMD_PORT=4370' >> /etc/rab bitmq/rabbitmq-env.conf$ > 36853 for pid in > $(pgrep epmd); do if [ "$(lsns -o NS -p $pid)" == "$(lsns -o NS -p > 1)" ]; then kill $pid; break; fi; done removes=None creates=None > chdir=None$ > > and so am wondering if that hardcoded localhost address is an issue here > (traced to > https://review.openstack.org/#/c/452889/19/docker/services/pacemaker/ > rabbitmq.yaml@153 ). So the "if lsns" is basically the only way we can avoid epmd being started on the host for no reason. The super short version is "even running facter on the host will start up epmd on the host"), the longer description of this problem is here: http://rhel-ha.etherpad.corp.redhat.com/epmd-container-port-issue
(In reply to Michele Baldessari from comment #6) > (In reply to marios from comment #3) > > o/ just spent some time looking here. My first question/unkown at present is > > this upgrades related, or is it an issue with deploying HA/container with > > IPv6 i.e. do we have jobs that deploy like that already and which are > > successful so we can rule that out? > > So I am totally surprised by Marius' comment nr 4 where he says he can > deploy containers with ipv6 and it works out of the box. Thinking a bit more about this, if you use one node for rabbit then it might just work even without the below patches. > Unless the tests > were done on a single controller it should not work. > In order to have rabbitmq working with ipv6 three things need to happen: > 1) erlang-18.3.4.5-4 > 2) https://github.com/puppetlabs/puppetlabs-rabbitmq/pull/552 unmerged as of > today :/ > 3) https://review.openstack.org/#/c/475457/ (which depends on 2 being merged) So now all (1,2,3) patches merged upstream, so if they are all present in a puddle. It's worth retrying to see if there are any other hiccups on the upgrade path. If you need any help verifying patches and what not, just let me know.
(In reply to Michele Baldessari from comment #8) > > Unless the tests > > were done on a single controller it should not work. > > In order to have rabbitmq working with ipv6 three things need to happen: > > 1) erlang-18.3.4.5-4 > > 2) https://github.com/puppetlabs/puppetlabs-rabbitmq/pull/552 unmerged as of > > today :/ > > 3) https://review.openstack.org/#/c/475457/ (which depends on 2 being merged) > > So now all (1,2,3) patches merged upstream, so if they are all present in a > puddle. It's worth retrying to see if there are any other hiccups on the > upgrade path. If you need any help verifying patches and what not, just let > me know. Checking the lastest OSP12 puddle: 1 - present: erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64 2 - patch not present: installed version is puppet-rabbitmq-5.6.1-0.20170710225057.62fed8d.el7ost.noarch 3 - patch not present: puppet-tripleo-7.2.1-0.20170807233007.4600842.el7ost.noarch
(In reply to Marius Cornea from comment #9) > (In reply to Michele Baldessari from comment #8) > > > Unless the tests > > > were done on a single controller it should not work. > > > In order to have rabbitmq working with ipv6 three things need to happen: > > > 1) erlang-18.3.4.5-4 > > > 2) https://github.com/puppetlabs/puppetlabs-rabbitmq/pull/552 unmerged as of > > > today :/ > > > 3) https://review.openstack.org/#/c/475457/ (which depends on 2 being merged) > > > > So now all (1,2,3) patches merged upstream, so if they are all present in a > > puddle. It's worth retrying to see if there are any other hiccups on the > > upgrade path. If you need any help verifying patches and what not, just let > > me know. > > Checking the lastest OSP12 puddle: > > 1 - present: erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64 > > 2 - patch not present: installed version is > puppet-rabbitmq-5.6.1-0.20170710225057.62fed8d.el7ost.noarch > > 3 - patch not present: > puppet-tripleo-7.2.1-0.20170807233007.4600842.el7ost.noarch I managed to get the upgrade successfully passing after manually applying patches in 2) and 3) so I think we should keep this BZ open to keep track of those changes landing in the downstream puddle.
(In reply to Marius Cornea from comment #10) > (In reply to Marius Cornea from comment #9) > > (In reply to Michele Baldessari from comment #8) ... > > 1 - present: erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64 > > > > 2 - patch not present: installed version is > > puppet-rabbitmq-5.6.1-0.20170710225057.62fed8d.el7ost.noarch > > > > 3 - patch not present: > > puppet-tripleo-7.2.1-0.20170807233007.4600842.el7ost.noarch > > I managed to get the upgrade successfully passing after manually applying > patches in 2) and 3) so I think we should keep this BZ open to keep track of > those changes landing in the downstream puddle. so moving to POST (we aren't tracking the changes here, I assume we have BZ for each of those individually already?) - or mcornea should this go straing to ON_QA? wdyt
(In reply to marios from comment #11) > (In reply to Marius Cornea from comment #10) > > (In reply to Marius Cornea from comment #9) > > > (In reply to Michele Baldessari from comment #8) > > ... > > > > 1 - present: erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64 > > > > > > 2 - patch not present: installed version is > > > puppet-rabbitmq-5.6.1-0.20170710225057.62fed8d.el7ost.noarch > > > > > > 3 - patch not present: > > > puppet-tripleo-7.2.1-0.20170807233007.4600842.el7ost.noarch > > > > I managed to get the upgrade successfully passing after manually applying > > patches in 2) and 3) so I think we should keep this BZ open to keep track of > > those changes landing in the downstream puddle. > > so moving to POST (we aren't tracking the changes here, I assume we have BZ > for each of those individually already?) - or mcornea should this go > straing to ON_QA? wdyt I think we should leave it on POST for now as the changes have merged upstream but have not yet landed in a downstream build.
> Checking the lastest OSP12 puddle: > > 1 - present: erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64 NOT present in RDO deps repo where we have old Fedora rebuild 18.3.4.4 https://github.com/rdo-common/erlang/commits/common-rdo There is newer erlang-19.3.6.1-1.el7 Fedora rebuild http://cbs.centos.org/koji/buildinfo?buildID=17627 candidate build, is the fix included and is it safe to push to Common (for all RDO releases) or to Pike only? Please propose appropriate update to rdoinfo/deps.yml for an example see https://review.rdoproject.org/r/#/c/8045/4/deps.yml
My preferred option would be the following: 1) Use erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64 for anything up to pike 2) For master instead we need to decide which version we want. If we plan a rebase downstream for Queens then sure let's update master upstream. Otherwise let's stay on 1) for master as well. Peter, thoughts?
(In reply to Michele Baldessari from comment #14) > My preferred option would be the following: > 1) Use erlang-runtime_tools-18.3.4.5-4.el7ost.x86_64 for anything up to pike > 2) For master instead we need to decide which version we want. If we plan a > rebase downstream for Queens then sure let's update master upstream. > Otherwise let's stay on 1) for master as well. > > Peter, thoughts? Hello All! It's a strange situation when we have multiple Erlang builds floating around. Our one (and under "our" I mean RHOS) has most if not all IPv6-related fixes applied. Even upstream's version 20 doesn't contain all the necessary pieces. So I strongly advise everyone to switch to our most recent build, which is erlang-18.3.4.5-4.el7ost at this moment. At least you'll have someone to blame for :) Regarding switch to ver. 19 - we will build it soon (for RHOS 12 or likely 13). Right now I believe that any recent build from 18.x.y.z series "should be enough for everyone" (c)
1) puppet-rabbitmq Not sure I can link a github PR, but I can confirm that the needed changes are in puppetlabs-rabbitmq package (http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/OpenStack/12.0-RHEL-7/2017-08-31.1/RH7-RHOS-12.0/source/puppet-rabbitmq-5.6.1-0.20170808005755.5ac45de.el7ost.src.rpm). The commit was https://github.com/voxpupuli/puppet-rabbitmq/commit/5ac45dedd9b409c9efac654724bc74867cb9233b puppet-rabbitmq-5.6.1-0.20170808005755.5ac45de has the needed fix. 2) puppet-tripleo Also puppet-tripleo-7.3.0-0.20170821114704.el7ost.src.rpm has the fix for puppet-tripleo (https://review.openstack.org/#/c/475457/) 3) erlang Downstream we have do have the proper erlang build http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/OpenStack/12.0-RHEL-7/2017-08-31.1/RH7-RHOS-12.0/source/erlang-18.3.4.5-4.el7ost.src.rpm as well. The only thing missing here AFAICT is the erlang build in RDO/upstream, which should not block things on this BZ
Peter can you please take care about the RDO/upstream thing? If you feel like we need to involve someone else feel free to include him here.
(In reply to Raoul Scarazzini from comment #17) > Peter can you please take care about the RDO/upstream thing? If you feel > like we need to involve someone else feel free to include him here. In progress.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462