Bug 1571946 - FFU: openstack overcloud upgrade run --roles Controller --skip-tags validation gets stuck during TASK [Run puppet host configuration for step 5]
Summary: FFU: openstack overcloud upgrade run --roles Controller --skip-tags validatio...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: ---
Assignee: Slawek Kaplonski
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-25 19:00 UTC by Marius Cornea
Modified: 2019-01-30 21:39 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-04 16:55:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Marius Cornea 2018-04-25 19:00:35 UTC
Description of problem:
FFU: openstack overcloud upgrade run --roles Controller --skip-tags validation gets stuck during TASK [Run puppet host configuration for step 5].

Investigation reveals that the controller nodes cannot reach the other controllers in the cluster over the internal_api address hence the entire cluster goes down:

[root@controller-1 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.18-11.el7_5.2-2b07d5c5a9) - partition WITHOUT quorum
Last updated: Wed Apr 25 18:57:16 2018
Last change: Wed Apr 25 18:39:36 2018 by root via cibadmin on controller-0

12 nodes configured
36 resources configured

Online: [ controller-1 ]
OFFLINE: [ controller-0 controller-2 ]

Full list of resources:

 ip-172.17.1.15	(ocf::heartbeat:IPaddr2):	Stopped
 ip-172.17.4.12	(ocf::heartbeat:IPaddr2):	Stopped
 ip-172.17.3.11	(ocf::heartbeat:IPaddr2):	Stopped
 ip-10.0.0.104	(ocf::heartbeat:IPaddr2):	Stopped
 ip-192.168.24.15	(ocf::heartbeat:IPaddr2):	Stopped
 ip-172.17.1.17	(ocf::heartbeat:IPaddr2):	Stopped
 Docker container set: rabbitmq-bundle [rhos-qe-mirror-rdu2.usersys.redhat.com:5000/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Stopped
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Stopped
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Stopped
 Docker container set: galera-bundle [rhos-qe-mirror-rdu2.usersys.redhat.com:5000/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Stopped
   galera-bundle-1	(ocf::heartbeat:galera):	Stopped
   galera-bundle-2	(ocf::heartbeat:galera):	Stopped
 Docker container set: redis-bundle [rhos-qe-mirror-rdu2.usersys.redhat.com:5000/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Stopped
   redis-bundle-1	(ocf::heartbeat:redis):	Stopped
   redis-bundle-2	(ocf::heartbeat:redis):	Stopped
 Docker container set: haproxy-bundle [rhos-qe-mirror-rdu2.usersys.redhat.com:5000/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Stopped
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Stopped
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@controller-1 heat-admin]# 
[root@controller-1 heat-admin]# ping -c1 controller-0
PING controller-0.localdomain (172.17.1.21) 56(84) bytes of data.
From controller-1.localdomain (172.17.1.24) icmp_seq=1 Destination Host Unreachable

--- controller-0.localdomain ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

[root@controller-1 heat-admin]# ping -c1 controller-2
PING controller-2.localdomain (172.17.1.23) 56(84) bytes of data.
From controller-1.localdomain (172.17.1.24) icmp_seq=1 Destination Host Unreachable

--- controller-2.localdomain ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-8.0.2-0.20180416194362.29a5ad5.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 3 controllers + 2 computes + 3 ceph osd nodes
2. Upgrade undercloud to OSP11/12/13
3. ## FFU prepare
openstack overcloud ffwd-upgrade prepare \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--control-scale 3 \
--control-flavor controller \
--compute-scale 2 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/ffu_repos.yaml \
--container-registry-file docker-images.yaml \


4. FFU run
openstack overcloud ffwd-upgrade run


5. FFU Controllers
openstack overcloud upgrade run --roles Controller --skip-tags validation


Actual results:
Gets stuck while running 2018-04-25 14:45:13,500 p=6040 u=mistral |  TASK [Run puppet host configuration for step 5] ********************************

Controller nodes cannot reach the other controllers in the cluster via the internal_api network.

Expected results:
Connectivity doesn't break.

Additional info:
Attaching sosreports.

Comment 2 Marius Cornea 2018-04-25 21:57:41 UTC
I suspect this could be related to the creation of the Neutron containers, from what I can tell things got broken after the creation of neutron_ovs_agent container:

In /var/log/openvswitch/ovs-vswitchd.log:

2018-04-25T18:45:03.348Z|03024|rconn|WARN|br-ex<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2018-04-25T18:45:03.348Z|03025|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2018-04-25T18:45:03.348Z|03026|rconn|WARN|br-isolated<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2018-04-25T18:45:04.842Z|03027|bridge|INFO|bridge br-int: deleted interface int-br-ex on port 1
2018-04-25T18:45:04.864Z|03028|bridge|INFO|bridge br-ex: deleted interface phy-br-ex on port 2
2018-04-25T18:45:04.890Z|03029|bridge|INFO|bridge br-int: deleted interface int-br-isolated on port 2
2018-04-25T18:45:04.901Z|03030|bridge|INFO|bridge br-isolated: deleted interface phy-br-isolated on port 6
2018-04-25T18:45:05.416Z|03031|fail_open|INFO|Still in fail-open mode after 5505 seconds disconnected from controller
2018-04-25T18:45:05.416Z|03032|fail_open|INFO|Still in fail-open mode after 5505 seconds disconnected from controller
2018-04-25T18:45:11.621Z|03033|rconn|INFO|br-tun<->tcp:127.0.0.1:6633: connected
2018-04-25T18:45:11.622Z|03034|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connected
2018-04-25T18:45:11.622Z|03035|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connected
2018-04-25T18:45:11.622Z|03036|rconn|INFO|br-isolated<->tcp:127.0.0.1:6633: connected


[root@controller-0 ~]# docker inspect neutron_ovs_agent | grep StartedAt
            "StartedAt": "2018-04-25T18:44:59.2953052Z",

In /var/log/messages:

Apr 25 18:45:05 controller-0 journal: + /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-dir /etc/neutron/conf.d/common --log-file=/var/log/neutron/openvswitch-agent.log


Note that the the internal_api network is configured on an ovs vlan interface under the br-isolated bridge which is used in the ovs bridge mappings:

/etc/neutron/plugins/ml2/openvswitch_agent.ini:bridge_mappings =datacentre:br-ex,tenant:br-isolated

Comment 3 Jose Luis Franco 2018-04-30 14:44:40 UTC
@Sofer, this BZ has been assigned to you during triage duty call. Please, feel free to reasign.

Comment 4 Marius Cornea 2018-05-04 16:55:51 UTC
I haven't been able to reproduce this issue in my latest upgrade attempt so I'm closing this as not a bug until further reproducers.


Note You need to log in before you can comment on or make changes to this bug.