Description of problem: FFU: openstack overcloud upgrade run --roles Controller --skip-tags validation gets stuck during TASK [Run puppet host configuration for step 5]. Investigation reveals that the controller nodes cannot reach the other controllers in the cluster over the internal_api address hence the entire cluster goes down: [root@controller-1 heat-admin]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-1 (version 1.1.18-11.el7_5.2-2b07d5c5a9) - partition WITHOUT quorum Last updated: Wed Apr 25 18:57:16 2018 Last change: Wed Apr 25 18:39:36 2018 by root via cibadmin on controller-0 12 nodes configured 36 resources configured Online: [ controller-1 ] OFFLINE: [ controller-0 controller-2 ] Full list of resources: ip-172.17.1.15 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.4.12 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.3.11 (ocf::heartbeat:IPaddr2): Stopped ip-10.0.0.104 (ocf::heartbeat:IPaddr2): Stopped ip-192.168.24.15 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.1.17 (ocf::heartbeat:IPaddr2): Stopped Docker container set: rabbitmq-bundle [rhos-qe-mirror-rdu2.usersys.redhat.com:5000/rhosp13/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped Docker container set: galera-bundle [rhos-qe-mirror-rdu2.usersys.redhat.com:5000/rhosp13/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Stopped galera-bundle-1 (ocf::heartbeat:galera): Stopped galera-bundle-2 (ocf::heartbeat:galera): Stopped Docker container set: redis-bundle [rhos-qe-mirror-rdu2.usersys.redhat.com:5000/rhosp13/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Stopped redis-bundle-1 (ocf::heartbeat:redis): Stopped redis-bundle-2 (ocf::heartbeat:redis): Stopped Docker container set: haproxy-bundle [rhos-qe-mirror-rdu2.usersys.redhat.com:5000/rhosp13/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Stopped haproxy-bundle-docker-1 (ocf::heartbeat:docker): Stopped haproxy-bundle-docker-2 (ocf::heartbeat:docker): Stopped Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@controller-1 heat-admin]# [root@controller-1 heat-admin]# ping -c1 controller-0 PING controller-0.localdomain (172.17.1.21) 56(84) bytes of data. From controller-1.localdomain (172.17.1.24) icmp_seq=1 Destination Host Unreachable --- controller-0.localdomain ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms [root@controller-1 heat-admin]# ping -c1 controller-2 PING controller-2.localdomain (172.17.1.23) 56(84) bytes of data. From controller-1.localdomain (172.17.1.24) icmp_seq=1 Destination Host Unreachable --- controller-2.localdomain ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-8.0.2-0.20180416194362.29a5ad5.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OSP10 with 3 controllers + 2 computes + 3 ceph osd nodes 2. Upgrade undercloud to OSP11/12/13 3. ## FFU prepare openstack overcloud ffwd-upgrade prepare \ --timeout 100 \ --templates /usr/share/openstack-tripleo-heat-templates \ --stack overcloud \ --libvirt-type kvm \ --ntp-server clock.redhat.com \ --control-scale 3 \ --control-flavor controller \ --compute-scale 2 \ --compute-flavor compute \ --ceph-storage-scale 3 \ --ceph-storage-flavor ceph \ -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \ -e /home/stack/virt/internal.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/virt/network/network-environment.yaml \ -e /home/stack/virt/hostnames.yml \ -e /home/stack/virt/debug.yaml \ -e /home/stack/ffu_repos.yaml \ --container-registry-file docker-images.yaml \ 4. FFU run openstack overcloud ffwd-upgrade run 5. FFU Controllers openstack overcloud upgrade run --roles Controller --skip-tags validation Actual results: Gets stuck while running 2018-04-25 14:45:13,500 p=6040 u=mistral | TASK [Run puppet host configuration for step 5] ******************************** Controller nodes cannot reach the other controllers in the cluster via the internal_api network. Expected results: Connectivity doesn't break. Additional info: Attaching sosreports.
I suspect this could be related to the creation of the Neutron containers, from what I can tell things got broken after the creation of neutron_ovs_agent container: In /var/log/openvswitch/ovs-vswitchd.log: 2018-04-25T18:45:03.348Z|03024|rconn|WARN|br-ex<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2018-04-25T18:45:03.348Z|03025|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2018-04-25T18:45:03.348Z|03026|rconn|WARN|br-isolated<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2018-04-25T18:45:04.842Z|03027|bridge|INFO|bridge br-int: deleted interface int-br-ex on port 1 2018-04-25T18:45:04.864Z|03028|bridge|INFO|bridge br-ex: deleted interface phy-br-ex on port 2 2018-04-25T18:45:04.890Z|03029|bridge|INFO|bridge br-int: deleted interface int-br-isolated on port 2 2018-04-25T18:45:04.901Z|03030|bridge|INFO|bridge br-isolated: deleted interface phy-br-isolated on port 6 2018-04-25T18:45:05.416Z|03031|fail_open|INFO|Still in fail-open mode after 5505 seconds disconnected from controller 2018-04-25T18:45:05.416Z|03032|fail_open|INFO|Still in fail-open mode after 5505 seconds disconnected from controller 2018-04-25T18:45:11.621Z|03033|rconn|INFO|br-tun<->tcp:127.0.0.1:6633: connected 2018-04-25T18:45:11.622Z|03034|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connected 2018-04-25T18:45:11.622Z|03035|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connected 2018-04-25T18:45:11.622Z|03036|rconn|INFO|br-isolated<->tcp:127.0.0.1:6633: connected [root@controller-0 ~]# docker inspect neutron_ovs_agent | grep StartedAt "StartedAt": "2018-04-25T18:44:59.2953052Z", In /var/log/messages: Apr 25 18:45:05 controller-0 journal: + /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-dir /etc/neutron/conf.d/common --log-file=/var/log/neutron/openvswitch-agent.log Note that the the internal_api network is configured on an ovs vlan interface under the br-isolated bridge which is used in the ovs bridge mappings: /etc/neutron/plugins/ml2/openvswitch_agent.ini:bridge_mappings =datacentre:br-ex,tenant:br-isolated
@Sofer, this BZ has been assigned to you during triage duty call. Please, feel free to reasign.
I haven't been able to reproduce this issue in my latest upgrade attempt so I'm closing this as not a bug until further reproducers.