Description of problem: FFU: network related Pacemaker failed actions show up while running the Ceph upgrade. We can notice the following output in pcs status: Failed Actions: * ip-172.17.1.10_monitor_10000 on controller-0 'unknown error' (1): call=97, status=complete, exitreason='Unable to find nic or netmask.', last-rc-change='Wed May 2 22:37:38 2018', queued=0ms, exec=0ms * ip-172.17.3.15_monitor_10000 on controller-0 'unknown error' (1): call=98, status=complete, exitreason='Unable to find nic or netmask.', last-rc-change='Wed May 2 22:37:38 2018', queued=0ms, exec=0ms * ip-10.0.0.105_monitor_10000 on controller-2 'not running' (7): call=96, status=complete, exitreason='', last-rc-change='Wed May 2 22:37:22 2018', queued=0ms, exec=0ms * ip-172.17.4.12_monitor_10000 on controller-2 'not running' (7): call=95, status=complete, exitreason='', last-rc-change='Wed May 2 22:37:42 2018', queued=0ms, exec=0ms * rabbitmq_monitor_10000 on rabbitmq-bundle-2 'not running' (7): call=54, status=complete, exitreason='', last-rc-change='Wed May 2 22:37:34 2018', queued=0ms, exec=0ms * galera_monitor_10000 on galera-bundle-2 'unknown error' (1): call=187, status=complete, exitreason='local node <controller-2> is started, but not in primary mode. Unknown state.', last-rc-change='Wed May 2 22:37:22 2018', queued=0ms, exec=0ms * galera_monitor_0 on galera-bundle-1 'unknown error' (1): call=468, status=complete, exitreason='local node <controller-1> is started, but not in primary mode. Unknown state.', last-rc-change='Wed May 2 22:38:07 2018', queued=0ms, exec=838ms After checking the haproxy backends we can notice that all the mysql servers are DOWN: ()[root@controller-0 /]# echo "show stat" | socat /var/lib/haproxy/stats stdio | grep DOWN mysql,controller-0.internalapi.localdomain,0,0,0,1,,3,0,0,,0,,0,0,3,0,DOWN,1,0,1,1,1,670,670,,1,12,1,,3,,2,0,,2,L4CON,,0,,,,,,,0,,,,0,0,,,,,670,Connection refused,,0,0,0,0, mysql,controller-1.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,0,1,1,1,670,670,,1,12,2,,0,,2,0,,0,L4CON,,0,,,,,,,0,,,,0,0,,,,,-1,Connection refused,,0,0,0,0, mysql,controller-2.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,0,1,1,1,670,670,,1,12,3,,0,,2,0,,0,L4CON,,0,,,,,,,0,,,,0,0,,,,,-1,Connection refused,,0,0,0,0, mysql,BACKEND,0,0,0,3,410,14639,0,0,0,0,,14636,0,3,0,DOWN,0,0,0,,1,670,670,,1,12,0,,3,,1,33,,86,,,,,,,,,,,,,,0,0,0,0,0,0,670,,,0,0,0,0, redis,controller-0.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,669,669,,1,19,1,,0,,2,0,,0,L4CON,,0,,,,,,,0,,,,0,0,,,,,-1,Connection refused at step 2 of tcp-check (send),,0,0,0,0, redis,controller-1.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,659,659,,1,19,2,,0,,2,0,,0,L7TOUT,,10001,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0, Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-8.0.2-9.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OSP10 with 3 controllers + 2 computes + 3 ceph OSD nodes 2. openstack overcloud ffwd-upgrade prepare 3. openstack overcloud ffwd-upgrade run 4. openstack overcloud upgrade run --roles Controller --skip-tags validation 5. openstack overcloud upgrade run --roles Compute --skip-tags validation 6. openstack overcloud ffwd-upgrade converge 7. openstack overcloud upgrade run --roles CephStorage --skip-tags validation 8. openstack overcloud ceph-upgrade run 9. wait for the process to finish 10. check pcs status Actual results: Failed actions are reported and mysql backends show as down. Expected results: The Pacemaker cluster status is not affected by the ceph upgrade. Additional info: Attaching sosreports.
some notes about the connectivity troubleshooting: haproxy config: listen mysql bind 172.17.1.10:3306 transparent option tcpka option httpchk option tcplog stick on dst stick-table type ip size 1000 timeout client 90m timeout server 90m server controller-0.internalapi.localdomain 172.17.1.22:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200 server controller-1.internalapi.localdomain 172.17.1.15:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200 server controller-2.internalapi.localdomain 172.17.1.13:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200 ()[root@controller-0 /]# curl 172.17.1.22:3306 5.5.5-10.1.20-MariaDB�)YdjjVOi�?�BM;~/iDZt\OWmysql_native_password!��#08S01Got packets out of ordercurl (HTTP://172.17.1.22:3306/): response: 000, time: 0.001, size: 125 ()[root@controller-0 /]# ()[root@controller-0 /]# curl 172.17.1.22:9200 curl (HTTP://172.17.1.22:9200/): response: 000, time: 0.000, size: 0 curl: (7) Failed connect to 172.17.1.22:9200; Connection refused
So a couple of thoughts around this. We observe two problems (Failed actions around VIPs (A) and galera down(B) ). They might or might not have the same root cause. Let's look at the reason for (A) aka VIPs first: From /var/log/messages on controller-0 we see the problem: May 2 22:37:38 controller-0 IPaddr2(ip-172.17.1.10)[97808]: ERROR: Unable to find nic or netmask. May 2 22:37:38 controller-0 IPaddr2(ip-172.17.3.15)[97809]: ERROR: Unable to find nic or netmask. May 2 22:37:38 controller-0 IPaddr2(ip-172.17.1.10)[97808]: ERROR: [findif] failed May 2 22:37:38 controller-0 IPaddr2(ip-172.17.3.15)[97809]: ERROR: [findif] failed May 2 22:37:38 controller-0 lrmd[501095]: notice: ip-172.17.3.15_monitor_10000:97809:stderr [ ocf-exit-reason:Unable to find nic or netmask. ] May 2 22:37:38 controller-0 lrmd[501095]: notice: ip-172.17.1.10_monitor_10000:97808:stderr [ ocf-exit-reason:Unable to find nic or netmask. ] IPaddr RA is just a script that brings up the VIPs if it fails with that error message it is because somebody pulled the nic under us. In fact in messages a bit above we see this: May 2 22:37:37 controller-0 ntpd[984729]: Deleting interface #27 vlan30, 172.17.3.15#123, interface stats: received=0, sent=0, dropped=0, active_time=8867 secs May 2 22:37:37 controller-0 ntpd[984729]: Deleting interface #26 vlan20, 172.17.1.10#123, interface stats: received=0, sent=0, dropped=0, active_time=8867 secs <snip> May 2 22:37:37 controller-0 ntpd[984729]: Deleting interface #5 vlan40, 172.17.4.10#123, interface stats: received=0, sent=0, dropped=0, active_time=15025 secs May 2 22:37:37 controller-0 ntpd[984729]: Deleting interface #4 br-ex, 10.0.0.103#123, interface stats: received=72, sent=72, dropped=0, active_time=15025 secs So this (for whatever reason) is due to 20-os-net-config bringing down the whole networking plane: May 2 22:37:31 controller-0 os-collect-config: dib-run-parts Wed May 2 22:37:31 UTC 2018 20-os-apply-config completed May 2 22:37:31 controller-0 os-collect-config: dib-run-parts Wed May 2 22:37:31 UTC 2018 Running /usr/libexec/os-refresh-config/configure.d/20-os-net-config May 2 22:37:31 controller-0 os-collect-config: ++ os-apply-config --key os_net_config --type raw --key-default '' May 2 22:37:31 controller-0 kernel: IPv4: martian source 172.17.1.22 from 10.0.0.101, on dev vlan20 May 2 22:37:31 controller-0 kernel: ll header: 00000000: 9a 9c f6 bf ac 66 72 c9 2c 7d f8 c7 08 00 .....fr.,}.... May 2 22:37:31 controller-0 os-collect-config: + NET_CONFIG='{"network_config": [{"dns_servers": ["10.0.0.1"], "addresses": [{"ip_netmask": "192.168.24.12/24"}], "routes": [{"default": true, "ip_netmask": "0.0.0.0/0", "next_hop": "192.168.24.1"}, {"ip_netmask": "169.254.169.254/32", "next_hop": "192.168.24.1"}], "use_dhcp": false, "type": "interface", "name": "nic1"}, {"use_dhcp": false, "type": "ovs_bridge", "name": "br-isolated", "members": [{"type": "interface", "name": "nic2", "primary": true}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.1.22/24"}], "vlan_id": 20}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.3.17/24"}], "vlan_id": 30}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.4.10/24"}], "vlan_id": 40}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.2.15/24"}], "vlan_id": 50}]}, {"addresses": [{"ip_netmask": "10.0.0.103/24"}], "members": [{"type": "interface", "name": "nic3", "primary": true}], "routes": [{"ip_netmask": "0.0.0.0/0", "next_hop": "10.0.0.1"}], "use_dhcp": false, "type": "ovs_bridge", "name": "br-ex"}]}' May 2 22:37:31 controller-0 os-collect-config: + '[' -n '{"network_config": [{"dns_servers": ["10.0.0.1"], "addresses": [{"ip_netmask": "192.168.24.12/24"}], "routes": [{"default": true, "ip_netmask": "0.0.0.0/0", "next_hop": "192.168.24.1"}, {"ip_netmask": "169.254.169.254/32", "next_hop": "192.168.24.1"}], "use_dhcp": false, "type": "interface", "name": "nic1"}, {"use_dhcp": false, "type": "ovs_bridge", "name": "br-isolated", "members": [{"type": "interface", "name": "nic2", "primary": true}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.1.22/24"}], "vlan_id": 20}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.3.17/24"}], "vlan_id": 30}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.4.10/24"}], "vlan_id": 40}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.2.15/24"}], "vlan_id": 50}]}, {"addresses": [{"ip_netmask": "10.0.0.103/24"}], "members": [{"type": "interface", "name": "nic3", "primary": true}], "routes": [{"ip_netmask": "0.0.0.0/0", "next_hop": "10.0.0.1"}], "use_dhcp": false, "type": "ovs_bridge", "name": "br-ex"}]}' ']' May 2 22:37:31 controller-0 os-collect-config: + trap configure_safe_defaults EXIT May 2 22:37:31 controller-0 os-collect-config: + os-net-config -c /etc/os-net-config/config.json -v --detailed-exit-codes May 2 22:37:31 controller-0 kernel: IPv4: martian source 172.17.1.22 from 10.0.0.101, on dev vlan20 May 2 22:37:31 controller-0 kernel: ll header: 00000000: 9a 9c f6 bf ac 66 72 c9 2c 7d f8 c7 08 00 .....fr.,}.... May 2 22:37:31 controller-0 os-collect-config: [2018/05/02 10:37:31 PM] [INFO] Using config file at: /etc/os-net-config/config.json May 2 22:37:31 controller-0 os-collect-config: [2018/05/02 10:37:31 PM] [INFO] Ifcfg net config provider created. May 2 22:37:31 controller-0 os-collect-config: [2018/05/02 10:37:31 PM] [INFO] Not using any mapping file. May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] Finding active nics May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] vxlan_sys_4789 is not an active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] br-isolated is not an active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] vlan40 is not an active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] vlan50 is not an active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] vlan20 is not an active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] vlan30 is not an active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] br-int is not an active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] br-tun is not an active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] ovs-system is not an active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] docker0 is not an active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] br-ex is not an active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] lo is not an active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] eth2 is an embedded active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] eth0 is an embedded active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] eth1 is an embedded active nic May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] No DPDK mapping available in path (/var/lib/os-net-config/dpdk_mapping.yaml) May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] Active nics are ['eth0', 'eth1', 'eth2'] May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] nic3 mapped to: eth2 May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] nic2 mapped to: eth1 May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] nic1 mapped to: eth0 May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding interface: eth0 May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding custom route for interface: eth0 May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding bridge: br-isolated May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding interface: eth1 May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding vlan: vlan20 May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding vlan: vlan30 May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding vlan: vlan40 May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding vlan: vlan50 May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding bridge: br-ex May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding custom route for interface: br-ex May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding interface: eth2 May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] applying network configs... May 2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] running ifdown on interface: vlan20 (B) Galera is likely due to the same networking root cause. We see the following: May 2 22:37:20 controller-0 corosync[500931]: [TOTEM ] A processor failed, forming new configuration. Which means nodes can't see each other due to network issues. So the focus here should be as to why the os-net-config brings down the network. I am not sure why that is though and am not too familiar with os-net-config. I do see that it has been updated beforehand (May 02 18:07:10 Updated: os-net-config-8.4.1-1.el7ost.noarch). As a matter of fact I see no other mention (except a couple of ansible tasks that check if yum is going to update os-net-config) of os-net-config between the time of the upgrade and the run that disrupts the networking. So, unless I got something wrong, it seems to me that the upgrade of os-net-config is what is causing this.
Thanks Michele. So this appears to be related to bug 1561255. I'm not able to reproduce this bug when applying the workaround for bug 1561255 (rm /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json) before starting the upgrade process.
*** This bug has been marked as a duplicate of bug 1561255 ***