Bug 1574258 - FFU: network related Pacemaker failed actions show up while running the Ceph upgrade
Summary: FFU: network related Pacemaker failed actions show up while running the Ceph ...
Keywords:
Status: CLOSED DUPLICATE of bug 1561255
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-02 22:53 UTC by Marius Cornea
Modified: 2018-05-04 15:56 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-04 15:56:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Marius Cornea 2018-05-02 22:53:44 UTC
Description of problem:

FFU: network related Pacemaker failed actions show up while running the Ceph upgrade. We can notice the following output in pcs status:

Failed Actions:
* ip-172.17.1.10_monitor_10000 on controller-0 'unknown error' (1): call=97, status=complete, exitreason='Unable to find nic or netmask.',
    last-rc-change='Wed May  2 22:37:38 2018', queued=0ms, exec=0ms
* ip-172.17.3.15_monitor_10000 on controller-0 'unknown error' (1): call=98, status=complete, exitreason='Unable to find nic or netmask.',
    last-rc-change='Wed May  2 22:37:38 2018', queued=0ms, exec=0ms
* ip-10.0.0.105_monitor_10000 on controller-2 'not running' (7): call=96, status=complete, exitreason='',
    last-rc-change='Wed May  2 22:37:22 2018', queued=0ms, exec=0ms
* ip-172.17.4.12_monitor_10000 on controller-2 'not running' (7): call=95, status=complete, exitreason='',
    last-rc-change='Wed May  2 22:37:42 2018', queued=0ms, exec=0ms
* rabbitmq_monitor_10000 on rabbitmq-bundle-2 'not running' (7): call=54, status=complete, exitreason='',
    last-rc-change='Wed May  2 22:37:34 2018', queued=0ms, exec=0ms
* galera_monitor_10000 on galera-bundle-2 'unknown error' (1): call=187, status=complete, exitreason='local node <controller-2> is started, but not in primary mode. Unknown state.',
    last-rc-change='Wed May  2 22:37:22 2018', queued=0ms, exec=0ms
* galera_monitor_0 on galera-bundle-1 'unknown error' (1): call=468, status=complete, exitreason='local node <controller-1> is started, but not in primary mode. Unknown state.',
    last-rc-change='Wed May  2 22:38:07 2018', queued=0ms, exec=838ms


After checking the haproxy backends we can notice that all the mysql servers are DOWN:

()[root@controller-0 /]# echo "show stat" |  socat /var/lib/haproxy/stats stdio | grep DOWN
mysql,controller-0.internalapi.localdomain,0,0,0,1,,3,0,0,,0,,0,0,3,0,DOWN,1,0,1,1,1,670,670,,1,12,1,,3,,2,0,,2,L4CON,,0,,,,,,,0,,,,0,0,,,,,670,Connection refused,,0,0,0,0,
mysql,controller-1.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,0,1,1,1,670,670,,1,12,2,,0,,2,0,,0,L4CON,,0,,,,,,,0,,,,0,0,,,,,-1,Connection refused,,0,0,0,0,
mysql,controller-2.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,0,1,1,1,670,670,,1,12,3,,0,,2,0,,0,L4CON,,0,,,,,,,0,,,,0,0,,,,,-1,Connection refused,,0,0,0,0,
mysql,BACKEND,0,0,0,3,410,14639,0,0,0,0,,14636,0,3,0,DOWN,0,0,0,,1,670,670,,1,12,0,,3,,1,33,,86,,,,,,,,,,,,,,0,0,0,0,0,0,670,,,0,0,0,0,
redis,controller-0.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,669,669,,1,19,1,,0,,2,0,,0,L4CON,,0,,,,,,,0,,,,0,0,,,,,-1,Connection refused at step 2 of tcp-check (send),,0,0,0,0,
redis,controller-1.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,659,659,,1,19,2,,0,,2,0,,0,L7TOUT,,10001,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0,

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-8.0.2-9.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:

1. Deploy OSP10 with 3 controllers + 2 computes + 3 ceph OSD nodes
2. openstack overcloud ffwd-upgrade prepare
3. openstack overcloud ffwd-upgrade run
4. openstack overcloud upgrade run --roles Controller --skip-tags validation
5. openstack overcloud upgrade run --roles Compute --skip-tags validation
6. openstack overcloud ffwd-upgrade converge
7. openstack overcloud upgrade run --roles CephStorage --skip-tags validation
8. openstack overcloud ceph-upgrade run
9. wait for the process to finish
10. check pcs status

Actual results:
Failed actions are reported and mysql backends show as down.

Expected results:
The Pacemaker cluster status is not affected by the ceph upgrade.

Additional info:
Attaching sosreports.

Comment 1 Marius Cornea 2018-05-02 23:02:01 UTC
some notes about the connectivity troubleshooting:

haproxy config:

listen mysql
  bind 172.17.1.10:3306 transparent
  option tcpka
  option httpchk
  option tcplog
  stick on dst
  stick-table type ip size 1000
  timeout client 90m
  timeout server 90m
  server controller-0.internalapi.localdomain 172.17.1.22:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200
  server controller-1.internalapi.localdomain 172.17.1.15:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200
  server controller-2.internalapi.localdomain 172.17.1.13:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200


()[root@controller-0 /]# curl 172.17.1.22:3306
5.5.5-10.1.20-MariaDB�)YdjjVOi�?�BM;~/iDZt\OWmysql_native_password!��#08S01Got packets out of ordercurl (HTTP://172.17.1.22:3306/): response: 000, time: 0.001, size: 125
()[root@controller-0 /]# 
()[root@controller-0 /]# curl 172.17.1.22:9200
curl (HTTP://172.17.1.22:9200/): response: 000, time: 0.000, size: 0
curl: (7) Failed connect to 172.17.1.22:9200; Connection refused

Comment 2 Michele Baldessari 2018-05-03 09:32:05 UTC
So a couple of thoughts around this. We observe two problems (Failed actions around VIPs (A) and galera down(B) ). They might or might not have the same root cause.

Let's look at the reason for (A) aka VIPs first:
From /var/log/messages on controller-0 we see the problem:
May  2 22:37:38 controller-0 IPaddr2(ip-172.17.1.10)[97808]: ERROR: Unable to find nic or netmask.
May  2 22:37:38 controller-0 IPaddr2(ip-172.17.3.15)[97809]: ERROR: Unable to find nic or netmask.
May  2 22:37:38 controller-0 IPaddr2(ip-172.17.1.10)[97808]: ERROR: [findif] failed
May  2 22:37:38 controller-0 IPaddr2(ip-172.17.3.15)[97809]: ERROR: [findif] failed
May  2 22:37:38 controller-0 lrmd[501095]:  notice: ip-172.17.3.15_monitor_10000:97809:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]
May  2 22:37:38 controller-0 lrmd[501095]:  notice: ip-172.17.1.10_monitor_10000:97808:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]

IPaddr RA is just a script that brings up the VIPs if it fails with that error message it is because somebody pulled the nic under us. In fact in messages a bit above we see this:
May  2 22:37:37 controller-0 ntpd[984729]: Deleting interface #27 vlan30, 172.17.3.15#123, interface stats: received=0, sent=0, dropped=0, active_time=8867 secs
May  2 22:37:37 controller-0 ntpd[984729]: Deleting interface #26 vlan20, 172.17.1.10#123, interface stats: received=0, sent=0, dropped=0, active_time=8867 secs
<snip>
May  2 22:37:37 controller-0 ntpd[984729]: Deleting interface #5 vlan40, 172.17.4.10#123, interface stats: received=0, sent=0, dropped=0, active_time=15025 secs
May  2 22:37:37 controller-0 ntpd[984729]: Deleting interface #4 br-ex, 10.0.0.103#123, interface stats: received=72, sent=72, dropped=0, active_time=15025 secs

So this (for whatever reason) is due to 20-os-net-config bringing down the whole networking plane:
May  2 22:37:31 controller-0 os-collect-config: dib-run-parts Wed May  2 22:37:31 UTC 2018 20-os-apply-config completed
May  2 22:37:31 controller-0 os-collect-config: dib-run-parts Wed May  2 22:37:31 UTC 2018 Running /usr/libexec/os-refresh-config/configure.d/20-os-net-config
May  2 22:37:31 controller-0 os-collect-config: ++ os-apply-config --key os_net_config --type raw --key-default ''
May  2 22:37:31 controller-0 kernel: IPv4: martian source 172.17.1.22 from 10.0.0.101, on dev vlan20
May  2 22:37:31 controller-0 kernel: ll header: 00000000: 9a 9c f6 bf ac 66 72 c9 2c 7d f8 c7 08 00        .....fr.,}....
May  2 22:37:31 controller-0 os-collect-config: + NET_CONFIG='{"network_config": [{"dns_servers": ["10.0.0.1"], "addresses": [{"ip_netmask": "192.168.24.12/24"}], "routes": [{"default": true, "ip_netmask": "0.0.0.0/0", "next_hop": "192.168.24.1"}, {"ip_netmask": "169.254.169.254/32", "next_hop": "192.168.24.1"}], "use_dhcp": false, "type": "interface", "name": "nic1"}, {"use_dhcp": false, "type": "ovs_bridge", "name": "br-isolated", "members": [{"type": "interface", "name": "nic2", "primary": true}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.1.22/24"}], "vlan_id": 20}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.3.17/24"}], "vlan_id": 30}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.4.10/24"}], "vlan_id": 40}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.2.15/24"}], "vlan_id": 50}]}, {"addresses": [{"ip_netmask": "10.0.0.103/24"}], "members": [{"type": "interface", "name": "nic3", "primary": true}], "routes": [{"ip_netmask": "0.0.0.0/0", "next_hop": "10.0.0.1"}], "use_dhcp": false, "type": "ovs_bridge", "name": "br-ex"}]}'
May  2 22:37:31 controller-0 os-collect-config: + '[' -n '{"network_config": [{"dns_servers": ["10.0.0.1"], "addresses": [{"ip_netmask": "192.168.24.12/24"}], "routes": [{"default": true, "ip_netmask": "0.0.0.0/0", "next_hop": "192.168.24.1"}, {"ip_netmask": "169.254.169.254/32", "next_hop": "192.168.24.1"}], "use_dhcp": false, "type": "interface", "name": "nic1"}, {"use_dhcp": false, "type": "ovs_bridge", "name": "br-isolated", "members": [{"type": "interface", "name": "nic2", "primary": true}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.1.22/24"}], "vlan_id": 20}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.3.17/24"}], "vlan_id": 30}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.4.10/24"}], "vlan_id": 40}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.2.15/24"}], "vlan_id": 50}]}, {"addresses": [{"ip_netmask": "10.0.0.103/24"}], "members": [{"type": "interface", "name": "nic3", "primary": true}], "routes": [{"ip_netmask": "0.0.0.0/0", "next_hop": "10.0.0.1"}], "use_dhcp": false, "type": "ovs_bridge", "name": "br-ex"}]}' ']'
May  2 22:37:31 controller-0 os-collect-config: + trap configure_safe_defaults EXIT
May  2 22:37:31 controller-0 os-collect-config: + os-net-config -c /etc/os-net-config/config.json -v --detailed-exit-codes
May  2 22:37:31 controller-0 kernel: IPv4: martian source 172.17.1.22 from 10.0.0.101, on dev vlan20
May  2 22:37:31 controller-0 kernel: ll header: 00000000: 9a 9c f6 bf ac 66 72 c9 2c 7d f8 c7 08 00        .....fr.,}....
May  2 22:37:31 controller-0 os-collect-config: [2018/05/02 10:37:31 PM] [INFO] Using config file at: /etc/os-net-config/config.json
May  2 22:37:31 controller-0 os-collect-config: [2018/05/02 10:37:31 PM] [INFO] Ifcfg net config provider created.
May  2 22:37:31 controller-0 os-collect-config: [2018/05/02 10:37:31 PM] [INFO] Not using any mapping file.
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] Finding active nics
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] vxlan_sys_4789 is not an active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] br-isolated is not an active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] vlan40 is not an active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] vlan50 is not an active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] vlan20 is not an active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] vlan30 is not an active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] br-int is not an active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] br-tun is not an active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] ovs-system is not an active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] docker0 is not an active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] br-ex is not an active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] lo is not an active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] eth2 is an embedded active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] eth0 is an embedded active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] eth1 is an embedded active nic
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] No DPDK mapping available in path (/var/lib/os-net-config/dpdk_mapping.yaml)
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] Active nics are ['eth0', 'eth1', 'eth2']
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] nic3 mapped to: eth2
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] nic2 mapped to: eth1
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] nic1 mapped to: eth0
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding interface: eth0
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding custom route for interface: eth0
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding bridge: br-isolated
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding interface: eth1
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding vlan: vlan20
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding vlan: vlan30
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding vlan: vlan40
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding vlan: vlan50
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding bridge: br-ex
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding custom route for interface: br-ex
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] adding interface: eth2
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] applying network configs...
May  2 22:37:32 controller-0 os-collect-config: [2018/05/02 10:37:32 PM] [INFO] running ifdown on interface: vlan20

(B) Galera is likely due to the same networking root cause. We see the following:
May  2 22:37:20 controller-0 corosync[500931]: [TOTEM ] A processor failed, forming new configuration.
Which means nodes can't see each other due to network issues.

So the focus here should be as to why the os-net-config brings down the network.
I am not sure why that is though and am not too familiar with os-net-config. I do see that it has been updated beforehand (May 02 18:07:10 Updated: os-net-config-8.4.1-1.el7ost.noarch). As a matter of fact I see no other mention (except a couple of ansible tasks that check if yum is going to update os-net-config) of os-net-config between the time of the upgrade and the run that disrupts the networking.

So, unless I got something wrong, it seems to me that the upgrade of os-net-config is what is causing this.

Comment 3 Marius Cornea 2018-05-04 01:18:02 UTC
Thanks Michele. So this appears to be related to bug 1561255. I'm not able to reproduce this bug when applying the workaround for bug 1561255 (rm /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json) before starting the upgrade process.

Comment 4 Marius Cornea 2018-05-04 15:56:26 UTC

*** This bug has been marked as a duplicate of bug 1561255 ***


Note You need to log in before you can comment on or make changes to this bug.