rhel-osp-director: After minor update (includes rhel7.2->rhel7.3), rebooted the overcloud nodes. The controllers aren't reachable now. Environment: instack-undercloud-4.0.0-14.el7ost.noarch openstack-tripleo-heat-templates-liberty-2.0.0-34.el7ost.noarch openstack-tripleo-heat-templates-2.0.0-34.el7ost.noarch openstack-puppet-modules-8.1.8-2.el7ost.noarch Steps to reproduce: 1. Deploy overcloud with: openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --swift-storage-scale 0 --block-storage-scale 0 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1 2. Update the undercloud + reboot it 3. Update the overcloud. 4. Reboot the overcloud nodes. 5. Try to reach any controller. Result: +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ | 6fb3b347-78ef-4ded-bd13-987d4bd174bc | overcloud-cephstorage-0 | ACTIVE | - | Running | ctlplane=192.0.2.7 | | c4bdb5ef-b186-42f0-9f8e-a58c159200fc | overcloud-compute-0 | ACTIVE | - | Running | ctlplane=192.0.2.8 | | e920d1bf-d25a-48b5-9d81-4be27809403d | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.0.2.9 | | 074dc00c-d79d-4408-a141-f4a2b43a77a9 | overcloud-controller-1 | ACTIVE | - | Running | ctlplane=192.0.2.10 | | 6798ddd4-80a1-44e5-a17c-087865786fdf | overcloud-controller-2 | ACTIVE | - | Running | ctlplane=192.0.2.11 | +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ The nodes appear to be running: [root@seal42 ~]# virsh list Id Name State ---------------------------------------------------- 21 instack running 22 baremetalbrbm_0 running 23 baremetalbrbm_1 running 24 baremetalbrbm_2 running 25 baremetalbrbm_5 running 26 baremetalbrbm_7 running [stack@instack ~]$ ironic node-list +--------------------------------------+------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------+--------------------------------------+-------------+--------------------+-------------+ | 8129c1bb-811d-4a58-9e25-3eb4993e7415 | None | 6fb3b347-78ef-4ded-bd13-987d4bd174bc | power on | active | False | | 757e91c1-c939-42d5-b610-1087845d8e47 | None | c4bdb5ef-b186-42f0-9f8e-a58c159200fc | power on | active | False | | 7879d19c-9ba9-4953-b1e4-5c9866a58038 | None | 6798ddd4-80a1-44e5-a17c-087865786fdf | power on | active | False | | 35588513-8014-4340-84ba-a47b53dc815b | None | None | power off | available | False | | f8d34ad8-616e-45a0-881b-3956cc54c4fd | None | None | power off | available | False | | d8d3dd17-a708-405f-a70d-617ccbc8cce7 | None | e920d1bf-d25a-48b5-9d81-4be27809403d | power on | active | False | | 53361161-4090-455e-9986-d1b7b3b30591 | None | None | power off | available | False | | b0c652f3-e1b0-4c64-924f-2b6180cc4358 | None | 074dc00c-d79d-4408-a141-f4a2b43a77a9 | power on | active | False | +--------------------------------------+------+--------------------------------------+-------------+--------------------+-------------+ [stack@instack ~]$ ping -c1 -W1 192.0.2.7 PING 192.0.2.7 (192.0.2.7) 56(84) bytes of data. 64 bytes from 192.0.2.7: icmp_seq=1 ttl=64 time=0.233 ms --- 192.0.2.7 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.233/0.233/0.233/0.000 ms [stack@instack ~]$ ping -c1 -W1 192.0.2.8 PING 192.0.2.8 (192.0.2.8) 56(84) bytes of data. 64 bytes from 192.0.2.8: icmp_seq=1 ttl=64 time=0.201 ms --- 192.0.2.8 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.201/0.201/0.201/0.000 ms [stack@instack ~]$ ping -c1 -W1 192.0.2.9 PING 192.0.2.9 (192.0.2.9) 56(84) bytes of data. --- 192.0.2.9 ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms [stack@instack ~]$ ping -c1 -W1 192.0.2.10 PING 192.0.2.10 (192.0.2.10) 56(84) bytes of data. --- 192.0.2.10 ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms [stack@instack ~]$ ping -c1 -W1 192.0.2.11 PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data. --- 192.0.2.11 ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms The controllers aren't reachable.
assigning to lifecycle team for now at least for initial triage ... @Ryan assigned to you as it is EOD in EU right now (Omri will be around and is anxious to test the update workflow with the reboot and is blocked please syncup with him if you get a chance to have a look here). At this point mostly interested in finding out more about what the specific issue is & initial triage. It could be we need to assign to a different DFG for further investigation depending on what the problem is (e.g. is it something we introduced and can fix or is it something that is related to the service which we need the appropriate DFG to look at).
When I restarted the nodes a second time, instead of hanging on a ping test it returns: From 192.0.2.1 icmp_seq=1 Destination Host Unreachable and from tcpdump: ARP, Request who-has 192.0.2.10 tell instack.localdomain, length 28 I used vncviewer to connect to the controller nodes just fine. They don't have passwords in place so I can't access them. From what I gathered, I think this issue lies on the nodes which I can't access unless this test is done again with a password set on the controller nodes. Can you run this again and set the password on the controller nodes so we can look around inside those nodes?
reassigning properly this time :) (I said Ryan but added Lucas sorry)
just thinking this could be related to the OVS issue (if ovs 2.5 was delivered with the minor update could be seeing same as https://bugzilla.redhat.com/show_bug.cgi?id=1371840
on BM setup, I'm able to reach the controllers after reboot via ctlplabe, but they're unable to reach the FW on the externa network: [stack@undercloud72 ~]$ nova list +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ | b7f466ad-5cb9-4c8a-9804-7625d998a0c4 | overcloud-cephstorage-0 | ACTIVE | - | Running | ctlplane=192.168.0.8 | | 47a8ff4d-47ae-48e1-8f96-1f59fc47ae8f | overcloud-cephstorage-1 | ACTIVE | - | Running | ctlplane=192.168.0.7 | | a13e9271-b02e-4a90-9e0b-456e423b96e8 | overcloud-compute-0 | ACTIVE | - | Running | ctlplane=192.168.0.9 | | 35c8fed9-fabe-4e63-a210-80b47a2dc18b | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.168.0.12 | | 66edde5e-abdc-4ec9-9f90-a1419eecd4e7 | overcloud-controller-1 | ACTIVE | - | Running | ctlplane=192.168.0.10 | | 4abf7103-ff17-4d35-9a5b-021a1fcb85d5 | overcloud-controller-2 | ACTIVE | - | Running | ctlplane=192.168.0.11 | +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ [stack@undercloud72 ~]$ ss^C [stack@undercloud72 ~]$ ssh heat-admin.0.10 [heat-admin@overcloud-controller-1 ~]$ sudo -i^C [heat-admin@overcloud-controller-1 ~]$ ping 8.8.8.8 PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. ^C --- 8.8.8.8 ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms [heat-admin@overcloud-controller-1 ~]$ sudo ip route default via 10.19.184.254 dev br-ex 10.19.94.0/24 dev br-nic2 proto kernel scope link src 10.19.94.12 10.19.95.0/24 dev vlan183 proto kernel scope link src 10.19.95.11 10.19.184.0/24 dev br-ex proto kernel scope link src 10.19.184.181 169.254.169.254 via 192.168.0.1 dev p2p1 192.168.0.0/24 dev p2p1 proto kernel scope link src 192.168.0.10 192.168.150.0/24 dev br-nic4 proto kernel scope link src 192.168.150.10 192.168.200.0/24 dev vlan103 proto kernel scope link src 192.168.200.11 [heat-admin@overcloud-controller-1 ~]$ ping 10.19.184.254 PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data. From 10.19.184.181 icmp_seq=1 Destination Host Unreachable From 10.19.184.181 icmp_seq=2 Destination Host Unreachable From 10.19.184.181 icmp_seq=3 Destination Host Unreachable From 10.19.184.181 icmp_seq=4 Destination Host Unreachable The non-controllers are routed through undercloud and they're able to ping the external world.
After the update the openvswitch version on OC looks like: openstack-neutron-openvswitch-8.1.2-5.el7ost.noarch python-openvswitch-2.4.0-1.el7.noarch openvswitch-2.4.0-1.el7.x86_64
So apparently the underlying NIC isn't brought UP upon reboot: [root@overcloud-controller-0 ~]# ovs-vsctl show c978e8e1-7ab1-4942-9167-2205c3edb82b Bridge "br-nic4" Port "vlan103" tag: 103 Interface "vlan103" type: internal Port "br-nic4" Interface "br-nic4" type: internal Port "p1p2" Interface "p1p2" Bridge "br-nic2" Port "em2" Interface "em2" Port "vlan183" tag: 183 Interface "vlan183" type: internal Port "br-nic2" Interface "br-nic2" type: internal Bridge br-tun fail_mode: secure Port "vxlan-c0a8960a" Interface "vxlan-c0a8960a" type: vxlan options: {df_default="true", in_key=flow, local_ip="192.168.150.12", out_key=flow, remote_ip="192.168.150.10"} Port br-tun Interface br-tun type: internal Port patch-int Interface patch-int type: patch options: {peer=patch-tun} Port "vxlan-c0a8960d" Interface "vxlan-c0a8960d" type: vxlan options: {df_default="true", in_key=flow, local_ip="192.168.150.12", out_key=flow, remote_ip="192.168.150.13"} Port "vxlan-c0a8960b" Interface "vxlan-c0a8960b" type: vxlan options: {df_default="true", in_key=flow, local_ip="192.168.150.12", out_key=flow, remote_ip="192.168.150.11"} Bridge br-int fail_mode: secure Port br-int Interface br-int type: internal Port int-br-ex Interface int-br-ex type: patch options: {peer=phy-br-ex} Port patch-tun Interface patch-tun type: patch options: {peer=patch-int} Bridge br-ex Port br-ex Interface br-ex type: internal ovs_version: "2.4.0" [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 link/ether 44:a8:42:3b:2e:61 brd ff:ff:ff:ff:ff:ff inet6 2620:52:0:13b8:46a8:42ff:fe3b:2e61/64 scope global mngtmpaddr dynamic valid_lft 2591682sec preferred_lft 604482sec inet6 fe80::46a8:42ff:fe3b:2e61/64 scope link valid_lft forever preferred_lft forever 3: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP qlen 1000 link/ether 44:a8:42:3b:2e:62 brd ff:ff:ff:ff:ff:ff inet6 fe80::46a8:42ff:fe3b:2e62/64 scope link valid_lft forever preferred_lft forever 4: p1p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 link/ether 00:0a:f7:7f:24:88 brd ff:ff:ff:ff:ff:ff inet 192.168.0.12/24 brd 192.168.0.255 scope global p1p1 valid_lft forever preferred_lft forever inet 192.168.0.6/32 brd 192.168.0.255 scope global p1p1 valid_lft forever preferred_lft forever inet6 fe80::20a:f7ff:fe7f:2488/64 scope link valid_lft forever preferred_lft forever 5: p1p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP qlen 1000 link/ether 00:0a:f7:7f:24:89 brd ff:ff:ff:ff:ff:ff inet6 fe80::20a:f7ff:fe7f:2489/64 scope link valid_lft forever preferred_lft forever 6: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 4e:cb:c2:92:c4:b3 brd ff:ff:ff:ff:ff:ff 7: br-int: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 5a:6d:49:d1:7b:4c brd ff:ff:ff:ff:ff:ff 8: br-tun: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether de:b7:9b:96:cf:47 brd ff:ff:ff:ff:ff:ff 9: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65470 qdisc noqueue master ovs-system state UNKNOWN qlen 1000 link/ether 22:4e:22:76:0b:da brd ff:ff:ff:ff:ff:ff 11: vlan103: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether fa:06:80:c3:27:54 brd ff:ff:ff:ff:ff:ff inet 192.168.200.14/24 brd 192.168.200.255 scope global vlan103 valid_lft forever preferred_lft forever inet 192.168.200.10/32 brd 192.168.200.255 scope global vlan103 valid_lft forever preferred_lft forever inet6 fe80::f806:80ff:fec3:2754/64 scope link valid_lft forever preferred_lft forever 12: br-nic4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 00:0a:f7:7f:24:89 brd ff:ff:ff:ff:ff:ff inet 192.168.150.12/24 brd 192.168.150.255 scope global br-nic4 valid_lft forever preferred_lft forever inet6 fe80::20a:f7ff:fe7f:2489/64 scope link valid_lft forever preferred_lft forever 13: br-nic2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 44:a8:42:3b:2e:62 brd ff:ff:ff:ff:ff:ff inet 10.19.94.15/24 brd 10.19.94.255 scope global br-nic2 valid_lft forever preferred_lft forever inet6 2620:52:0:135e:46a8:42ff:fe3b:2e62/64 scope global mngtmpaddr dynamic valid_lft 2591793sec preferred_lft 604593sec inet6 fe80::46a8:42ff:fe3b:2e62/64 scope link valid_lft forever preferred_lft forever 14: vlan183: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 9e:7a:72:fe:7a:74 brd ff:ff:ff:ff:ff:ff inet 10.19.95.15/24 brd 10.19.95.255 scope global vlan183 valid_lft forever preferred_lft forever inet6 2620:52:0:135f:9c7a:72ff:fefe:7a74/64 scope global mngtmpaddr dynamic valid_lft 2591849sec preferred_lft 604649sec inet6 fe80::9c7a:72ff:fefe:7a74/64 scope link valid_lft forever preferred_lft forever 15: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 44:a8:42:3b:2e:61 brd ff:ff:ff:ff:ff:ff inet 10.19.184.182/24 brd 10.19.184.255 scope global br-ex valid_lft forever preferred_lft forever inet6 fe80::46a8:42ff:fe3b:2e61/64 scope link valid_lft forever preferred_lft forever [root@overcloud-controller-0 ~]# ping 10.19.184.254 -c1 PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data. From 10.19.184.182 icmp_seq=1 Destination Host Unreachable --- 10.19.184.254 ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms [root@overcloud-controller-0 ~]# ifup em1 [root@overcloud-controller-0 ~]# ping 10.19.184.254 -c1 PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data. 64 bytes from 10.19.184.254: icmp_seq=1 ttl=64 time=3.15 ms --- 10.19.184.254 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 3.159/3.159/3.159/0.000 ms
/etc/sysconfig/network-scripts/ifcfg-em1: # This file is autogenerated by os-net-config DEVICE=em1 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no DEVICETYPE=ovs TYPE=OVSPort OVS_BRIDGE=br-ex BOOTPROTO=none
More findings from that setup: Reboot again the controller where I didn't bring UP connectivity to see if the issue reproduces after one more reboot and it did: [heat-admin@overcloud-controller-2 ~]$ ping 10.19.184.254 PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data. ^C --- 10.19.184.254 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2000ms [heat-admin@overcloud-controller-2 ~]$ sudo ifup em1 [heat-admin@overcloud-controller-2 ~]$ ping 10.19.184.254 PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data. From 10.19.184.183 icmp_seq=1 Destination Host Unreachable From 10.19.184.183 icmp_seq=2 Destination Host Unreachable From 10.19.184.183 icmp_seq=3 Destination Host Unreachable From 10.19.184.183 icmp_seq=4 Destination Host Unreachable ^C --- 10.19.184.254 ping statistics --- 4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 2999ms pipe 4 [heat-admin@overcloud-controller-2 ~]$ sudo ifup br-ex [heat-admin@overcloud-controller-2 ~]$ ping 10.19.184.254 PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data. From 10.19.184.183 icmp_seq=1 Destination Host Unreachable From 10.19.184.183 icmp_seq=2 Destination Host Unreachable From 10.19.184.183 icmp_seq=3 Destination Host Unreachable From 10.19.184.183 icmp_seq=4 Destination Host Unreachable ^C --- 10.19.184.254 ping statistics --- 4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 2999ms pipe 4 [heat-admin@overcloud-controller-2 ~]$ sudo ifdown br-ex [heat-admin@overcloud-controller-2 ~]$ sudo ifup br-ex [heat-admin@overcloud-controller-2 ~]$ ping 10.19.184.254 PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data. From 10.19.184.183 icmp_seq=1 Destination Host Unreachable From 10.19.184.183 icmp_seq=2 Destination Host Unreachable From 10.19.184.183 icmp_seq=3 Destination Host Unreachable ^C --- 10.19.184.254 ping statistics --- 4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2998ms pipe 3 [heat-admin@overcloud-controller-2 ~]$ sudo ifup em1 [heat-admin@overcloud-controller-2 ~]$ ping 10.19.184.254 PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data. 64 bytes from 10.19.184.254: icmp_seq=1 ttl=64 time=58.9 ms 64 bytes from 10.19.184.254: icmp_seq=2 ttl=64 time=1.13 ms 64 bytes from 10.19.184.254: icmp_seq=3 ttl=64 time=0.978 ms 64 bytes from 10.19.184.254: icmp_seq=4 ttl=64 time=1.07 ms ^C --- 10.19.184.254 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3002ms rtt min/avg/max/mdev = 0.978/15.531/58.942/25.063 ms If I reboot that node, now that the connectivity works, the connectivity remains working after the reboot.
So in order to be able to communicate on external network[1], had to: sudo ifdown br-ex sudo ifup br-ex sudo ifup em1 Then the communication worked after subsequent reboot.
what if a user can't ssh into their nodes? Also, what if you have 100 nodes? Bit of an inconvenience.
I am seeing this very similar in my case all networks are down, but they use ovs-bridges. I have do a ifdown em1 - em 4 ifdown bond0 ifdown bond1 ifup bond0 ifup bond1 and networks come up. testing to see if reboot works.
I am having to use the root password or the heat-admin account. So far this happening only to the 3 controller nodes. (In reply to Ryan Hallisey from comment #11) > what if a user can't ssh into their nodes? Also, what if you have 100 nodes? > Bit of an inconvenience.
One other item I am still on RHEL 7.2
Assigned for root cause analysis.
Seems like the issue is intermittent. I didn't reproduce it on last 2 attempts deploying on baremetal setups.
this is after you updated and upgraded? or before. Also the date of the last Kernel Update is 10/10/2016, so before that date or after. If it is before, we lock the kernel version.
I can't find anything obvious, from the comments (for the root cause of those interfaces not being brought up). A sosreport from the controller nodes attached on the first bug report would have helped a lot. Could we get sosreports if we get this reproduced again?
The SOSREPORTS for my install are attached to the Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1385143 The three controller all experience this on Every reboot.
I believe I have reproduced this error in a virt environment, although I still haven't identified a root cause. A potential workaround seems to be bouncing the affected bridge interface (br-ex in my case) after the update but before rebooting. sudo -i ifdown eth0 ; ifup eth0 ; ifdown br-ex ; ifup br-ex ; ifup vlan10 After running the above command, I did have my ssh connection cut off, but the connectivity returned after a couple of seconds and I was able to reboot the controller without incident. So it looks like the bridge gets in a bad way, and running ifdown/ifup on the bridge seems to fix the issue, but that doesn't explain why the bridge doesn't come up properly after a reboot when the workaround is not run.
(In reply to Dan Sneddon from comment #20) > I believe I have reproduced this error in a virt environment, although I > still haven't identified a root cause. > > A potential workaround seems to be bouncing the affected bridge interface > (br-ex in my case) after the update but before rebooting. > > sudo -i > ifdown eth0 ; ifup eth0 ; ifdown br-ex ; ifup br-ex ; ifup vlan10 > > After running the above command, I did have my ssh connection cut off, but > the connectivity returned after a couple of seconds and I was able to reboot > the controller without incident. > > So it looks like the bridge gets in a bad way, and running ifdown/ifup on > the bridge seems to fix the issue, but that doesn't explain why the bridge > doesn't come up properly after a reboot when the workaround is not run. Agreed that this works, but this assume you have alternate SSH or Console access, which means you need to ensure your Images have known user/password in them with sudo privileges.
Seems like I was able to reproduce the issue simply by deploying OSP9 with rhel7.3 images and rebooting the OC nodes.
I have just installed OSP 8 on my install and manually updated the kernel only. Reboots on the controller still works. I am now going to complete a full minor update on the cluster.
Created attachment 1214246 [details] version lock after update
Created attachment 1214247 [details] versionlock before update
Can confirm: Node I updated kernel and the ran update script will not cleanly reboot. Needs to have network restarted after reboot. Before update worked fine.
what is the version of OVS before and after upgrade?
(In reply to Randy Perryman from comment #26) > Can confirm: Node I updated kernel and the ran update script will not > cleanly reboot. Needs to have network restarted after reboot. > > Before update worked fine. Thanks for performing this test. That rules out kernel bugs, and points squarely at the upgraded packages included in the update, perhaps neutron-openvswitch.
(In reply to arkady kanevsky from comment #27) > what is the version of OVS before and after upgrade? Randy, please confirm. In my testing, I found that the OVS package was not updated during the update, but neutron-openvswitch package was updated.
This is the only difference between the two with a grep for for vswitch. s1:openstack-neutron-openvswitch-7.1.1-7.el7ost.*
btw this configuration is: RHEL 7.2 OSP 8 with versionlock files - reboots work. Unlock the versionlock run openstack update on the cluster Reboot fails So this affects OSP 8 and OSP 9
I'll defer to DFG-networking to look at it - the issue occurs when reboot OC nodes after performing clean-deployment OSP on top of rhel7.3 based nodes. the issue is unrelated with upgrade/updates procedure. although when updates nodes from rhel7.2 to rhel7.3 + reboot the issue reproduce and therefore - block some of our testing scenarios.
The scenario is: Deploy osp9 on rhel7.2 -> run minor-update (which takes the OS to rhel7.3) --> then due to kernel upgrade -> we run reboot and hit the issue.
Is this possibly related to: https://bugzilla.redhat.com/show_bug.cgi?id=1388286 ?
Can you try the following workaround? https://bugzilla.redhat.com/show_bug.cgi?id=1385096#c4
(In reply to Franck Baudin from comment #37) > Can you try the following workaround? > > https://bugzilla.redhat.com/show_bug.cgi?id=1385096#c4 Looks like it's not related to that, em1, em2, em3, em4 have no assigned MAC addresses. Bond0 uses em1 & em3 , taking mac address xx:xx:xx:xx:71:61 for both interfaces Bond1 uses em2 & em4 , taking mac address xx:xx:xx:xx:71:63 for both interfaces I se a little dance with em3 and em4 going up and down several times after being added to their respective bonds, that makes me suspect of the switch. See: [ 26.368384] bond0: Adding slave em1 [ 26.368397] i40e 0000:01:00.0 em1: already using mac address 14:9e:cf:2c:71:61 [ 26.382889] i40e 0000:01:00.0 em1: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None [ 26.383570] bond0: Enslaving em1 as a backup interface with an up link [ 26.660414] bond0: Adding slave em3 [ 26.660426] i40e 0000:01:00.2 em3: set new mac address 14:9e:cf:2c:71:61 [ 26.678255] i40e 0000:01:00.2 em3: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None [ 26.678874] bond0: Enslaving em3 as a backup interface with an up link [ 26.684491] i40e 0000:01:00.2 em3: NIC Link is Down [ 26.781366] bond0: link status definitely down for interface em3, disabling it [ 27.057027] i40e 0000:01:00.2 em3: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None [ 27.081064] bond0: link status definitely up for interface em3, 10000 Mbps full duplex [ 27.564049] device em1 entered promiscuous mode [ 27.564160] device em3 entered promiscuous mode [ 27.690440] bond0: link status up again after 0 ms for interface em1 [ 27.818593] i40e 0000:01:00.2 em3: NIC Link is Down [ 27.823315] bond0: link status definitely down for interface em3, disabling it [ 28.198131] i40e 0000:01:00.2 em3: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None [ 28.222943] bond0: link status definitely up for interface em3, 10000 Mbps full duplex and: [ 28.844636] bond1: Adding slave em2 [ 28.844648] i40e 0000:01:00.1 em2: already using mac address 14:9e:cf:2c:71:63 [ 28.858938] i40e 0000:01:00.1 em2: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None [ 28.859502] bond1: Enslaving em2 as a backup interface with an up link [ 29.215957] bond1: Adding slave em4 [ 29.215968] i40e 0000:01:00.3 em4: set new mac address 14:9e:cf:2c:71:63 [ 29.234158] i40e 0000:01:00.3 em4: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None [ 29.234770] bond1: Enslaving em4 as a backup interface with an up link [ 29.241013] i40e 0000:01:00.3 em4: NIC Link is Down [ 29.336846] bond1: link status definitely down for interface em4, disabling it [ 29.756017] i40e 0000:01:00.3 em4: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None [ 29.836358] bond1: link status definitely up for interface em4, 10000 Mbps full duplex [ 29.915285] device em2 entered promiscuous mode [ 29.915404] device em4 entered promiscuous mode [ 30.043148] bond1: link status up again after 0 ms for interface em2 [ 30.173821] i40e 0000:01:00.3 em4: NIC Link is Down [ 30.176019] bond1: link status definitely down for interface em4, disabling it [ 30.583925] i40e 0000:01:00.3 em4: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None [ 30.675528] bond1: link status definitely up for interface em4, 10000 Mbps full duplex Is the switch properly configured for LACP (802.3ad) with ports em1+em3 and em2+em4 ? May be we had any changes in RHEL7.3 kernel regarding how 802.3ad is handled?
After all the dancing bond0/bond1 seem to be up, but it doesn't work, I'm moving it to the openvswitch component for them to have an eye. Here's a trace of the boot of one of the bonds: Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info> (bond0): new Bond device (carrier: OFF, driver: 'bonding', ifindex: 17) Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Setting MII monitoring interval to 100 Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Setting MII monitoring interval to 100 Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Setting MII monitoring interval to 100 Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Adding slave em1 Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.0 em1: already using mac address 14:9e:cf:2c:71:61 Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.0 em1: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Enslaving em1 as a backup interface with an up link Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info> (bond0): bond slave em1 was enslaved Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info> (em1): enslaved to bond0 Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info> (em1): link connected Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Adding slave em3 Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.2 em3: set new mac address 14:9e:cf:2c:71:61 Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.2 em3: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Enslaving em3 as a backup interface with an up link Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info> (bond0): bond slave em3 was enslaved Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info> (em3): enslaved to bond0 Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info> (em3): link connected Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info> (bond0): link connected Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.2 em3: NIC Link is Down Oct 18 21:15:31 overcloud-controller-1.localdomain NetworkManager[1704]: <info> (em3): link disconnected Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: bond0: link status definitely down for interface em3, disabling it Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.2 em3: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None Oct 18 21:15:31 overcloud-controller-1.localdomain NetworkManager[1704]: <info> (em3): link connected Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: bond0: link status definitely up for interface em3, 10000 Mbps full duplex Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: bond0: first active interface up! Oct 18 21:15:31 overcloud-controller-1.localdomain ovs-vsctl[2233]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br-tenant bond0 -- add-port br-tenant bond0 Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: device bond0 entered promiscuous mode Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: device em1 entered promiscuous mode Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: device em3 entered promiscuous mode Oct 18 21:15:31 overcloud-controller-1.localdomain NetworkManager[1704]: <info> (bond0): enslaved to non-master-type device ovs-system; ignoring Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: bond0: link status up again after 0 ms for interface em1 Oct 18 21:15:32 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.2 em3: NIC Link is Down Oct 18 21:15:32 overcloud-controller-1.localdomain NetworkManager[1704]: <info> (em3): link disconnected Oct 18 21:15:32 overcloud-controller-1.localdomain kernel: bond0: link status definitely down for interface em3, disabling it Oct 18 21:15:32 overcloud-controller-1.localdomain ovs-vsctl[2263]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --may-exist add-br br-tenant Oct 18 21:15:32 overcloud-controller-1.localdomain network[1761]: Bringing up interface bond0: [ OK ] Oct 18 21:15:32 overcloud-controller-1.localdomain cloud-init[1689]: Cloud-init v. 0.7.6 running 'init-local' at Tue, 18 Oct 2016 21:15 And here's the config for the bonds: ajo@mbp-ajo:~/Downloads/sosreport/ctl1$ tail etc/sysconfig/network-scripts/ifcfg-em* ==> etc/sysconfig/network-scripts/ifcfg-em1 <== # This file is autogenerated by os-net-config DEVICE=em1 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no MASTER=bond0 SLAVE=yes BOOTPROTO=none ==> etc/sysconfig/network-scripts/ifcfg-em2 <== # This file is autogenerated by os-net-config DEVICE=em2 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no MASTER=bond1 SLAVE=yes BOOTPROTO=none ==> etc/sysconfig/network-scripts/ifcfg-em3 <== # This file is autogenerated by os-net-config DEVICE=em3 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no MASTER=bond0 SLAVE=yes BOOTPROTO=none ==> etc/sysconfig/network-scripts/ifcfg-em4 <== # This file is autogenerated by os-net-config DEVICE=em4 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no MASTER=bond1 SLAVE=yes BOOTPROTO=none ajo@mbp-ajo:~/Downloads/sosreport/ctl1$ tail etc/sysconfig/network-scripts/ifcfg-bond* ==> etc/sysconfig/network-scripts/ifcfg-bond0 <== DEVICE=bond0 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no DEVICETYPE=ovs TYPE=OVSPort OVS_BRIDGE=br-tenant MACADDR="14:9e:cf:2c:71:61" BONDING_OPTS="mode=802.3ad miimon=100" ==> etc/sysconfig/network-scripts/ifcfg-bond1 <== DEVICE=bond1 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no DEVICETYPE=ovs TYPE=OVSPort OVS_BRIDGE=br-ex MACADDR="14:9e:cf:2c:71:63" BONDING_OPTS="mode=802.3ad miimon=100"
Switch is interesting, but we are seeing this on multiple installs with and without bonds.
(In reply to Randy Perryman from comment #40) > Switch is interesting, but we are seeing this on multiple installs with and > without bonds. Without bonds? Do we have any sosreport of this reproduced without the bonds to simplify the diagnostics?
I believe Dan did this in VM's.
(In reply to Dan Sneddon from comment #20) > I believe I have reproduced this error in a virt environment, although I > still haven't identified a root cause. > > A potential workaround seems to be bouncing the affected bridge interface > (br-ex in my case) after the update but before rebooting. > > sudo -i > ifdown eth0 ; ifup eth0 ; ifdown br-ex ; ifup br-ex ; ifup vlan10 > > After running the above command, I did have my ssh connection cut off, but > the connectivity returned after a couple of seconds and I was able to reboot > the controller without incident. > > So it looks like the bridge gets in a bad way, and running ifdown/ifup on > the bridge seems to fix the issue, but that doesn't explain why the bridge > doesn't come up properly after a reboot when the workaround is not run.
By looking at the packages, we found that final RHEL status after upgrade is not rhel 7.3, but rhel 7.2.z $ cat uname Linux overcloud-controller-1.localdomain 3.10.0-327.36.2.el7.x86_64 #1 SMP Tue Sep 27 16:01:21 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux and openvswitch-2.4.0-2.el7_2.x86_64 while RHEL 7.3 has kernel-3.10.0-514.el7.x86_64.rpm in RC3 at least. Comment 44 also indicates this is happening with other OSP versions.
As per @cascardo comments on IRC, could you please try adding: OVS_EXTRA="set bridge ovsbr fail_mode=standalone" to /etc/sysconfig/network-scripts/ifcfg-br-tenant ? He found that the culprit could be that, for some reason the bridge is not removed and re-created on reboot (it was supposed to be destroyed when network is stopped -can we verify this-?), and since we now [1] set the bridge to secure mode, that means that no default "NORMAL" switching rule is introduced in the bridge at boot-up (among other things), making all traffic arriving br-tenant from the bonds (or external interfaces) dropped. [1] https://review.openstack.org/#/c/355315/ We should consider changing this bug back to rhel-osp-director to make sure we install that OVS_EXTRA setting by default to avoid something like this in the future.
please Add the OVS_EXTRA="set bridge ovsbr fail_mode=standalone" to any ovs-bridge ifcfg script (as per @cascardo comments on IRC it seems like there are vlan interfaces attached to br-ex).
Initial testing is looking promising. All controllers rebooted with network, coming back. First time this happened. Looking at network config files to add it to the deployment going forward.
Should mention this is OSP 8 RHEL 7.2
This points out that the Gate needs to be updated to test for reboot failures.
Hi, I need help in figuring out how to code this to be in my nic-configs.
(In reply to Manabu Ori from comment #53) > I tried to > - delete "systemctl restart network" in rc.local > - add OVS_EXTRA="set bridge br-ex fail_mode=standalone" to ifcfg-br-ex > and the symptom disappeared after reboot. It seems that only the OVS_EXTRA config would be enough, so no need for restarting the network. Could someone please confim?
Also, the OVS by default is in 'in-band' mode: [ovs-vswitchd.conf.db(5)] in-band In this mode, this controller’s OpenFlow traffic travels over the bridge associated with the controller. With this setting, Open vSwitch allows traffic to and from the controller regardless of the contents of the OpenFlow flow table. (Otherwise, Open vSwitch would never be able to connect to the controller, because it did not have a flow to enable it.) This is the most common connection mode because it is not necessary to maintain two independent networks. So, even in 'secure' mode OVS should allow talking with the controller.
(In reply to Flavio Leitner from comment #54) > (In reply to Manabu Ori from comment #53) > > I tried to > > - delete "systemctl restart network" in rc.local > > - add OVS_EXTRA="set bridge br-ex fail_mode=standalone" to ifcfg-br-ex > > and the symptom disappeared after reboot. > > It seems that only the OVS_EXTRA config would be enough, so no need for > restarting the network. > > Could someone please confim? Sorry for confusion, only the OVS_EXTRA config worked well. At first, "systemctl restart network" seemed to be a workaround and I wrote it in rc.local. After that, I read this bz and removed the "systemctl restart network" in rc.local and tried OVS_EXTRA config, resulted in success.
FWIW: To implement this workaround in the templates, you can set OVS_EXTRA on deployment in the network environment templates where br-ex is defined. For example, if using the environments/net-multiple-nics.yaml environment file that pulls in network/config/multiple-nics/controller.yaml, you would add to the part of network/config/multiple-nics/controller.yaml that looks like: - type: ovs_bridge name: {get_input: bridge_name} dns_servers: {get_param: DnsServers} use_dhcp: false addresses: - ip_netmask: {get_param: ExternalIpSubnet} to make it: type: ovs_bridge name: {get_input: bridge_name} dns_servers: {get_param: DnsServers} use_dhcp: false ovs_extra: str_replace: template: "set bridge BRIDGE fail_mode=standalone" params: BRIDGE: {get_input: bridge_name} addresses: - ip_netmask: {get_param: ExternalIpSubnet} (I think the syntax is correct...)
(In reply to Flavio Leitner from comment #55) > So, even in 'secure' mode OVS should allow talking with the controller. AFAIK OpenStack/Neutron does not implement OpenFlow controller thus this doesn't really apply.
(In reply to Jiri Benc from comment #58) > (In reply to Flavio Leitner from comment #55) > > So, even in 'secure' mode OVS should allow talking with the controller. > > AFAIK OpenStack/Neutron does not implement OpenFlow controller thus this > doesn't really apply. it does now :) we have a mode where we set ourselves as a local controller (neutron-openvswitch-agent).
@dsneddon: I checked with Miguel about whether or not it is safe to set the OVS_EXTRA info everywhere and it seems to be okay. With that in mind we might be better off making the change in os-net-config so it covers the situation where people are using customized network configuration templates. Thoughts?
(In reply to Brent Eagles from comment #60) > @dsneddon: I checked with Miguel about whether or not it is safe to set the > OVS_EXTRA info everywhere and it seems to be okay. With that in mind we > might be better off making the change in os-net-config so it covers the > situation where people are using customized network configuration templates. > Thoughts? The problem with that is that in existing deployments, if the network config changes (and adding that line to the ifcfg file counts as a change), then the network will be redeployed. What actually happens is that os-net-config notices that the ifcfg file is different, so it issues ifdown/ifup on the interface after writing the new configuration. This may have additional impact during upgrades, so should be tested.
Having taken a step back, I'm convinced that modifying os-net-config to do this is a bad idea. Injecting a default workaround into code in a manner that "hides it" is bad practice in general. Considering the fallout of future changes in openvswitch, neutron, etc. and even other possible uses of os-net-config, this just screams "DON'T". In the interim, I think our best bet is to add ovs_extra data to the templates and document errata.
(In reply to Brent Eagles from comment #57) > type: ovs_bridge > name: {get_input: bridge_name} > dns_servers: {get_param: DnsServers} > use_dhcp: false > ovs_extra: > str_replace: > template: "set bridge BRIDGE fail_mode=standalone" > params: > BRIDGE: {get_input: bridge_name} > addresses: > - > ip_netmask: {get_param: ExternalIpSubnet} > > (I think the syntax is correct...) I tried it with OSP8, but no luck... <nic-configs/controller.yaml> (snip) type: ovs_bridge name: {get_input: bridge_name} dns_servers: {get_param: DnsServers} ovs_extra: str_replace: template: "set bridge BRIDGE fail_mode=standalone" params: BRIDGE: {get_input: bridge_name} members: (snip) <output of openstack overcloud deploy> 2016-11-04 03:05:27 [overcloud-Controller-chea4pnwnc2q-2-gcqoq7xsfd7i]: CREATE_FAILED Resource CREATE failed: resources.NetworkConfig: Property error: resources.OsNetConfigImpl.properties.config: "str_replace" params must be strings or numbers 2016-11-04 03:05:28 [2]: CREATE_FAILED resources.NetworkConfig: resources[2].Property error: resources.OsNetConfigImpl.properties.config: "str_replace" params must be strings or numbers
I just discovered this u/s bz https://bugs.launchpad.net/heat/+bug/1344284 (actually indicated in one of the networking templates ...) that indicates that this particular method won't work as described. Working on alternatives.
I suspect it could happen with older (RHEL7.2 systems too), since we backported patche to OSP8 to set the bridge in secure mode, once that's applied, and we reboot any controller, this could manifest itself. I've also experienced this issue yesterday in a packstack AIO deployment. It doesn't mean that it's a neutron bug, it's now packstack + director, both need to make sure the secure mode is cleared up from the bridges at boot, or use a separate (independent of neutron bridge) for node connectivity.
I have created the corresponding packstack bug too: https://bugzilla.redhat.com/show_bug.cgi?id=1392800
What it affects: • OSP 8 • OSP 9 • OSP 10 Is there a permanent work around? No Is there anything special that needs to happen? Yes an interactive account needs to be created that has sudo privileges Is there a workaround? Only a one rework of the ifcfg files, OS-Net-Config will reset them at it’s first opportunity What happens if a user reboots a controller, and there is no interactive account or ifcfg fix? ?? Does the critically of this bug need to be updated?
(In reply to Randy Perryman from comment #77) > What it affects: > • OSP 8 > • OSP 9 > • OSP 10 > > > Is there a permanent work around? No > > Is there anything special that needs to happen? Yes an interactive account > needs to be created that has sudo privileges > > Is there a workaround? Only a one rework of the ifcfg files, OS-Net-Config > will reset them at it’s first opportunity > > > What happens if a user reboots a controller, and there is no interactive > account or ifcfg fix? ?? > > Does the critically of this bug need to be updated? It's already marked as urgent/urgent and as a blocker. We already have people working on this around the clock.
I examined this on recent packaging of OSPd 10 and was able to reproduce what appears to be the same behavior. My evaluation environment is a standard virt-setup, so a single NIC in a VM bridged to br-ex is used for control plane as well as external network, etc. In my virtual environment, the br-ex interface obtains its address via DHCP and with the ovs bridge unable to move traffic, the address wasn't obtained, rendering the node completely unreachable via network. I was able to log in via a console to verify. Without any additional changes, running "systemctl restart network" does configure the interface and it seems to function properly afterwards. Simply cycling the interfaces does not appear to work unless "systemctl restart network" is performed first. If I restart the VM at this point, it will be again unreachable on the next reboot. For what it is worth, even though cycling the interface to obtain the IP address works, it seems that many of the OpenStack services had already failed to start. Modifying ifcfg-br-ex to change the fail_mode to standalone on boot seems to allow the IP address to be obtained on boot. It's worth noting that neutron changes the fail_mode back to secure at some point afterwards. Due to the timing, I was able to determine with certainty that fail_mode actually goes to standalone for any appreciable period of time during startup before neutron has a change to set it back - but this is probably an unimportant detail. Controller and compute nodes are both unreachable with this network configuration. I'm not sure why restarting the network would work unless it opens a very small window where the br-ex bridge has the default fail_mode instead of secure. Please note that firing of a post-boot network restart does not seem a workable option. A lot of services are in a bad state and may not "come back to life" once the network connectivity is restored. It pretty much has to happen at the usual time during network configuration. For comparison, I preformed the same experiment using CentOS and upstream tripleo and got the same results the first time I tried, but further attempts to get br-ex in a bad state failed. So on CentOS at least it may be timing dependent. The host system was under pretty significant load at the time of the first trial so it may have been a factor. While multiple trials were performed with RHEL with other instances turned off, the network configuration failed consistently. The OSP-d setup was: Linux overcloud-controller-0.localdomain 3.10.0-493.el7.x86_64 #1 SMP Tue Aug 16 11:45:26 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux ovs-vsctl (Open vSwitch) 2.5.0 Compiled Jul 21 2016 10:24:02 DB Schema 7.12.1 NetworkManager The CentOS setup was: Linux overcloud-controller-0.localdomain 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux ovs-vsctl (Open vSwitch) 2.5.0 Compiled Mar 18 2016 15:00:11 DB Schema 7.12.1 No NetworkManager As CentOS seems to work at least most of the time, it seems that the fail_mode alone isn't the root cause, but some combination of fail_mode and some other behavior - perhaps a race condition around load order on startup? Since one of the premises of the secure fail_mode being an issue, the question that follows what is the openvswitch agent doing and why isn't it doing some flow configuration? When using the ovs-ofctl openflow flow driver, ryu chokes because the IP it is expecting to be configured for the host hasn't been assigned. The native driver seems to be equally unhappy, so possibly the change from ovs-ofctl to native isn't to blame. It would be difficult to say for certain without some other tests - in any case, the old configuration works no better so isn't workaround. I checked whether the changes to the loading of bridge and bridge netfilters filters by modifying my initramfs to load bridge and br_netfilter at boot and set the sysctl parameters, etc. No effect - it was a "hail Mary" attempt anyways. So my conclusion at this time is that the secure mode *is* the culprit on RHEL. It is quite possibly only a problem because the network that the control plane and management networks are all connected via a bridge that neutron in the overcloud "knows about". It's not clear to me that this would be an issue in environments where these networks are configured with bridges not managed by neutron in the overcloud. At the very least it neutron wouldn't be altering the fail_mode. At the moment, we don't have a clear way to workaround by temporarily causing the br-ex bridge to configured with the standalone fail_mode in the heat templates because of a long-standing issue with intrinsic methods and values obtained by "get_input" (i.e. not a parameter to that heat template). I'll continue investigating how best to do that, but considering all of the variables we should consider alternate solutions. Possibly reverting the secure mode patch from neutron for the time being or altering os-net-config to insert the fail mode information by default.
I figured out a way to do this. Involves patching os-net-config with a format operation to do a string replace on a template (gets around the heat issue) and modifying the network configuration templates. Patches made directly to overcloud nodes proved the os-net-config side, full test involving heat template deployment in progress.
Brent, so what is the patch? What is timeline to get into OSP9 and OSP10? Arkady
@Arkady, see the external tracker links for OpenStack gerrit. I don't have an ETA as these are very fresh, tested in my environment only for a small subset of configurations (so far 2) and as yet there haven't been any other eyes on them.
Thanks Brent. See it now. Simple fix in too many places.
Do we have a patch of the file: /usr/lib/python2.7/site-packages/os_net_config/objects.py for images used in OSP 8? rhosp-director-images-8.0-20160603.2.el7ost.noarch rhosp-director-images-8.0-20160415.1.el7ost.noarch
This will need to be back ported all the way to Liberty.
Created attachment 1219472 [details] Liberty objects.py This is my first pass at the objects.py for the liberty release. I have updated this file, virt-copy-in -a overcloud-full.qcow2 ./objects.py /usr/lib/python2.7/site-packages/os_net_config/objects.py Updated the qcow image for deployment Updated the nic-configs
So my fix is not working quite right: This is an example of br-ex -- I did update my objects.py. # This file is autogenerated by os-net-config DEVICE=br-ex ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no DEVICETYPE=ovs TYPE=OVSBridge OVS_EXTRA="s -- e -- t -- -- b -- r -- i -- d -- g -- e -- -- b -- r -- - -- e -- x -- -- f -- a -- i -- l -- _ -- m -- o -- d -- e -- = -- s -- t -- a -- n -- d -- a -- l -- o -- n -- e" DNS1=8.8.8.8 DNS2=8.8.4.4
Hi Randy, while it looks like your heat template data isn't quite right, this could be because it's liberty. There could be incompatibilities in the interacting parts.
Created attachment 1219551 [details] The Controller file that I used
Okay in files impl_ine.py and impl_ifcfg.py there is the following: impl_ifcfg.py: ovs_extra = [] impl_ifcfg.py: ovs_extra.append("set bridge %s other-config:hwaddr=%s" % impl_ifcfg.py: ovs_extra.extend(base_opt.ovs_extra) impl_ifcfg.py: ovs_extra.extend(base_opt.ovs_extra) impl_ifcfg.py: if ovs_extra: impl_ifcfg.py: data += "OVS_EXTRA=\"%s\"\n" % " -- ".join(ovs_extra) where it is adding " -- ". Ideas on how to work around this.
(In reply to Randy Perryman from comment #94) > Okay in files impl_ine.py and impl_ifcfg.py there is the following: > > > impl_ifcfg.py: ovs_extra = [] > impl_ifcfg.py: ovs_extra.append("set bridge %s > other-config:hwaddr=%s" % > impl_ifcfg.py: ovs_extra.extend(base_opt.ovs_extra) > impl_ifcfg.py: ovs_extra.extend(base_opt.ovs_extra) > impl_ifcfg.py: if ovs_extra: > impl_ifcfg.py: data += "OVS_EXTRA=\"%s\"\n" % " -- > ".join(ovs_extra) > > > > where it is adding " -- ". > > Ideas on how to work around this. ---------------- So the issue was my network config I had the following line: ovs_extra: set bridge br-ex fail_mode=standalone" which the above code treated each letter, space, """ as a string. Changing the config to be: ovs_extra: - "set bridge br-ex fail_mode=standalone" Plus updating the file objects.py, inserting the modification as discussed on Gerrit, but only as needed for Liberty. My controllers now have the correct information in the OVS_EXTRA line.
On a new deployment of OSP 8.0 , with the files locked to specific version Changing the config to be: ovs_extra: - "set bridge br-ex fail_mode=standalone" for the bridges on the computes and controllers, the proper line was inserted into the ifcfg files. No other changes were needed. Do not know if OSP 9.0 will do the same.
New Question: In a deployment that is already in place, how do you update the config files?
(In reply to Randy Perryman from comment #98) > New Question: > > In a deployment that is already in place, how do you update the config files? You can temporarily set NetworkDeploymentActions: ['CREATE', 'UPDATE'] in the parameter_defaults: section of an environment file to update the ifcfg files during a stack update. Since this issue only comes up when doing an initial install or update, it shouldn't be necessary to modify the ifcfg files of a running system. Instead, the ifcfg files can be updated as part of the update process, which should ensure that the networking works after a reboot when the update is complete.
adding a note for reviewers as there was movement on this last night. The approach proposed by Brent with https://review.openstack.org/#/c/396285/ (tripleo-heat-templates) dependant on https://review.openstack.org/#/c/395795/ (os-net-config) is being abandoned in favor of the new fix also from Brent at https://review.openstack.org/#/c/397405/ (os-net-config). This needs to go to master then newton asap. Updating external trackers - needinfo beagles please sanity check.
The external tracker update looks good.
Is it possible to backport this latest fix to liberty and mitaka? I understand there have been significant changes to os_net_config/objects.py.
(In reply to Ronelle Landy from comment #103) > Is it possible to backport this latest fix to liberty and mitaka? > I understand there have been significant changes to os_net_config/objects.py. That is the plan, yes. Look for upgrades on this RHBZ and the OSP 8 clone. The 8 and 9 fixes will not involve changes to os-net-config/ifcfg files, rather revert the OVS agent change that put bridges in secure mode.
(In reply to Assaf Muller from comment #104) > (In reply to Ronelle Landy from comment #103) > > Is it possible to backport this latest fix to liberty and mitaka? > > I understand there have been significant changes to os_net_config/objects.py. > > That is the plan, yes. Look for upgrades on this RHBZ and the OSP 8 clone. > The 8 and 9 fixes will not involve changes to os-net-config/ifcfg files, > rather revert the OVS agent change that put bridges in secure mode. updates, that is, not upgrades.
Couple of questions: will the proposed patches affect network performance? for vlan's on bridge, IP on that bridge, etc..? For existing deployments going from 9 - 10, will there be an Upgrade path? I understand the resolution for 8/9 will be to not turn on the secure mode, thus we will not need to add the ovs_extra to any file. Thank You
The backports for this into OSP 8 and 9 have been completed and will be available in the next puddle and in the RC due out Dec. 1
Agreed on closing the bug in sake of BZ 1394890. Will let somebody from Neutron DFG to do this action since they own this BZ.
*** This bug has been marked as a duplicate of bug 1394890 ***