Bug 1386299 - Compute and controller nodes are not reachable after reboot when OVS bridges are set to secure fail mode
Summary: Compute and controller nodes are not reachable after reboot when OVS bridges ...
Keywords:
Status: CLOSED DUPLICATE of bug 1394890
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: os-net-config
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: async
: 9.0 (Mitaka)
Assignee: Brent Eagles
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On:
Blocks: 1391031 1394890 1394894
TreeView+ depends on / blocked
 
Reported: 2016-10-18 15:20 UTC by Alexander Chuzhoy
Modified: 2021-03-11 14:45 UTC (History)
42 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1391031 1394890 (view as bug list)
Environment:
Last Closed: 2016-11-21 20:50:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
version lock after update (41.79 KB, text/plain)
2016-10-26 10:35 UTC, Randy Perryman
no flags Details
versionlock before update (33.98 KB, text/plain)
2016-10-26 10:35 UTC, Randy Perryman
no flags Details
Liberty objects.py (24.45 KB, text/plain)
2016-11-10 18:21 UTC, Randy Perryman
no flags Details
The Controller file that I used (7.79 KB, text/plain)
2016-11-10 21:19 UTC, Randy Perryman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1640812 0 None None None 2016-11-10 14:26:43 UTC
Red Hat Knowledge Base (Solution) 2754311 0 None None None 2016-11-07 11:32:57 UTC

Description Alexander Chuzhoy 2016-10-18 15:20:38 UTC
rhel-osp-director:   After minor update (includes rhel7.2->rhel7.3), rebooted the overcloud nodes. The controllers aren't reachable now.
Environment:
instack-undercloud-4.0.0-14.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-34.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-34.el7ost.noarch
openstack-puppet-modules-8.1.8-2.el7ost.noarch


Steps to reproduce:
1. Deploy overcloud with:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --swift-storage-scale 0 --block-storage-scale 0 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1

2. Update the undercloud + reboot it
3. Update the overcloud.
4. Reboot the overcloud nodes.

5. Try to reach any controller.

Result:
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| ID                                   | Name                    | Status | Task State | Power State | Networks            |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| 6fb3b347-78ef-4ded-bd13-987d4bd174bc | overcloud-cephstorage-0 | ACTIVE | -          | Running     | ctlplane=192.0.2.7  |
| c4bdb5ef-b186-42f0-9f8e-a58c159200fc | overcloud-compute-0     | ACTIVE | -          | Running     | ctlplane=192.0.2.8  |
| e920d1bf-d25a-48b5-9d81-4be27809403d | overcloud-controller-0  | ACTIVE | -          | Running     | ctlplane=192.0.2.9  |
| 074dc00c-d79d-4408-a141-f4a2b43a77a9 | overcloud-controller-1  | ACTIVE | -          | Running     | ctlplane=192.0.2.10 |
| 6798ddd4-80a1-44e5-a17c-087865786fdf | overcloud-controller-2  | ACTIVE | -          | Running     | ctlplane=192.0.2.11 |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+




The nodes appear to be running:
[root@seal42 ~]# virsh list                      
 Id    Name                     State                                          ----------------------------------------------------                             21    instack                        running          
 22    baremetalbrbm_0                running        
 23    baremetalbrbm_1                running         
 24    baremetalbrbm_2                running           
 25    baremetalbrbm_5                running    
 26    baremetalbrbm_7                running                       



[stack@instack ~]$ ironic node-list
+--------------------------------------+------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+------+--------------------------------------+-------------+--------------------+-------------+
| 8129c1bb-811d-4a58-9e25-3eb4993e7415 | None | 6fb3b347-78ef-4ded-bd13-987d4bd174bc | power on    | active             | False       |
| 757e91c1-c939-42d5-b610-1087845d8e47 | None | c4bdb5ef-b186-42f0-9f8e-a58c159200fc | power on    | active             | False       |
| 7879d19c-9ba9-4953-b1e4-5c9866a58038 | None | 6798ddd4-80a1-44e5-a17c-087865786fdf | power on    | active             | False       |
| 35588513-8014-4340-84ba-a47b53dc815b | None | None                                 | power off   | available          | False       |
| f8d34ad8-616e-45a0-881b-3956cc54c4fd | None | None                                 | power off   | available          | False       |
| d8d3dd17-a708-405f-a70d-617ccbc8cce7 | None | e920d1bf-d25a-48b5-9d81-4be27809403d | power on    | active             | False       |
| 53361161-4090-455e-9986-d1b7b3b30591 | None | None                                 | power off   | available          | False       |
| b0c652f3-e1b0-4c64-924f-2b6180cc4358 | None | 074dc00c-d79d-4408-a141-f4a2b43a77a9 | power on    | active             | False       |
+--------------------------------------+------+--------------------------------------+-------------+--------------------+-------------+




[stack@instack ~]$ ping -c1 -W1 192.0.2.7
PING 192.0.2.7 (192.0.2.7) 56(84) bytes of data.
64 bytes from 192.0.2.7: icmp_seq=1 ttl=64 time=0.233 ms

--- 192.0.2.7 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.233/0.233/0.233/0.000 ms
[stack@instack ~]$ ping -c1 -W1 192.0.2.8
PING 192.0.2.8 (192.0.2.8) 56(84) bytes of data.
64 bytes from 192.0.2.8: icmp_seq=1 ttl=64 time=0.201 ms

--- 192.0.2.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.201/0.201/0.201/0.000 ms
[stack@instack ~]$ ping -c1 -W1 192.0.2.9
PING 192.0.2.9 (192.0.2.9) 56(84) bytes of data.

--- 192.0.2.9 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

[stack@instack ~]$ ping -c1 -W1 192.0.2.10
PING 192.0.2.10 (192.0.2.10) 56(84) bytes of data.

--- 192.0.2.10 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

[stack@instack ~]$ ping -c1 -W1 192.0.2.11
PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data.

--- 192.0.2.11 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms


The controllers aren't reachable.

Comment 1 Marios Andreou 2016-10-18 16:38:07 UTC
assigning to lifecycle team for now at least for initial triage ... @Ryan assigned to you as it is EOD in EU right now (Omri will be around and is anxious to test the update workflow with the reboot and is blocked  please syncup with him if you get a chance to have a look here). 

At this point mostly interested in finding out more about what the specific issue is & initial triage. It could be we need to assign to a different DFG for further investigation depending on what the problem is (e.g. is it something we introduced and can fix or is it something that is related to the service which we need the appropriate DFG to look at).

Comment 2 Ryan Hallisey 2016-10-18 20:11:53 UTC
When I restarted the nodes a second time, instead of hanging on a ping test it returns:
  From 192.0.2.1 icmp_seq=1 Destination Host Unreachable

and from tcpdump:
  ARP, Request who-has 192.0.2.10 tell instack.localdomain, length 28

I used vncviewer to connect to the controller nodes just fine.  They don't have passwords in place so I can't access them.

From what I gathered, I think this issue lies on the nodes which I can't access unless this test is done again with a password set on the controller nodes.
Can you run this again and set the password on the controller nodes so we can look around inside those nodes?

Comment 3 Marios Andreou 2016-10-19 06:54:17 UTC
reassigning properly this time :) (I said Ryan but added Lucas sorry)

Comment 4 Marios Andreou 2016-10-19 16:16:12 UTC
just thinking this could be related to the OVS issue (if ovs 2.5 was delivered with the minor update could be seeing same as https://bugzilla.redhat.com/show_bug.cgi?id=1371840

Comment 5 Alexander Chuzhoy 2016-10-20 14:33:52 UTC
on BM setup, I'm able to reach the controllers after reboot via ctlplabe, but they're unable to reach the FW on the externa network:
[stack@undercloud72 ~]$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+
| ID                                   | Name                    | Status | Task State | Power State | Networks              |
+--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+
| b7f466ad-5cb9-4c8a-9804-7625d998a0c4 | overcloud-cephstorage-0 | ACTIVE | -          | Running     | ctlplane=192.168.0.8  |
| 47a8ff4d-47ae-48e1-8f96-1f59fc47ae8f | overcloud-cephstorage-1 | ACTIVE | -          | Running     | ctlplane=192.168.0.7  |
| a13e9271-b02e-4a90-9e0b-456e423b96e8 | overcloud-compute-0     | ACTIVE | -          | Running     | ctlplane=192.168.0.9  |
| 35c8fed9-fabe-4e63-a210-80b47a2dc18b | overcloud-controller-0  | ACTIVE | -          | Running     | ctlplane=192.168.0.12 |
| 66edde5e-abdc-4ec9-9f90-a1419eecd4e7 | overcloud-controller-1  | ACTIVE | -          | Running     | ctlplane=192.168.0.10 |
| 4abf7103-ff17-4d35-9a5b-021a1fcb85d5 | overcloud-controller-2  | ACTIVE | -          | Running     | ctlplane=192.168.0.11 |
+--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+
[stack@undercloud72 ~]$ ss^C
[stack@undercloud72 ~]$ ssh heat-admin.0.10
[heat-admin@overcloud-controller-1 ~]$ sudo -i^C
[heat-admin@overcloud-controller-1 ~]$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
^C
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

[heat-admin@overcloud-controller-1 ~]$ sudo ip route
default via 10.19.184.254 dev br-ex
10.19.94.0/24 dev br-nic2  proto kernel  scope link  src 10.19.94.12
10.19.95.0/24 dev vlan183  proto kernel  scope link  src 10.19.95.11
10.19.184.0/24 dev br-ex  proto kernel  scope link  src 10.19.184.181
169.254.169.254 via 192.168.0.1 dev p2p1
192.168.0.0/24 dev p2p1  proto kernel  scope link  src 192.168.0.10
192.168.150.0/24 dev br-nic4  proto kernel  scope link  src 192.168.150.10
192.168.200.0/24 dev vlan103  proto kernel  scope link  src 192.168.200.11

[heat-admin@overcloud-controller-1 ~]$ ping 10.19.184.254
PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data.
From 10.19.184.181 icmp_seq=1 Destination Host Unreachable
From 10.19.184.181 icmp_seq=2 Destination Host Unreachable
From 10.19.184.181 icmp_seq=3 Destination Host Unreachable
From 10.19.184.181 icmp_seq=4 Destination Host Unreachable


The non-controllers are routed through undercloud and they're able to ping the external world.

Comment 6 Alexander Chuzhoy 2016-10-20 15:29:56 UTC
After the update the openvswitch version on OC looks like:
openstack-neutron-openvswitch-8.1.2-5.el7ost.noarch
python-openvswitch-2.4.0-1.el7.noarch
openvswitch-2.4.0-1.el7.x86_64

Comment 7 Alexander Chuzhoy 2016-10-20 16:31:03 UTC
So apparently the underlying NIC isn't brought UP upon reboot:


[root@overcloud-controller-0 ~]# ovs-vsctl show
c978e8e1-7ab1-4942-9167-2205c3edb82b           
    Bridge "br-nic4"                           
        Port "vlan103"                         
            tag: 103                           
            Interface "vlan103"                
                type: internal                 
        Port "br-nic4"                         
            Interface "br-nic4"                
                type: internal                 
        Port "p1p2"                            
            Interface "p1p2"                   
    Bridge "br-nic2"                           
        Port "em2"                             
            Interface "em2"                    
        Port "vlan183"                         
            tag: 183                           
            Interface "vlan183"                
                type: internal                 
        Port "br-nic2"                         
            Interface "br-nic2"                
                type: internal                 
    Bridge br-tun                              
        fail_mode: secure                      
        Port "vxlan-c0a8960a"                  
            Interface "vxlan-c0a8960a"         
                type: vxlan                    
                options: {df_default="true", in_key=flow, local_ip="192.168.150.12", out_key=flow, remote_ip="192.168.150.10"}
        Port br-tun                                                                                                           
            Interface br-tun                                                                                                  
                type: internal                                                                                                
        Port patch-int                                                                                                        
            Interface patch-int                                                                                               
                type: patch                                                                                                   
                options: {peer=patch-tun}                                                                                     
        Port "vxlan-c0a8960d"                                                                                                 
            Interface "vxlan-c0a8960d"                                                                                        
                type: vxlan                                                                                                   
                options: {df_default="true", in_key=flow, local_ip="192.168.150.12", out_key=flow, remote_ip="192.168.150.13"}
        Port "vxlan-c0a8960b"                                                                                                 
            Interface "vxlan-c0a8960b"                                                                                        
                type: vxlan                                                                                                   
                options: {df_default="true", in_key=flow, local_ip="192.168.150.12", out_key=flow, remote_ip="192.168.150.11"}
    Bridge br-int                                                                                                             
        fail_mode: secure                                                                                                     
        Port br-int                                                                                                           
            Interface br-int                                                                                                  
                type: internal                                                                                                
        Port int-br-ex                                                                                                        
            Interface int-br-ex                                                                                               
                type: patch                                                                                                   
                options: {peer=phy-br-ex}                                                                                     
        Port patch-tun                                                                                                        
            Interface patch-tun                                                                                               
                type: patch                                                                                                   
                options: {peer=patch-int}                                                                                     
    Bridge br-ex                                                                                                              
        Port br-ex                                                                                                            
            Interface br-ex                                                                                                   
                type: internal                                                                                                
    ovs_version: "2.4.0"                                                                                                      
[root@overcloud-controller-0 ~]#                                                                                              
[root@overcloud-controller-0 ~]#                                                                                              
[root@overcloud-controller-0 ~]#                                                                                              
[root@overcloud-controller-0 ~]# ip a                                                                                         
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1                                                    
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00                                                                     
    inet 127.0.0.1/8 scope host lo                                                                                            
       valid_lft forever preferred_lft forever                                                                                
    inet6 ::1/128 scope host                                                                                                  
       valid_lft forever preferred_lft forever                                                                                
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000                                                
    link/ether 44:a8:42:3b:2e:61 brd ff:ff:ff:ff:ff:ff                                                                        
    inet6 2620:52:0:13b8:46a8:42ff:fe3b:2e61/64 scope global mngtmpaddr dynamic                                               
       valid_lft 2591682sec preferred_lft 604482sec                                                                           
    inet6 fe80::46a8:42ff:fe3b:2e61/64 scope link                                                                             
       valid_lft forever preferred_lft forever                                                                                
3: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP qlen 1000                              
    link/ether 44:a8:42:3b:2e:62 brd ff:ff:ff:ff:ff:ff                                                                        
    inet6 fe80::46a8:42ff:fe3b:2e62/64 scope link                                                                             
       valid_lft forever preferred_lft forever                                                                                
4: p1p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000                                               
    link/ether 00:0a:f7:7f:24:88 brd ff:ff:ff:ff:ff:ff                                                                        
    inet 192.168.0.12/24 brd 192.168.0.255 scope global p1p1                                                                  
       valid_lft forever preferred_lft forever                                                                                
    inet 192.168.0.6/32 brd 192.168.0.255 scope global p1p1                                                                   
       valid_lft forever preferred_lft forever                                                                                
    inet6 fe80::20a:f7ff:fe7f:2488/64 scope link                                                                              
       valid_lft forever preferred_lft forever                                                                                
5: p1p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP qlen 1000                             
    link/ether 00:0a:f7:7f:24:89 brd ff:ff:ff:ff:ff:ff                                                                        
    inet6 fe80::20a:f7ff:fe7f:2489/64 scope link                                                                              
       valid_lft forever preferred_lft forever                                                                                
6: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000                                                 
    link/ether 4e:cb:c2:92:c4:b3 brd ff:ff:ff:ff:ff:ff                                                                        
7: br-int: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 5a:6d:49:d1:7b:4c brd ff:ff:ff:ff:ff:ff
8: br-tun: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether de:b7:9b:96:cf:47 brd ff:ff:ff:ff:ff:ff
9: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65470 qdisc noqueue master ovs-system state UNKNOWN qlen 1000
    link/ether 22:4e:22:76:0b:da brd ff:ff:ff:ff:ff:ff
11: vlan103: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether fa:06:80:c3:27:54 brd ff:ff:ff:ff:ff:ff
    inet 192.168.200.14/24 brd 192.168.200.255 scope global vlan103
       valid_lft forever preferred_lft forever
    inet 192.168.200.10/32 brd 192.168.200.255 scope global vlan103
       valid_lft forever preferred_lft forever
    inet6 fe80::f806:80ff:fec3:2754/64 scope link
       valid_lft forever preferred_lft forever
12: br-nic4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 00:0a:f7:7f:24:89 brd ff:ff:ff:ff:ff:ff
    inet 192.168.150.12/24 brd 192.168.150.255 scope global br-nic4
       valid_lft forever preferred_lft forever
    inet6 fe80::20a:f7ff:fe7f:2489/64 scope link
       valid_lft forever preferred_lft forever
13: br-nic2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 44:a8:42:3b:2e:62 brd ff:ff:ff:ff:ff:ff
    inet 10.19.94.15/24 brd 10.19.94.255 scope global br-nic2
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:135e:46a8:42ff:fe3b:2e62/64 scope global mngtmpaddr dynamic
       valid_lft 2591793sec preferred_lft 604593sec
    inet6 fe80::46a8:42ff:fe3b:2e62/64 scope link
       valid_lft forever preferred_lft forever
14: vlan183: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 9e:7a:72:fe:7a:74 brd ff:ff:ff:ff:ff:ff
    inet 10.19.95.15/24 brd 10.19.95.255 scope global vlan183
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:135f:9c7a:72ff:fefe:7a74/64 scope global mngtmpaddr dynamic
       valid_lft 2591849sec preferred_lft 604649sec
    inet6 fe80::9c7a:72ff:fefe:7a74/64 scope link
       valid_lft forever preferred_lft forever
15: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 44:a8:42:3b:2e:61 brd ff:ff:ff:ff:ff:ff
    inet 10.19.184.182/24 brd 10.19.184.255 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 fe80::46a8:42ff:fe3b:2e61/64 scope link
       valid_lft forever preferred_lft forever
[root@overcloud-controller-0 ~]# ping 10.19.184.254 -c1
PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data.
From 10.19.184.182 icmp_seq=1 Destination Host Unreachable

--- 10.19.184.254 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

[root@overcloud-controller-0 ~]# ifup em1
[root@overcloud-controller-0 ~]# ping 10.19.184.254 -c1
PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data.
64 bytes from 10.19.184.254: icmp_seq=1 ttl=64 time=3.15 ms

--- 10.19.184.254 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.159/3.159/3.159/0.000 ms

Comment 8 Alexander Chuzhoy 2016-10-20 16:34:09 UTC
/etc/sysconfig/network-scripts/ifcfg-em1:
# This file is autogenerated by os-net-config
DEVICE=em1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
DEVICETYPE=ovs
TYPE=OVSPort
OVS_BRIDGE=br-ex
BOOTPROTO=none

Comment 9 Alexander Chuzhoy 2016-10-20 17:02:01 UTC
More findings from that setup:

Reboot again the controller where I didn't bring UP connectivity to see if the issue reproduces after one more reboot  and it did:

[heat-admin@overcloud-controller-2 ~]$ ping 10.19.184.254
PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data. 
^C                                                       
--- 10.19.184.254 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2000ms

[heat-admin@overcloud-controller-2 ~]$ sudo ifup em1
[heat-admin@overcloud-controller-2 ~]$ ping 10.19.184.254
PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data.
From 10.19.184.183 icmp_seq=1 Destination Host Unreachable
From 10.19.184.183 icmp_seq=2 Destination Host Unreachable
From 10.19.184.183 icmp_seq=3 Destination Host Unreachable
From 10.19.184.183 icmp_seq=4 Destination Host Unreachable
^C
--- 10.19.184.254 ping statistics ---
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 2999ms
pipe 4


[heat-admin@overcloud-controller-2 ~]$ sudo ifup br-ex
[heat-admin@overcloud-controller-2 ~]$ ping 10.19.184.254
PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data.
From 10.19.184.183 icmp_seq=1 Destination Host Unreachable
From 10.19.184.183 icmp_seq=2 Destination Host Unreachable
From 10.19.184.183 icmp_seq=3 Destination Host Unreachable
From 10.19.184.183 icmp_seq=4 Destination Host Unreachable
^C
--- 10.19.184.254 ping statistics ---
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 2999ms
pipe 4
[heat-admin@overcloud-controller-2 ~]$ sudo ifdown br-ex
[heat-admin@overcloud-controller-2 ~]$ sudo ifup br-ex
[heat-admin@overcloud-controller-2 ~]$ ping 10.19.184.254
PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data.
From 10.19.184.183 icmp_seq=1 Destination Host Unreachable
From 10.19.184.183 icmp_seq=2 Destination Host Unreachable
From 10.19.184.183 icmp_seq=3 Destination Host Unreachable
^C
--- 10.19.184.254 ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2998ms
pipe 3
[heat-admin@overcloud-controller-2 ~]$ sudo ifup em1
[heat-admin@overcloud-controller-2 ~]$ ping 10.19.184.254
PING 10.19.184.254 (10.19.184.254) 56(84) bytes of data.
64 bytes from 10.19.184.254: icmp_seq=1 ttl=64 time=58.9 ms
64 bytes from 10.19.184.254: icmp_seq=2 ttl=64 time=1.13 ms
64 bytes from 10.19.184.254: icmp_seq=3 ttl=64 time=0.978 ms
64 bytes from 10.19.184.254: icmp_seq=4 ttl=64 time=1.07 ms
^C
--- 10.19.184.254 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3002ms
rtt min/avg/max/mdev = 0.978/15.531/58.942/25.063 ms



If I reboot that node, now that the connectivity works, the connectivity remains working after the reboot.

Comment 10 Alexander Chuzhoy 2016-10-20 18:24:35 UTC
So in order to be able to communicate on external network[1], had to:


sudo ifdown br-ex
sudo ifup br-ex
sudo ifup em1

Then the communication worked after subsequent reboot.

Comment 11 Ryan Hallisey 2016-10-20 18:37:35 UTC
what if a user can't ssh into their nodes? Also, what if you have 100 nodes? Bit of an inconvenience.

Comment 12 Randy Perryman 2016-10-20 19:13:40 UTC
I am seeing this very similar in my case all networks are down, but they use ovs-bridges.  
I have do a 
ifdown em1 - em 4
ifdown bond0 
ifdown bond1
ifup bond0
ifup bond1 

and networks come up.

testing to see if reboot works.

Comment 13 Randy Perryman 2016-10-20 19:14:49 UTC
I am having to use the root password or the heat-admin account.  So far this happening only to the 3 controller nodes. 

(In reply to Ryan Hallisey from comment #11)
> what if a user can't ssh into their nodes? Also, what if you have 100 nodes?
> Bit of an inconvenience.

Comment 14 Randy Perryman 2016-10-20 19:18:17 UTC
One other item I am still on RHEL 7.2

Comment 15 Assaf Muller 2016-10-20 19:21:52 UTC
Assigned for root cause analysis.

Comment 16 Alexander Chuzhoy 2016-10-21 14:15:58 UTC
Seems like the issue is intermittent. I didn't reproduce it on last 2 attempts deploying on baremetal setups.

Comment 17 Randy Perryman 2016-10-21 14:21:20 UTC
this is after you updated and upgraded? or before.  Also the date of the last Kernel Update is 10/10/2016, so before that date or after. 

If it is before, we lock the kernel version.

Comment 18 Miguel Angel Ajo 2016-10-21 15:24:04 UTC
I can't find anything obvious, from the comments (for the root cause of those interfaces not being brought up).

A sosreport from the controller nodes attached on the first bug report would have helped a lot.

Could we get sosreports if we get this reproduced again?

Comment 19 Randy Perryman 2016-10-21 15:27:27 UTC
The SOSREPORTS for my install are attached to the Bugzilla: 
https://bugzilla.redhat.com/show_bug.cgi?id=1385143

The three controller all experience this on Every reboot.

Comment 20 Dan Sneddon 2016-10-21 15:49:21 UTC
I believe I have reproduced this error in a virt environment, although I still haven't identified a root cause.

A potential workaround seems to be bouncing the affected bridge interface (br-ex in my case) after the update but before rebooting.

  sudo -i
  ifdown eth0 ; ifup eth0 ; ifdown br-ex ; ifup br-ex ; ifup vlan10

After running the above command, I did have my ssh connection cut off, but the connectivity returned after a couple of seconds and I was able to reboot the controller without incident.

So it looks like the bridge gets in a bad way, and running ifdown/ifup on the bridge seems to fix the issue, but that doesn't explain why the bridge doesn't come up properly after a reboot when the workaround is not run.

Comment 21 Randy Perryman 2016-10-21 15:53:12 UTC
(In reply to Dan Sneddon from comment #20)
> I believe I have reproduced this error in a virt environment, although I
> still haven't identified a root cause.
> 
> A potential workaround seems to be bouncing the affected bridge interface
> (br-ex in my case) after the update but before rebooting.
> 
>   sudo -i
>   ifdown eth0 ; ifup eth0 ; ifdown br-ex ; ifup br-ex ; ifup vlan10
> 
> After running the above command, I did have my ssh connection cut off, but
> the connectivity returned after a couple of seconds and I was able to reboot
> the controller without incident.
> 
> So it looks like the bridge gets in a bad way, and running ifdown/ifup on
> the bridge seems to fix the issue, but that doesn't explain why the bridge
> doesn't come up properly after a reboot when the workaround is not run.

Agreed that this works, but this assume you have alternate SSH or Console access, which means you need to ensure your Images have known user/password in them with sudo privileges.

Comment 22 Alexander Chuzhoy 2016-10-25 19:38:45 UTC
Seems like I was able to reproduce the issue simply by deploying OSP9 with rhel7.3 images and rebooting the OC nodes.

Comment 23 Randy Perryman 2016-10-26 08:49:13 UTC
I have just installed OSP 8 on my install and manually updated the kernel only.  Reboots on the controller still works.  I am now going to complete a full minor update on the cluster.

Comment 24 Randy Perryman 2016-10-26 10:35:11 UTC
Created attachment 1214246 [details]
version lock after update

Comment 25 Randy Perryman 2016-10-26 10:35:52 UTC
Created attachment 1214247 [details]
versionlock before update

Comment 26 Randy Perryman 2016-10-26 10:37:01 UTC
Can confirm:  Node I updated kernel and the ran update script will not cleanly reboot. Needs to have network restarted after reboot.  

Before update worked fine.

Comment 27 arkady kanevsky 2016-10-26 10:39:51 UTC
what is the version of OVS before and after upgrade?

Comment 28 Dan Sneddon 2016-10-26 11:31:58 UTC
(In reply to Randy Perryman from comment #26)
> Can confirm:  Node I updated kernel and the ran update script will not
> cleanly reboot. Needs to have network restarted after reboot.  
> 
> Before update worked fine.

Thanks for performing this test. That rules out kernel bugs, and points squarely at the upgraded packages included in the update, perhaps neutron-openvswitch.

Comment 29 Dan Sneddon 2016-10-26 11:37:17 UTC
(In reply to arkady kanevsky from comment #27)
> what is the version of OVS before and after upgrade?

Randy, please confirm. In my testing, I found that the OVS package was not updated during the update, but neutron-openvswitch package was updated.

Comment 30 Randy Perryman 2016-10-26 12:24:33 UTC
This is the only difference between the two with a grep for for vswitch. 

s1:openstack-neutron-openvswitch-7.1.1-7.el7ost.*

Comment 31 Randy Perryman 2016-10-26 12:27:29 UTC
btw this configuration is:

RHEL 7.2
OSP 8 with versionlock files - reboots work.
Unlock the versionlock 
run openstack update on the cluster
Reboot fails

So this affects 
OSP 8 and OSP 9

Comment 32 Omri Hochman 2016-10-26 15:28:45 UTC
I'll defer to DFG-networking to look at it - the issue occurs when reboot OC nodes after performing clean-deployment OSP on top of rhel7.3 based nodes. 

the issue is unrelated with upgrade/updates procedure.  although when updates nodes from rhel7.2 to rhel7.3 + reboot the issue reproduce and therefore - block some of our testing scenarios.

Comment 35 Omri Hochman 2016-11-01 16:17:40 UTC
The scenario is:

Deploy osp9 on rhel7.2 -> run minor-update (which takes the OS to rhel7.3) --> then due to kernel upgrade -> we run reboot and hit the issue.

Comment 36 Mike Orazi 2016-11-01 16:52:24 UTC
Is this possibly related to:  https://bugzilla.redhat.com/show_bug.cgi?id=1388286 ?

Comment 37 Franck Baudin 2016-11-02 08:20:59 UTC
Can you try the following workaround? 

https://bugzilla.redhat.com/show_bug.cgi?id=1385096#c4

Comment 38 Miguel Angel Ajo 2016-11-02 10:57:19 UTC
(In reply to Franck Baudin from comment #37)
> Can you try the following workaround? 
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1385096#c4

Looks like it's not related to that, em1, em2, em3, em4 have no assigned MAC addresses.

Bond0 uses em1 & em3 , taking mac address xx:xx:xx:xx:71:61 for both interfaces
Bond1 uses em2 & em4 , taking mac address xx:xx:xx:xx:71:63 for both interfaces



I se a little dance with em3 and em4 going up and down several times after being added to their respective bonds, that makes me suspect of the switch.

See:
[   26.368384] bond0: Adding slave em1
[   26.368397] i40e 0000:01:00.0 em1: already using mac address 14:9e:cf:2c:71:61
[   26.382889] i40e 0000:01:00.0 em1: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[   26.383570] bond0: Enslaving em1 as a backup interface with an up link
[   26.660414] bond0: Adding slave em3
[   26.660426] i40e 0000:01:00.2 em3: set new mac address 14:9e:cf:2c:71:61
[   26.678255] i40e 0000:01:00.2 em3: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[   26.678874] bond0: Enslaving em3 as a backup interface with an up link
[   26.684491] i40e 0000:01:00.2 em3: NIC Link is Down
[   26.781366] bond0: link status definitely down for interface em3, disabling it
[   27.057027] i40e 0000:01:00.2 em3: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[   27.081064] bond0: link status definitely up for interface em3, 10000 Mbps full duplex
[   27.564049] device em1 entered promiscuous mode
[   27.564160] device em3 entered promiscuous mode
[   27.690440] bond0: link status up again after 0 ms for interface em1
[   27.818593] i40e 0000:01:00.2 em3: NIC Link is Down
[   27.823315] bond0: link status definitely down for interface em3, disabling it
[   28.198131] i40e 0000:01:00.2 em3: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[   28.222943] bond0: link status definitely up for interface em3, 10000 Mbps full duplex


and:


[   28.844636] bond1: Adding slave em2
[   28.844648] i40e 0000:01:00.1 em2: already using mac address 14:9e:cf:2c:71:63
[   28.858938] i40e 0000:01:00.1 em2: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[   28.859502] bond1: Enslaving em2 as a backup interface with an up link
[   29.215957] bond1: Adding slave em4
[   29.215968] i40e 0000:01:00.3 em4: set new mac address 14:9e:cf:2c:71:63
[   29.234158] i40e 0000:01:00.3 em4: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[   29.234770] bond1: Enslaving em4 as a backup interface with an up link
[   29.241013] i40e 0000:01:00.3 em4: NIC Link is Down
[   29.336846] bond1: link status definitely down for interface em4, disabling it
[   29.756017] i40e 0000:01:00.3 em4: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[   29.836358] bond1: link status definitely up for interface em4, 10000 Mbps full duplex
[   29.915285] device em2 entered promiscuous mode
[   29.915404] device em4 entered promiscuous mode
[   30.043148] bond1: link status up again after 0 ms for interface em2
[   30.173821] i40e 0000:01:00.3 em4: NIC Link is Down
[   30.176019] bond1: link status definitely down for interface em4, disabling it
[   30.583925] i40e 0000:01:00.3 em4: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[   30.675528] bond1: link status definitely up for interface em4, 10000 Mbps full duplex




Is the switch properly configured for LACP (802.3ad) with ports em1+em3  and em2+em4 ? 


May be we had any changes in RHEL7.3 kernel regarding how 802.3ad is handled?

Comment 39 Miguel Angel Ajo 2016-11-02 11:05:28 UTC
After all the dancing bond0/bond1 seem to be up, but it doesn't work, I'm moving it to the openvswitch component for them to have an eye.

Here's a trace of the boot of one of the bonds:

Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info>  (bond0): new Bond device (carrier: OFF, driver: 'bonding', ifindex: 17)
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Setting MII monitoring interval to 100
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Setting MII monitoring interval to 100
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Setting MII monitoring interval to 100
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Adding slave em1
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.0 em1: already using mac address 14:9e:cf:2c:71:61
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.0 em1: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Enslaving em1 as a backup interface with an up link
Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info>  (bond0): bond slave em1 was enslaved
Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info>  (em1): enslaved to bond0
Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info>  (em1): link connected
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Adding slave em3
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.2 em3: set new mac address 14:9e:cf:2c:71:61
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.2 em3: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: bond0: Enslaving em3 as a backup interface with an up link
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info>  (bond0): bond slave em3 was enslaved
Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info>  (em3): enslaved to bond0
Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info>  (em3): link connected
Oct 18 21:15:30 overcloud-controller-1.localdomain NetworkManager[1704]: <info>  (bond0): link connected
Oct 18 21:15:30 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.2 em3: NIC Link is Down
Oct 18 21:15:31 overcloud-controller-1.localdomain NetworkManager[1704]: <info>  (em3): link disconnected
Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: bond0: link status definitely down for interface em3, disabling it
Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.2 em3: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
Oct 18 21:15:31 overcloud-controller-1.localdomain NetworkManager[1704]: <info>  (em3): link connected
Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: bond0: link status definitely up for interface em3, 10000 Mbps full duplex
Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: bond0: first active interface up!
Oct 18 21:15:31 overcloud-controller-1.localdomain ovs-vsctl[2233]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br-tenant bond0 -- add-port br-tenant bond0
Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: device bond0 entered promiscuous mode
Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: device em1 entered promiscuous mode
Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: device em3 entered promiscuous mode
Oct 18 21:15:31 overcloud-controller-1.localdomain NetworkManager[1704]: <info>  (bond0): enslaved to non-master-type device ovs-system; ignoring
Oct 18 21:15:31 overcloud-controller-1.localdomain kernel: bond0: link status up again after 0 ms for interface em1
Oct 18 21:15:32 overcloud-controller-1.localdomain kernel: i40e 0000:01:00.2 em3: NIC Link is Down
Oct 18 21:15:32 overcloud-controller-1.localdomain NetworkManager[1704]: <info>  (em3): link disconnected
Oct 18 21:15:32 overcloud-controller-1.localdomain kernel: bond0: link status definitely down for interface em3, disabling it
Oct 18 21:15:32 overcloud-controller-1.localdomain ovs-vsctl[2263]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --may-exist add-br br-tenant
Oct 18 21:15:32 overcloud-controller-1.localdomain network[1761]: Bringing up interface bond0:  [  OK  ]
Oct 18 21:15:32 overcloud-controller-1.localdomain cloud-init[1689]: Cloud-init v. 0.7.6 running 'init-local' at Tue, 18 Oct 2016 21:15


And here's the config for the bonds:

ajo@mbp-ajo:~/Downloads/sosreport/ctl1$ tail etc/sysconfig/network-scripts/ifcfg-em*
==> etc/sysconfig/network-scripts/ifcfg-em1 <==
# This file is autogenerated by os-net-config
DEVICE=em1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
MASTER=bond0
SLAVE=yes
BOOTPROTO=none

==> etc/sysconfig/network-scripts/ifcfg-em2 <==
# This file is autogenerated by os-net-config
DEVICE=em2
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
MASTER=bond1
SLAVE=yes
BOOTPROTO=none

==> etc/sysconfig/network-scripts/ifcfg-em3 <==
# This file is autogenerated by os-net-config
DEVICE=em3
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
MASTER=bond0
SLAVE=yes
BOOTPROTO=none

==> etc/sysconfig/network-scripts/ifcfg-em4 <==
# This file is autogenerated by os-net-config
DEVICE=em4
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
MASTER=bond1
SLAVE=yes
BOOTPROTO=none
ajo@mbp-ajo:~/Downloads/sosreport/ctl1$ tail etc/sysconfig/network-scripts/ifcfg-bond*
==> etc/sysconfig/network-scripts/ifcfg-bond0 <==
DEVICE=bond0
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
DEVICETYPE=ovs
TYPE=OVSPort
OVS_BRIDGE=br-tenant
MACADDR="14:9e:cf:2c:71:61"
BONDING_OPTS="mode=802.3ad miimon=100"

==> etc/sysconfig/network-scripts/ifcfg-bond1 <==
DEVICE=bond1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
DEVICETYPE=ovs
TYPE=OVSPort
OVS_BRIDGE=br-ex
MACADDR="14:9e:cf:2c:71:63"
BONDING_OPTS="mode=802.3ad miimon=100"

Comment 40 Randy Perryman 2016-11-02 11:30:38 UTC
Switch is interesting, but we are seeing this on multiple installs with and without bonds.

Comment 41 Miguel Angel Ajo 2016-11-02 13:46:02 UTC
(In reply to Randy Perryman from comment #40)
> Switch is interesting, but we are seeing this on multiple installs with and
> without bonds.

Without bonds? 

Do we have any sosreport of this reproduced without the bonds to simplify the diagnostics?

Comment 42 Randy Perryman 2016-11-02 14:11:13 UTC
I believe Dan did this in VM's.

Comment 43 Randy Perryman 2016-11-02 14:12:12 UTC
(In reply to Dan Sneddon from comment #20)
> I believe I have reproduced this error in a virt environment, although I
> still haven't identified a root cause.
> 
> A potential workaround seems to be bouncing the affected bridge interface
> (br-ex in my case) after the update but before rebooting.
> 
>   sudo -i
>   ifdown eth0 ; ifup eth0 ; ifdown br-ex ; ifup br-ex ; ifup vlan10
> 
> After running the above command, I did have my ssh connection cut off, but
> the connectivity returned after a couple of seconds and I was able to reboot
> the controller without incident.
> 
> So it looks like the bridge gets in a bad way, and running ifdown/ifup on
> the bridge seems to fix the issue, but that doesn't explain why the bridge
> doesn't come up properly after a reboot when the workaround is not run.

Comment 45 Miguel Angel Ajo 2016-11-02 14:49:04 UTC
By looking at the packages, we found that final RHEL status after upgrade is not rhel 7.3, but rhel 7.2.z

$ cat uname
Linux overcloud-controller-1.localdomain 3.10.0-327.36.2.el7.x86_64 #1 SMP Tue Sep 27 16:01:21 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux


and 

openvswitch-2.4.0-2.el7_2.x86_64


while RHEL 7.3 has kernel-3.10.0-514.el7.x86_64.rpm  in RC3 at least.

Comment 44 also indicates this is happening with other OSP versions.

Comment 46 Miguel Angel Ajo 2016-11-02 16:49:56 UTC
As per @cascardo comments on IRC, could you please try adding:

OVS_EXTRA="set bridge ovsbr fail_mode=standalone"

to /etc/sysconfig/network-scripts/ifcfg-br-tenant ? 


He found that the culprit could be that, for some reason the bridge is not removed and re-created on reboot (it was supposed to be destroyed when network is stopped -can we verify this-?),

and since we now [1] set the bridge to secure mode, that means that no default "NORMAL" switching rule is introduced in the bridge at boot-up (among other things), making all traffic arriving br-tenant from the bonds (or external interfaces) dropped.


[1] https://review.openstack.org/#/c/355315/

We should consider changing this bug back to rhel-osp-director to make sure we install that OVS_EXTRA setting by default to avoid something like this in the future.

Comment 47 Miguel Angel Ajo 2016-11-02 16:52:16 UTC
please Add the OVS_EXTRA="set bridge ovsbr fail_mode=standalone" to any ovs-bridge ifcfg script (as per @cascardo comments on IRC it seems like there are vlan interfaces attached to br-ex).

Comment 48 Randy Perryman 2016-11-02 17:31:48 UTC
Initial testing is looking promising. All controllers rebooted with network, coming back.  First time this happened.  Looking at network config files to add it to the deployment going forward.

Comment 49 Randy Perryman 2016-11-02 17:32:29 UTC
Should mention this is OSP 8 RHEL 7.2

Comment 50 Randy Perryman 2016-11-02 17:33:32 UTC
This points out that the Gate needs to be updated to test for reboot failures.

Comment 52 Randy Perryman 2016-11-02 19:27:16 UTC
Hi,

I need help in figuring out how to code this to be in my nic-configs.

Comment 54 Flavio Leitner 2016-11-03 12:49:18 UTC
(In reply to Manabu Ori from comment #53)
> I tried to
>   - delete "systemctl restart network" in rc.local
>   - add OVS_EXTRA="set bridge br-ex fail_mode=standalone" to ifcfg-br-ex
> and the symptom disappeared after reboot.

It seems that only the OVS_EXTRA config would be enough, so no need for restarting the network.

Could someone please confim?

Comment 55 Flavio Leitner 2016-11-03 12:53:22 UTC
Also, the OVS by default is in 'in-band' mode:

[ovs-vswitchd.conf.db(5)]
 in-band
    In this mode, this controller’s OpenFlow traffic travels over the bridge associated with the controller.   With  this  setting,  Open vSwitch  allows  traffic  to and from the controller regardless of the contents of the OpenFlow flow table.  (Otherwise, Open vSwitch would never be able to connect to the controller, because it did not have a flow to enable it.)  This is the most  common  connection mode because it is not necessary to maintain two independent networks.

So, even in 'secure' mode OVS should allow talking with the controller.

Comment 56 Manabu Ori 2016-11-03 12:57:13 UTC
(In reply to Flavio Leitner from comment #54)
> (In reply to Manabu Ori from comment #53)
> > I tried to
> >   - delete "systemctl restart network" in rc.local
> >   - add OVS_EXTRA="set bridge br-ex fail_mode=standalone" to ifcfg-br-ex
> > and the symptom disappeared after reboot.
> 
> It seems that only the OVS_EXTRA config would be enough, so no need for
> restarting the network.
> 
> Could someone please confim?

Sorry for confusion, only the OVS_EXTRA config worked well.

At first, "systemctl restart network" seemed to be a workaround and I wrote it in rc.local.
After that, I read this bz and removed the "systemctl restart network" in rc.local and tried OVS_EXTRA config, resulted in success.

Comment 57 Brent Eagles 2016-11-03 13:13:37 UTC
FWIW: To implement this workaround in the templates, you can set OVS_EXTRA on deployment in the network environment templates where br-ex is defined. For example, if using the environments/net-multiple-nics.yaml environment file that pulls in network/config/multiple-nics/controller.yaml, you would add to the part of network/config/multiple-nics/controller.yaml that looks like:            
            -
              type: ovs_bridge
              name: {get_input: bridge_name}
              dns_servers: {get_param: DnsServers}
              use_dhcp: false
              addresses:
                -
                  ip_netmask: {get_param: ExternalIpSubnet}


to make it:

              type: ovs_bridge
              name: {get_input: bridge_name}
              dns_servers: {get_param: DnsServers}
              use_dhcp: false
              ovs_extra:
                  str_replace:
                      template: "set bridge BRIDGE fail_mode=standalone"
                      params:
                          BRIDGE: {get_input: bridge_name}
              addresses:
                -
                  ip_netmask: {get_param: ExternalIpSubnet}

(I think the syntax is correct...)

Comment 58 Jiri Benc 2016-11-03 13:40:13 UTC
(In reply to Flavio Leitner from comment #55)
> So, even in 'secure' mode OVS should allow talking with the controller.

AFAIK OpenStack/Neutron does not implement OpenFlow controller thus this doesn't really apply.

Comment 59 Miguel Angel Ajo 2016-11-03 17:20:08 UTC
(In reply to Jiri Benc from comment #58)
> (In reply to Flavio Leitner from comment #55)
> > So, even in 'secure' mode OVS should allow talking with the controller.
> 
> AFAIK OpenStack/Neutron does not implement OpenFlow controller thus this
> doesn't really apply.

it does now :) we have a mode where we set ourselves as a local controller (neutron-openvswitch-agent).

Comment 60 Brent Eagles 2016-11-03 17:38:41 UTC
@dsneddon: I checked with Miguel about whether or not it is safe to set the OVS_EXTRA info everywhere and it seems to be okay. With that in mind we might be better off making the change in os-net-config so it covers the situation where people are using customized network configuration templates. Thoughts?

Comment 61 Dan Sneddon 2016-11-03 18:44:01 UTC
(In reply to Brent Eagles from comment #60)
> @dsneddon: I checked with Miguel about whether or not it is safe to set the
> OVS_EXTRA info everywhere and it seems to be okay. With that in mind we
> might be better off making the change in os-net-config so it covers the
> situation where people are using customized network configuration templates.
> Thoughts?

The problem with that is that in existing deployments, if the network config changes (and adding that line to the ifcfg file counts as a change), then the network will be redeployed. What actually happens is that os-net-config notices that the ifcfg file is different, so it issues ifdown/ifup on the interface after writing the new configuration.

This may have additional impact during upgrades, so should be tested.

Comment 62 Brent Eagles 2016-11-03 19:12:27 UTC
Having taken a step back, I'm convinced that modifying os-net-config to do this is a bad idea. Injecting a default workaround into code in a manner that "hides it" is bad practice in general. Considering the fallout of future changes in openvswitch, neutron, etc. and even other possible uses of os-net-config, this just screams "DON'T".

In the interim, I think our best bet is to add ovs_extra data to the templates and document errata.

Comment 63 Manabu Ori 2016-11-04 03:27:27 UTC
(In reply to Brent Eagles from comment #57)
>               type: ovs_bridge
>               name: {get_input: bridge_name}
>               dns_servers: {get_param: DnsServers}
>               use_dhcp: false
>               ovs_extra:
>                   str_replace:
>                       template: "set bridge BRIDGE fail_mode=standalone"
>                       params:
>                           BRIDGE: {get_input: bridge_name}
>               addresses:
>                 -
>                   ip_netmask: {get_param: ExternalIpSubnet}
> 
> (I think the syntax is correct...)

I tried it with OSP8, but no luck...

<nic-configs/controller.yaml>
(snip)
              type: ovs_bridge
              name: {get_input: bridge_name}
              dns_servers: {get_param: DnsServers}
              ovs_extra:
                str_replace:
                  template: "set bridge BRIDGE fail_mode=standalone"
                  params:
                    BRIDGE: {get_input: bridge_name}
              members:
(snip)

<output of openstack overcloud deploy>
2016-11-04 03:05:27 [overcloud-Controller-chea4pnwnc2q-2-gcqoq7xsfd7i]: CREATE_FAILED  Resource CREATE failed: resources.NetworkConfig: Property error: resources.OsNetConfigImpl.properties.config: "str_replace" params must be strings or numbers
2016-11-04 03:05:28 [2]: CREATE_FAILED  resources.NetworkConfig: resources[2].Property error: resources.OsNetConfigImpl.properties.config: "str_replace" params must be strings or numbers

Comment 66 Brent Eagles 2016-11-04 18:15:16 UTC
I just discovered this u/s bz https://bugs.launchpad.net/heat/+bug/1344284 (actually indicated in one of the networking templates ...) that indicates that this particular method won't work as described. Working on alternatives.

Comment 75 Miguel Angel Ajo 2016-11-08 09:34:37 UTC
I suspect it could happen with older (RHEL7.2 systems too), since we backported patche to OSP8 to set the bridge in secure mode, once that's applied, and we reboot any controller, this could manifest itself.

I've also experienced this issue yesterday in a packstack AIO deployment.

It doesn't mean that it's a neutron bug, it's now packstack + director, both need to make sure the secure mode is cleared up from the bridges at boot, or use a separate (independent of neutron bridge) for node connectivity.

Comment 76 Miguel Angel Ajo 2016-11-08 09:45:02 UTC
I have created the corresponding packstack bug too:
https://bugzilla.redhat.com/show_bug.cgi?id=1392800

Comment 77 Randy Perryman 2016-11-08 12:11:46 UTC
What it affects:
•	OSP 8
•	OSP 9
•	OSP 10


Is there a permanent work around?  No

Is there anything special that needs to happen?  Yes an interactive account needs to be created that has sudo privileges

Is there a workaround?  Only a one rework of the ifcfg files, OS-Net-Config will reset them at it’s first opportunity


What happens if a user reboots a controller, and there is no interactive account or ifcfg fix?   ?? 

Does the critically of this bug need to be updated?

Comment 78 Assaf Muller 2016-11-08 12:26:21 UTC
(In reply to Randy Perryman from comment #77)
> What it affects:
> •	OSP 8
> •	OSP 9
> •	OSP 10
> 
> 
> Is there a permanent work around?  No
> 
> Is there anything special that needs to happen?  Yes an interactive account
> needs to be created that has sudo privileges
> 
> Is there a workaround?  Only a one rework of the ifcfg files, OS-Net-Config
> will reset them at it’s first opportunity
> 
> 
> What happens if a user reboots a controller, and there is no interactive
> account or ifcfg fix?   ?? 
> 
> Does the critically of this bug need to be updated?

It's already marked as urgent/urgent and as a blocker. We already have people working on this around the clock.

Comment 81 Brent Eagles 2016-11-09 18:33:00 UTC
I examined this on recent packaging of OSPd 10 and was able to reproduce what appears to be the same behavior. My evaluation environment is a standard virt-setup, so a single NIC in a VM bridged to br-ex is used for control plane as well as external network, etc. In my virtual environment, the br-ex interface obtains its address via DHCP and with the ovs bridge unable to move traffic, the address wasn't obtained, rendering the node completely unreachable via network. I was able to log in via a console to verify. Without any additional changes, running "systemctl restart network" does configure the interface and it seems to function properly afterwards. Simply cycling the interfaces does not appear to work unless "systemctl restart network" is performed first. If I restart the VM at this point, it will be again unreachable on the next reboot. For what it is worth, even though cycling the interface to obtain the IP address works, it seems that many of the OpenStack services had already failed to start. Modifying ifcfg-br-ex to change the fail_mode to standalone on boot seems to allow the IP address to be obtained on boot. It's worth noting that neutron changes the fail_mode back to secure at some point afterwards. Due to the timing, I was able to determine with certainty that fail_mode actually goes to standalone for any appreciable period of time during startup before neutron has a change to set it back - but this is probably an unimportant detail.

Controller and compute nodes are both unreachable with this network configuration. I'm not sure why restarting the network would work unless it opens a very small window where the br-ex bridge has the default fail_mode instead of secure. Please note that firing of a post-boot network restart does not seem a workable option. A lot of services are in a bad state and may not "come back to life" once the network connectivity is restored. It pretty much has to happen at the usual time during network configuration.

For comparison, I preformed the same experiment using CentOS and upstream tripleo and got the same results the first time I tried, but further attempts to get br-ex in a bad state failed. So on CentOS at least it may be timing dependent. The host system was under pretty significant load at the time of the first trial so it may have been a factor. While multiple trials were performed with RHEL with other instances turned off, the network configuration failed consistently.

The OSP-d setup was:
Linux overcloud-controller-0.localdomain 3.10.0-493.el7.x86_64 #1 SMP Tue Aug 16 11:45:26 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
ovs-vsctl (Open vSwitch) 2.5.0
Compiled Jul 21 2016 10:24:02
DB Schema 7.12.1
NetworkManager

The CentOS setup was:
Linux overcloud-controller-0.localdomain 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
ovs-vsctl (Open vSwitch) 2.5.0
Compiled Mar 18 2016 15:00:11
DB Schema 7.12.1
No NetworkManager

As CentOS seems to work at least most of the time, it seems that the fail_mode alone isn't the root cause, but some combination of fail_mode and some other behavior - perhaps a race condition around load order on startup?

Since one of the premises of the secure fail_mode being an issue, the question that follows what is the openvswitch agent doing and why isn't it doing some flow configuration? When using the ovs-ofctl openflow flow driver, ryu chokes because the IP it is expecting to be configured for the host hasn't been assigned. The native driver seems to be equally unhappy, so possibly the change from ovs-ofctl to native isn't to blame. It would be difficult to say for certain without some other tests - in any case, the old configuration works no better so isn't workaround.

I checked whether the changes to the loading of bridge and bridge netfilters filters by modifying my initramfs to load bridge and br_netfilter at boot and set the sysctl parameters, etc. No effect - it was a "hail Mary" attempt anyways.

So my conclusion at this time is that the secure mode *is* the culprit on RHEL. It is quite possibly only a problem because the network that the control plane and management networks are all connected via a bridge that neutron in the overcloud "knows about". It's not clear to me that this would be an issue in environments where these networks are configured with bridges not managed by neutron in the overcloud. At the very least it neutron wouldn't be altering the fail_mode. At the moment, we don't have a clear way to workaround by temporarily causing the br-ex bridge to configured with the standalone fail_mode in the heat templates because of a long-standing issue with intrinsic methods and values obtained by "get_input" (i.e. not a parameter to that heat template). I'll continue investigating how best to do that, but considering all of the variables we should consider alternate solutions. Possibly reverting the secure mode patch from neutron for the time being or altering os-net-config to insert the fail mode information by default.

Comment 82 Brent Eagles 2016-11-09 20:21:46 UTC
I figured out a way to do this. Involves patching os-net-config with a format operation to do a string replace on a template (gets around the heat issue) and modifying the network configuration templates. Patches made directly to overcloud nodes proved the os-net-config side, full test involving heat template deployment in progress.

Comment 83 arkady kanevsky 2016-11-09 20:26:49 UTC
Brent,
so what is the patch?
What is timeline to get into OSP9 and OSP10?
Arkady

Comment 84 Brent Eagles 2016-11-09 22:38:26 UTC
@Arkady, see the external tracker links for OpenStack gerrit. I don't have an ETA as these are very fresh, tested in my environment only for a small subset of configurations (so far 2)  and as yet there haven't been any other eyes on them.

Comment 85 arkady kanevsky 2016-11-09 23:02:51 UTC
Thanks Brent.
See it now. 
Simple fix in too many places.

Comment 87 Randy Perryman 2016-11-10 12:37:02 UTC
Do we have a patch of the file: /usr/lib/python2.7/site-packages/os_net_config/objects.py

for images used in OSP 8?
rhosp-director-images-8.0-20160603.2.el7ost.noarch
rhosp-director-images-8.0-20160415.1.el7ost.noarch

Comment 89 Randy Perryman 2016-11-10 14:34:28 UTC
This will need to be back ported all the way to Liberty.

Comment 90 Randy Perryman 2016-11-10 18:21:33 UTC
Created attachment 1219472 [details]
Liberty objects.py

This is my first pass at the objects.py for the liberty release.

I have updated this file,
virt-copy-in  -a overcloud-full.qcow2 ./objects.py /usr/lib/python2.7/site-packages/os_net_config/objects.py
Updated the qcow image for deployment
Updated the nic-configs

Comment 91 Randy Perryman 2016-11-10 19:34:22 UTC
So my fix is not working quite right:
This is an example of br-ex -- I did update my objects.py.


# This file is autogenerated by os-net-config
DEVICE=br-ex
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
DEVICETYPE=ovs
TYPE=OVSBridge
OVS_EXTRA="s -- e -- t --   -- b -- r -- i -- d -- g -- e --   -- b -- r -- - -- e -- x --   -- f -- a -- i -- l -- _ -- m -- o -- d -- e -- = -- s -- t -- a -- n -- d -- a -- l -- o -- n -- e"
DNS1=8.8.8.8
DNS2=8.8.4.4

Comment 92 Brent Eagles 2016-11-10 20:05:12 UTC
Hi Randy, while it looks like your heat template data isn't quite right, this could be because it's liberty. There could be incompatibilities in the interacting parts.

Comment 93 Randy Perryman 2016-11-10 21:19:48 UTC
Created attachment 1219551 [details]
The Controller file that I used

Comment 94 Randy Perryman 2016-11-10 21:39:21 UTC
Okay in files impl_ine.py and impl_ifcfg.py there is the following:


impl_ifcfg.py:        ovs_extra = []
impl_ifcfg.py:                ovs_extra.append("set bridge %s other-config:hwaddr=%s" %
impl_ifcfg.py:            ovs_extra.extend(base_opt.ovs_extra)
impl_ifcfg.py:            ovs_extra.extend(base_opt.ovs_extra)
impl_ifcfg.py:        if ovs_extra:
impl_ifcfg.py:            data += "OVS_EXTRA=\"%s\"\n" % " -- ".join(ovs_extra)



where it is adding " -- ".   

Ideas on how to work around this.

Comment 95 Randy Perryman 2016-11-10 22:10:47 UTC
(In reply to Randy Perryman from comment #94)
> Okay in files impl_ine.py and impl_ifcfg.py there is the following:
> 
> 
> impl_ifcfg.py:        ovs_extra = []
> impl_ifcfg.py:                ovs_extra.append("set bridge %s
> other-config:hwaddr=%s" %
> impl_ifcfg.py:            ovs_extra.extend(base_opt.ovs_extra)
> impl_ifcfg.py:            ovs_extra.extend(base_opt.ovs_extra)
> impl_ifcfg.py:        if ovs_extra:
> impl_ifcfg.py:            data += "OVS_EXTRA=\"%s\"\n" % " --
> ".join(ovs_extra)
> 
> 
> 
> where it is adding " -- ".   
> 
> Ideas on how to work around this.

----------------
So the issue was my network config I had the following line:

ovs_extra: set bridge br-ex fail_mode=standalone"

which the above code treated each letter, space, """ as a string. 

Changing the config to be: 
ovs_extra:
                - "set bridge br-ex fail_mode=standalone"
Plus updating the file objects.py, inserting the modification as discussed on Gerrit, but only as needed for Liberty.
My controllers now have the correct information in the OVS_EXTRA line.

Comment 97 Randy Perryman 2016-11-11 18:33:08 UTC
On a new deployment of OSP 8.0 , with the files locked to specific version 
Changing the config to be: 
ovs_extra:
                - "set bridge br-ex fail_mode=standalone"

for the bridges on the computes and controllers, the proper line was inserted into the ifcfg files.  

No other changes were needed.  Do not know if OSP 9.0 will do the same.

Comment 98 Randy Perryman 2016-11-11 18:35:42 UTC
New Question:

In a deployment that is already in place, how do you update the config files?

Comment 99 Dan Sneddon 2016-11-11 20:30:05 UTC
(In reply to Randy Perryman from comment #98)
> New Question:
> 
> In a deployment that is already in place, how do you update the config files?

You can temporarily set NetworkDeploymentActions: ['CREATE', 'UPDATE'] in the parameter_defaults: section of an environment file to update the ifcfg files during a stack update.

Since this issue only comes up when doing an initial install or update, it shouldn't be necessary to modify the ifcfg files of a running system. Instead, the ifcfg files can be updated as part of the update process, which should ensure that the networking works after a reboot when the update is complete.

Comment 101 Marios Andreou 2016-11-15 12:11:06 UTC
adding a note for reviewers as there was movement on this last night. The approach proposed by Brent with https://review.openstack.org/#/c/396285/ (tripleo-heat-templates) dependant on https://review.openstack.org/#/c/395795/ (os-net-config) is being abandoned in favor of the new fix also from Brent at https://review.openstack.org/#/c/397405/ (os-net-config). This needs to go to master then newton asap.

Updating external trackers - needinfo beagles please sanity check.

Comment 102 Brent Eagles 2016-11-16 10:34:39 UTC
The external tracker update looks good.

Comment 103 Ronelle Landy 2016-11-16 14:47:17 UTC
Is it possible to backport this latest fix to liberty and mitaka?
I understand there have been significant changes to os_net_config/objects.py.

Comment 104 Assaf Muller 2016-11-16 14:49:49 UTC
(In reply to Ronelle Landy from comment #103)
> Is it possible to backport this latest fix to liberty and mitaka?
> I understand there have been significant changes to os_net_config/objects.py.

That is the plan, yes. Look for upgrades on this RHBZ and the OSP 8 clone. The 8 and 9 fixes will not involve changes to os-net-config/ifcfg files, rather revert the OVS agent change that put bridges in secure mode.

Comment 105 Assaf Muller 2016-11-16 14:50:26 UTC
(In reply to Assaf Muller from comment #104)
> (In reply to Ronelle Landy from comment #103)
> > Is it possible to backport this latest fix to liberty and mitaka?
> > I understand there have been significant changes to os_net_config/objects.py.
> 
> That is the plan, yes. Look for upgrades on this RHBZ and the OSP 8 clone.
> The 8 and 9 fixes will not involve changes to os-net-config/ifcfg files,
> rather revert the OVS agent change that put bridges in secure mode.

updates, that is, not upgrades.

Comment 106 Randy Perryman 2016-11-16 15:26:49 UTC
Couple of questions: will the proposed patches affect network performance?  for vlan's on bridge, IP on that bridge, etc..?

For existing deployments going from 9 - 10, will there be an Upgrade path?

I understand the resolution for 8/9 will be to not turn on the secure mode, thus we will not need to add the ovs_extra to any file.  

Thank You

Comment 109 Sean Merrow 2016-11-18 13:49:12 UTC
The backports for this into OSP 8 and 9 have been completed and will be available in the next puddle and in the RC due out Dec. 1

Comment 113 Jaromir Coufal 2016-11-21 20:24:14 UTC
Agreed on closing the bug in sake of BZ 1394890. Will let somebody from Neutron DFG to do this action since they own this BZ.

Comment 114 Assaf Muller 2016-11-21 20:50:37 UTC

*** This bug has been marked as a duplicate of bug 1394890 ***


Note You need to log in before you can comment on or make changes to this bug.