Bug 1523661

Summary: When adding the host over an existing bond-vlan, it looses network connection after reboot.
Product: [oVirt] vdsm Reporter: Bernhard Seidl <info>
Component: GeneralAssignee: Edward Haas <edwardh>
Status: CLOSED CURRENTRELEASE QA Contact: Michael Burman <mburman>
Severity: medium Docs Contact:
Priority: high    
Version: 4.20.9CC: bugs, danken, info, mburman
Target Milestone: ovirt-4.2.1Flags: rule-engine: ovirt-4.2+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: upstream 4.2.1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-12 11:56:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logfiles none

Description Bernhard Seidl 2017-12-08 14:51:20 UTC
Description of problem:
Cluster member looses network connection after reboot. Network is setup using bonding and vlans.

Version-Release number of selected component (if applicable):
ovirt-node-ng-installer-master-2017120709.iso
4.2.1-0.0.master.20171206161426.git88e9120.el7.centos
Might also apply to 4.2.0 branch

How reproducible:


Steps to Reproduce:
1. Setup ovirt node with bonding and one vlan on top for management on three nodes
2. Setup self hosted engine with gluster
3. Complete cluster setup by adding hosts and storage domains
4. Set one of the hosts in maintenance mode
5. Reboot this host

Actual results:
reboot completes and host is not accessible via network

Expected results:
reboot completes and hosts is available again

Additional info:
Node network:
em3 + em4 => bond0 => bond0.78 => ovirtmgmt

Comment 1 Dan Kenigsberg 2017-12-08 15:08:53 UTC
Please attach supervdsm and vdsm logs so we can look at the problem

Comment 2 Bernhard Seidl 2017-12-08 16:18:06 UTC
Created attachment 1364959 [details]
Logfiles

Comment 3 Edward Haas 2017-12-10 15:19:44 UTC
It seems that we are not acquiring correctly an external bond when it has a VLAN over it.
I'm not exactly sure why it causes a disconnection in the presented scenario, but fixing the acquirement should result in a more predictive setup after reboot.

Comment 4 Bernhard Seidl 2017-12-11 14:03:23 UTC
I just investigated a bit further. It seems that bond0 was set to be down. After executing the following command the connection is working again:

# ip link set dev bond0 up

Comment 5 Bernhard Seidl 2018-01-02 13:30:24 UTC
Just tested.

Version used: ovirt-node-ng-installer-master-2018010109

Result: Works, no errrors

Comment 6 Michael Burman 2018-01-14 10:49:11 UTC
Verified on - vdsm-4.20.13-1.el7ev.x86_64 and 
4.2.1.1-0.1.el7
cockpit-155-1.el7.x86_64

The Scenario is - 

1) Create bond1 with IP and bond1.162 on top of him with IP as well(created via cockpit). 
Both hosts has default route before add host - 
[root@silver-vdsb yum.repos.d]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.35.1x8.254   0.0.0.0         UG    300    0        0 bond1
0.0.0.0         10.35.1x9.254   0.0.0.0         UG    400    0        0 bond1.162
10.35.1x8.0     0.0.0.0         255.255.255.0   U     300    0        0 bond1
10.35.1x9.0     0.0.0.0         255.255.255.0   U     400    0        0 bond1.162

[root@silver-vdsb yum.repos.d]# ip -4 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
6: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    inet 10.35.1x8.x/24 brd 10.35.1x8.255 scope global dynamic bond1
       valid_lft 42379sec preferred_lft 42379sec
7: bond1.162@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    inet 10.35.1x9.x/24 brd 10.35.1x9.255 scope global dynamic bond1.162
       valid_lft 42409sec preferred_lft 42409sec

[root@silver-vdsb yum.repos.d]# ping -I bond1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from 10.35.1x8.x bond1: 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=48 time=62.5 ms

[root@silver-vdsb yum.repos.d]# ping -I bond1.162 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from 10.35.1x9.x bond1.162: 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=48 time=128 ms

2) Add host to RHV on top of the vlan bond bond1.162 
3) ovirtmgmt network configured on top of the bond.162
[root@silver-vdsb ~]# brctl show
bridge name     bridge id               STP enabled     interfaces
;vdsmdummy;             8000.000000000000       no
ovirtmgmt             8000.001d096871c1       no              bond1.162

4) vdsm take ownership on the bond1 - 
[root@silver-vdsb ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond1
# Generated by VDSM version 4.20.13-1.el7ev
DEVICE=bond1
BONDING_OPTS='mode=1 miimon=100 primary=eno1'
MACADDR=00:1d:09:68:71:c1
ONBOOT=yes
BOOTPROTO=dhcp
MTU=1500
DEFROUTE=yes
NM_CONTROLLED=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
[root@silver-vdsb ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond1.162 
# Generated by VDSM version 4.20.13-1.el7ev
DEVICE=bond1.162
VLAN=yes
BRIDGE=new-default
ONBOOT=yes
MTU=1500
DEFROUTE=no
NM_CONTROLLED=no
IPV6INIT=no

5) and persist MACADDR in /var/lib/vdsm/persistence/netconf/bonds/bond1 
{
    "hwaddr": "00:1d:09:68:71:c1", 
    "nics": [
        "eno1", 
        "eno2"
    ], 
    "switch": "legacy", 
    "options": "mode=1 miimon=100 primary=eno1"

6) The default route after add host will be only via bond1.162 on which the management network configured on. 
bond1 has it's IP, but no default route via this interface after the host(we set default route for the management network).

6) Survive reboot

Comment 7 Sandro Bonazzola 2018-02-12 11:56:14 UTC
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.