Bug 1576628

Summary: default route validation fails when swapping default route networks
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: vdsmAssignee: Edward Haas <edwardh>
Status: CLOSED CURRENTRELEASE QA Contact: Meni Yakove <myakove>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.11CC: danken, edwardh, lsurette, mburman, srevivo, ycui, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-24 07:23:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2018-05-10 01:21:44 UTC
Description of problem:

I see this part of the code is quite different in 4.20, but I couldn't find any BZ related to this. Maybe it is already fixed, but found important to report it:

If on the same setupNetworks command a default route network is being removed and a new one is being added, the validation fails: Only a singe default route network is allowed

New network: OCE02-NP-MGMT0  (being added)
Old network: on3462ee86c69c4 (being removed)

Looks like _validate_default_route (which was replaced in 4.2) doesn't look at networks being removed.

MainProcess|jsonrpc/3::DEBUG::2018-05-08 18:15:40,903::api::204::root::(setupNetworks) Setting up network according to configuration: networks:{u'OCE02-NP-MGMT01': {u'ipv6autoconf': True, u'vlan': u'52', u'ipaddr': u'10.182.52.184', u'switch': u'legacy', u'mtu': 1500, u'bonding': u'bond0', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true', u'netmask': u'255.255.252.0', u'gateway': u'10.182.52.1', u'defaultRoute': True}, u'on3462ee86c69c4': {u'remove': u'true'}}, bondings:{}, options:{u'connectivityCheck': u'true', u'connectivityTimeout': 120}

MainProcess|jsonrpc/3::ERROR::2018-05-08 18:15:40,909::supervdsmServer::94::SuperVdsm.ServerCallback::(wrapper) Error in setupNetworks
Traceback (most recent call last):
  File "/usr/share/vdsm/supervdsmServer", line 92, in wrapper
    res = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 210, in setupNetworks
    ipvalidator.validate(networks)
  File "/usr/lib/python2.7/site-packages/vdsm/network/ip/validator.py", line 43, in validate
    _validate_default_route(default_route_nets, no_default_route_nets)
  File "/usr/lib/python2.7/site-packages/vdsm/network/ip/validator.py", line 53, in _validate_default_route
    'Only a singe default route network is allowed.')
ConfigNetworkError: (21, 'Only a singe default route network is allowed.')

var/lib/vdsm/persistence/netconf/nets/on3462ee86c69c4:    "defaultRoute": true

Version-Release number of selected component (if applicable):
vdsm-4.19.50-1.el7ev.x86_64

This is what the customer did:
1. Move host from cluster A to B (A and B have different default route networks)
2. Attempt to setup networks on host, swapping the default route networks.

Comment 1 Germano Veit Michel 2018-05-10 01:26:24 UTC
on3462ee86c69c4 was unmanaged at that point. That's why I added BZ1515880.

Comment 2 Meni Yakove 2018-05-14 08:41:38 UTC
Edy, danken thinks that in 4.2 vdsm would remove defaultRoute=True from the unmanaged network when adding it to another network, right?

Comment 3 Dan Kenigsberg 2018-05-16 07:42:22 UTC
Michael, could you try to reproduce this in 4.2? I think we have fixed it in Vdsm already.

Comment 4 Edward Haas 2018-05-16 07:56:14 UTC
I think this is the scenario raised here: https://bugzilla.redhat.com/show_bug.cgi?id=1522971

Comment 5 Michael Burman 2018-05-16 14:43:53 UTC
(In reply to Dan Kenigsberg from comment #3)
> Michael, could you try to reproduce this in 4.2? I think we have fixed it in
> Vdsm already.

The error no longer happens on 4.2, BUT it is only partially works:
This is what the customer did and it's what i did as well ->
1. Move host from cluster A to B (A and B have different default route networks)
2. Attempt to setup networks on host, swapping the default route networks.

Result - swapping the network seem to be fine with engine(after setup networks and refresh caps the network is in sync), no errors on vdsm side of duplicate default route(as Germano reported in the bug) and the icon moved to the correct network, BUT -
The default route wasn't really update on vdsm side properly(both now has true) , and route is still via the unmanaged network. 

Example - df1 network is the default route in cluster A
df11 is the default route in cluster B

This is how it looks after the host move from cluster A to B

from caps:
 "df1": {
            "ipv6autoconf": false, 
            "addr": "10.35.x.x", 
            "ipv4defaultroute": true, 

}, 
        "df11": {
            "ipv6autoconf": false, 
            "addr": "10.35.x.x", 
            "ipv4defaultroute": true,

Both networks has "ipv4defaultroute": true, 

The actual route is still via df1(unmanaged) and not df11

[root@orchid-vds1 ~]# ping -I df1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from 10.35.x.x df1: 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=57 time=58.2 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=57 time=58.1 ms
^C
--- 8.8.8.8 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 58.152/58.177/58.202/0.025 ms

[root@orchid-vds1 ~]# ping -I df11 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from 10.35.x.x df11: 56(84) bytes of data.
^C
--- 8.8.8.8 ping statistics ---
7 packets transmitted, 0 received, 100% packet loss, time 5999ms

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.35.x.x   0.0.0.0         UG    0      0        0 df1

* So in the bottom line it doesn't work, but the error doesn't reproduced.
It become very bad if trying to remove the unmanaged network because if removing the unmanaged network from this state, df11 will be reported again as out-of-sync and only after few minutes it become synced again. By doing it manually.
The unmanaged scenario is not handled properly at all.

Comment 6 Michael Burman 2018-05-21 09:07:44 UTC
When the networks are on different subnets as you requested it works only if syncing manually the network on clusterB 

df1 vlan 162 in clusterA
df2 vlan 163 in clusterB

After host moved to clusterA, df1 is unmanaged and df2 is out-of-sync(default route true/false)

The route is still via df1
}, 
        "df1": {
            "ipv6autoconf": false, 
            "addr": "10.35.129.161", 
            "ipv4defaultroute": true, 

"df2": {
            "ipv6autoconf": false, 
            "addr": "10.35.130.61", 
            "ipv4defaultroute": false, 

[root@orchid-vds1 ~]# ping 8.8.8.8 -I df1
PING 8.8.8.8 (8.8.8.8) from 10.35.129.161 df1: 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=55 time=66.8 ms

- After doing manual sync all networks, df1 is unmanaged, df2 remain out-of-sync for few minutes and then synced.

[root@orchid-vds1 ~]# ping 8.8.8.8 -I df1
PING 8.8.8.8 (8.8.8.8) from 10.35.129.161 df1: 56(84) bytes of data.
^C
--- 8.8.8.8 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

[root@orchid-vds1 ~]# ping 8.8.8.8 -I df2
PING 8.8.8.8 (8.8.8.8) from 10.35.130.61 df2: 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=55 time=66.8 ms
^C
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 66.875/66.875/66.875/0.000 ms

 "df1": {
            "ipv6autoconf": false, 
            "addr": "10.35.129.161", 
            "ipv4defaultroute": false, 

}, 
        "df2": {
            "ipv6autoconf": false, 
            "addr": "10.35.130.61", 
            "ipv4defaultroute": true, 

BUT this only works if syncing manually, engine doesn't take care of it and doesn't syncing the network with default route property.

Comment 7 Michael Burman 2018-05-22 07:33:59 UTC
Summary:

- The original report no longer exist on 4.2

- I did found an issue with the described scenario with handling an 'unmanaged' network that used to be the default route network in cluster A. We keep reporting default route=true for this network unless doing manual remove of it.

Cluster A - net1 is the default route and doesn't exist in cluster B
Cluster B - net2 is the default route and attached to host(with bootproto) prior the host move. 

The correct way to work around this issue is:
1) Once host moved from cluster A (net1 is the default route and doesn't exist in cluster B) to cluster B, is to remove the 'unmanaged' network net1 from the host
2) net2 is out-of-sync, we should sync all networks only after the 'unmanaged' network has been removed from the host. 

- I don't think we going to handle this any time soon as i understand from our dev guys. 
As Edy requested, i believe this report can be closed. If user/admin doing such operations(move host cluster with different default route networks) then he should be aware that manual intervention will be required from him.

Comment 8 Dan Kenigsberg 2018-06-24 07:23:20 UTC
To avoid the reported issue, please upgrade your hypervisor to 4.2

Comment 9 Franta Kust 2019-05-16 13:09:47 UTC
BZ<2>Jira Resync