Bug 1416805 - [RFE] notify Engine on ad_partner_mac change
Summary: [RFE] notify Engine on ad_partner_mac change
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.19.6
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: ---
Assignee: Edward Haas
QA Contact: Meni Yakove
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-26 13:35 UTC by Mor
Modified: 2019-11-13 08:08 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-20 07:23:47 UTC
oVirt Team: Network
rule-engine: ovirt-4.4?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
ip monitor command log during bond setup (11.67 KB, text/plain)
2017-02-26 13:54 UTC, Mor
no flags Details

Description Mor 2017-01-26 13:35:06 UTC
Description of problem:
When configuring invalid LACP bond ("bad bond" - bond with one or more interfaces connected to non-LACP switch port) engine in REST sometimes reports a zero MAC address and sometimes it doesn't report a value at all.

Version-Release number of selected component (if applicable):
Red Hat Virtualization Manager Version: 4.1.0.2-0.2.el7

How reproducible:
100%

Steps to Reproduce:
1. Create bond of two VDS interfaces that are connected to non-LACP ports. 
2. Try to get the ad_partner_mac from REST from: http://<server>/ovirt-engine/api/hosts/<host_id>/nics

Actual results:

Here you can query output of bad bond named: bond1. There is no ad_partner_mac value.

<host_nic href= "/ovirt-engine/api/hosts/582651bb-c45a-4ee2-921c-cbbbee75ac9b/nics/35eb7078-c817-4749-87c7-bada1b242df9" id="35eb7078-c817-4749-87c7-bada1b242df9">
<actions/>
<name>bond1</name>
<link href= "/ovirt-engine/api/hosts/582651bb-c45a-4ee2-921c-cbbbee75ac9b/nics/35eb7078-c817-4749-87c7-bada1b242df9/networklabels" rel="networklabels"/>
<link href= "/ovirt-engine/api/hosts/582651bb-c45a-4ee2-921c-cbbbee75ac9b/nics/35eb7078-c817-4749-87c7-bada1b242df9/networkattachments" rel="networkattachments"/>
<link href= "/ovirt-engine/api/hosts/582651bb-c45a-4ee2-921c-cbbbee75ac9b/nics/35eb7078-c817-4749-87c7-bada1b242df9/statistics" rel="statistics"/>
<bonding>
<options>
<option>
<name>mode</name>
<type>Dynamic link aggregation (802.3ad)</type>
<value>4</value>
</option>
<option>
<name>miimon</name>
<value>100</value>
</option>
<option>
<name>xmit_hash_policy</name>
<value>2</value>
</option>
</options>
<slaves>
<host_nic href= "/ovirt-engine/api/hosts/582651bb-c45a-4ee2-921c-cbbbee75ac9b/nics/cfe447b8-2dd0-4615-a128-bdf5b2c9b1c3" id="cfe447b8-2dd0-4615-a128-bdf5b2c9b1c3"/>
<host_nic href= "/ovirt-engine/api/hosts/582651bb-c45a-4ee2-921c-cbbbee75ac9b/nics/947e380d-782c-4a7e-a4e1-fb32020a81be" id="947e380d-782c-4a7e-a4e1-fb32020a81be"/>
</slaves>
</bonding>
<boot_protocol>none</boot_protocol>
<bridged>false</bridged>
<ip>
<address/>
<netmask/>
<version>v4</version>
</ip>
<ipv6_boot_protocol>none</ipv6_boot_protocol>
<mac>
<address>00:e0:ed:30:27:4c</address>
</mac>
<mtu>1500</mtu>
<status>up</status>
<host href= "/ovirt-engine/api/hosts/582651bb-c45a-4ee2-921c-cbbbee75ac9b" id="582651bb-c45a-4ee2-921c-cbbbee75ac9b"/>
</host_nic>
<host_nic href= "/ovirt-engine/api/hosts/582651bb-c45a-4ee2-921c-cbbbee75ac9b/nics/f249139a-b8ae-47b8-8240-edc92c1ff7e1" id="f249139a-b8ae-47b8-8240-edc92c1ff7e1">
<actions/>
<name>enp1s0f0</name>
<link href= "/ovirt-engine/api/hosts/582651bb-c45a-4ee2-921c-cbbbee75ac9b/nics/f249139a-b8ae-47b8-8240-edc92c1ff7e1/networklabels" rel="networklabels"/>
<link href= "/ovirt-engine/api/hosts/582651bb-c45a-4ee2-921c-cbbbee75ac9b/nics/f249139a-b8ae-47b8-8240-edc92c1ff7e1/networkattachments" rel="networkattachments"/>
<link href= "/ovirt-engine/api/hosts/582651bb-c45a-4ee2-921c-cbbbee75ac9b/nics/f249139a-b8ae-47b8-8240-edc92c1ff7e1/statistics" rel="statistics"/>
<boot_protocol>dhcp</boot_protocol>
<bridged>true</bridged>
<custom_configuration>false</custom_configuration>
<ip>
<address>10.35.128.27</address>
<gateway>10.35.128.254</gateway>
<netmask>255.255.255.0</netmask>
<version>v4</version>
</ip>
<ipv6>
<address>2620:52:0:2380:225:90ff:fec6:4052</address>
<gateway>fe80:52:0:2380::fe</gateway>
<netmask>64</netmask>
<version>v6</version>
</ipv6>
<ipv6_boot_protocol>autoconf</ipv6_boot_protocol>
<mac>
<address>00:25:90:c6:40:52</address>
</mac>
<mtu>1500</mtu>
<properties/>
<speed>1000000000</speed>
<status>up</status>
<host href= "/ovirt-engine/api/hosts/582651bb-c45a-4ee2-921c-cbbbee75ac9b" id="582651bb-c45a-4ee2-921c-cbbbee75ac9b"/>
<network href= "/ovirt-engine/api/networks/e041a045-a06f-491d-bd2b-a18be89fb85e" id="e041a045-a06f-491d-bd2b-a18be89fb85e"/>
</host_nic>


Expected results:


Additional info:

Comment 1 Dan Kenigsberg 2017-01-26 14:07:35 UTC
(In reply to Mor from comment #0)

The following too statements are contradicting.

> ... engine in REST sometimes
> reports a zero MAC address and sometimes it doesn't report a value at all.


> How reproducible:
> 100%


Please attach engine/vdsm logs for the expected case (partner mac is reported as 00:00:00:00:00:00) as well as the buggy case (partner mac not reported at all)

Most interesting is the output of getVdsCaps on both cases - I suspect the problem lies in Vdsm, but I would like to confirm that.

Comment 2 Mor 2017-01-26 14:34:30 UTC
(In reply to Dan Kenigsberg from comment #1)
> Please attach engine/vdsm logs for the expected case (partner mac is
> reported as 00:00:00:00:00:00) as well as the buggy case (partner mac not
> reported at all)
> 
> Most interesting is the output of getVdsCaps on both cases - I suspect the
> problem lies in Vdsm, but I would like to confirm that.

Sure, I will upload them.

Comment 5 Marcin Mirecki 2017-02-15 09:03:24 UTC
The problem is visible right after creating a bond, but seems to disappear when a get_caps is called on the host.
It looks like the first get caps done as part of SetupNetworks is does not pick up the partner_mac, so it is not reported. During attempts to reproduce, each time doing "refresh capabilities" on the host retrieved the proper partner_mac.

It was only possible to reporoduce this using SRIOV devices (this is probably not relevant, but I'm mentioning this just in case).

Comment 6 Dan Kenigsberg 2017-02-20 10:16:46 UTC
Can you confirm comment 5, Mor?

If this is the case, Vdsm would need to generate an event when ad_partner_mac changes. And until we do that, we'd request users to click the "refresh capabilities" button.

Comment 7 Mor 2017-02-20 12:06:31 UTC
(In reply to Dan Kenigsberg from comment #6)
> Can you confirm comment 5, Mor?
> 
> If this is the case, Vdsm would need to generate an event when
> ad_partner_mac changes. And until we do that, we'd request users to click
> the "refresh capabilities" button.

Yes, I can confirm that we saw that happening. I just need to check in the automation if it really resolves the issue, until you add event to handle it.

Comment 8 Mor 2017-02-20 12:59:32 UTC
(In reply to Dan Kenigsberg from comment #6)
> Can you confirm comment 5, Mor?
> 
> If this is the case, Vdsm would need to generate an event when
> ad_partner_mac changes. And until we do that, we'd request users to click
> the "refresh capabilities" button.

I tried to run it several times with refresh capabilities on the affected hosts and it seems to be working correctly.

Comment 9 Mor 2017-02-26 13:54:12 UTC
Created attachment 1257825 [details]
ip monitor command log during bond setup

bond1 is not reporting ad_partner_mac in REST.

# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 00:e0:ed:30:27:4e
Active Aggregator Info:
	Aggregator ID: 1
	Number of ports: 1
	Actor Key: 9
	Partner Key: 1
	Partner Mac Address: 00:00:00:00:00:00

Slave Interface: enp2s0f2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:e0:ed:30:27:4e
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: churned
Actor Churned Count: 0
Partner Churned Count: 1
details actor lacp pdu:
    system priority: 65535
    system mac address: 00:e0:ed:30:27:4e
    port key: 9
    port priority: 255
    port number: 1
    port state: 77
details partner lacp pdu:
    system priority: 65535
    system mac address: 00:00:00:00:00:00
    oper key: 1
    port priority: 255
    port number: 1
    port state: 1

Slave Interface: enp2s0f3
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:e0:ed:30:27:4f
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: churned
Partner Churn State: churned
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
    system priority: 65535
    system mac address: 00:e0:ed:30:27:4e
    port key: 9
    port priority: 255
    port number: 2
    port state: 69
details partner lacp pdu:
    system priority: 65535
    system mac address: 00:00:00:00:00:00
    oper key: 1
    port priority: 255
    port number: 1
    port state: 1

Comment 10 Edward Haas 2017-04-24 13:17:40 UTC
This makes no sense to me, monitoring and sending events on status changes of connections to peer devices is more close to a routing distribution algorithm then to a management system.

I think that the parameters are not reported if the LACP negotiation has failed.
So it is up to the management system to mark the bond as "not fully connected" or something like that and visualize it to the user.
The user then can choose to ignore it as a known issue or fix this.

Going back to the original problem for the initial creation of bonds, it makes sense to me for the management system to re-query the host after N secs to confirm the data before declaring that something is wrong.

Comment 11 Dan Kenigsberg 2017-04-25 08:16:43 UTC
Engine cannot guess if due to a switch-side reconfiguration a formerly-working bond is now out of sync. Only the host, which participated in LACP protocol, knows that, and should inform management. Same logic apply to the bad-to-good transition.

Comment 12 Edward Haas 2017-04-25 09:06:53 UTC
(In reply to Dan Kenigsberg from comment #11)
> Engine cannot guess if due to a switch-side reconfiguration a
> formerly-working bond is now out of sync. Only the host, which participated
> in LACP protocol, knows that, and should inform management. Same logic apply
> to the bad-to-good transition.

By that you assume that the controller/management has defined the desired state, like to what VLAN or to what exact peer (mac) the nic should be connected to.
If that is the requirement, then I would propose something like:
- Send the agent the desired state.
- Agent monitors the current state periodically and in case of a diff, report it as an event back to the controller.
This reduces the load from the controller/management of receiving events for things it may not even care about.

We need something that scales, easy to manage and simple to extend. RFE? :)

Comment 13 Mor 2017-05-28 08:21:19 UTC
Should I open an RFE?

Comment 14 Dan Kenigsberg 2017-05-28 08:48:14 UTC
No, Edy was referring to to a mythical management system, that does not have a proper design yet. What it needs is a request for definition, not request for extension.

Comment 15 Yaniv Kaul 2017-06-06 21:45:16 UTC
(In reply to Dan Kenigsberg from comment #11)
> Engine cannot guess if due to a switch-side reconfiguration a
> formerly-working bond is now out of sync. Only the host, which participated
> in LACP protocol, knows that, and should inform management. Same logic apply
> to the bad-to-good transition.

You could do a ping... There's a RFE to test end-to-end networks.

Comment 16 Yaniv Kaul 2018-02-28 16:17:54 UTC
(In reply to Dan Kenigsberg from comment #14)
> No, Edy was referring to to a mythical management system, that does not have
> a proper design yet. What it needs is a request for definition, not request
> for extension.

Can you defer this to 4.4 for the time being?

Comment 17 Yaniv Lavi 2018-06-20 07:23:47 UTC
Closing this RFE.
Please reopen if still needed.
Patches are welcomed.


Note You need to log in before you can comment on or make changes to this bug.