Bug 1281666

Summary:

[RFE] Engine should warn admin about bad 802.3ad status

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Germano Veit Michel <gveitmic>

Component:

RFEs

Assignee:

Marcin Mirecki <mmirecki>

Status:

CLOSED ERRATA

QA Contact:

Mor <mkalfon>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

unspecified

CC:

alkaplan, danken, gklein, inetkach, lsurette, mburman, mkalinin, mmirecki, myakove, penguin.wrangler, rbalakri, sigbjorn, srevivo, ykaul, ylavi

Target Milestone:

ovirt-4.0.2

Keywords:

FutureFeature

Target Release:

---

Flags:

mburman: testing_plan_complete+

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Enhancement

Doc Text:

Story Points:

---

Clone Of:

Clones:

1317457 (view as bug list)

Environment:

Last Closed:

2016-08-23 20:30:30 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Network

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1240719, 1317457

Bug Blocks:

902971

Attachments:

Description	Flags
Bond tooltip	none

Description Germano Veit Michel 2015-11-13 04:28:53 UTC

Proposed title of this feature request  

LACP Bond Bad Status Warning

Description of problem:

I have worked on many different issues that in the end come down to missing configuration on the switch side of RHEV Hypervisor LACP bonds. It's very easy to setup RHEV Hypervisors in Bond Mode 4 or change some configuration and forget to do the required thing on the switch side.

Various different outcomes of bad LACP bonds:
- Applications on the VMs running slow
- Storage domain connection problems
- Flipping states for Hypervisors (non-operational, non-responsive...)
- Missing pings
- TCP resets, reordering, retransmissions
- Various timeouts
- VMs unable to communicate, flipping communication for VMs.

If each hypervisor could just check for a couple fields in each /proc/net/bonding/bondX for 802.3ad and RHEV-M issue a warning in the GUI we could probably avoid many of these problems.

Comment 3 Yaniv Kaul 2015-11-15 10:21:20 UTC

In addition:
1. Prefer NOT to schedule new VMs on that host (those VMs that use that network).
2. Have a policy to migrate VMs (that use that network) to other hosts.

Both are SLA related stuff.


Same for iSCSI multipathing...

Comment 4 Germano Veit Michel 2015-11-16 04:56:58 UTC

Indeed, good ideas!

But if the migration network is related to that bond, trying to migrate VMs will probably be just a waste of resources.

Also, if only way to storage is through a bad bond, shouldn't we set the host to non-operational?

Comment 5 Marina Kalinin 2015-11-16 20:03:57 UTC

(In reply to Germano Veit Michel from comment #4)

> Also, if only way to storage is through a bad bond, shouldn't we set the
> host to non-operational?
If this bond is indeed non-operational, then yes, but if it still can talk  to the storage, but slow, I think we should not.

Comment 6 Germano Veit Michel 2015-11-17 02:04:35 UTC

Fair points. Anyway, if the bond is causing communication issues with the SD then host will likely transit to Non-Operational at some point.

Comment 10 Jarod Wilson 2016-02-09 05:36:45 UTC

Note: /proc/net/* is generally considered deprecated/legacy, you might be better off looking at sysfs, particularly /sys/devices/virtual/net/bond*/, where there are a number of things exposed, such as the following from a test system I've got up and running right now:

# cat bonding/ad_partner_mac
00:00:00:00:00:00

# cat lower_p5p1/bonding_slave/ad_aggregator_id
1

# cat lower_p5p2/bonding_slave/ad_aggregator_id
2

(bond0 contains p5p1 and p5p2, in mode 4, no switch-side configuration done)

Comment 11 Yaniv Lavi 2016-02-10 12:30:49 UTC

What need to be setup for testing:
A badly configured switch with different 'Aggregator ID' for each nic, alert should be present in the engine.

Comment 12 Mike McCune 2016-03-28 23:08:19 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 13 Jillian Morgan 2016-05-19 16:41:40 UTC

Folks,

I would like to see a mechanism to "override" or disable any bond configuration-related warnings, especially if a warning state will cause any side-effect (like avoiding scheduling VMs on the host etc).

Specifically, this is because I run my hypervisors with a bond configuration that will almost certainly run afoul of these proposed "goodness" checks.
I run an active-backup bond across 4 nics, where 2 nics link to one switch and the other 2 nics link to another switch. Thus two of the nics will show AggregatorID X and the other two will show AggregatorID Y. Only one of the pairs will operate as the current active-backup pair at one time, but if a whole switch fails, then the 2nd set of links will become the new active-backup pair.

Thoughts?

Comment 14 Jillian Morgan 2016-05-19 16:46:20 UTC

Disregard my last. I mixed up/forgot that I had switched from 802.3ad mode to active-backup mode which doesn't care about AggregatorIDs one bit.

Comment 15 Michael Burman 2016-06-06 08:57:28 UTC

[root@orchid-vds2 ~]# vdsClient -s 0 getVdsCaps |grep aggreg
                              'ad_aggregator_id': '3',
                              'ad_aggregator_id': '1',
        nics = {'dummy_0': {'ad_aggregator_id': '3',
                'dummy_1': {'ad_aggregator_id': '4',
                'dummy_3': {'ad_aggregator_id': '1',
                'dummy_4': {'ad_aggregator_id': '2',

[root@orchid-vds2 ~]# vdsClient -s 0 getVdsCaps |grep partner 
                              'ad_partner_mac': '00:00:00:00:00:00',
                              'ad_partner_mac': '00:00:00:00:00:00',

- partner mac with zeros should be considered as bad bond status.

Comment 17 Michael Burman 2016-06-14 04:44:44 UTC

Is this RFE really should be ON_QA? I see that patch 59062 was uploaded after the RFE changed his status..

Comment 18 Marcin Mirecki 2016-06-14 05:32:06 UTC

Patches 59062 and 58697 were added as a response to comment 15.

Comment 19 Michael Burman 2016-06-14 05:40:53 UTC

So this bug can't be ON_QA on the current build, please check if we have your latest patches on out latest qe build - 4.0.0.4-0.1.el7ev, if not, move back to post/modified, thank you

Comment 20 Marcin Mirecki 2016-06-20 09:59:14 UTC

This was tagged as:
Tags 	ovirt-engine-4.0.0.5

Should be ON QA

Comment 21 Michael Burman 2016-06-21 06:25:58 UTC

(In reply to Marcin Mirecki from comment #20)
> This was tagged as:
> Tags 	ovirt-engine-4.0.0.5
> 
> Should be ON QA

Hi Marcin, in that case the target milestone should be changed and please add the target release as well. Thanks)

Comment 22 Michael Burman 2016-06-21 06:36:09 UTC

Verified on - 4.0.0.5-0.1.el7ev

Comment 23 Michael Burman 2016-06-22 12:43:55 UTC

I'm sorry, it was verified to soon. Seems like some scenarios are failed. 
Further investigation is needed.

bond from 2 slaves that are not configured in the switch for lacp, are not reported as bad bonds in the engine.

Comment 24 Dan Kenigsberg 2016-06-22 14:35:20 UTC

Please attach `vdsClient -s getVdsCaps ` and /proc/net/bonding/* to understand the failure.

Comment 25 Michael Burman 2016-06-22 15:25:10 UTC

[root@vega05 ~]# vdsClient -s 0 getVdsCaps | grep aggreg
                              'ad_aggregator_id': '2',
                              'ad_aggregator_id': '1',
                'enp1s0f1': {'ad_aggregator_id': '1',
                'enp2s0f0': {'ad_aggregator_id': '2',
                'enp2s0f1': {'ad_aggregator_id': '1',
                'enp2s0f2': {'ad_aggregator_id': '2',

[root@vega05 ~]# vdsClient -s 0 getVdsCaps | grep partner
                              'ad_partner_mac': '18:ef:63:a1:75:00',
                              'ad_partner_mac': '18:ef:63:a1:75:00',


This bonds should be bad bonds, their aggregator_ids are different(they are not configured for lacp in the switch), but engine is not reporting them as bad bonds. 

 [root@vega05 ~]# cat /proc/net/bonding/*
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
        Aggregator ID: 2
        Number of ports: 1
        Actor Key: 9
        Partner Key: 37
        Partner Mac Address: 18:ef:63:a1:75:00

Slave Interface: enp1s0f1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:25:90:c6:3d:59
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: churned
Actor Churned Count: 0
Partner Churned Count: 1
details actor lacp pdu:
    system priority: 65535
    port key: 9
    port priority: 255
    port number: 1
    port state: 77
details partner lacp pdu:
    system priority: 65535
    oper key: 1
    port priority: 255
    port number: 1
    port state: 1

Slave Interface: enp2s0f0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:e0:ed:40:40:28
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    port key: 9
    port priority: 255
    port number: 2
    port state: 61
details partner lacp pdu:
    system priority: 32768
    oper key: 37
    port priority: 32768
    port number: 27
    port state: 61
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 1
        Actor Key: 9
        Partner Key: 37
        Partner Mac Address: 18:ef:63:a1:75:00

Slave Interface: enp2s0f1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:e0:ed:40:40:29
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: churned
Actor Churned Count: 0
Partner Churned Count: 1
details actor lacp pdu:
    system priority: 65535
    port key: 9
    port priority: 255
    port number: 1
    port state: 13
details partner lacp pdu:
    system priority: 32768
    oper key: 37
    port priority: 32768
    port number: 28
    port state: 133

Slave Interface: enp2s0f2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:e0:ed:40:40:2a
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: churned
Partner Churn State: churned
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
    system priority: 65535
    port key: 9
    port priority: 255
    port number: 2
    port state: 69
details partner lacp pdu:
    system priority: 65535
    oper key: 1
    port priority: 255
    port number: 1
    port state: 1

Comment 26 Marcin Mirecki 2016-06-23 07:09:10 UTC

The problem is that we assumed that a bond not using all the slaves will not be operational (will not have an assigned mac and will be DOWN).
It looks however that the bond will go to UP even if some of the slaves are not used (the ones having a different aggregator id).
We will need to check not only the mac, but also the aggregator ids.

We need to change vdsm to collect 'ad_aggregator_id' not only for bonds, but also for the nics.
On the engine side we must make an additional check that the bond and slave 'ad_aggregator_id' are all the same.

Comment 27 Yaniv Kaul 2016-07-14 13:04:11 UTC

Time to move this to 4.1?

Comment 28 Jillian Morgan 2016-07-14 14:06:21 UTC

Previously, I ran with a 4-port AD bond across 2 switches, but switched that bond to the simpler active-backup mode when the switches were showing some unreliability. Now that my switch issue is resolved (firmware upgrade), I've moved back to an AD bond to be able to gain some additional bandwidth while retaining pairs of links to two different switches for retained reliability.

My AD bond correctly shows two different Aggregator IDs ("1" for ports 1 and 2, "2" for ports 3 and 4).
This is perfectly valid and correct, given that the AD partners are on two different switches.

If one link fails, the other partner (with 2 remaining good links) will become the Active Aggregator because of ad_select=1 (bandwidth). If a whole switch fails, the other will become the active partner, etc. This is exactly how it's supposed to work, is exactly what I want, and IS a valid configuration.

I reinstate my earlier request that any "warning" issued regarding an "invalid" bond state be overridable (aka "Don't warn me about this again").

My present situation with engine 4.0.0.6-1.el7.centos, and vdsm-4.18.4.1-0.fc23.x86_64 on the nodes is that I get a warning of "Bond is in link aggregation mode (mode 4) but no partner mac has been reported for it". Which is definitely wrong.

# cat /proc/net/bonding/bond0 |grep -i mac
System MAC address: 00:25:90:f5:24:66
        Partner Mac Address: e8:de:27:c6:c0:2d
    system mac address: 00:25:90:f5:24:66
    system mac address: e8:de:27:c6:c0:2d
    system mac address: 00:25:90:f5:24:66
    system mac address: e8:de:27:c6:c0:2d
    system mac address: 00:25:90:f5:24:66
    system mac address: e8:de:27:d8:2e:2e
    system mac address: 00:25:90:f5:24:66
    system mac address: e8:de:27:d8:2e:2e

# vdsClient -s 0 getVdsCaps | grep aggr
        nics = {'eth0': {'ad_aggregator_id': '1',
                'eth1': {'ad_aggregator_id': '1',
                'eth2': {'ad_aggregator_id': '2',
                'eth3': {'ad_aggregator_id': '2',

# vdsClient -s 0 getVdsCaps | grep partner
(no output)

# cat /sys/devices/virtual/net/bond0/bonding/ad_partner_mac 
e8:de:27:c6:c0:2d


Since a warning is just a warning, I have no big problem with the resolution of this issue being pushed to 4.1, as long as any future update doesn't start refusing to use the bond just because of a perceived invalid state that isn't.

Comment 29 Dan Kenigsberg 2016-07-17 07:40:27 UTC

(In reply to Yaniv Kaul from comment #27)
> Time to move this to 4.1?

I'm afraid that the current state of Marcin's recent patches forces us to do so. The feature is partially in - we warn if no ad_partner_mac is reported by the partner switch. But we currently do not warn mismatching agg_id (which for Ian Morgan is a good thing).

Comment 30 Dan Kenigsberg 2016-07-17 07:47:51 UTC

(In reply to Ian Morgan from comment #28)

Could you share your whole /proc/net/bonding/bond0 and in particular your `cat /sys/devices/virtual/net/bond0/bonding/ad_aggregator`? Vdsm intentionally does not report ad_partner_mac if the aggregator is missing.

Comment 31 Jillian Morgan 2016-07-17 10:23:54 UTC

Dan,

Here's the info you requested. ad_aggregator is 1, which matches the proc output.


# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): bandwidth
System priority: 65535
System MAC address: 00:25:90:f5:24:66
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 2
        Actor Key: 9
        Partner Key: 2404
        Partner Mac Address: e8:de:27:c6:c0:2d

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:25:90:f5:24:66
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 00:25:90:f5:24:66
    port key: 9
    port priority: 255
    port number: 1
    port state: 61
details partner lacp pdu:
    system priority: 32768
    system mac address: e8:de:27:c6:c0:2d
    oper key: 2404
    port priority: 32768
    port number: 2
    port state: 60

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:25:90:f5:24:67
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 00:25:90:f5:24:66
    port key: 9
    port priority: 255
    port number: 2
    port state: 61
details partner lacp pdu:
    system priority: 32768
    system mac address: e8:de:27:c6:c0:2d
    oper key: 2404
    port priority: 32768
    port number: 4
    port state: 60

Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:25:90:f5:24:68
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: churned
Partner Churn State: none
Actor Churned Count: 1
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 00:25:90:f5:24:66
    port key: 9
    port priority: 255
    port number: 3
    port state: 5
details partner lacp pdu:
    system priority: 32768
    system mac address: e8:de:27:d8:2e:2e
    oper key: 210
    port priority: 32768
    port number: 2
    port state: 12

Slave Interface: eth3
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:25:90:f5:24:69
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: churned
Partner Churn State: none
Actor Churned Count: 1
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 00:25:90:f5:24:66
    port key: 9
    port priority: 255
    port number: 4
    port state: 5
details partner lacp pdu:
    system priority: 32768
    system mac address: e8:de:27:d8:2e:2e
    oper key: 210
    port priority: 32768
    port number: 4
    port state: 12


# cat /sys/devices/virtual/net/bond0/bonding/ad_aggregator
1

# grep '' /sys/devices/virtual/net/bond0/bonding/*
ad_actor_key:9
ad_actor_sys_prio:65535
ad_actor_system:00:00:00:00:00:00
ad_aggregator:1
ad_num_ports:2
ad_partner_key:2404
ad_partner_mac:e8:de:27:c6:c0:2d
ad_select:bandwidth 1
ad_user_port_key:0
all_slaves_active:0
arp_all_targets:any 0
arp_interval:0
arp_validate:none 0
downdelay:0
fail_over_mac:none 0
lacp_rate:slow 0
lp_interval:1
miimon:100
mii_status:up
min_links:0
mode:802.3ad 4
num_grat_arp:1
num_unsol_na:1
packets_per_slave:1
primary_reselect:always 0
queue_id:eth0:0 eth1:0 eth2:0 eth3:0
resend_igmp:1
slaves:eth0 eth1 eth2 eth3
tlb_dynamic_lb:1
updelay:0
use_carrier:1
xmit_hash_policy:layer3+4 1

Comment 34 Marcin Mirecki 2016-08-02 08:13:19 UTC

Created attachment 1186700 [details]
Bond tooltip

To show the state of the bond properties without raising any alerts, I added a tooltip to the bond icon.
This will look more or less like on the attached image.

For bond in DOWN status, only the status will be displayed.
For UP: ad_partner_mac, ad_aggragator_id, and slave ad_aggregator_id's

Comment 35 Mor 2016-08-03 08:39:18 UTC

Hi guys,

I verified the RFE on a bond which is connected to a Cisco switch without PortChannel configuration.

Engine Version: 4.0.2.3-0.1.el7ev.

Comment 37 errata-xmlrpc 2016-08-23 20:30:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html