Bug 1281666
| Summary: | [RFE] Engine should warn admin about bad 802.3ad status | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | ||||
| Component: | RFEs | Assignee: | Marcin Mirecki <mmirecki> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Mor <mkalfon> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | unspecified | CC: | alkaplan, danken, gklein, inetkach, lsurette, mburman, mkalinin, mmirecki, myakove, penguin.wrangler, rbalakri, sigbjorn, srevivo, ykaul, ylavi | ||||
| Target Milestone: | ovirt-4.0.2 | Keywords: | FutureFeature | ||||
| Target Release: | --- | Flags: | mburman:
testing_plan_complete+
|
||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Enhancement | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1317457 (view as bug list) | Environment: | |||||
| Last Closed: | 2016-08-23 20:30:30 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1240719, 1317457 | ||||||
| Bug Blocks: | 902971 | ||||||
| Attachments: |
|
||||||
|
Description
Germano Veit Michel
2015-11-13 04:28:53 UTC
In addition: 1. Prefer NOT to schedule new VMs on that host (those VMs that use that network). 2. Have a policy to migrate VMs (that use that network) to other hosts. Both are SLA related stuff. Same for iSCSI multipathing... Indeed, good ideas! But if the migration network is related to that bond, trying to migrate VMs will probably be just a waste of resources. Also, if only way to storage is through a bad bond, shouldn't we set the host to non-operational? (In reply to Germano Veit Michel from comment #4) > Also, if only way to storage is through a bad bond, shouldn't we set the > host to non-operational? If this bond is indeed non-operational, then yes, but if it still can talk to the storage, but slow, I think we should not. Fair points. Anyway, if the bond is causing communication issues with the SD then host will likely transit to Non-Operational at some point. Note: /proc/net/* is generally considered deprecated/legacy, you might be better off looking at sysfs, particularly /sys/devices/virtual/net/bond*/, where there are a number of things exposed, such as the following from a test system I've got up and running right now: # cat bonding/ad_partner_mac 00:00:00:00:00:00 # cat lower_p5p1/bonding_slave/ad_aggregator_id 1 # cat lower_p5p2/bonding_slave/ad_aggregator_id 2 (bond0 contains p5p1 and p5p2, in mode 4, no switch-side configuration done) What need to be setup for testing: A badly configured switch with different 'Aggregator ID' for each nic, alert should be present in the engine. This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions Folks, I would like to see a mechanism to "override" or disable any bond configuration-related warnings, especially if a warning state will cause any side-effect (like avoiding scheduling VMs on the host etc). Specifically, this is because I run my hypervisors with a bond configuration that will almost certainly run afoul of these proposed "goodness" checks. I run an active-backup bond across 4 nics, where 2 nics link to one switch and the other 2 nics link to another switch. Thus two of the nics will show AggregatorID X and the other two will show AggregatorID Y. Only one of the pairs will operate as the current active-backup pair at one time, but if a whole switch fails, then the 2nd set of links will become the new active-backup pair. Thoughts? Disregard my last. I mixed up/forgot that I had switched from 802.3ad mode to active-backup mode which doesn't care about AggregatorIDs one bit. [root@orchid-vds2 ~]# vdsClient -s 0 getVdsCaps |grep aggreg
'ad_aggregator_id': '3',
'ad_aggregator_id': '1',
nics = {'dummy_0': {'ad_aggregator_id': '3',
'dummy_1': {'ad_aggregator_id': '4',
'dummy_3': {'ad_aggregator_id': '1',
'dummy_4': {'ad_aggregator_id': '2',
[root@orchid-vds2 ~]# vdsClient -s 0 getVdsCaps |grep partner
'ad_partner_mac': '00:00:00:00:00:00',
'ad_partner_mac': '00:00:00:00:00:00',
- partner mac with zeros should be considered as bad bond status.
Is this RFE really should be ON_QA? I see that patch 59062 was uploaded after the RFE changed his status.. Patches 59062 and 58697 were added as a response to comment 15. So this bug can't be ON_QA on the current build, please check if we have your latest patches on out latest qe build - 4.0.0.4-0.1.el7ev, if not, move back to post/modified, thank you This was tagged as: Tags ovirt-engine-4.0.0.5 Should be ON QA (In reply to Marcin Mirecki from comment #20) > This was tagged as: > Tags ovirt-engine-4.0.0.5 > > Should be ON QA Hi Marcin, in that case the target milestone should be changed and please add the target release as well. Thanks) Verified on - 4.0.0.5-0.1.el7ev I'm sorry, it was verified to soon. Seems like some scenarios are failed. Further investigation is needed. bond from 2 slaves that are not configured in the switch for lacp, are not reported as bad bonds in the engine. Please attach `vdsClient -s getVdsCaps ` and /proc/net/bonding/* to understand the failure. [root@vega05 ~]# vdsClient -s 0 getVdsCaps | grep aggreg
'ad_aggregator_id': '2',
'ad_aggregator_id': '1',
'enp1s0f1': {'ad_aggregator_id': '1',
'enp2s0f0': {'ad_aggregator_id': '2',
'enp2s0f1': {'ad_aggregator_id': '1',
'enp2s0f2': {'ad_aggregator_id': '2',
[root@vega05 ~]# vdsClient -s 0 getVdsCaps | grep partner
'ad_partner_mac': '18:ef:63:a1:75:00',
'ad_partner_mac': '18:ef:63:a1:75:00',
This bonds should be bad bonds, their aggregator_ids are different(they are not configured for lacp in the switch), but engine is not reporting them as bad bonds.
[root@vega05 ~]# cat /proc/net/bonding/*
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
Aggregator ID: 2
Number of ports: 1
Actor Key: 9
Partner Key: 37
Partner Mac Address: 18:ef:63:a1:75:00
Slave Interface: enp1s0f1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:25:90:c6:3d:59
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: churned
Actor Churned Count: 0
Partner Churned Count: 1
details actor lacp pdu:
system priority: 65535
port key: 9
port priority: 255
port number: 1
port state: 77
details partner lacp pdu:
system priority: 65535
oper key: 1
port priority: 255
port number: 1
port state: 1
Slave Interface: enp2s0f0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:e0:ed:40:40:28
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
port key: 9
port priority: 255
port number: 2
port state: 61
details partner lacp pdu:
system priority: 32768
oper key: 37
port priority: 32768
port number: 27
port state: 61
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 1
Actor Key: 9
Partner Key: 37
Partner Mac Address: 18:ef:63:a1:75:00
Slave Interface: enp2s0f1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:e0:ed:40:40:29
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: churned
Actor Churned Count: 0
Partner Churned Count: 1
details actor lacp pdu:
system priority: 65535
port key: 9
port priority: 255
port number: 1
port state: 13
details partner lacp pdu:
system priority: 32768
oper key: 37
port priority: 32768
port number: 28
port state: 133
Slave Interface: enp2s0f2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:e0:ed:40:40:2a
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: churned
Partner Churn State: churned
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
system priority: 65535
port key: 9
port priority: 255
port number: 2
port state: 69
details partner lacp pdu:
system priority: 65535
oper key: 1
port priority: 255
port number: 1
port state: 1
The problem is that we assumed that a bond not using all the slaves will not be operational (will not have an assigned mac and will be DOWN). It looks however that the bond will go to UP even if some of the slaves are not used (the ones having a different aggregator id). We will need to check not only the mac, but also the aggregator ids. We need to change vdsm to collect 'ad_aggregator_id' not only for bonds, but also for the nics. On the engine side we must make an additional check that the bond and slave 'ad_aggregator_id' are all the same. Time to move this to 4.1? Previously, I ran with a 4-port AD bond across 2 switches, but switched that bond to the simpler active-backup mode when the switches were showing some unreliability. Now that my switch issue is resolved (firmware upgrade), I've moved back to an AD bond to be able to gain some additional bandwidth while retaining pairs of links to two different switches for retained reliability.
My AD bond correctly shows two different Aggregator IDs ("1" for ports 1 and 2, "2" for ports 3 and 4).
This is perfectly valid and correct, given that the AD partners are on two different switches.
If one link fails, the other partner (with 2 remaining good links) will become the Active Aggregator because of ad_select=1 (bandwidth). If a whole switch fails, the other will become the active partner, etc. This is exactly how it's supposed to work, is exactly what I want, and IS a valid configuration.
I reinstate my earlier request that any "warning" issued regarding an "invalid" bond state be overridable (aka "Don't warn me about this again").
My present situation with engine 4.0.0.6-1.el7.centos, and vdsm-4.18.4.1-0.fc23.x86_64 on the nodes is that I get a warning of "Bond is in link aggregation mode (mode 4) but no partner mac has been reported for it". Which is definitely wrong.
# cat /proc/net/bonding/bond0 |grep -i mac
System MAC address: 00:25:90:f5:24:66
Partner Mac Address: e8:de:27:c6:c0:2d
system mac address: 00:25:90:f5:24:66
system mac address: e8:de:27:c6:c0:2d
system mac address: 00:25:90:f5:24:66
system mac address: e8:de:27:c6:c0:2d
system mac address: 00:25:90:f5:24:66
system mac address: e8:de:27:d8:2e:2e
system mac address: 00:25:90:f5:24:66
system mac address: e8:de:27:d8:2e:2e
# vdsClient -s 0 getVdsCaps | grep aggr
nics = {'eth0': {'ad_aggregator_id': '1',
'eth1': {'ad_aggregator_id': '1',
'eth2': {'ad_aggregator_id': '2',
'eth3': {'ad_aggregator_id': '2',
# vdsClient -s 0 getVdsCaps | grep partner
(no output)
# cat /sys/devices/virtual/net/bond0/bonding/ad_partner_mac
e8:de:27:c6:c0:2d
Since a warning is just a warning, I have no big problem with the resolution of this issue being pushed to 4.1, as long as any future update doesn't start refusing to use the bond just because of a perceived invalid state that isn't.
(In reply to Yaniv Kaul from comment #27) > Time to move this to 4.1? I'm afraid that the current state of Marcin's recent patches forces us to do so. The feature is partially in - we warn if no ad_partner_mac is reported by the partner switch. But we currently do not warn mismatching agg_id (which for Ian Morgan is a good thing). (In reply to Ian Morgan from comment #28) Could you share your whole /proc/net/bonding/bond0 and in particular your `cat /sys/devices/virtual/net/bond0/bonding/ad_aggregator`? Vdsm intentionally does not report ad_partner_mac if the aggregator is missing. Dan,
Here's the info you requested. ad_aggregator is 1, which matches the proc output.
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): bandwidth
System priority: 65535
System MAC address: 00:25:90:f5:24:66
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Actor Key: 9
Partner Key: 2404
Partner Mac Address: e8:de:27:c6:c0:2d
Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:25:90:f5:24:66
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: 00:25:90:f5:24:66
port key: 9
port priority: 255
port number: 1
port state: 61
details partner lacp pdu:
system priority: 32768
system mac address: e8:de:27:c6:c0:2d
oper key: 2404
port priority: 32768
port number: 2
port state: 60
Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:25:90:f5:24:67
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: 00:25:90:f5:24:66
port key: 9
port priority: 255
port number: 2
port state: 61
details partner lacp pdu:
system priority: 32768
system mac address: e8:de:27:c6:c0:2d
oper key: 2404
port priority: 32768
port number: 4
port state: 60
Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:25:90:f5:24:68
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: churned
Partner Churn State: none
Actor Churned Count: 1
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: 00:25:90:f5:24:66
port key: 9
port priority: 255
port number: 3
port state: 5
details partner lacp pdu:
system priority: 32768
system mac address: e8:de:27:d8:2e:2e
oper key: 210
port priority: 32768
port number: 2
port state: 12
Slave Interface: eth3
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:25:90:f5:24:69
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: churned
Partner Churn State: none
Actor Churned Count: 1
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: 00:25:90:f5:24:66
port key: 9
port priority: 255
port number: 4
port state: 5
details partner lacp pdu:
system priority: 32768
system mac address: e8:de:27:d8:2e:2e
oper key: 210
port priority: 32768
port number: 4
port state: 12
# cat /sys/devices/virtual/net/bond0/bonding/ad_aggregator
1
# grep '' /sys/devices/virtual/net/bond0/bonding/*
ad_actor_key:9
ad_actor_sys_prio:65535
ad_actor_system:00:00:00:00:00:00
ad_aggregator:1
ad_num_ports:2
ad_partner_key:2404
ad_partner_mac:e8:de:27:c6:c0:2d
ad_select:bandwidth 1
ad_user_port_key:0
all_slaves_active:0
arp_all_targets:any 0
arp_interval:0
arp_validate:none 0
downdelay:0
fail_over_mac:none 0
lacp_rate:slow 0
lp_interval:1
miimon:100
mii_status:up
min_links:0
mode:802.3ad 4
num_grat_arp:1
num_unsol_na:1
packets_per_slave:1
primary_reselect:always 0
queue_id:eth0:0 eth1:0 eth2:0 eth3:0
resend_igmp:1
slaves:eth0 eth1 eth2 eth3
tlb_dynamic_lb:1
updelay:0
use_carrier:1
xmit_hash_policy:layer3+4 1
Created attachment 1186700 [details]
Bond tooltip
To show the state of the bond properties without raising any alerts, I added a tooltip to the bond icon.
This will look more or less like on the attached image.
For bond in DOWN status, only the status will be displayed.
For UP: ad_partner_mac, ad_aggragator_id, and slave ad_aggregator_id's
Hi guys, I verified the RFE on a bond which is connected to a Cisco switch without PortChannel configuration. Engine Version: 4.0.2.3-0.1.el7ev. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1743.html |