Bug 1124800
Summary: | network on top of bond considered operational even if all its nics are down | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] oVirt | Reporter: | Martin Pavlik <mpavlik> | ||||||
Component: | ovirt-engine-core | Assignee: | Alona Kaplan <alkaplan> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Martin Pavlik <mpavlik> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 3.5 | CC: | alkaplan, bazulay, bugs, danken, ecohen, gklein, iheim, lvernia, mburman, mgoldboi, mpavlik, myakove, peterm, rbalakri, s.kieske, yeylon | ||||||
Target Milestone: | --- | Keywords: | Regression, Reopened, Triaged | ||||||
Target Release: | 3.5.1 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | network | ||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-01-21 16:05:22 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
What does `ip link show bond0` has to say in this case? [root@dell-r210ii-08 ~]# ip link show bond0 2: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default link/ether 90:e2:ba:04:28:b8 brd ff:ff:ff:ff:ff:ff So, Vdsm reports the true operational state of the bond device. I must admit that it's not telling you much, since the bond device is connected to nothing. Still, I do not think that Vdsm should override Linux's state with some computation of its own. Since we cannot trust Linux's operstate here, we should add our own logic. The only question is where this should take place - in Vdsm or in Engine. This seems to be a recent regression due to the NIC fault notification feature. If I understand correctly, patch 27720 removed (among other things) logic that checked whether slave interfaces were down, and if so would mark the bond as down; this logic is probably needed after all. Veaceslav, is it an intentional bonding module behavior, to have all slaves in LOWER_DOWN state, and master bond staying up? kernel-3.14.13-100.fc19.x86_64 shows that (slaves are igb), while kernel-2.6.32-431.5.1.el6.x86_64 does not (slaves are dummy). it seems to work with [root@dell-r210ii-07 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.5 (Santiago) RHEL and ovirt rc version vdsm-4.16.1-6.gita4a4614.el6.x86_64 [root@dell-r210ii-07 ~]# ip link set dev p1p1 down [root@dell-r210ii-07 ~]# ip link set dev p1p2 down [root@dell-r210ii-07 ~]# vdsClient -s 0 getVdsStats 'bond0': {'name': 'bond0', 'rxDropped': '0', 'rxErrors': '0', 'rxRate': '0.0', 'speed': '1000', 'state': 'down', <----- 'txDropped': '0', 'txErrors': '0', 'txRate': '0.0'}, 'p1p1': {'name': 'p1p1', 'rxDropped': '0', 'rxErrors': '0', 'rxRate': '0.0', 'speed': '1000', 'state': 'down', <----- 'txDropped': '0', 'txErrors': '0', 'txRate': '0.0'}, 'p1p2': {'name': 'p1p2', 'rxDropped': '0', 'rxErrors': '0', 'rxRate': '0.0', 'speed': '1000', 'state': 'down', <----- 'txDropped': '0', 'txErrors': '0', 'txRate': '0.0'}} [root@dell-r210ii-07 ~]# ip link show bond0 9: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN link/ether 90:e2:ba:04:29:88 brd ff:ff:ff:ff:ff:ff kernel version used in comment 7 [root@dell-r210ii-07 ~]# uname -r 2.6.32-431.el6.x86_64 it seems to work with rhel7 as well: 'bond0': {'name': 'bond0', 'rxDropped': '666', 'rxErrors': '0', 'rxRate': '0.0', 'speed': '1000', 'state': 'down', 'txDropped': '0', 'txErrors': '0', 'txRate': '0.0'}, 'enp6s0f0': {'name': 'enp6s0f0', 'rxDropped': '333', 'rxErrors': '0', 'rxRate': '0.0', 'speed': '1000', 'state': 'down', 'txDropped': '0', 'txErrors': '0', 'txRate': '0.0'}, 'enp6s0f1': {'name': 'enp6s0f1', 'rxDropped': '333', 'rxErrors': '0', 'rxRate': '0.0', 'speed': '1000', 'state': 'down', 'txDropped': '0', 'txErrors': '0', 'txRate': '0.0'}, uname -r - 3.10.0-123.el7.x86_64 Guys, where do we stand with this? On el6 and el7 a bond is marked as down and on f20 it's marked as up? Can we verify this again on a different f20 machine? I just tested this on Fedora 20 and got the expected behavior, bond is reported as down when its slaves are down. Martin, please check your deployment again and try to understand what might be special about it... Host is F19 Verified on - vdsm-4.16.2-1.gite8cba75.el6.x86_64 it seems that this BZ can be reproduced by using custom mode for bond (custom bond mode is also used if bond is created over REST API in any bond mode except mode 4) steps: 1) create bridged network (e.g. net_bond) 2) create bond with custom mode on host 2a) Hosts -> your host -> Network Interfaces -> Setup Host Networks 2b) take one host NIC and drag it over another 2c) in Create new bond dialog: Bonding mode: custom Custom mode: mode=1 2d) attach network created in step 1 to the new bond 2e) confir setup host network dialog by clicking OK 3) ssh to host and set all bond slaves down using command: ip link set dev <your NIC> down result: bond0 = p1p1 + p1p2 root@dell-r210ii-05 ~]# ip a l p1p1 2: p1p1: <BROADCAST,MULTICAST,SLAVE> mtu 1500 qdisc mq master bond0 state DOWN qlen 1000 link/ether 90:e2:ba:04:2d:74 brd ff:ff:ff:ff:ff:ff [root@dell-r210ii-05 ~]# ip a l p1p2 3: p1p2: <BROADCAST,MULTICAST,SLAVE> mtu 1500 qdisc mq master bond0 state DOWN qlen 1000 link/ether 90:e2:ba:04:2d:74 brd ff:ff:ff:ff:ff:ff [root@dell-r210ii-05 ~]# ip a l bond0 33: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 90:e2:ba:04:2d:74 brd ff:ff:ff:ff:ff:ff inet6 fe80::92e2:baff:fe04:2d74/64 scope link valid_lft forever preferred_lft forever [root@dell-r210ii-05 ~]# cat /sys/class/net/bond0/operstate up [root@dell-r210ii-05 ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond0 # Generated by VDSM version 4.16.3-3.el6ev.beta DEVICE=bond0 BONDING_OPTS=mode=1 BRIDGE=net_bond ONBOOT=no MTU=1500 NM_CONTROLLED=no HOTPLUG=no used versions: [root@dell-r210ii-05 ~]# rpm -q vdsm vdsm-4.16.3-3.el6ev.beta.x86_64 Red Hat Enterprise Virtualization Manager Version: 3.5.0-0.12.beta.el6ev [root@dell-r210ii-05 ~]# uname -r 2.6.32-431.23.3.el6.x86_64 [root@dell-r210ii-05 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.5 (Santiago) engine.log: 2014-09-18 14:33:43,338 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-37) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Slave p1p2 of bond bond0 on host dell-05, changed state to down 2014-09-18 14:33:43,347 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-37) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Slave p1p1 of bond bond0 on host dell-05, changed state to down Created attachment 938906 [details]
log_collector with custom bond mode
Moving to 3.5.1, as this is not important enough to block 3.5.0. I can repeat my question to vfalico: isn't it a kernel bug? Martin, which bonding modes (other than 1) report 'up' despite of all slaves being down? playing with bonds a bit more I've found out that the issue seem to tie to custom bond1 and miimon parameter if used custom bond: mode=1 it produces following ifcfg BONDING_OPTS=mode=1 this does not seem to work ======================================= if used custom bond: mode=1 miimon=100 it produces following ifcfg BONDING_OPTS='mode=1 miimon100' (which is the same as if you use dropown menu to select mode 1) (note that now it is BONDING_OPTS in quotes now) ========================================== other custom bond modes (2, 4, 5) work OK SUMMARY: bond mode 1 requires miimon=100 to work properly Time to close this bug again... We know all too well that without setting miimon, the bonding module does not sense downed interfaces. commit 5a9f424a9fe07e935820493e5d8fcf5d1626adf2 Author: Lior Vernia <lvernia> Date: Wed Aug 7 11:45:36 2013 +0300 webadmin: Set miimon=100 for preset bonding options So far it hasn't been set by default, which would apparently cause the kernel to not poll for the status of the bonded NICs. Change-Id: If4a6070639d6566f9ebb5e30adf16d63128eb820 Signed-off-by: Lior Vernia <lvernia> http://gerrit.ovirt.org/17821/ oVirt 3.5.1 has been released. If problems still persist, please make note of it in this bug report. |
Created attachment 922487 [details] log_collector Description of problem: if all bond slaves are taken down by ip link set dev XXX down vdsm reports bond as up which is not correct if ifdown command is used it works correctly (since physical link is shut down) bond0 = p1p1 + p1p2 [root@dell-r210ii-08 ~]# ip link set dev p1p1 down [root@dell-r210ii-08 ~]# ip link set dev p1p2 down [root@dell-r210ii-08 ~]# vdsClient -s 0 getVdsStats 'bond0': {'name': 'bond0', 'rxDropped': '6104', 'rxErrors': '1413', 'rxRate': '0.0', 'speed': '1000', 'state': 'up', <------------------ 'txDropped': '0', 'txErrors': '0', 'txRate': '0.0'}, 'p1p1': {'name': 'p1p1', 'rxDropped': '8', 'rxErrors': '1413', 'rxRate': '0.0', 'speed': '1000', 'state': 'down', <---------------- 'txDropped': '0', 'txErrors': '0', 'txRate': '0.0'}, 'p1p2': {'name': 'p1p2', 'rxDropped': '0', 'rxErrors': '0', 'rxRate': '0.0', 'speed': '1000', 'state': 'down',<----------------- 'txDropped': '0', 'txErrors': '0', 'txRate': '0.0'}, Version-Release number of selected component (if applicable): [root@dell-r210ii-08 ~]# rpm -q vdsm vdsm-4.16.1-0.gita4d9abf.fc19.x86_64 How reproducible: 100% Steps to Reproduce: 1. create bond 2. shut down all bond slaves using ip link set dev XXX down 3. run vdsClient -s 0 getVdsStats Actual results: bond is reported as UP despite the fact all its slaves are down Expected results: bond is reported DOWN Additional info: