Bug 654600
Summary: | Kernel panic on vlan with bonding in balance-alb mode | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Liang Zheng <lzheng> | ||||||
Component: | kernel | Assignee: | Neil Horman <nhorman> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 5.6 | CC: | agospoda, hjia, kzhang, peterm | ||||||
Target Milestone: | rc | Keywords: | Regression | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2010-12-06 16:02:17 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Liang Zheng
2010-11-18 11:22:27 UTC
This appears to be a side-effect of the code added to enable netconsole over bonded interfaces as the BUG halt is from block_netpoll_tx. Assigning to Neil since that was his feature. Created attachment 463769 [details]
convert per cpu mask to a counter
The more I look at it the more I think using a per-cpu flag was a mistake. The only time we query the flag is when we recurse through a netpoll path anyway, so the IFF_IN_NETPOLL flag gates us on same-cpu access anyway. And the possibility of sleeping in paths gives rise to the possibility of us clearing a different flag than we set, which is bad. There won't be any performance hit if we just make a counter out of this thing instead, and that will allow us to sleep while holding the netpoll_block_tx flag. It means that all netpoll clients will block at the same time, even if they could make forward progress, but no one actually has more than one netpoll client active at any one time, so thats not super relevant.
Please test this patch out and let me know if it solves the problem. If it does I'll push it upstream and into RHEL5/6
Created attachment 463770 [details]
updated patch to convert per-cpu flag to counter
sorry, updated patch. Misnamed an origional file and so it didn't get picked up in the origional patch.
(In reply to comment #2) > Created attachment 463769 [details] > convert per cpu mask to a counter > > The more I look at it the more I think using a per-cpu flag was a mistake. The > only time we query the flag is when we recurse through a netpoll path anyway, > so the IFF_IN_NETPOLL flag gates us on same-cpu access anyway. And the > possibility of sleeping in paths gives rise to the possibility of us clearing a > different flag than we set, which is bad. There won't be any performance hit > if we just make a counter out of this thing instead, and that will allow us to > sleep while holding the netpoll_block_tx flag. It means that all netpoll > clients will block at the same time, even if they could make forward progress, > but no one actually has more than one netpoll client active at any one time, so > thats not super relevant. > > Please test this patch out and let me know if it solves the problem. If it > does I'll push it upstream and into RHEL5/6 This is probably a better solution than the CPU mask as it avoids the case where the locks are taken and released on a different CPU (though it seems those cases were removed). Should some care also be taken to make sure the ref counting is zero by the time the driver is removed? We can certainly add a WARN_ON, but we should never be waiting on that refcount to decrement. the use paths of the counter are all within the driver, and the onus of making sure all the tasks/workqueues/etc which use this code are flushed is already part of the driver exit routine. Something like a warning was all I was thinking about. (In reply to comment #3) > Created attachment 463770 [details] > updated patch to convert per-cpu flag to counter > > sorry, updated patch. Misnamed an origional file and so it didn't get picked > up in the origional patch. This bug reproduces with probability. I service network restart about 20 times the kernel panic [root@ibm-ls21-03 network-scripts]# service network restart Shutting down interface bond0.10: Removed VLAN -:bond0.10:- [ OK ] Shutting down interface bond0: bonding: bond0: Warning: the permanent HWaddr of eth0 - 00:14:5E:6D:1C:B8 - is still in use by bond0. Set the HWaddr of eth0 to a different address to avoid conflicts. ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at drivers/net/bonding/bonding.h:135 invalid opcode: 0000 [1] SMP last sysfs file: /class/net/bond0/bonding/slaves CPU 0 Modules linked in: bonding 8021q autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api loop dm_multipath scsi_dh video backlight sbs power_meter i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg i2c_piix4 tpm_tis k8temp i2c_core k8_edac bnx2 tpm hwmon edac_mc serio_raw tpm_bios pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 12674, comm: ifdown-eth Not tainted 2.6.18-232.el5 #1 RIP: 0010:[<ffffffff884c2c0b>] [<ffffffff884c2c0b>] :bonding:bond_release+0x62/0x4f1 RSP: 0018:ffff810127609e28 EFLAGS: 00010286 RAX: 00000000ffffffff RBX: 00000000000005dc RCX: ffffffff80318f28 RDX: ffffffff80318f28 RSI: ffff81022c488000 RDI: ffff8101281f2530 RBP: ffff8101281f2500 R08: ffffffff80318f28 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000080 R12: ffff8101281f2000 R13: 0000000000000006 R14: ffff81022c488000 R15: ffff81012ea50ac0 FS: 00002addc8667f50(0000) GS:ffffffff80424000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000388ca69220 CR3: 000000012743a000 CR4: 00000000000006e0 Process ifdown-eth (pid: 12674, threadinfo ffff810127608000, task ffff810128bc57a0) Stack: 00000000000080d0 ffffffff8006456b ffff810128bc57a0 00000000000005dc ffff81022c488000 ffff8101281f2500 0000000000000006 0000000000000006 ffff81012ea50ac0 ffffffff884cbb54 000000316874652d 0000000000000000 Call Trace: [<ffffffff8006456b>] __down_write_nested+0x12/0x92 [<ffffffff884cbb54>] :bonding:bonding_store_slaves+0x25c/0x2f7 [<ffffffff8010fdb5>] sysfs_write_file+0xb9/0xe8 [<ffffffff80016af0>] vfs_write+0xce/0x174 [<ffffffff800173a8>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 0f 0b 68 aa d4 4c 88 c2 87 00 4c 8b 6d 08 31 c0 eb 0c 4d 39 RIP [<ffffffff884c2c0b>] :bonding:bond_release+0x62/0x4f1 RSP <ffff810127609e28> <0>Kernel panic - not syncing: Fatal exception It looks like the issue is not the same. I test on kernel 2.6.18-194.el5 It does not reproduce. I think it is a regression bug Ok, I'm going to post this upstream and backport it then *** This bug has been marked as a duplicate of bug 659594 *** |