Bug 438629
Summary: | multiple concurrent brctl addif cause kernel panic | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Dan Kenigsberg <danken> | ||||||||
Component: | kernel | Assignee: | Neil Horman <nhorman> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Martin Jenner <mjenner> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 5.1 | CC: | davem, herbert.xu, tgraf | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2008-05-14 18:14:30 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Dan Kenigsberg
2008-03-23 15:37:39 UTC
Created attachment 298862 [details]
reproduce kernel panic with multiple concurrent brctl/tunctl
Created attachment 298863 [details]
tunctl source code
Dan, can you cut-and-paste the output of the panic here? Thanks, P. Sure. Note that this specific run is on an unpatched 2.6.18-8 kernel, but same thing happens with 5.1 kernel. device 0x0 entered promiscuous mode device 0x0 left promiscuous mode sw0: port 2(0x0) entering disabled state device 1x0 entered promiscuous mode device 16x0 entered promiscuous mode device 16x0 left promiscuous mode sw0: port 3(16x0) entering disabled state device 9x0 entered promiscuous mode device 9x0 left promiscuous mode sw0: port 3(9x0) entering disabled state device 6x0 entered promiscuous mode device 6x0 left promiscuous mode sw0: port 3(6x0) entering disabled state device 11x0 entered promiscuous mode device 2x0 entered promiscuous mode device 3x0 entered promiscuous mode device 4x0 entered promiscuous mode sw0: port 2(1x0) entering learning state device 5x0 entered promiscuous mode device 5x0 left promiscuous mode sw0: port 7(5x0) entering disabled state device 1x0 left promiscuous mode sw0: port 2(1x0) entering disabled state device 18x0 entered promiscuous mode device 18x0 left promiscuous mode sw0: port 2(18x0) entering disabled state device 25x0 entered promiscuous mode device 2x0 left promiscuous mode sw0: port 4(2x0) entering disabled state device 2x1 entered promiscuous mode device 4x0 left promiscuous mode sw0: port 6(4x0) entering disabled state device 4x1 entered promiscuous mode device 28x0 entered promiscuous mode device 4x1 left promiscuous mode sw0: port 6(4x1) entering disabled state sw0: port 3(11x0) entering learning state device 7x0 entered promiscuous mode device 26x0 entered promiscuous mode device 27x0 entered promiscuous mode device 10x0 entered promiscuous mode sw0: port 7(28x0) entering learning state device 0x1 entered promiscuous mode device 17x0 entered promiscuous mode Unable to handle kernel NULL pointer dereference at 0000000000000009 RIP: [<ffffffff8004b284>] run_workqueue+0x64/0xe5 PGD 0 Oops: 0002 [1] SMP last sysfs file: /class/net/lo/type CPU 0 Modules linked in: netconsole tun ksm_mem(U) kvm_intel(U) kvm(U) autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth sunrpc bridge ipv6 cpufreq_ondemand video sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg i2c_i801 ide_cd i2c_core cdrom serio_raw shpchp e1000 pcspkr dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 8, comm: events/0 Tainted: GF 2.6.18-8.el5 #1 RIP: 0010:[<ffffffff8004b284>] [<ffffffff8004b284>] run_workqueue+0x64/0xe5 RSP: 0018:ffff810037dabe40 EFLAGS: 00010006 RAX: 000000046474e550 RBX: ffff810071071740 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000296 RDI: ffff810037d09740 RBP: ffff810071071748 R08: ffff810037d09788 R09: ffffffff800617b6 R10: ffff81005784f588 R11: ffff810058780e48 R12: ffff810037d09740 device 17x0 left promiscuous mode sw0: port 12(17x0) entering disabled state R13: 0000000000000296 R14: 00000000004056f0 R15: 00000000000056f0 FS: 0000000000000000(0000) GS:ffffffff8038a000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000009 CR3: 0000000059463000 CR4: 00000000000026e0 Process events/0 (pid: 8, threadinfo ffff810037daa000, task ffff810037fef7a0) Stack: ffff810037dabe80 ffff810037d09740 ffffffff80047c13 ffff81007fe31d10 ffff81007fee9720 ffffffff80280001 0000000000000000 ffffffff80047d03 0000000000000000 ffff810037fef7a0 ffffffff80086c5f 0000000000100100 Call Trace: [<ffffffff80047c13>] worker_thread+0x0/0x122 [<ffffffff80047d03>] worker_thread+0xf0/0x122 [<ffffffff80086c5f>] default_wake_function+0x0/0xe [<ffffffff8003216e>] kthread+0xfe/0x132 [<ffffffff8005bfe5>] child_rip+0xa/0x11 [<ffffffff80032070>] kthread+0x0/0x132 [<ffffffff8005bfdb>] child_rip+0x0/0x11 Code: 48 89 42 08 48 89 10 48 89 6d 08 48 89 6d 00 e8 c0 73 01 00 RIP [<ffffffff8004b284>] run_workqueue+0x64/0xe5 RSP <ffff810037dabe40> CR2: 0000000000000009 <0>Kernel panic - not syncing: Fatal exception could you please test with the latest RHEL 5.2 kernel (-92.el5 I think is the latest). This looks like a duplication of bz 408791. Thanks! pardon my ignorance, but where can I get one of those latest RHEL 5.2 kernels? Please note that bug 408791 is not viewable by me, so I cannot judge if it's the same. (if it, maybe this bug, too, has to be embatgoed) you can get them from the RHN beta channel for RHEL5 server. Its release 84.el5, rather than 92, but it should still have the fix in place. I've cc'd you on the other bug. ITs not sensitive, just in-accessible with the group set that you're in. You should be able to see it now. Created attachment 305372 [details]
nasty script makes network unworkable for an hour
with -92.el5 the panic is gone. however, my nasty script makes the server
unresponsive for at least an hour (/var/log/messages attached). This does not
happen on my 2.6.24.3-12.fc8 Fedora.
toggle needinfo yeah, fork bombs do that too ;) Your script effectively creates 1000 processes all trying to manipulate some of the same data structures. The patch for the panic you reported works by serializing the removal of your tun/tap interfaces behind the completion of the port_carrier_check work that the bridge has to do on the tun/tap interface after its added to the bridge. The result is that every one of those 1000 process has to wait in line for the bridge interface to process its corresponding carrier check work. As for F-8 not having this problem, it looks like the between the time this patch went up stream and 2.6.24 released, there were significant changes made to the bridge code, which among other things took this carrier check operation out of the blocking path for the operations you are trying to preform. We could look into moving that code back to RHEL5 if you like, but I suspect it will be an ABI breaker if we do, and given that the script you provide here is more of an academic exercize more than a practical function, I'd say its probably best to just leave this fixed as it is. *** This bug has been marked as a duplicate of 408791 *** Thanks for your help. FYI, I wrote that script in order to reproduce a real-world error (brctl Abort'ing without core dump, while other tunctl/brctl were running). Instead, I produced this kernel panic. This was enough to prove that we cannot trust the bridge code, and must protected ourselves with a userspace semaphore. no worries. I figured that you were trying to reproduce a real world error, I just didn't figure that the real world problem involved actually trying to create and delete 1000 tap interfaces on a bridge at once. As you can see from the fix though, that serialization happens in the kernel now, you shouldn't need any additional syncronization in userspace. Regards Neil |