Bug 216895
Summary: | BUG: bringing up balanced-alb mode bond network interface | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | jordan hargrave <jordan_hargrave> | ||||||||||||||
Component: | kernel | Assignee: | Andy Gospodarek <agospoda> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||
Priority: | medium | ||||||||||||||||
Version: | 4.4 | CC: | jbaron, jfeeney, jordan_hargrave, linville, peterm, wwlinuxengineering | ||||||||||||||
Target Milestone: | --- | Keywords: | Regression | ||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | RHBA-2007-0304 | Doc Type: | Bug Fix | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2007-05-08 04:14:55 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | 210577 | ||||||||||||||||
Bug Blocks: | 200936 | ||||||||||||||||
Attachments: |
|
Description
jordan hargrave
2006-11-22 16:13:52 UTC
Note.. same issue occurs with RHEL4.4 kernel as with RHEL5 kernel. This issue is for tracking RHEL4.4 This is a High severity issue for Dell. Request Urgent action on this one. Can you please be more specific about what you are seeing on rhel4.4? There are actually 2 errors in BZ 210577. Ones like these: BUG: sleeping function called from invalid context at mm/slab.c:2948 in_atomic():1, irqs_disabled():0 [<c04051ed>] show_trace_log_lvl+0x58/0x16a [<c04057fa>] show_trace+0xd/0x10 [<c0405913>] dump_stack+0x19/0x1b [<c041db63>] __might_sleep+0x8d/0x95 [<c0470446>] kmem_cache_alloc+0x28/0xb5 [<c05b5015>] __alloc_skb+0x2c/0xfa [<c05c04c7>] rtmsg_ifinfo+0x21/0x6e [<c05c054c>] rtnetlink_event+0x38/0x3c [<c0615a2d>] notifier_call_chain+0x20/0x31 [<c0430a84>] raw_notifier_call_chain+0x8/0xa [<c05b8011>] dev_set_mac_address+0x48/0x4e [<f8d06f8c>] alb_set_slave_mac_addr+0x5d/0x83 [bonding] [<f8d07725>] bond_alb_handle_active_change+0xb3/0xc9 [bonding] [<f8d02b0f>] bond_change_active_slave+0x1a7/0x298 [bonding] [<f8d0357b>] bond_select_active_slave+0x99/0xce [bonding] [<f8d04886>] bond_mii_monitor+0x364/0x3ab [bonding] [<c042da4b>] run_timer_softirq+0x108/0x167 [<c04290b3>] __do_softirq+0x78/0xf2 [<c0406683>] do_softirq+0x5a/0xbe [<c0428f5c>] irq_exit+0x3d/0x3f [<c04179cf>] smp_apic_timer_interrupt+0x73/0x78 [<c0404b12>] apic_timer_interrupt+0x2a/0x30 And ones like these: RTNL: assertion failed at net/core/fib_rules.c (388) [<c04051ed>] show_trace_log_lvl+0x58/0x16a [<c04057fa>] show_trace+0xd/0x10 [<c0405913>] dump_stack+0x19/0x1b [<c05c553d>] fib_rules_event+0x34/0xeb [<c0615a2d>] notifier_call_chain+0x20/0x31 [<c0430a84>] raw_notifier_call_chain+0x8/0xa [<c05b8011>] dev_set_mac_address+0x48/0x4e [<f8d06f8c>] alb_set_slave_mac_addr+0x5d/0x83 [bonding] [<f8d07725>] bond_alb_handle_active_change+0xb3/0xc9 [bonding] [<f8d02b0f>] bond_change_active_slave+0x1a7/0x298 [bonding] [<f8d0357b>] bond_select_active_slave+0x99/0xce [bonding] [<f8d04886>] bond_mii_monitor+0x364/0x3ab [bonding] [<c042da4b>] run_timer_softirq+0x108/0x167 [<c04290b3>] __do_softirq+0x78/0xf2 [<c0406683>] do_softirq+0x5a/0xbe [<c0428f5c>] irq_exit+0x3d/0x3f [<c04179cf>] smp_apic_timer_interrupt+0x73/0x78 [<c0404b12>] apic_timer_interrupt+0x2a/0x30 I suspect it is the latter, but I'd like to be sure. We are seeing both messages; they are caused when dev_set_mac_address is called from the bond_mii_monitor thread. You are seeing this message? BUG: sleeping function called from invalid context at mm/slab.c:2948 in_atomic():1, irqs_disabled():0 Please just post the exact messages you are seeing? I don't doubt that you might be seeing something similar to the rhel5 issue but the bonding code for rhel4 isn't the same as the code for rhel5 so there should be a different backtrace. Thanks. We will post the messages shortly. Andy, I got to reproduce the hang. It is not the rtnl lock though. It's simply because we are scheduling in atomic context.(sounds familiar right). The culprit this time is the new code which is not there in R4.4 native tg3 driver but in a tg3 driver version 3.70 version dkms driver that I am testing R4.4 bonding drivers with. The new code is the full chip reset which is done if ASF is enabled so that the driver's mac address is not overwritten by the ASF MAC address. The new code is present R5 so we can easily reproduce it there but not present in the native R4.4 code and so we are not able to reproduce the hang with the native tg3 driver. The trace for your reference---- ip_tables: (C) 2000-2002 Netfilter core team ip_tables: (C) 2000-2002 Netfilter core team bad: scheduling while atomic! Call Trace:<IRQ> <ffffffff80359a9f>{schedule+72} <ffffffff8013adad> {printk+141} <ffffffff8035b62f>{schedule_timeout+410} <ffffffff8014545f> {process_timeout+0} <ffffffffa00dafba>{:tg3:tg3_restart_hw+12} <ffffffffa00db5c6> {:tg3:tg3_set_mac_addr+208} <ffffffffa00c59f5>{:bonding:alb_set_slave_mac_addr+68} <ffffffffa00d1000>{:tg3:tg3_write32+0} <ffffffffa00c5aba> {:bonding:alb_swap_mac_addr+161} <ffffffffa00bef81>{:bonding:bond_change_active_slave+520} <ffffffffa00bf81d>{:bonding:bond_mii_monitor+944} <ffffffff80144d1d> {run_timer_softirq+591} <ffffffff80140164>{__do_softirq+76} <ffffffff801401eb>{do_softirq+49} <ffffffff80113d8f>{do_IRQ+664} <ffffffff8011105b>{ret_from_intr+0} <EOI> <ffffffff8010e886>{mwait_idle+85} <ffffffff8010e817> {cpu_idle+26} <ffffffff8054b6f8>{start_kernel+637} <ffffffff8054b1ab>{_sinittext+427} bad: scheduling from the idle thread! Call Trace:<IRQ> <ffffffff80359ae9>{schedule+146} <ffffffff8013adad> {printk+141} <ffffffff8035b62f>{schedule_timeout+410} <ffffffff8014545f> {process_timeout+0} <ffffffffa00dafba>{:tg3:tg3_restart_hw+12} <ffffffffa00db5c6> {:tg3:tg3_set_mac_addr+208} <ffffffffa00c59f5>{:bonding:alb_set_slave_mac_addr+68} <ffffffffa00d1000>{:tg3:tg3_write32+0} <ffffffffa00c5aba> {:bonding:alb_swap_mac_addr+161} <ffffffffa00bef81>{:bonding:bond_change_active_slave+520} <ffffffffa00bf81d>{:bonding:bond_mii_monitor+944} <ffffffff80144d1d> {run_timer_softirq+591} <ffffffff80140164>{__do_softirq+76} <ffffffff801401eb>{do_softirq+49} <ffffffff80113d8f>{do_IRQ+664} <ffffffff8011105b>{ret_from_intr+0} <EOI> <ffffffff8010e886>{mwait_idle+85} <ffffffff8010e817> {cpu_idle+26} <ffffffff8054b6f8>{start_kernel+637} <ffffffff8054b1ab>{_sinittext+427} bad: scheduling while atomic! Call Trace:<IRQ> <ffffffff80359a9f>{schedule+72} <ffffffff8035b62f> {schedule_timeout+410} <ffffffff8014545f>{process_timeout+0} <ffffffffa00dafba> {:tg3:tg3_restart_hw+12} <ffffffffa00db5c6>{:tg3:tg3_set_mac_addr+208} <ffffffffa00c59f5> {:bonding:alb_set_slave_mac_addr+68} <ffffffffa00d1000>{:tg3:tg3_write32+0} <ffffffffa00c5aba> {:bonding:alb_swap_mac_addr+161} <ffffffffa00bef81>{:bonding:bond_change_active_slave+520} <ffffffffa00bf81d>{:bonding:bond_mii_monitor+944} <ffffffff80144d1d> {run_timer_softirq+591} <ffffffff80140164>{__do_softirq+76} <ffffffff801401eb>{do_softirq+49} <ffffffff80113d8f>{do_IRQ+664} <ffffffff8011105b>{ret_from_intr+0} <EOI> <ffffffff8010e886>{mwait_idle+85} <ffffffff8010e817> {cpu_idle+26} <ffffffff8054b6f8>{start_kernel+637} <ffffffff8054b1ab>{_sinittext+427} bad: scheduling from the idle thread! Call Trace:<IRQ> <ffffffff80359ae9>{schedule+146} <ffffffff8035b62f> {schedule_timeout+410} <ffffffff8014545f>{process_timeout+0} <ffffffffa00dafba> {:tg3:tg3_restart_hw+12} <ffffffffa00db5c6>{:tg3:tg3_set_mac_addr+208} <ffffffffa00c59f5> {:bonding:alb_set_slave_mac_addr+68} <ffffffffa00d1000>{:tg3:tg3_write32+0} <ffffffffa00c5aba> {:bonding:alb_swap_mac_addr+161} <ffffffffa00bef81>{:bonding:bond_change_active_slave+520} <ffffffffa00bf81d>{:bonding:bond_mii_monitor+944} <ffffffff80144d1d> {run_timer_softirq+591} <ffffffff80140164>{__do_softirq+76} <ffffffff801401eb>{do_softirq+49} <ffffffff80113d8f>{do_IRQ+664} <ffffffff8011105b>{ret_from_intr+0} <EOI> <ffffffff8010e886>{mwait_idle+85} <ffffffff8010e817> {cpu_idle+26} <ffffffff8054b6f8>{start_kernel+637} <ffffffff8054b1ab>{_sinittext+427} Unable to handle kernel paging request at 0000000000200200 RIP: <ffffffff801341a4>{dequeue_task+18} PML4 31ec1067 PGD 31736067 PMD 0 Oops: 0002 [1] CPU 0 Modules linked in: md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) ds(U) yenta_socket(U) pcmcia_core(U) dm_multipath(U) button(U) battery(U) ac(U) usb_storage(U) uhci_hcd(U) ehci_hcd (U) tg3(U) bonding(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) ext3(U) jbd(U) dm_mod(U) ata_piix(U) libata(U) aacraid(U) sd_mod(U) scsi_mod(U) Pid: 0, comm: swapper Not tainted 2.6.9-prep RIP: 0010:[<ffffffff801341a4>] <ffffffff801341a4>{dequeue_task+18} RSP: 0018:ffffffff804adc20 EFLAGS: 00010097 RAX: 0000000000100100 RBX: ffffffff80427880 RCX: ffffffff804278b8 RDX: 0000000000200200 RSI: 0000000000000000 RDI: ffffffff80427880 RBP: ffffffff804adc20 R08: 0000000000000005 R09: ffffffff8039241e R10: 0000ffff80454060 R11: 0000ffff80454060 R12: 00000000ffff7e18 R13: 0000000000000000 R14: ffffffff804bca20 R15: 0000005895d994b3 FS: 0000000000000000(0000) GS:ffffffff80544d00(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000200200 CR3: 0000000000101000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffffffff80548000, task ffffffff80427880) Stack: ffffffff804adc38 ffffffff801343ec 00000000ffff7e18 ffffffff804adcb8 ffffffff80359cd1 00000000080303fe 0000000000000246 ffffffff80427880 000000003b9aca00 ffffffff80427880 Call Trace:<IRQ> <ffffffff801343ec>{deactivate_task+37} <ffffffff80359cd1> {schedule+634} <ffffffff8035b62f>{schedule_timeout+410} <ffffffff8014545f> {process_timeout+0} <ffffffffa00dafba>{:tg3:tg3_restart_hw+12} <ffffffffa00db5c6> {:tg3:tg3_set_mac_addr+208} <ffffffffa00c59f5>{:bonding:alb_set_slave_mac_addr+68} <ffffffffa00d1000>{:tg3:tg3_write32+0} <ffffffffa00c5aba> {:bonding:alb_swap_mac_addr+161} <ffffffffa00bef81>{:bonding:bond_change_active_slave+520} <ffffffffa00bf81d>{:bonding:bond_mii_monitor+944} <ffffffff80144d1d> {run_timer_softirq+591} <ffffffff80140164>{__do_softirq+76} <ffffffff801401eb>{do_softirq+49} <ffffffff80113d8f>{do_IRQ+664} <ffffffff8011105b>{ret_from_intr+0} <EOI> <ffffffff8010e886>{mwait_idle+85} <ffffffff8010e817> {cpu_idle+26} <ffffffff8054b6f8>{start_kernel+637} <ffffffff8054b1ab>{_sinittext+427} Code: 48 89 02 48 89 50 08 48 c7 41 08 00 02 20 00 8b 4f 2c 48 c7 RIP <ffffffff801341a4>{dequeue_task+18} RSP <ffffffff804adc20> CR2: 0000000000200200 <0>Kernel panic - not syncing: Oops Created attachment 148312 [details]
diff output
This is the diff output of the 3.71 tg3 driver from broadcom and the native tg3
driver in R4.4
Shyam, Thanks for posting that info. The tg3 problem is known upstream and a bugzilla has been opened there: http://bugzilla.kernel.org/show_bug.cgi?id=7974 Per conversation with Charles Rose at Dell on 2/22/2007, Dell wants this bugzilla to be a blocker of RHEL4-5 because it is a regression. In addition, when bringing up a bonded interface in balanced-alb mode, the user sees "a ton of BUG spew." This is seen as a significant call generator for Dell. This bug is analogous to bz210577 which was deemed a blocker as well as a Day 0 Errata for the release it was associated with which should indicate the level of attention it has received. The patch is most likely in the kernel. I am working on getting the other appropriate flags set. Does this happen with the current 4.5 beta kernel? I don't see how this is a regression if it happens with 4.4 + dkms tg3 driver, but doesn't happen with the 4.4 tg3 driver. Just as I feared after looking at the tg3 code in rhel4.5 beta kernel I do get the hang in the rhel4.5 bond setup now. So as I said in comment 8 it is the ASF code that has been added in the newer tg3 drivers that is aggravating the bonding design problem. Attaching the serial trace. Created attachment 149677 [details]
bonding hang rhel4.5 beta1 kernel.
kernel version = 2.6.18-8.el5
Sorry the kernel version is 2.6.9-48.EL.x86_64 in comment no.15. Shyam has provided the required info. Agreed. Good progress is being made upstream on this issue so I hope to have something that we can use for both rhel4 and rhel5 soon. The best way for you to help is to continue to test kernels for rhel5 so we can find as many problems as possible. Yes. We but dell is shipping a lot of servers with rhel4 and with the current bug that is being introduced this needs equal attention as well. (In reply to comment #19) > Yes. We but dell is shipping a lot of servers with rhel4 and with the current > bug that is being introduced this needs equal attention as well. What version of the tg3 driver are you now giving to your customers to use on rhel4.4 and earlier systems that creates this problem? Version 3.71 In response to comment no.12 since this was not seen on r4.4 kernel and is being seen in r4.5 (native drivers only) this is a regression. This problem can be recreated easily with broadcom nics. In this RHEL4 issue, the server experiences a kernel panic if the link goes down. This is far more serious that what happens in RHEL5. Hence Dell is seeking an exception for this case and has requested us to initate this for 4.5. Per communicaiton with enginering, the fix to this problem is still under discussion and as soon as we have a working fix, we will provide it to Dell for testing. When such a fix is confirmed to resolve the probelm, engineering will evaluate the best possible mechanism to release it (hotfix, kernel errata, etc.) -------- Original Message -------- Subject: Bug 216895: BUG: bringing up balanced-alb mode bond network interface Date: Thu, 22 Mar 2007 17:14:10 +0530 From: <Charles_Rose> To: <sbenjamin>, <ltroan>, <jfeeney>, <rhentosh> CC: <Shyam_Iyer>, <Jordan_Hargrave> Sammy, This issue is a blocker for Dell. We cannot have bonding broken and which can cause a kernel panic on a live OS. Request you to get this issue the right attention inside Red Hat. I agree that this is being worked on in the community, we would like to see a fix in RHEL 4.5. Thanks, Charles Rose Linux Engineering Dell Product Group Bangalore Development Center, INDIA I'm not a big fan of this patch, but I would like you to test kernels here: http://people.redhat.com/agospoda/#rhel4 What I've done is backout the fix that resets any chips that have ASF enabled only when changing the MAC address. This is not the ideal solution, but it is one that is possible if we can't get everything squared away with the bonding driver in time. I don't like the idea of causing problems with chips that possibly have ASF enabled, but I'm not sure how many systems have this. Please verify this patch as soon as possible. Since Shyam Iyer has been leading the charge on this recently, I am changing the needinfo to him. Please respond with the test results as quickly as possible. The deadline is approaching fast. Thank you. Created attachment 151555 [details]
bonding driver workqueue based.
Andy, Could not get a chance to test the tg3 driver modification. Above attachement is an update agnostic bonding driver with workqueue. Tomorrow I will update you with 2 things. 1) How the tg3 patch is working? 2) The exact patch for the bonding driver to rhel4.5 Shyam, You can plan to ship this code if you like, but I can guarantee that you will see deadlocks from some customers. Taking the rtnl_lock after read-locking the bond-lock in mii_monitor will be problematic since there are other times (like when calling bond_set_multicast_list) where rtnl is already held when you enter a bonding function and you will wait to write-lock the bond-lock. I'm also not sure you can use _bh locks whenever doing a failover (in the case of miimon at least) because you cannot sleep in softirq context (the current problem that we have with the tg3 driver panicing the box). I have heard that other distros are shipping a workqueue conversion patch much like this (but without the extra rtnl-locks in the monitoring functions) and reports are already coming in that their users are seeing deadlocks, so we need to be careful. -andy >I'm also not sure you can use _bh locks whenever doing a failover (in the case
>of miimon at least) because you cannot sleep in softirq context (the current
>problem that we have with the tg3 driver panicing the box).
Well, I can safely remove the bh write locks while failover. I have confirmed
that from my testing. It was a last minute inclusion because of a lockup from
the read_lock(&bond->curr_slave_lock).
So does it now boil down to one problem? (rtnl_lock and multicast_list)
I will also try reproducing this issue. Did you get a chance to get a trace of
rtnl_lock and the bond_set_multicast_list lockup?
Andy,
I see looking back at the code what you are trying to say about the lockup. I
did a subtle change which will throw away the deadlock.
->bond_mii_mon calls read_lock(&bond->lock).
->bond_set_multicast is called with rtnl_lock held and tries a write lock -
fails.
->bond_mii_mon calls rtnl_lock for failover() - fails.
Deadlock.
Now, my solution.
In mii_mon.
1) Call rtnl_lock first.
2) Then call read_lock.
Only if bond_set_multicast_list get rtnl_lock will it get the write_lock(&bond-
>lock).
At the most rtnl_lock assertions will fail some times which is better than
deadlocks.
Some thing like this.
diff -Naru bonding-3.2.1dell/bond_main.c bonding-3.2.2dell/bond_main.c
--- bonding-3.2.1dell/bond_main.c 2007-04-03 16:03:06.000000000 +0530
+++ bonding-3.2.2dell/bond_main.c 2007-04-04 11:59:34.972060432 +0530
@@ -2572,7 +2572,11 @@
int do_failover = 0;
int delta_in_ticks;
int i;
-
+
+ /* Grab the rtnl lock here so that if you have the read_lock(&bond-
>lock)
+ * and anyone else like the bond_set_multicast don't deadlock while
+ * holding the rtnl_lock and trying to get the write_lock(&bond->lock)
*/
+ rtnl_lock();
read_lock(&bond->lock);
delta_in_ticks = (bond->params.miimon * HZ) / 1000;
@@ -2591,10 +2595,9 @@
* program could monitor the link itself if needed.
*/
- /* balance-rr lockup without bh lock. The bond_xmit_roundrobin fails
to get a write lock
- * if read lock is held and bond_mii_mon gets scheduled. Remember we
are in process
- * context
- */
+ /* balance-rr locksup without this bh lock. The bond_xmit_roundrobin
fails to get
+ * a write lock if read lock is held and bond_mii_mon gets scheduled.
Remember we
+ * are in process context */
read_lock_bh(&bond->curr_slave_lock);
oldcurrent = bond->curr_active_slave;
read_unlock_bh(&bond->curr_slave_lock);
@@ -2788,12 +2791,9 @@
} /* end of for */
if (do_failover) {
- rtnl_lock();
- write_lock_bh(&bond->curr_slave_lock); /* balance-rr lockup
problem */
+ write_lock(&bond->curr_slave_lock); /* Bh locks not necessary
here*/
bond_select_active_slave(bond);
- write_unlock_bh(&bond->curr_slave_lock);
- rtnl_unlock();
-
+ write_unlock(&bond->curr_slave_lock);
if (oldcurrent && !bond->curr_active_slave) {
printk(KERN_INFO DRV_NAME
": %s: now running without any active "
@@ -2812,6 +2812,7 @@
}
out:
read_unlock(&bond->lock);
+ rtnl_unlock();
}
static void bond_arp_send_all(struct bonding *bond, struct slave *slave)
diff -Naru bonding-3.2.1dell/dkms.conf bonding-3.2.2dell/dkms.conf
--- bonding-3.2.1dell/dkms.conf 2007-04-03 16:16:08.000000000 +0530
+++ bonding-3.2.2dell/dkms.conf 2007-04-04 07:59:09.000000000 +0530
@@ -1,4 +1,4 @@
-PACKAGE_VERSION="3.2.1dell"
+PACKAGE_VERSION="3.2.2dell"
# Items below here should not have to change with each driver version
PACKAGE_NAME="bonding"
Shyam, Taking the rtnl_lock duing monitoring is WAAAAAAY too expensive. Its only needed when you are actually doing a failover. You are correct though, that the rtnl lock needs to be taken before bond->lock, so the proposal currently is too do something like this: if failover is needed drop bond read-lock take rtnl-lock take bond read/write-lock do necessary work (maybe even slave-lock) drop rtnl drop bond-lock possibly later Created attachment 151687 [details]
proposed patch
This is a rhel4 backport of the following proposed upstream patch.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. QE ack for 4.5. Dell is really hot on this bug and since the patch came through we can get Dell to do the majority of the validation. Test kernels available here: http://people.redhat.com/agospoda/#rhel4 Andy, The test kernel caused a kernel panic. Please see the trace below. Setting network parameters: [ OK ] Bringing up loopback interface: [ OK ] Bringing up interface bond0: Unable to handle kernel NULL pointer dereference at 00000000000002e8 RIP: <ffffffff802f965b>{netif_receive_skb+25} PML4 12a401067 PGD 12aac1067 PMD 0 Oops: 0000 [1] CPU 0 Modules linked in: ds yenta_socket pcmcia_core joydev button battery ac uhci_hcd ehci_hcd hw_random tg3 e100 mii bonding(U) floppy ata_piix libata sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod scsi_mod Pid: 0, comm: swapper Not tainted 2.6.9-52.EL.gtest.15 RIP: 0010:[<ffffffff802f965b>] <ffffffff802f965b>{netif_receive_skb+25} RSP: 0018:ffffffff804b24b8 EFLAGS: 00010296 RAX: 0000000000008e88 RBX: 000001012a5b7500 RCX: 0000000000008e88 RDX: 0000000000000000 RSI: 000001012e4f50a8 RDI: 000001012a5b7500 RBP: 000001012e4f5440 R08: 000001012abd0c80 R09: 000001012fd39870 R10: 000001012a5b7500 R11: 00000000000000e4 R12: 0000000000000001 R13: 0000000000000000 R14: 000000000000003c R15: 000001012abe0000 FS: 0000000000000000(0000) GS:ffffffff8054f200(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000000002e8 CR3: 0000000000101000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffffffff80552000, task ffffffff8042af80) Stack: 0000000000000220 000001012a5b7500 000001012e4f5440 000001012a5b7500 000001012e4f5440 000001012a5b7500 0000000000000000 ffffffffa0111ae1 0000000000000002 000001012e4f5534 Call Trace:<IRQ> <ffffffffa0111ae1>{:tg3:tg3_poll+2448} <ffffffff802f9c9c> {net_rx_action+305} <ffffffff80140494>{__do_softirq+76} <ffffffff8014051b>{do_softirq+49} <ffffffff80113e5b>{do_IRQ+664} <ffffffff8011105b>{ret_from_intr+0} <EOI> <ffffffff8010e86d>{mwait_idle+60} <ffffffff8010e817> {cpu_idle+26} <ffffffff805556f8>{start_kernel+637} <ffffffff805551ab>{_sinittext+427} Code: 48 83 ba e8 02 00 00 00 0f 84 53 01 00 00 31 c0 f6 82 8c 00 RIP <ffffffff802f965b>{netif_receive_skb+25} RSP <ffffffff804b24b8> CR2: 00000000000002e8 <0>Kernel panic - not syncing: Oops If I use the patch that michael chan has sent over the rhel4.5 kernels then the issue is not reproducible. Is this a build problem? (In reply to comment #41) > > Modules linked in: ds yenta_socket pcmcia_core joydev button battery ac > uhci_hcd ehci_hcd hw_random tg3 e100 mii bonding(U) floppy ata_piix libata sg > dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod > Shyam, Do you have one of your dkms bonding modules currently installed? It would seem to me that is why 'bonding(U)' appears in the list instead of 'bonding.' Can you remove that and try again? Well no. The dkms drivers were loaded the first time it booted but then I removed them by booting to single user mode. This was the first doubt that I had and I made sure I had removed all dkms rpms. I guess since the .ko files were removed and placed again with the original .ko by uninstalling the dkms rpm module the flag might come as U. I am attaching the modinfo for the bonding driver and the tg3 driver. Please check the versions. I have booted the system about 4 times with the same version of bonding and tg3 driver. Created attachment 151755 [details]
bonding module's modinfo output
Created attachment 151756 [details]
tg3 module's modinfo output
My latest test kernel also contains this patch: http://people.redhat.com/agospoda/rhel4/gtest/tg3-update-3_73.patch Which has a backport of a later tg3 driver included. The problem with this patch is right here: @@ -3116,7 +3175,6 @@ static int tg3_alloc_rx_skb(struct tg3 * if (skb == NULL) return -ENOMEM; - skb->dev = tp->dev; skb_reserve(skb, tp->rx_offset); mapping = pci_map_single(tp->pdev, skb->data, In newer upstream kernels netdev_alloc_skb() sets skb->dev = tp->dev, but in RHEL4's tg3_compat.h we've just got netdev_alloc_skb looking like this: #define netdev_alloc_skb(dev, len) dev_alloc_skb(len) I'll make a change and put out some new test kernels ASAP. 2.6.9-53 will show up soon here: http://people.redhat.com/jbaron/rhel4/RPMS.kernel/ Let me know if you don't see it later today. Sorry I wasn't clear on this -- it will contain the suggested tg3 patch without any of my experimental updates. Dell, please test this ASAP and report results here. committed in stream U5 build 53. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ Fix verified. Results are positive. Thanks for the speedy turnaround. A fix for this issue has been included in RHEL4.5. Please test the Release Candidate of RHEL4.5, which was released today to Partners, and let us know if the problem is resolved. The Release Candidate can be downloaded from here: ftp://partners.redhat.com/af38ac4316ba20df2dec5f990913396d Internal Status set to 'Waiting on Customer' Status set to: Waiting on Client Resolution set to: 'RHEL 4.5' This event sent from IssueTracker by gcase issue 108058 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html |