Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 558809

Summary:

e1000 & e1000e: Memory corruption/paging error when tx hang occurs

Product:

Red Hat Enterprise Linux 5

Reporter:

Prarit Bhargava <prarit>

Component:

kernel

Assignee:

Neil Horman <nhorman>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

medium

Docs Contact:

Priority:

high

Version:

5.4

CC:

agospoda, bruce.w.allan, dhoward, emcnabb, ikent, jburke, jesse.brandeburg, jlayton, john.ronciak, jpirko, nhorman, rwheeler, sardella, steved

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

572350 (view as bug list)

Environment:

Last Closed:

2010-03-30 07:23:15 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

569797, 572350

Attachments:

Description	Flags
make sure rx unit is stopped before reset	none
e1000e dump code	none
panic with Jesse's patches	none
tests.init file configuration	none
dmidecode output from piketon-02	none
untested first draft of my slab overrun trapper	none
new version of trap patch	none
cleaner version of slab trap	none
enhancement to the slab trp patch	none
patch to write protect all data pages for all skbs	none
client side nfs capture w/ tcpdump	none
interesting-skb-data-alignment.patch	none
debug patch 2 -- add routine to check redzone integrity in slabcache	none
debug patch 3 -- more comprehensive patch -- scan and dump all corrupt objects	none
my current debugging patch	none
todays update	none
patch to e1000e for debug	none
2nd debug patch with proposed workaround	none
e1000e-nfs-packet-split-disable2.patch	none
updated e1000e patch	none
alternate code location for change	none

Description Prarit Bhargava 2010-01-26 13:51:02 UTC

Description of problem: A memory corruption or a paging error occurs during the NFS connectathon test is run


Version-Release number of selected component (if applicable): 2.6.18-185.el5


How reproducible: 100%


Steps to Reproduce:
1. Install the NFS connectathon testsuite
2. Edit the testsuite to only run against the Solaris 10 NFS server
3. Run the test suite
  d 
[NOTE: on the intel-piketon-02.lab.bos.redhat.com system, this issue is reproducible within _minutes_ on 185.el5 and self built 185.el5 kernels]

Actual results:

The transmitter hangs,

SELinux: initialized (dev 0:1b, type nfs), uses genfs_contexts
eth0: Detected Tx Unit Hang:
  TDH                  <d1>
  TDT                  <d2>
  next_to_use          <d2>
  next_to_clean        <d0>
buffer_info[next_to_clean]:
  time_stamp           <fffe3619>
  next_to_watch        <d1>
  jiffies              <fffe41ae>
  next_to_watch.status <0>
nfs: server sol10-nfs not responding, still trying
eth0: Detected Tx Unit Hang:
  TDH                  <d1>
  TDT                  <d2>
  next_to_use          <d2>
  next_to_clean        <d0>
buffer_info[next_to_clean]:
  time_stamp           <fffe3619>
  next_to_watch        <d1>
  jiffies              <fffe4959>
  next_to_watch.status <0>
eth0: Detected Tx Unit Hang:
  TDH                  <d1>
  TDT                  <d2>
  next_to_use          <d2>
  next_to_clean        <d0>
buffer_info[next_to_clean]:
  time_stamp           <fffe3619>
  next_to_watch        <d1>
  jiffies              <fffe5129>
  next_to_watch.status <0>
eth0: Detected Tx Unit Hang:
  TDH                  <d1>
  TDT                  <d2>
  next_to_use          <d2>
  next_to_clean        <d0>
buffer_info[next_to_clean]:
  time_stamp           <fffe3619>
  next_to_watch        <d1>
  jiffies              <fffe58fa>
  next_to_watch.status <0>
NETDEV WATCHDOG: eth0: transmit timed out

and the card is reset.  During the down, we hit this panic:

NETDEV WATCHDOG: eth0: transmit timed out
general protection fault: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:00.0/resource
CPU 0 
Modules linked in: autofs4 hidp nfs fscache nfs_acl rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac lp joydev snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc sr_mod snd_hwdep shpchp cdrom i2c_i801 e1000e parport_serial parport_pc parport sg snd i2c_core pcspkr soundcore dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 14, comm: events/0 Not tainted 2.6.18-185.el5debug #1
RIP: 0010:[<ffffffff8000d257>]  [<ffffffff8000d257>] put_page+0x0/0x2e
RSP: 0018:ffff810137899db8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8101361514b0 RCX: 0000000000000002
RDX: ffff810129782848 RSI: 0000000129783000 RDI: 000880484fdb0b00
RBP: 0000000000000001 R08: ffff810129784000 R09: 0000000000000000
R10: ffff810137a270b8 R11: 00000000000000f8 R12: ffff810134156e88
R13: ffff810137a270b8 R14: ffff810136758680 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff8043e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aac08939000 CR3: 0000000120e33000 CR4: 00000000000006e0
Process events/0 (pid: 14, threadinfo ffff810137898000, task ffff810137896440)
Stack:  ffffffff80239dc4 ffff810132b95560 ffff8101361514b0 ffffc2000011a078
 ffffffff8002a2e7 ffff8101338fb668 ffffffff8821c52e 0000000000000003
 ffff810136758680 ffff810136758718 ffff8101076ba188 0000000000000282
Call Trace:
 [<ffffffff80239dc4>] skb_release_data+0x5f/0x99
 [<ffffffff8002a2e7>] __kfree_skb+0x11/0x1a
 [<ffffffff8821c52e>] :e1000e:e1000_clean_rx_ring+0xde/0x1b7
 [<ffffffff8821e566>] :e1000e:e1000_reset_task+0x0/0xc
 [<ffffffff8821c96e>] :e1000e:e1000e_reinit_locked+0x3d/0x50
 [<ffffffff80050582>] run_workqueue+0x9a/0xf4
 [<ffffffff8004cd33>] worker_thread+0x0/0x122
 [<ffffffff8004ce23>] worker_thread+0xf0/0x122
 [<ffffffff80091598>] default_wake_function+0x0/0xe
 [<ffffffff80034d73>] kthread+0xfe/0x132
 [<ffffffff80067fea>] trace_hardirqs_on_thunk+0x35/0x37
 [<ffffffff80061079>] child_rip+0xa/0x11
 [<ffffffff800688ed>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800606a8>] restore_args+0x0/0x30
 [<ffffffff80034c75>] kthread+0x0/0x132
 [<ffffffff8006106f>] child_rip+0x0/0x11


Code: 8b 07 f6 c4 40 74 05 e9 15 34 02 00 8b 47 08 85 c0 75 0a 0f 
RIP  [<ffffffff8000d257>] put_page+0x0/0x2e
 RSP <ffff810137899db8>
 <0>Kernel panic - not syncing: Fatal exception

Expected results: No panic

Additional info:  We've seen different versions of this panic.  It *usually* is preceded by a TX hang, however, in some cases that doesn't appear to have happened.

The key point is, however, that the panics appear to be the same -- the e1000e device is being reset and during the cleanup of the rx buffers a panic occurs due to the slab debug or an invalid paging request.

This has been recently reported upstream against the e1000 (NOT _e_) in the 2.6.33-rc5 kernel here:

http://marc.info/?l=linux-netdev&m=126426149306083&w=4

The panic is *very* similar to that of the e1000e.

Of interest in this panic is

a) The card is being reset and the rx buffers are being cleaned:

[<c1276970>] ? e1000_clean_rx_ring+0x80/0x150
 [<c127c743>] ? e1000_down+0x1b3/0x1d0
 [<c127cf60>] ? e1000_reset_task+0x0/0x10
 [<c127cd3b>] ? e1000_reinit_locked+0x4b/0x70

and b) this is reported against a UP kernel -- so the possibility of a concurrency issue is not possible.

P.

Comment 1 Prarit Bhargava 2010-01-26 13:52:41 UTC

An example in which the slab corruption detector kicked in:

<transmit hang info snipped>

NETDEV WATCHDOG: eth0: transmit timed out
slab error in verify_redzone_free(): cache `size-2048': memory outside object was overwritten

Call Trace:
 [<ffffffff80033633>] cache_free_debugcheck+0x106/0x217
 [<ffffffff8000b94c>] kfree+0xcc/0x25d
 [<ffffffff8002a2e7>] __kfree_skb+0x11/0x1a
 [<ffffffff881f652e>] :e1000e:e1000_clean_rx_ring+0xde/0x1b7
 [<ffffffff881f8566>] :e1000e:e1000_reset_task+0x0/0xc
 [<ffffffff881f696e>] :e1000e:e1000e_reinit_locked+0x3d/0x50
 [<ffffffff80050582>] run_workqueue+0x9a/0xf4
 [<ffffffff8004cd33>] worker_thread+0x0/0x122
 [<ffffffff8004ce23>] worker_thread+0xf0/0x122
 [<ffffffff80090ab8>] default_wake_function+0x0/0xe
 [<ffffffff80034d73>] kthread+0xfe/0x132
 [<ffffffff80067fea>] trace_hardirqs_on_thunk+0x35/0x37
 [<ffffffff80061079>] child_rip+0xa/0x11
 [<ffffffff800688ed>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800606a8>] restore_args+0x0/0x30
 [<ffffffff80034c75>] kthread+0x0/0x132
 [<ffffffff8006106f>] child_rip+0x0/0x11

ffff81012fff7530: redzone 1:0x170f208e, redzone 2:0x170fc2a5.
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<ffffffff8000d257>] put_page+0x0/0x2e
PGD 121130067 PUD 121193067 PMD 0 
Oops: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:19.0/irq
CPU 0 
Modules linked in: autofs4(U) hidp(U) nfs(U) fscache(U) nfs_acl(U) rfcomm(U) l2cap(U) bluetooth(U) lockd(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) lp(U) snd_hda_intel(U) snd_seq_dummy(U) snd_seq_oss(U) snd_seq_midi_event(U) snd_seq(U) joydev(U) snd_seq_device(U) snd_pcm_oss(U) snd_mixer_oss(U) snd_pcm(U) snd_timer(U) sr_mod(U) snd_page_alloc(U) cdrom(U) parport_serial(U) snd_hwdep(U) parport_pc(U) i2c_i801(U) snd(U) e1000e(U) parport(U) shpchp(U) i2c_core(U) sg(U) soundcore(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) ahci(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 14, comm: events/0 Tainted: G      2.6.18.4 #8
RIP: 0010:[<ffffffff8000d257>]  [<ffffffff8000d257>] put_page+0x0/0x2e
RSP: 0018:ffff810137899db8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff810115e32a00 RCX: 0000000000000002
RDX: ffff8101317198f8 RSI: 000000013171a000 RDI: 0000000000000000
RBP: 0000000000000001 R08: ffff81013171b000 R09: 0000000000000000
R10: ffff810137a0f0b8 R11: 00000000000000f8 R12: ffff810131cc8988
R13: ffff810137a0f0b8 R14: ffff810133f7c680 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff8043c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000012113e000 CR4: 00000000000006e0
Process events/0 (pid: 14, threadinfo ffff810137898000, task ffff810137896440)
Process events/0 (pid: 14, threadinfo ffff810137898000, task ffff810137896440)
Stack:  ffffffff802392d5 ffff8101310fa458 ffff810115e32a00 ffffc2000011a938
 ffffffff8002a2e7 ffff8101310fafb0 ffffffff881f652e 000000000000003b
 ffff810133f7c680 ffff810133f7c718 ffff8101375a4930 0000000000000282
Call Trace:
 [<ffffffff802392d5>] skb_release_data+0x5f/0x99
 [<ffffffff8002a2e7>] __kfree_skb+0x11/0x1a
 [<ffffffff881f652e>] :e1000e:e1000_clean_rx_ring+0xde/0x1b7

<snip>

P.

Comment 2 Prarit Bhargava 2010-01-26 13:54:26 UTC

Boot with "time" argument and Paging request panic:

[ 1286.693799]   time_stamp           <1000e2218>
[ 1286.693800]   next_to_watch        <67>
[ 1286.693802]   jiffies              <1000e3f49>
[ 1286.693803]   next_to_watch.status <0>
[ 1287.291723] NETDEV WATCHDOG: eth0: transmit timed out
[ 1287.378043] Unable to handle kernel paging request at ffffffffffffd786 RIP: 
[ 1287.382793]  [<ffffffff8023925f>] skb_drop_list+0xb/0x22
[ 1287.390783] PGD 203067 PUD 1075d8067 PMD 0 
[ 1287.395119] Oops: 0000 [1] SMP 
[ 1287.398373] last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
[ 1287.404641] CPU 0 
[ 1287.406720] Modules linked in: autofs4(U) hidp(U) nfs(U) fscache(U) nfs_acl(U) rfcomm(U) l2cap(U) bluetooth(U) lockd(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) lp(U) joydev(U) snd_hda_intel(U) snd_seq_dummy(U) snd_seq_oss(U) snd_seq_midi_event(U) snd_seq(U) snd_seq_device(U) snd_pcm_oss(U) snd_mixer_oss(U) snd_pcm(U) shpchp(U) snd_timer(U) snd_page_alloc(U) parport_serial(U) sr_mod(U) snd_hwdep(U) i2c_i801(U) i2c_core(U) e1000e(U) parport_pc(U) parport(U) cdrom(U) snd(U) soundcore(U) sg(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) ahci(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
[ 1287.508755] Pid: 14, comm: events/0 Tainted: G      2.6.18.4 #8
[ 1287.514843] RIP: 0010:[<ffffffff8023925f>]  [<ffffffff8023925f>] skb_drop_list+0xb/0x22
[ 1287.523087] RSP: 0018:ffff810137899db0  EFLAGS: 00010282
[ 1287.528550] RAX: ffffffffffffd786 RBX: ffff810107788710 RCX: 0000000000000002
[ 1287.535887] RDX: ffff8101332ce500 RSI: 0000000115caf000 RDI: ffff810115caeff0
[ 1287.543225] RBP: 0000000000000000 R08: ffff810115cb0000 R09: 0000000000000000
[ 1287.550560] R10: ffff810137a270b8 R11: 00000000000000f8 R12: ffff8101340f3d58
[ 1287.557899] R13: ffff810137a270b8 R14: ffff81013674c680 R15: 0000000000000000
[ 1287.565236] FS:  0000000000000000(0000) GS:ffffffff8043c000(0000) knlGS:0000000000000000
[ 1287.573555] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 1287.579465] CR2: ffffffffffffd786 CR3: 00000001236ac000 CR4: 00000000000006e0
[ 1287.586804] Process events/0 (pid: 14, threadinfo ffff810137898000, task ffff810137896440)
[ 1287.595302] Stack:  ffff810107788710 ffffffff802392fb ffff8101332ce5b8 ffff810107788710
[ 1287.603597]  ffffc2000011a190 ffffffff8002a2e7 ffff8101332ce140 ffffffff8821052e
[ 1287.611249]  000000000000000a ffff81013674c680 ffff81013674c718 ffff8101375a4930
[ 1287.618717] Call Trace:
[ 1287.621424]  [<ffffffff802392fb>] skb_release_data+0x85/0x99
[ 1287.627250]  [<ffffffff8002a2e7>] __kfree_skb+0x11/0x1a
[ 1287.632640]  [<ffffffff8821052e>] :e1000e:e1000_clean_rx_ring+0xde/0x1b7
[ 1287.639550]  [<ffffffff88212566>] :e1000e:e1000_reset_task+0x0/0xc
[ 1287.645918]  [<ffffffff8821096e>] :e1000e:e1000e_reinit_locked+0x3d/0x50
[ 1287.652811]  [<ffffffff80050582>] run_workqueue+0x9a/0xf4
[ 1287.658366]  [<ffffffff8004cd33>] worker_thread+0x0/0x122
[ 1287.663922]  [<ffffffff8004ce23>] worker_thread+0xf0/0x122
[ 1287.669569]  [<ffffffff80090ab8>] default_wake_function+0x0/0xe
[ 1287.675673]  [<ffffffff80034d73>] kthread+0xfe/0x132
[ 1287.680782]  [<ffffffff80067fea>] trace_hardirqs_on_thunk+0x35/0x37
[ 1287.687239]  [<ffffffff80061079>] child_rip+0xa/0x11
[ 1287.692349]  [<ffffffff800688ed>] _spin_unlock_irq+0x24/0x27
[ 1287.698170]  [<ffffffff800606a8>] restore_args+0x0/0x30
[ 1287.703555]  [<ffffffff80034c75>] kthread+0x0/0x132
[ 1287.708576]  [<ffffffff8006106f>] child_rip+0x0/0x11
[ 1287.713687] 
[ 1287.715219] 
[ 1287.715220] Code: 48 8b 18 48 89 c7 e8 65 ff ff ff 48 85 db 74 05 48 89 d8 eb 
[ 1287.724430] RIP  [<ffffffff8023925f>] skb_drop_list+0xb/0x22
[ 1287.730262]  RSP <ffff810137899db0>
[ 1287.733848] CR2: ffffffffffffd786
[ 1287.737257]  <0>Kernel panic - not syncing: Fatal exception
[ 1287.742996]  

P.

Comment 3 Prarit Bhargava 2010-01-26 13:55:30 UTC

Panic reported by gospo, not very reproducible on e1000:

Failure on 2.6.31-rc4:

log moved to: '/tmp/tmp.W17147'
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8108eaa0>] put_page+0x4/0xca
PGD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:ff/0000:ff:02.3/irq
CPU 1 
Modules linked in: nfs nfs_acl auth_rpcgss autofs4 hidp rfcomm l2cap bluetooth
rfkill lockd sunrpc ipv6 cpufreq_ondemand acpi_cpufreq dm_multipath sbs sbshc
battery acpi_memhotplug ac lp snd_hda_codec_intelhdmi snd_hda_codec_realtek
snd_hda_intel sg snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event
snd_seq snd_seq_device sr_mod snd_pcm_oss cdrom video snd_mixer_oss rtc_cmos
output rtc_core snd_pcm rtc_lib snd_timer button snd parport_serial parport_pc
parport i2c_i801 e1000e soundcore i2c_core snd_page_alloc shpchp pcspkr
dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod ahci libata sd_mod
scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 0, comm: swapper Not tainted 2.6.31-rc4 #6 To be filled by O.E.M.
RIP: 0010:[<ffffffff8108eaa0>]  [<ffffffff8108eaa0>] put_page+0x4/0xca
RSP: 0018:ffff880028069d60  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88012a83e480 RCX: 0000000000000011
RDX: ffff8800af130640 RSI: 00000000640000e0 RDI: 0000000000000000
RBP: 0000000000000001 R08: ffffffff81b3ce00 R09: 0000000000000002
R10: 0000000000000000 R11: ffffffff81272de9 R12: ffffc90014717af0
R13: ffff88012a83e480 R14: ffff88012f916460 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880028066000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff880137b1a000, task ffff880137b19560)
Stack:
 ffff88012a83e480 0000000000000001 ffffc90014717af0 ffffffff81249720
<0> ffff88012a83e480 ffff88012a83e480 ffff88012a83e480 ffffffff81249474
<0> ffff8800af130030 ffffffff81272ad4 0000000000000000 000000000000018f
Call Trace:
 <IRQ> 
 [<ffffffff81249720>] ? skb_release_data+0x65/0xaa
 [<ffffffff81249474>] ? __kfree_skb+0x9/0x6f
 [<ffffffff81272ad4>] ? ip_rcv_finish+0x3a0/0x3b0
 [<ffffffffa0152ab3>] ? e1000_clean_rx_irq+0x22e/0x2cd [e1000e]
 [<ffffffffa01518cf>] ? e1000_clean+0x6e/0x21d [e1000e]
 [<ffffffff81253778>] ? net_rx_action+0xa9/0x17d
 [<ffffffff8104153e>] ? __do_softirq+0xc5/0x182
 [<ffffffff8100ca3c>] ? call_softirq+0x1c/0x28
 [<ffffffff8100ddd2>] ? do_softirq+0x2c/0x68
 [<ffffffff8100d452>] ? do_IRQ+0xa0/0xb6
 [<ffffffff8100c2d3>] ? ret_from_intr+0x0/0xa
 <EOI> 
 [<ffffffff8100c42e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff811a852f>] ? acpi_safe_halt+0x27/0x39
 [<ffffffff811a861f>] ? acpi_idle_enter_c1+0x6f/0xc7
 [<ffffffff81231396>] ? ladder_select_state+0x2b/0x135
 [<ffffffff812309e5>] ? cpuidle_idle_call+0x7f/0xbe
 [<ffffffff8100aa1d>] ? cpu_idle+0x40/0x5e

Comment 4 John Ronciak 2010-01-26 17:11:08 UTC

Is this problem seen with our stand-alone versions of the e1000 and e1000e drivers?  Maybe a patch didn't get picked up during backport efforts.  This would narrow that down.   We also don't have other reports of this happening either upstream or from our stand-alone drivers.

Comment 6 Prarit Bhargava 2010-01-26 17:59:37 UTC

(In reply to comment #4)
> Is this problem seen with our stand-alone versions of the e1000 and e1000e
> drivers?  Maybe a patch didn't get picked up during backport efforts.  This
> would narrow that down.   We also don't have other reports of this happening
> either upstream or from our stand-alone drivers.    

John,

This does happen upstream on the e1000.

http://marc.info/?l=linux-netdev&m=126426149306083&w=4

(Of course that is a UP kernel ... but the panic is the same)

P.

Comment 8 Andy Gospodarek 2010-01-26 19:12:42 UTC

NFS: (0:18/6248) revalidation complete
NFS: (0:18/6248) data cache invalidated
NFS: readdir_search_pagecache() searching for cookie 1536
NFS: find_dirent_page: searching page 0 for target 1536
NFS: nfs_readdir_filler: reading cookie 0 into page 0
NFS: find_dirent: examining cookie 12
NFS: find_dirent: examining cookie 512
NFS: find_dirent_page: returns -523
NFS: readdir_search_pagecache: returns -523
NFS: uncached_readdir() searching for cookie 1536
NFS: uncached_readdir: returns -523

<e1000e tx unit hang messages>

slab error in verify_redzone_free(): cache `size-2048': memory outside object was overwritten

Call Trace:
 [<ffffffff80033633>] cache_free_debugcheck+0x106/0x217
 [<ffffffff8000b94c>] kfree+0xcc/0x25d
 [<ffffffff8002a2e7>] __kfree_skb+0x11/0x1a
 [<ffffffff881d052e>] :e1000e:e1000_clean_rx_ring+0xde/0x1b7
 [<ffffffff881d2566>] :e1000e:e1000_reset_task+0x0/0xc
 [<ffffffff881d096e>] :e1000e:e1000e_reinit_locked+0x3d/0x50
 [<ffffffff80050582>] run_workqueue+0x9a/0xf4
 [<ffffffff8004cd33>] worker_thread+0x0/0x122
 [<ffffffff8004ce23>] worker_thread+0xf0/0x122
 [<ffffffff80091598>] default_wake_function+0x0/0xe
 [<ffffffff80034d73>] kthread+0xfe/0x132
 [<ffffffff80067fea>] trace_hardirqs_on_thunk+0x35/0x37
 [<ffffffff80061079>] child_rip+0xa/0x11
 [<ffffffff800688ed>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800606a8>] restore_args+0x0/0x30
 [<ffffffff80034c75>] kthread+0x0/0x132
 [<ffffffff8006106f>] child_rip+0x0/0x11

ffff81010fcb0e18: redzone 1:0x170f0000, redzone 2:0x170fc2a5.
slab error in verify_redzone_free(): cache `size-2048': memory outside object was overwritten

Call Trace:
 [<ffffffff80033633>] cache_free_debugcheck+0x106/0x217
 [<ffffffff8000b94c>] kfree+0xcc/0x25d
 [<ffffffff8002a2e7>] __kfree_skb+0x11/0x1a
 [<ffffffff881d052e>] :e1000e:e1000_clean_rx_ring+0xde/0x1b7
 [<ffffffff881d2566>] :e1000e:e1000_reset_task+0x0/0xc
 [<ffffffff881d096e>] :e1000e:e1000e_reinit_locked+0x3d/0x50
 [<ffffffff80050582>] run_workqueue+0x9a/0xf4
 [<ffffffff8004cd33>] worker_thread+0x0/0x122
 [<ffffffff8004ce23>] worker_thread+0xf0/0x122
 [<ffffffff80091598>] default_wake_function+0x0/0xe
 [<ffffffff80034d73>] kthread+0xfe/0x132
 [<ffffffff80067fea>] trace_hardirqs_on_thunk+0x35/0x37
 [<ffffffff80061079>] child_rip+0xa/0x11
 [<ffffffff800688ed>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800606a8>] restore_args+0x0/0x30
 [<ffffffff80034c75>] kthread+0x0/0x132
 [<ffffffff8006106f>] child_rip+0x0/0x11

ffff810116f72e18: redzone 1:0x170f0000, redzone 2:0x170fc2a5.
general protection fault: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
CPU 0 
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac lp joydev snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc sr_mod parport_serial cdrom snd_hwdep snd i2c_i801 parport_pc sg shpchp e1000e i2c_core parport soundcore pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 14, comm: events/0 Not tainted 2.6.18-185.el5debug #1
RIP: 0010:[<ffffffff80239d4e>]  [<ffffffff80239d4e>] skb_drop_list+0xb/0x22
RSP: 0018:ffff810137897db0  EFLAGS: 00010206
RAX: 000e0000003c0000 RBX: ffff8101191de710 RCX: 0000000000000002
RDX: ffff81013061be00 RSI: 000000012d253000 RDI: ffff81012d252858
RBP: 0000000000000000 R08: ffff81012d254000 R09: 0000000000000000
R10: ffff810137a0f0b8 R11: 00000000000000f8 R12: ffff81012f01c8f0
R13: ffff810137a0f0b8 R14: ffff8101325e0680 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff8043e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000439520 CR3: 0000000125d41000 CR4: 00000000000006e0
Process events/0 (pid: 14, threadinfo ffff810137896000, task ffff810137894440)
Stack:  ffff8101191de710 ffffffff80239dea ffff81013061be50 ffff8101191de710
 ffffc200000eda00 ffffffff8002a2e7 ffff81013061b9d8 ffffffff881d052e
 0000000000000040 ffff8101325e0680 ffff8101325e0718 ffff8101375a8b60
Call Trace:
 [<ffffffff80239dea>] skb_release_data+0x85/0x99
 [<ffffffff8002a2e7>] __kfree_skb+0x11/0x1a
 [<ffffffff881d052e>] :e1000e:e1000_clean_rx_ring+0xde/0x1b7
 [<ffffffff881d2566>] :e1000e:e1000_reset_task+0x0/0xc
 [<ffffffff881d096e>] :e1000e:e1000e_reinit_locked+0x3d/0x50
 [<ffffffff80050582>] run_workqueue+0x9a/0xf4
 [<ffffffff8004cd33>] worker_thread+0x0/0x122
 [<ffffffff8004ce23>] worker_thread+0xf0/0x122
 [<ffffffff80091598>] default_wake_function+0x0/0xe
 [<ffffffff80034d73>] kthread+0xfe/0x132
 [<ffffffff80067fea>] trace_hardirqs_on_thunk+0x35/0x37
 [<ffffffff80061079>] child_rip+0xa/0x11
 [<ffffffff800688ed>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800606a8>] restore_args+0x0/0x30
 [<ffffffff80034c75>] kthread+0x0/0x132
 [<ffffffff8006106f>] child_rip+0x0/0x11


Code: 48 8b 18 48 89 c7 e8 65 ff ff ff 48 85 db 74 05 48 89 d8 eb 
RIP  [<ffffffff80239d4e>] skb_drop_list+0xb/0x22
 RSP <ffff810137897db0>
 <0>Kernel panic - not syncing: Fatal exception

Comment 9 Andy Gospodarek 2010-01-26 19:31:10 UTC

In the middle of the

<e1000e tx unit hang messages>

I did notice one of these:

nfs: server sol10-nfs not responding, still trying

Comment 10 Jeff Layton 2010-01-26 20:13:13 UTC

Hard to understand why the NFS version would matter much here...

A question: are you running the NFSv3 runs over UDP as well?

Comment 12 Andy Gospodarek 2010-01-26 21:12:54 UTC

(In reply to comment #10)
> Hard to understand why the NFS version would matter much here...

If it was easy, we would have already figured it out. :)

> A question: are you running the NFSv3 runs over UDP as well?    

Yes.

Comment 21 Jeff Layton 2010-01-27 14:59:10 UTC

Ok, I got on the box and reproduced the panic:

ffff810132749330: redzone 1:0x170f0000, redzone 2:0x170fc2a5.
ffff81011f6b6b18: redzone 1:0x170fc2a5, redzone 2:0x82a0e000046208e.
ffff810133eb0700: redzone 1:0x170fc2a5, redzone 2:0x208e0e0000aa208e.

In the latter two cases, rz2 was clobbered so it doesn't seem to be *just* rz1.

Comment 26 Jesse Brandeburg 2010-01-27 19:45:51 UTC

I think we need to look at why the rx unit is still receiving into buffers that the driver had freed.

The tx hang debug we also need to work on, but I think the rx cleanup panic can be fixed, probably by being a lot more paranoid about making sure RX is stopped before reset continues.

Comment 29 Jesse Brandeburg 2010-01-28 00:35:42 UTC

Created attachment 387208 [details]
make sure rx unit is stopped before reset

patch applies against 2.6.33-rc5 and was load/unload tested.  We don't have a repro here, so please give us feedback if this stops the panic.

will also attach the dump code.

Comment 30 Jesse Brandeburg 2010-01-28 00:37:23 UTC

Created attachment 387209 [details]
e1000e dump code

this dump code will dump registers and descriptor rings upon a tx hang.  If possible please consider running with only 80 rx and tx descriptors.

ethtool -G eth0 rx 80 tx 80

patch is against 2.6.33-rc5

Comment 33 Prarit Bhargava 2010-01-28 14:49:38 UTC

(In reply to comment #29)
> Created an attachment (id=387208) [details]
> make sure rx unit is stopped before reset
> 
> patch applies against 2.6.33-rc5 and was load/unload tested.  We don't have a
> repro here, so please give us feedback if this stops the panic.
> 
> will also attach the dump code.    

Jesse,

I tried both patches on a system here and I still get the same panic.

FYI,

P.

Comment 35 Jesse Brandeburg 2010-01-28 17:22:18 UTC

did you capture the panic? were there any messages from the driver before the crash?

Comment 39 Prarit Bhargava 2010-01-28 19:29:07 UTC

(In reply to comment #35)
> did you capture the panic? were there any messages from the driver before the
> crash?    

I have the panic, and will attach it shortly.  Panic done with sysctl -w kernel.printk=8 ...

There are no messages from the driver, and the panic is pretty much the same as the previous panics reported in this BZ.

P.

Comment 40 Prarit Bhargava 2010-01-28 19:30:01 UTC

Created attachment 387411 [details]
panic with Jesse's patches

Comment 43 Andy Gospodarek 2010-01-28 20:08:22 UTC

For fun we tried the e1000e-1.1.2 driver from sourceforge and the system fails just like it does when using the backported e1000e driver.

Comment 44 Jesse Brandeburg 2010-01-28 21:57:13 UTC

I'm trying to reproduce.  Do you have any non piketon (like ich9 or ich10R) systems that you can reproduce on too?  My immediately accessible systems are ich9/10.  Please be absolutely sure your piketon has the latest bios, they were changing the bios every day on those machines, right up to production.

I'm in connectathon build hell, no libsocket, I'll probably figure it out just after sending this.

Comment 47 Jesse Brandeburg 2010-01-29 01:32:47 UTC

please attach dmidecode output, and any other information regarding the system you're running on.  Is it a production piketon?

I've got a piketon CRB in my office we used for development, installing 5.5 pre now.

Comment 48 Jesse Brandeburg 2010-01-29 02:20:12 UTC

attached is my tests.init
I've started two cthon04 runs, two machines connected via the e1000e ports to a switch.

each machine is serving nfs, one is configured sync in exports, the other is not.
ich9:
00:19.0 Ethernet controller: Intel Corporation 82566DC-2 Gigabit Network Connection (rev 02)
ich10r:
00:19.0 Ethernet controller: Intel Corporation 82567LM-2 Gigabit Network Connection

I started cthon like so:
./server -N 1000

apparently there are some comments in this bug that I cannot see, so if you want to fill me in, feel free.

I'll run with the piketon tomorrow.

Comment 49 Jesse Brandeburg 2010-01-29 02:26:17 UTC

Created attachment 387475 [details]
tests.init file configuration

Comment 51 Prarit Bhargava 2010-01-29 13:32:38 UTC

Created attachment 387564 [details]
dmidecode output from piketon-02

(In reply to comment #47)
> please attach dmidecode output, and any other information regarding the system
> you're running on.  Is it a production piketon?
> 

Hi Jesse -- here is the dmidecode output.  

AFAIK, this piketon box is a pre-production box, however, we've seen this error crop up on a few other production systems (HP z400 for example) but with much less reproducibility.

The panic on those other systems is *identical* (I cannot stress that enough -- the panics are *exactly the same*) as the panic in this BZ.

More information on reproducibility to follow shortly,

P.

Comment 52 Prarit Bhargava 2010-01-29 13:33:53 UTC

Jesse, given the nature of this BZ I'm going to send you the reproducer details in a private email and cc a few others as well.

P.

Comment 61 Neil Horman 2010-01-29 21:12:37 UTC

Created attachment 387644 [details]
untested first draft of my slab overrun trapper

ITs untested and only works on x86_64, but this is a first draft of my slab trap patch.  It appends a page to the 2048 slab, aligns the actual object to the end of the first page, and marks the second page read-only.  This should, when working properly trap the corruption of the 2048 slab when it occurs, in the context of the corruptor.  I need to validate its functionality still, but this is pretty close.

Comment 72 Neil Horman 2010-01-30 15:48:52 UTC

Created attachment 387745 [details]
new version of trap patch

Here you go, I've still got to verify it, but this will let you boot.  I'm going to try do a version 2 (I think I can clean it up significantly), but we should try rolling with this for a bit to see if we catch anything

Comment 73 Neil Horman 2010-01-30 15:50:00 UTC

in response to commetn #71, thats odd, its booting fine for me on hp-windsor.  Try the above patch and see if anything happens

Comment 74 Jeff Layton 2010-01-30 22:16:47 UTC

Hmmm...still not working correctly. It now boots a bit further, but eventually hangs after trying to start up udev:

--------------------[snip]------------------------
Switching to new root and running init.
unmounting old /dev
unmounting old /proc
unmounting old /sys
type=1404 audit(1264871459.791:2): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295
BUG: warning at mm/slab.c:2333/check_irq_off() (Not tainted)

Call Trace:
 [<ffffffff80017ff0>] cache_grow+0x463/0x4e2
 [<ffffffff8005d12e>] cache_alloc_refill+0x184/0x1d4
 [<ffffffff800dbdcb>] __kmalloc+0xbd/0xd7
 [<ffffffff800cc14e>] __kzalloc+0x9/0x21
 [<ffffffff8013861e>] security_get_bools+0x74/0x1b5
 [<ffffffff80130a22>] sel_write_load+0x130/0x324
 [<ffffffff80016da1>] vfs_write+0xce/0x174
 [<ffffffff8001766e>] sys_write+0x45/0x6e
 [<ffffffff8005e116>] system_call+0x7e/0x83

BUG: warning at mm/slab.c:2333/check_irq_off() (Not tainted)

Call Trace:
 [<ffffffff8005d074>] cache_alloc_refill+0xca/0x1d4
 [<ffffffff800dbdcb>] __kmalloc+0xbd/0xd7
 [<ffffffff800cc14e>] __kzalloc+0x9/0x21
 [<ffffffff8013861e>] security_get_bools+0x74/0x1b5
 [<ffffffff80130a22>] sel_write_load+0x130/0x324
 [<ffffffff80016da1>] vfs_write+0xce/0x174
 [<ffffffff8001766e>] sys_write+0x45/0x6e
 [<ffffffff8005e116>] system_call+0x7e/0x83

type=1403 audit(1264871460.166:3): policy loaded auid=4294967295 ses=4294967295
INIT: version 2.86 booting
		Welcome to Red Hat Enterprise Linux Server
		Press 'I' to enter interactive startup.
Setting clock  (localtime): Sat Jan 30 17:11:02 EST 2010 [  OK  ]
Starting udev: 
--------------------[snip]------------------------

...it hung in that spot for >5 mins and the console stopped responding.

Sysrq doesn't seem to work correctly through the serial console here, so I can't tell exactly what's wrong.

Comment 75 Jeff Layton 2010-01-30 22:33:42 UTC

Got it to boot after a few attempts and started up the test. It took a while and never quite panic'ed in the way we expected:

BUG: scheduling while atomic: swapper/0x10000100/0

Call Trace:
 <IRQ>  [<ffffffff8006343d>] __sched_text_start+0x7d/0xbd6
 [<ffffffff8008d6a7>] enqueue_task+0x41/0x56
 [<ffffffff8008d712>] __activate_task+0x56/0x6d
 [<ffffffff8008f56b>] __cond_resched+0x1c/0x44
 [<ffffffff800640cb>] cond_resched+0x37/0x42
 [<ffffffff800a3794>] down_write+0x9/0x12
 [<ffffffff8008363a>] change_page_attr_addr+0x43/0xf2
 [<ffffffff800dc2d3>] slab_destroy+0x101/0x19f
 [<ffffffff800dc50c>] free_block+0x19b/0x1db
 [<ffffffff800dc697>] cache_flusharray+0x8b/0xba
 [<ffffffff8000b82c>] kfree+0x1fa/0x215
 [<ffffffff80029187>] __kfree_skb+0x11/0x1a
 [<ffffffff8022c86c>] skb_free_datagram+0xc/0x31
 [<ffffffff885425b2>] :sunrpc:xs_udp_data_ready+0x17c/0x193
 [<ffffffff802297ca>] sock_queue_rcv_skb+0x15f/0x16b
 [<ffffffff8025dcc5>] __udp_queue_rcv_skb+0x9/0x5d
 [<ffffffff80053f69>] udp_queue_rcv_skb+0x283/0x2d6
 [<ffffffff80053b9e>] udp_rcv+0x3e5/0x52d
 [<ffffffff8003540b>] ip_local_deliver+0x19d/0x263
 [<ffffffff80036579>] ip_rcv+0x539/0x57c
 [<ffffffff80020faa>] netif_receive_skb+0x470/0x49f
 [<ffffffff881b139c>] :e1000e:e1000_receive_skb+0x1b5/0x1d6
 [<ffffffff881b5af7>] :e1000e:e1000_clean_rx_irq+0x283/0x322
 [<ffffffff881b3b8c>] :e1000e:e1000_clean+0x7c/0x29a
 [<ffffffff8000cbfb>] net_rx_action+0xac/0x1e0
 [<ffffffff881b3a1c>] :e1000e:e1000_intr_msi+0xd6/0xe0
 [<ffffffff80012762>] __do_softirq+0x89/0x133
 [<ffffffff8005f2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006dba8>] do_softirq+0x2c/0x85
 [<ffffffff8006da30>] do_IRQ+0xec/0xf5
 [<ffffffff8005e615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff8019d5aa>] acpi_processor_idle_simple+0x17d/0x30e
 [<ffffffff8019cce3>] acpi_safe_halt+0x25/0x36
 [<ffffffff8019d50d>] acpi_processor_idle_simple+0xe0/0x30e
 [<ffffffff8019d42d>] acpi_processor_idle_simple+0x0/0x30e
 [<ffffffff8004a19e>] cpu_idle+0x95/0xb8
 [<ffffffff800786bc>] start_secondary+0x495/0x4a4

NMI Watchdog detected LOCKUP on CPU 0
CPU 0
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd
 sunrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_m
ultipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi butto
n battery asus_acpi acpi_memhotplug ac lp joydev snd_hda_intel snd_seq_dummy snd
_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd
_pcm i2c_i801 snd_timer snd_page_alloc snd_hwdep i2c_core snd parport_serial par
port_pc parport shpchp soundcore sr_mod cdrom e1000e pcspkr sg dm_raid45 dm_mess
age dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci
 libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 2817, comm: automount Not tainted 2.6.18-185.el5.slabtrap.1 #4
RIP: 0010:[<ffffffff80065bfc>]  [<ffffffff80065bfc>] .text.lock.spinlock+0x2/0x3
0
RSP: 0018:ffff810125dedd90  EFLAGS: 00000086
RAX: 0000000000000000 RBX: 0000000000000246 RCX: 0000000000000000
RDX: ffff81010448ef40 RSI: 00000000000000d0 RDI: ffff81010448e680
RBP: 00000000000000d0 R08: ffff810125dec000 R09: 00000000000000d0
R10: 0000000000000000 R11: ffff810133a33860 R12: ffff81010448ef40
R13: ffff81010448e640 R14: 0000000000000004 R15: ffff810104499380
FS:  00000000415c9940(0063) GS:ffffffff803ca000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003a9989a3a0 CR3: 00000001287c4000 CR4: 00000000000006e0
Process automount (pid: 2817, threadinfo ffff810125dec000, task ffff810133a33860
)
Stack:  ffffffff8005d01d 000000d000000010 0000000000000246 00000000000000d0
 ffff810104499380 ffffffff8004a7ee 00000000424a49d0 0000000000010800
 ffffffff8000ae35 ffff8101349d79d8 ffffffff80316e70 0000000000000002
Call Trace:
 [<ffffffff8005d01d>] cache_alloc_refill+0x73/0x1d4
 [<ffffffff8004a7ee>] audit_alloc+0x70/0x123
 [<ffffffff8000ae35>] kmem_cache_alloc+0x86/0xa3
 [<ffffffff8004a7ee>] audit_alloc+0x70/0x123
 [<ffffffff8012dbbf>] selinux_task_alloc_security+0x1e/0x55
 [<ffffffff8001fab1>] copy_process+0x562/0x15eb
 [<ffffffff8009f459>] alloc_pid+0x1f6/0x292
 [<ffffffff80031c00>] do_fork+0x69/0x1c1
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0
 [<ffffffff8005e427>] ptregscall_common+0x67/0xac


Code: 83 3f 00 7e f9 e9 7f fe ff ff f3 90 83 3f 00 7e f9 e9 f9 fe
Kernel panic - not syncing: nmi watchdog
 BUG: warning at kernel/panic.c:137/panic() (Not tainted)

Call Trace:
 <NMI>  [<ffffffff80092913>] panic+0x1da/0x1eb
 [<ffffffff8006caef>] _show_stack+0xdb/0xea
 [<ffffffff8006cbe2>] show_registers+0xe4/0x100
 [<ffffffff800662c5>] die_nmi+0x66/0xa3
 [<ffffffff80066a0b>] nmi_watchdog_tick+0x157/0x1d3
 [<ffffffff80066629>] default_do_nmi+0x81/0x225
 [<ffffffff80066896>] do_nmi+0x43/0x61
 [<ffffffff80065eef>] nmi+0x7f/0x88
 [<ffffffff80065bfc>] .text.lock.spinlock+0x2/0x30
 <<EOE>>  [<ffffffff8005d01d>] cache_alloc_refill+0x73/0x1d4
 [<ffffffff8004a7ee>] audit_alloc+0x70/0x123
 [<ffffffff8000ae35>] kmem_cache_alloc+0x86/0xa3
 [<ffffffff8004a7ee>] audit_alloc+0x70/0x123
 [<ffffffff8012dbbf>] selinux_task_alloc_security+0x1e/0x55
 [<ffffffff8001fab1>] copy_process+0x562/0x15eb
 [<ffffffff8009f459>] alloc_pid+0x1f6/0x292
 [<ffffffff80031c00>] do_fork+0x69/0x1c1
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0
 [<ffffffff8005e427>] ptregscall_common+0x67/0xac

Comment 76 Neil Horman 2010-01-31 01:07:19 UTC

Yeah, this is confusing to me.  I'm actually growing a bit concerned about this check_irq_off bug.  I started from scratch and made a better implementation of the guard page patch, and despite it being significantly different, I still get the same check irq warning.  I'm starting to wonder if we have a unbalance local_irq_enable call somewhere that allowing a double dequeue of an object somewhere thats leading to this sort of corruption.  Since I can trip the check_irq_off warning here on my development machine, I'm further instrumenting to track hardirq disable/enables and report who the last enabler was.  If I am tripping over another problem that should reveal it.  I'll post here when I have something.

Comment 77 Neil Horman 2010-01-31 19:10:36 UTC

Created attachment 387872 [details]
cleaner version of slab trap

Ok, heres a cleaned up implementation of the slab trap patch.  Turns out the page protection altering code did some locking that seemed to be messing up the slab debug checking, thats fixed  now

Some notes:
1) This version of the patch requires that CONFIG_SLAB_DEBUG be enabled
2) It explicitly disables Red Zone checking for the slabs.  With this patch,ideally that shouldn't matter, and it solves a number of problem with alignment when adding a guard page


I've tested this and verified that it catches overwrites beyond 2048 bytes (the object is set 1/2 page into the first of two pages, so any overwrite is trapped.  Jeff, not sure if you're still working on this over the weekend now, but if you have time to run it trough your test that would be great.  Otherwise I'll take it over monday.

Comment 79 Neil Horman 2010-02-01 12:37:48 UTC

Created attachment 388024 [details]
enhancement to the slab trp patch

same patch as before, but enhanced.  It writes 0xcc to the first 128 bytes of the guard page before write protecting it.  This should catch any dma overruns on it.

Comment 85 Neil Horman 2010-02-02 01:30:02 UTC

Created attachment 388175 [details]
patch to write protect all data pages for all skbs

Comment 88 Neil Horman 2010-02-02 14:40:39 UTC

Created attachment 388317 [details]
client side nfs capture w/ tcpdump

So, with my most recent slab trap patch, I'm not getting crashes, but part of the problem is certainly reproducing, in that the NIC hangs up and needs to be ifdowned/ifuped to restore function.  So since we're seeing part of the problem I did a client side tcpdump capture during the reproducer, attached here.  Not sure what to make of it yet, but even during the NIC hang, we seem to receive frames.  This kind of suggests that the card is having a problem with transmisson rather than receive (which might explain why my write protection changes on the receive side aren't catching anything, I'm really not sure).  Regardless however, I note that the sol-10 server seems to be sending _lots_ of duplicate replies to frames we received many many frames ago.  That shouldn't be problematic, but its definately something to look at and explain.

Comment 89 Jesse Brandeburg 2010-02-02 17:58:34 UTC

I'm now running against a Solaris 10 x86 nfs server, and I ran through 941 iterations of ./server -N 10000 before getting the message:
=====
check for lost reply on non-idempotent requests
100 tries

test exclusive create.

test negative seek, you should get: read: Invalid argument
or lseek: Invalid argument
lseek: Invalid argument

test rename

test truncate
truncate succeeded

test holey file support
Holey file test ok

second check for lost reply on non-idempotent requests
testing 50 idempotencies in directory "testdir"
rmdir 1: Directory not empty
special tests failed
Tests failed, leaving /mnt/sol-dell1.jf.intel.com mounted
=====

as it says above the directory is full of files.

I still have not seen any slab corruption reported, could the network that this machine is plugged into have jumbo frame traffic on it?  I'm wondering if this is somehow related to the LPE (long packet enable) problems we've had at redhat on some of the other piketon systems (that we cannot reproduce)

Neil, will you make sure that your e1000e driver is running in non-jumbo mode with this patch applied?

commit 6a9a79019524ed1fd4e2036e947b93077b2d7aff
Author: Jesse Brandeburg <jesse.brandeburg>
Date:   Tue Feb 2 09:42:49 2010 -0800

    e1000e: don't use LPE to limit buffer size

    Signed-off-by: Jesse Brandeburg <jesse.brandeburg>

diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index a042aa0..61a511a 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -4639,12 +4639,6 @@ static int e1000_change_mtu(struct net_device *netdev, int new_mtu)
        else
                adapter->rx_buffer_len = 4096;

-       /* adjust allocation if LPE protects us, and we aren't using SBP */
-       if ((max_frame == ETH_FRAME_LEN + ETH_FCS_LEN) ||
-            (max_frame == ETH_FRAME_LEN + VLAN_HLEN + ETH_FCS_LEN))
-               adapter->rx_buffer_len = ETH_FRAME_LEN + VLAN_HLEN
-                                        + ETH_FCS_LEN;
-
        if (netif_running(netdev))
                e1000e_up(adapter);
        else

Comment 90 Andy Gospodarek 2010-02-02 18:19:33 UTC

(In reply to comment #89)
> 
> Neil, will you make sure that your e1000e driver is running in non-jumbo mode
> with this patch applied?
> 
> commit 6a9a79019524ed1fd4e2036e947b93077b2d7aff
> Author: Jesse Brandeburg <jesse.brandeburg>
> Date:   Tue Feb 2 09:42:49 2010 -0800
> 
>     e1000e: don't use LPE to limit buffer size
> 
>     Signed-off-by: Jesse Brandeburg <jesse.brandeburg>
> 
> diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
> index a042aa0..61a511a 100644
> --- a/drivers/net/e1000e/netdev.c
> +++ b/drivers/net/e1000e/netdev.c
> @@ -4639,12 +4639,6 @@ static int e1000_change_mtu(struct net_device *netdev,
> int new_mtu)
>         else
>                 adapter->rx_buffer_len = 4096;
> 
> -       /* adjust allocation if LPE protects us, and we aren't using SBP */
> -       if ((max_frame == ETH_FRAME_LEN + ETH_FCS_LEN) ||
> -            (max_frame == ETH_FRAME_LEN + VLAN_HLEN + ETH_FCS_LEN))
> -               adapter->rx_buffer_len = ETH_FRAME_LEN + VLAN_HLEN
> -                                        + ETH_FCS_LEN;
> -
>         if (netif_running(netdev))
>                 e1000e_up(adapter);
>         else    

I can't say for sure if there are any jumbo frames on the network, but we are not running that patch.  The bits are still in place.

Comment 91 Prarit Bhargava 2010-02-02 18:51:15 UTC

(In reply to comment #90)
> (In reply to comment #89)
> > 
> > Neil, will you make sure that your e1000e driver is running in non-jumbo mode
> > with this patch applied?
> > 
> > commit 6a9a79019524ed1fd4e2036e947b93077b2d7aff
> > Author: Jesse Brandeburg <jesse.brandeburg>
> > Date:   Tue Feb 2 09:42:49 2010 -0800
> > 
> >     e1000e: don't use LPE to limit buffer size
> > 
> >     Signed-off-by: Jesse Brandeburg <jesse.brandeburg>
> > 
> > diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
> > index a042aa0..61a511a 100644
> > --- a/drivers/net/e1000e/netdev.c
> > +++ b/drivers/net/e1000e/netdev.c

<snip>

> I can't say for sure if there are any jumbo frames on the network, but we are
> not running that patch.  The bits are still in place.    

Removing it now and testing ... fyi.

P.

Comment 92 Prarit Bhargava 2010-02-02 18:54:28 UTC

> <snip>
> 
> > I can't say for sure if there are any jumbo frames on the network, but we are
> > not running that patch.  The bits are still in place.    
> 
> Removing it now and testing ... fyi.
> 
> P.    

Kernel still panics (in exactly the same way as previously noted).

P.

Comment 93 Neil Horman 2010-02-02 21:26:06 UTC

Did some more testing today.  Tried some page protection strategies on the actual data buffer for the rx skb, and got the server not responding messages, but never traped any corruption.  Tried poisoning some slack space in the data buffer to catch scribbing by a dma transfer and never detected any of that either.  Gospo had an interesting theory in which it was just the skb_shared info that was getting corrupted previously which would have fit with previous observations as well (given the corruption of rz2).  So we're trying some approaches with just that area getting moved away.

Comment 94 Andy Gospodarek 2010-02-02 22:06:04 UTC

Created attachment 388391 [details]
interesting-skb-data-alignment.patch

While the patch Neil referenced sounds like a great idea -- something about it doesn't work.  I'll have to look at it tomorrow.  It is attached for those that are curious.

Comment 95 Neil Horman 2010-02-03 01:50:37 UTC

To clairfy, the origional patch I posted in comment 85 works fine, it properly write protects the data page and places the skb_shared_info on the writeable second page, it just doesn't catch any corruption. Its only the variant above that doesn't boot.  I _think_ its because we're adding a guard page between the data buffer and the skb_shared_info struct.  A 3 page allocation will require that we alloc from the order 2 slab, which is 16k of contiguous data.  Thats pretty hard to get in normal conditions, and if netlink or some other early use protocol is allocating skbs during boot, we could run out of available 16k chunks quick.

Given that my patch in comment 85 works, but doesn't trap any writes or detect any corruption via dma (if I use poisoning on the slack space), I think we can fix this by reducing the allocation to 2 pages, and pushing the data buffer size such that the end pointer of the skb is PAGE_SIZE-sizeof(skb_shared_info) from the end of the page.  That way if someone is only corrupting skb_shared_info (as the patch in comment 94 hypothesizes), we can use the second of 2 pages as a guard page here and achieve the same goal.

Comment 96 Neil Horman 2010-02-03 03:32:16 UTC

so, I just gave the above a shot, and got simmilar results.  I observed tx-side hangs, and an eventual complete hang of the NIC, which resulted in loss of connectivity.  an ifdown/ifup of the interface brought it back to life.  Given that I'm seeing dumps of the tx ring during these operations, I strongly feel we need to start investigating the tx path in the driver.  I'm specifically wondering if perhaps the nfs code is sending down a packet with a data field that is causing the tx logic in the nic to fall over and create unpredictable scenarios.  I'm going to further instrument the tx path tomorrow.

Comment 97 Neil Horman 2010-02-04 03:01:59 UTC

i tried to do some isolated redzone checking today, I repoduced the issue on a stock -185 kernel, then added a private redzone variable at the end of the skb_shared_info strucutre, then I checked it after we get a tx complete interrupt.  As soon as I added the redzone variable, I stopped oopsing the system.  still got the tx non-responsiveness, but no oops.  Need to investigate this further, checking the private redzone on free.

Comment 100 Jeff Layton 2010-02-05 15:45:18 UTC

Created attachment 389102 [details]
debug patch 2 -- add routine to check redzone integrity in slabcache

This is a later iteration of the patch I was using to check the validity of the slabcaches in the RPC code. I just sprinkles some of these calls around the RPC code:

    objp = verify_kmalloc_cache(2048, GFP_KERNEL);

...and did some other printk's if that returned a non-NULL, to indicate when the corruption was detected.

This version also does a hexdump of 256 bytes around rz2 (which is where the corruption was almost always detected).

This patch bails out when it detects the first corrupted object, but it might be interesting too to keep on scanning and see whether there might be contiguous corrupted objects.

Comment 101 Jeff Layton 2010-02-05 16:00:44 UTC

Created attachment 389110 [details]
debug patch 3 -- more comprehensive patch -- scan and dump all corrupt objects

The other patch had a bug, if it detected problems around rz2, it didn't dump_mem the data around that area, but rather at the head of the allocation. This should fix that, and also change the patch to dump out info on any corrupt redzone in the slabcache.

Note: not even compile tested. If it's broken, let me know (or send me a patch ;).

Comment 102 Neil Horman 2010-02-06 20:28:15 UTC

so, I decided to take a more layton-esque approach this weekend.  given that our page protection approaches, while functional, modified the system enough to prevent these oopses from occuring.  Instead i did some lightweight instrumentation that validated the affected redzone in multiple places.  I've trapped the violation in netif_receive_skb, which is very early in the receive path, so thats good.  the odd part is, i did some checking earlier in the driver, that did not trigger, but I also put in some printks that should always trigger, but do not, even though the function their in appears in the backtrace when we do catch the violation, its causing me question my understanding of the rx path.

Comment 103 Neil Horman 2010-02-07 03:20:52 UTC

Interesting results today.  Based on my last comments, I sprinkled BUG() traps throughout the code in the e1000e drivers rx path to ensure i would hit one of them, but when i tested I didn't.   Yet, I still managed to hit a BUG() trap in netif_receive_skb().  So I tried booting with eth0 set to ONBOOT=no, I _still_managed to trap on the BUG() halt (with e1000e functions on the stack, perhaps erroneously, despite its being down).  I'm beginning to if maybe that our common code is the loopback driver, or something that passes data through the loopback driver.  I need to test with lo disabled to see if that avoids the problem.  I hope to have results tomorrow.  If thats the case, I'll continue by checking SKBs there in the loopback xmit path.

Comment 104 Neil Horman 2010-02-07 19:40:53 UTC

Yeah, so I configured eth0 to not come up during boot, and added some redzone checking the loopback tx/rx routine, loopback_xmit.  I trapped several bad frames during boot there, that originated from rpc.statd (not sure if the app has any relevance).  I validated that the e1000e driver wasn't loaded at all.  looks like there might be some sort of overrun on frames going through loopback.  I'm going to instrument back up the stack that I get from these.

Comment 105 Neil Horman 2010-02-09 20:35:29 UTC

Sigh, bad news.   AFter digging through the redzone code, I finally found that my check was off bya  byte, hence the xmit failures....going back to instrumenting the rx path...

Comment 106 Neil Horman 2010-02-10 02:24:07 UTC

Created attachment 389887 [details]
my current debugging patch

Ok, so good news, bad news.

Bad news is that the bug in my patch was causing the erroneous results above. I was using the wrong offset when checking redzones.  With that bug fixed, we're back to the origional behavior.

the good news is that this patch is working properly now.  With it I managed to get this backtrace:

Kernel BUG at drivers/net/e1000e/netdev.c:219
invalid opcode: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
CPU 3
Modules linked in: nfs(U) fscache(U) nfs_acl(U) autofs4(U) hidp(U) rfcomm(U)
l2cap(U) bluetooth(U) lockd(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U)
cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) dm_multipath(U) scsi_dh(U)
video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U)
wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) lp(U)
joydev(U) snd_hda_intel(U) snd_seq_dummy(U) snd_seq_oss(U) snd_seq_midi_event(U)
snd_seq(U) snd_seq_device(U) snd_pcm_oss(U) snd_mixer_oss(U) snd_pcm(U)
snd_timer(U) snd_page_alloc(U) snd_hwdep(U) parport_serial(U) parport_pc(U)
snd(U) soundcore(U) parport(U) sr_mod(U) cdrom(U) e1000e(U) i2c_i801(U)
i2c_core(U) shpchp(U) sg(U) pcspkr(U) dm_raid45(U) dm_message(U)
dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U)
dm_log(U) dm_mod(U) ahci(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 0, comm: swapper Tainted: G      2.6.18-prep #2
RIP: 0010:[<ffffffff881e6745>]  [<ffffffff881e6745>]
:e1000e:e1000_alloc_rx_buffers+0x20e/0x278
RSP: 0018:ffff81010777fdc0  EFLAGS: 00010287
RAX: 0000000000000000 RBX: ffffc200000eee50 RCX: 0000000000000026
RDX: ffff8101072c7da0 RSI: ffff81013214f1f0 RDI: ffff8101075e8340
RBP: ffff810130844c20 R08: ffff810132150000 R09: 0000000000000001
R10: 0000000000000046 R11: ffffffff800c5604 R12: 00000000000000c2
R13: ffff810136040680 R14: ffff810133fcd1d0 R15: ffff810137a650b8
FS:  0000000000000000(0000) GS:ffff8101075f7c28(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000003a9987a7d0 CR3: 0000000124cc4000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffff81010777a000, task ffff810107778400)
Stack:  000000018008fcbd ffff810136040000 000005f40777fee8 a210724b06c10800
 00000000170f9b0f 0000000000000001 ffff810136040680 ffff810130844a00
 ffffc200000eeea0 ffff8101315f4c40 ffffffff80468340 ffffffff881e6c7d
Call Trace:
 <IRQ>  [<ffffffff881e6c7d>] :e1000e:e1000_clean_rx_irq+0x2e8/0x322
 [<ffffffff881e4bfb>] :e1000e:e1000_clean+0x7c/0x2b3
 [<ffffffff8000d111>] net_rx_action+0xb6/0x1fc
 [<ffffffff80012eb6>] __do_softirq+0x94/0x152
 [<ffffffff80070c49>] do_softirq+0x35/0xa0
 [<ffffffff80070c0b>] do_IRQ+0xfb/0x104
 [<ffffffff80060652>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffff80065ff3>] __sched_text_start+0xc03/0xc3e
[<ffffffff801aaa1e>] acpi_processor_idle_simple+0x193/0x324
 [<ffffffff801aaa14>] acpi_processor_idle_simple+0x189/0x324
 [<ffffffff801aa88b>] acpi_processor_idle_simple+0x0/0x324
 [<ffffffff801aa88b>] acpi_processor_idle_simple+0x0/0x324
 [<ffffffff8004c06a>] cpu_idle+0x9a/0xbd
 [<ffffffff8007b8e9>] start_secondary+0x498/0x4a7

I've also seen a failed check on receive in the driver which I didn't record.  that tells me that the error is independent of weather or not the skb is on the rx ring, which in turn supports the theory that a cpu based access is causing this corruption.  Based on this I have an idea: Given that, with redzoning, size-2048 slabs occupy 1 page per object, I can simply write protect that single page in the following conditions

1) between the times the skb is on the rx ring and the time it is removed by a napi poll

2) when its passed to kfree, until its recalled with a kmalloc, or a slab_destory_object

Since the two conditions above constitue periods when the cpu should not ever write to an object, we should be safe.  If the corruptor doesn't move due to timing (it won't move due to size changes, as we observed before, since we're not changing the object size).  With any luck we can trap the corruptor when it happens.  I'll start work in the morning.

Comment 107 Neil Horman 2010-02-10 21:24:28 UTC

Created attachment 390115 [details]
todays update

Update notes for today are in this text file.  I drew some ascii art that doesn't display well in firefox, so I figured I'd just upload it here.  interesting stuff....

Comment 108 Andy Gospodarek 2010-02-10 21:33:36 UTC

(In reply to comment #107)
> Created an attachment (id=390115) [details]
> todays update
> 
> Update notes for today are in this text file.  I drew some ascii art that
> doesn't display well in firefox, so I figured I'd just upload it here. 
> interesting stuff....    

Interesting, Neil.  Have we seen tx-timeouts when not using the debug kernel and our standard reproducer?

Comment 109 Neil Horman 2010-02-11 01:07:46 UTC

I'm not sure, I'll try that.  Whats also interesting is that when I added the 'dispose page-spanning skbs' bits to the e1000e tx path, it began corrupting stuff almost immediately in exactly the same way (and _very_ quickly).  Not sure if I did something wrong, but I'm going to follow up on that too.

Comment 110 Neil Horman 2010-02-12 11:53:19 UTC

Notes: Last night I realized that this box didn't have an AMD iommu in it (not sure why I ever thought it did).  But I added some tracing code and found that it always used the swiotlb library for dma transfers.  Looking at that libraries map_single method, I found that we were falling into a clause in which we just tried direct memory-memory dmas, rather than bounce buffering, because the e1000e cards dma_mask was set to 64 bits.  Just for curiosities sake, I set the mask to 32 bits, and the system has been running all night (the card has stopped sending/receiving data after a while, but no hangs as of yet, after running all night).  Now, it makes perfectly good sense for this card to have a 64 bit dma mask, so I don't see why this should do anything to prevent the problem, but thought it was worth noting.

Comment 113 Jesse Brandeburg 2010-02-18 00:56:19 UTC

does this system end up with memory addresses > 4GB ?  we have seen the occasional issue with previous e1000 hardware in client systems with > 4GB ram.

I wonder if something in the bios or the swiotlb code is off.  Sometimes the system has 4GB ram, but the BIOS has to do strange things to remap some of the memory in use by video adapters and io ranges.

also, does the driver you're running have the skb_dma_map stuff still in it?  That code was removed upstream because it was being buggy (multiple mappings of a single tx buffer did bad things)

Also, what happens if you turn on the hardware MMU (VT-d)/DMAR?  do you get a different fault?

PLEASE GET A RECENT BIOS!! from the dmidecode:
+ Release Date: 08/13/2009

it should probably be November or newer.

Comment 114 Neil Horman 2010-02-18 02:02:17 UTC

In response to your comments questions, Jesse:

Yes, the system has m> 4GB of ram and many of the skbs we allocate wind up above the 4G limit. In fact I had noticed that if I had set the dma_mask of the adapter to 32 bits, the memory corruption problem goes away.  Tomorrow on my list of things to try is booting with mem=4G to.  I had also thought that the swiotlb library might have a problem, and limiting ram to 4GB should effectively make it a pass through operation.

We no longer have the skb_dma_map bits in e1000 or e1000e (I will double check this though, as I only looked very quickly into this question).

I had not thought of disabling the DMAR.  I assume you can do that in the BIOS?  I will look into that at the same time that I look into disabling the BMC


Regarding the DMIdecode information, we should have a later bios than that, but IIRC it was dated october, not november, we'll look into a later bios.

I know your out the rest of the week, but I'll update with results of the outstanding questions here as soon as I'm able, in case you're keeping tabs while your out.

Comment 115 Neil Horman 2010-02-18 02:54:16 UTC

quick note, just tried booting with mem=4G, still failed.

Comment 116 Neil Horman 2010-02-18 16:05:46 UTC

updated bios to 10.29.09 version, disabled ME, failures still occur.

Comment 117 Jesse Brandeburg 2010-02-18 18:31:28 UTC

Please try mem = 3G too

Comment 118 Neil Horman 2010-02-18 20:27:37 UTC

I've gone all the way down to mem=2G today, with no change in behavior, unfortunately.  It did occur to me however that the bios might be mapping ram oddly in such a way that low virtual addresses are still located in physical locations above 4GB, so I'm currently hacking up the network stack so that, in addition to setting the device dma_mask to be 32 bits (for transmit bounce buffering), calls to netdev_alloc_skb pass GFP_DMA as a allocation flag, which should guarantee low memory allocations (I know I should use GFP_DMA32, but that flag was very new with the 2.6.18 kernel and doesn't work quite right with the slab allocation code).  GFP_DMA should be overkill, and ok for the narrow reproducer we have.  I'll post results shortly.

Comment 119 Neil Horman 2010-02-23 17:29:43 UTC

Got some time on Jesse B. piketon box today.  Set up the reproducer there and I think I was able to make the problem reproduce.  Asking Jesse to look at it more closely to confirm.

Comment 120 Jesse Brandeburg 2010-02-24 02:41:33 UTC

I made some good progress on this bug today, I made a driver that is showing that the driver is processing !EOP packets (should never happen on a network with this MTU) with length 0x800, and with the copybreak code disabled too.

this explains the memory corruption of the skb_shinfo(skb) area, I think.

Next I will prepare a driver that prints the bad skbs, and allocates a much larger area to DMA into.

Neil, this sounds much like the other problem we had on the Z200 system where you were getting longer than expected DMA (aka LPE bit didn't work right)

I'm suspecting that the NIC hardware is getting confused by something in the packets coming in from the NFS server.  The fact that we are getting 0x800 length packets without EOP set means there is likely a hardware bug.

Comment 121 Neil Horman 2010-02-24 03:29:21 UTC

Jesse, I agree completely, what you describe is definately simmilar to what we saw on the z200 systems, and explains why bumping up the rx_buf_size in the driver avoided the problem at hand.

Just so that I can be sure I understand what you're seeing packets arrive at the driver with a length of 2048 bytes, but no EOP bit set, yes?  Can you send me the patch that you're currently working with.  I'd like to put that in place here so see if I can get you any additional info that might be helpful in diagnosing the problem.  If you have any suggestions/requests for stuff that you'd be interested in seeing, I'll get on putting it together.

Comment 122 Jesse Brandeburg 2010-02-24 08:45:02 UTC

(In reply to comment #121)
> Just so that I can be sure I understand what you're seeing packets arrive at
> the driver with a length of 2048 bytes, but no EOP bit set, yes?  Can you send

yes, it appears that simultaneously with the nfs server losing connection, we see corrupt (long) packets make their way to memory, the redzone logic kicks in, and the driver continues to receive long packets.

The next step is to hexdump the packet data, and possibly capture with tcpdump the data that is being sent from the sol-10 server (for comparison)

> me the patch that you're currently working with.  I'd like to put that in place
> here so see if I can get you any additional info that might be helpful in
> diagnosing the problem.  If you have any suggestions/requests for stuff that
> you'd be interested in seeing, I'll get on putting it together.    

I'll attach the patch to my driver that I'm using for debug code.  It makes the driver last much longer before a panic occurs.  I guess since we are getting kdumps that we can probably print a few of the ring variables (buffer_info start, tx and rx descriptor ring start addresses, and such at open time, and be able to quickly browse memory in the crash tool.

Comment 123 Jesse Brandeburg 2010-02-24 08:46:58 UTC

Created attachment 395964 [details]
patch to e1000e for debug

this patch is to our out of tree driver, and as such may not apply cleanly to your source.  the changes are pretty simple however, and I feel it is actually good that the issue reproduces on all drivers we've tried.

Comment 124 Neil Horman 2010-02-24 18:20:53 UTC

hey, jesse, thanks for the debug patch.  Just as a quick update, I was talking with gospo about this, and he noted some bits in the rfctl register, which he though might be germaine.  Spercifically he asked me to test with the NFSW, NFSR and ISCSI filtering bits set to disable those filters.  Unfortunately, even with those bits set, we still encountered the same failure as before.  Don't know if thats meaningful to you, but I figured it would be worth sharing, just in case.

I'm going to set up your debug patch this afternoon.

Comment 125 Jesse Brandeburg 2010-02-24 18:53:05 UTC

I have an updated debug patch, and (last night before bed) I set the RFCTL bits DIS_NFSR and DIS_NFSW, and your failing test ran all night with no issues, and no !EOP packets.

So, at this point I believe we have a solid workaround and I am working with the hardware team to figure out exactly why it fixes the issue and what the root cause is.

Comment 126 Jesse Brandeburg 2010-02-24 18:57:22 UTC

Created attachment 396127 [details]
2nd debug patch with proposed workaround

this patch sets 
        RFCTL          000000c0

which for me allowed the failing test to run for > 10 hours so far.

Comment 127 Andy Gospodarek 2010-02-24 19:07:41 UTC

I was so close.  The patch in comment #126 is almost exactly what I did after reading the programmers guide this morning.

It looks like I just put those bits in the wrong spot initially since the hardware Neil was testing doesn't appear to evaluate to true for this statement:

        if (!(adapter->flags & FLAG_IS_ICH) && (pages <= 3) &&
            (PAGE_SIZE <= 16384) && (rctl & E1000_RCTL_LPE))

This was the patch Neil tested.

diff --git a/drivers/net/e1000e/defines.h b/drivers/net/e1000e/defines.h
index db05ec3..e301e26 100644
--- a/drivers/net/e1000e/defines.h
+++ b/drivers/net/e1000e/defines.h
@@ -320,6 +320,8 @@
 #define E1000_RXCSUM_IPPCSE    0x00001000   /* IP payload checksum enable */
 
 /* Header split receive */
+#define E1000_RFCTL_NFSW_DIS            0x00000040
+#define E1000_RFCTL_NFSR_DIS            0x00000080
 #define E1000_RFCTL_ACK_DIS             0x00001000
 #define E1000_RFCTL_EXTEN               0x00008000
 #define E1000_RFCTL_IPV6_EX_DIS         0x00010000
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 88d54d3..d6eba1c 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -2398,6 +2398,12 @@ static void e1000_setup_rctl(struct e1000_adapter *adapter)
                 */
                rfctl |= (E1000_RFCTL_IPV6_EX_DIS |
                          E1000_RFCTL_NEW_IPV6_EXT_DIS);
+               /*
+                * disable packet split for nfs write request headers and
+                * nfs read reply headers
+                */
+               rfctl |= (E1000_RFCTL_NFSW_DIS |
+                         E1000_RFCTL_NFSR_DIS);
 
                ew32(RFCTL, rfctl);

Comment 128 Neil Horman 2010-02-24 19:54:19 UTC

yeah, thats looking _way_ better!  I'm running the updated patch on piketon-02, and its working smoothly.  I'm going to let this run over night to be sure, but its definately looking good.

So, looking forward, jesse, how shall we handle this?  do you see disabling the NFS filters universally in e1000e as the right solution, or can it be narrowed down to certain revisions?  Or will this not be the final fix at all (opting instead for a eeprom/firmware change to modify the behavior of whatever it is the silicon is doing when these filters are on)?  This all assumes of course we don't hit more problems, but this is looking good so far.

Thanks for all your hard work on this!

Comment 129 Andy Gospodarek 2010-02-24 20:01:42 UTC

Created attachment 396144 [details]
e1000e-nfs-packet-split-disable2.patch

Jesse, this is the patch Neil has been running successfully for a while.  It's like yours, but compiles. ;-)

Anyway do you think this will go upstream or are you likely to try and fix the actual problem quickly rather than posting this in the interim?

We would like to ship something that works so we are inclined to ship this patch since we literally need to decided *today* what will be done for 5.5.

Comment 130 Jesse Brandeburg 2010-02-24 21:45:29 UTC

To be conservative, I would limit it to FLAG_IS_ICH parts to set those bits.

it *should* be safe to have them set for all e1000e parts, but its a lot fewer to test if we just do the change within a FLAG_IS_ICH section.

otherwise andy your patch looks fine, and thanks for fixing it up. :-)

@Neil: The driver fix is the only way to go at this point.

Comment 131 Neil Horman 2010-02-25 01:59:35 UTC

@Andy, if its ok with you, I'll make & test jesse's requested modification, by adding a adapter->flag & FLAG_IS_ICH conditional around your rfctl setting tonight, so we're ready to do something with this in time for snap4 tomorrow.


@Jesse, how do you want to handle upstream, given that the driver fix is our way forward?  Shall Andy or I just post it, or do you want to handle upstream after you roll it around internally for a bit?

Comment 132 Neil Horman 2010-02-25 02:58:41 UTC

Created attachment 396206 [details]
updated e1000e patch

heres the updated patch I'm running.  Its Andy's patch, wrapped in a conditional check for FLAG_IS_ICH.  I've had it running well for an hour under our reproducer.  I'm going to leave it running overnight.

Comment 136 Jesse Brandeburg 2010-02-26 18:51:12 UTC

I'm going to propose a slightly different patch with the same functionality, which we'll maintain copyright on, and modifies ich8lan.c.  We'll post and test it via our kernel patch process and jeff kirsher.

also, it is worth mentioning the details that you sent me in email to reproduce this.

solaris-10 nfs server
RHEL5 client

on RHEL5 client,
mount -t nfs -o udp,nfsvers=2,rw 192.168.10.10:/export/home/connectathon /mnt

in that nfs directory, put these two files.
[jbrandeb@jbrandeb-piketon nhorman]$ cat fail.sh
#!/bin/sh -x

while true;
do
        date
        ./make_dir
        sleep 2;
        rm -rf rm-test/
done
<EOF>

[jbrandeb@jbrandeb-piketon nhorman]$ cat make_dir
#!/bin/sh

echo create
mkdir rm-test
cd rm-test
touch `seq 1 50`
cd ..
<EOF>

Comment 137 Jesse Brandeburg 2010-02-26 20:00:22 UTC

Created attachment 396650 [details]
alternate code location for change

here is the patch similar to the one we'll push upstream.

Comment 138 Andy Gospodarek 2010-02-26 20:03:19 UTC

Thanks, Jesse.  That seems reasonable to me.

Comment 140 Jesse Brandeburg 2010-03-01 17:58:14 UTC

I confirmed on Friday that I could not reproduce this issue at all on ICH10.

Comment 141 Andy Gospodarek 2010-03-01 18:07:11 UTC

Jesse, I cannot reproduce on ICH10 either.

I tried a McCreary with 82567LM-3 (8086:10de) and it ran for around an hour with no problems.

Comment 142 Andy Gospodarek 2010-03-01 19:29:58 UTC

Jesse, I also tried another PCH system with a X3440 (Lynnfield) processor while running a debug kernel and cannot get it to fail there either.

Comment 144 Andy Gospodarek 2010-03-01 23:26:27 UTC

Jesse, have you had any luck recreating this (it works best with a -debug kernel) on any system other than the piketon system?

Comment 145 Jesse Brandeburg 2010-03-01 23:31:33 UTC

(In reply to comment #144)
> Jesse, have you had any luck recreating this (it works best with a -debug
> kernel) on any system other than the piketon system?    

short answer: no.

That's a good thing, and we can move forward with the workaround I attached as it won't hurt any of the other systems (esp since they don't use packet split)

Comment 146 Neil Horman 2010-03-02 11:38:21 UTC

*** Bug 541424 has been marked as a duplicate of this bug. ***

Comment 149 Jarod Wilson 2010-03-03 15:44:16 UTC

in kernel-2.6.18-191.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 153 errata-xmlrpc 2010-03-30 07:23:15 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html