541424 – [RHEL5.4-z]kernel-debug, slab error in cache_alloc_debugcheck_after()

Bug 541424 - [RHEL5.4-z]kernel-debug, slab error in cache_alloc_debugcheck_after()

Summary: [RHEL5.4-z]kernel-debug, slab error in cache_alloc_debugcheck_after()

Keywords:
Status:	CLOSED DUPLICATE of bug 558809
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.4.z
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Neil Horman
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:	http://rhts.redhat.com/cgi-bin/rhts/t...
Whiteboard:
Duplicates (1):	545074 (view as bug list)
Depends On:
Blocks:	545074 545863
TreeView+	depends on / blocked

Reported:	2009-11-25 21:21 UTC by Jeff Burke
Modified:	2010-03-02 11:38 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	545074 (view as bug list)
Environment:
Last Closed:	2010-03-02 11:38:21 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
first try at a patch which places the caller in 2 places in a slab (4.42 KB, patch) 2009-11-30 22:06 UTC, Eric Paris	no flags	Details \| Diff
UNTESTED!!! tripwire memory allocator (2.79 KB, patch) 2009-12-03 05:21 UTC, Eric Paris	no flags	Details \| Diff
targeted debug patch (1.56 KB, patch) 2009-12-03 20:34 UTC, Neil Horman	no flags	Details \| Diff
test fix for our corruptor problem. (1.75 KB, patch) 2009-12-04 16:13 UTC, Neil Horman	no flags	Details \| Diff
updated version of fix under test (1.84 KB, patch) 2009-12-04 20:16 UTC, Neil Horman	no flags	Details \| Diff
new patch (1.96 KB, patch) 2009-12-04 21:37 UTC, Neil Horman	no flags	Details \| Diff
final patch for e1000, e1000e and ixgb (4.85 KB, patch) 2009-12-07 15:23 UTC, Neil Horman	no flags	Details \| Diff
final final patch :) (3.91 KB, patch) 2009-12-07 16:15 UTC, Neil Horman	no flags	Details \| Diff
my debug patch (1.78 KB, patch) 2009-12-14 19:06 UTC, Neil Horman	no flags	Details \| Diff
dmidecode from system (21.26 KB, text/plain) 2009-12-22 14:25 UTC, Jeff Burke	no flags	Details
Show Obsolete (4) View All

Description Jeff Burke 2009-11-25 21:21:35 UTC

Description of problem:
 While running the kerneltier1 tests. The x86_64 kenrel-debug Oops/Panics.
The first sign of the issue is the following message:
slab error in cache_alloc_debugcheck_after(): cache `size-2048': double free, or memory outside object was overwritten

Version-Release number of selected component (if applicable):
2.6.18-164.8.1.el5

How reproducible:
Always

Steps to Reproduce:
1. Use the attached XML in RHTS environment.
  
Actual results:
slab error in cache_alloc_debugcheck_after(): cache `size-2048': double free, or memory outside object was overwritten

Call Trace:
 [<ffffffff8000ce81>] cache_alloc_debugcheck_after+0xa2/0x1c1
 [<ffffffff8000b065>] kmem_cache_alloc+0xdf/0xeb
 [<ffffffff8004c2eb>] audit_alloc+0x70/0x123
 [<ffffffff8013911f>] selinux_task_alloc_security+0x1e/0x55
 [<ffffffff800206c3>] copy_process+0x6e6/0x17a1
 [<ffffffff800688bd>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800a2749>] alloc_pid+0x26e/0x294
 [<ffffffff80032fca>] do_fork+0x68/0x1c0
 [<ffffffff800c135a>] audit_syscall_entry+0x180/0x1b3
 [<ffffffff800602a6>] tracesys+0xd5/0xdf
 [<ffffffff8006045f>] ptregscall_common+0x67/0xac

ffff81004d934ad8: redzone 1:0x5a2c0500, redzone 2:0x5a2cf071

slab error in cache_alloc_debugcheck_after(): cache `size-2048': double free, or memory outside object was overwritten

Call Trace:
 [<ffffffff8000ce81>] cache_alloc_debugcheck_after+0xa2/0x1c1
 [<ffffffff800e5e1b>] __kmalloc+0x11d/0x129
 [<ffffffff8003272d>] expand_files+0x124/0x2a7
 [<ffffffff800237f3>] dup_fd+0x153/0x2a8
 [<ffffffff8004bd71>] copy_files+0x47/0x63
 [<ffffffff800206f1>] copy_process+0x714/0x17a1
 [<ffffffff800688bd>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800a2749>] alloc_pid+0x26e/0x294
 [<ffffffff80032fca>] do_fork+0x68/0x1c0
 [<ffffffff800233fa>] fd_install+0x2e/0x68
 [<ffffffff800c135a>] audit_syscall_entry+0x180/0x1b3
 [<ffffffff800602a6>] tracesys+0xd5/0xdf
 [<ffffffff8006045f>] ptregscall_common+0x67/0xac

ffff810058f896f0: redzone 1:0x5a2cb422, redzone 2:0x5a2cf071
general protection fault: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:19.0/irq
CPU 3 
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss shpchp snd_mixer_oss snd_pcm e1000e sr_mod serio_raw snd_timer snd_page_alloc snd_hwdep snd soundcore cdrom sg pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-164.8.1.el5debug #1
RIP: 0010:[<ffffffff8000d241>]  [<ffffffff8000d241>] put_page+0x0/0x2e
RSP: 0018:ffff810037c6fcc8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff810021943d30 RCX: 0000000000004300
RDX: ffff81004d934948 RSI: 6200000000000000 RDI: 5b67103a64d12400
RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
R10: ffffffff800555c6 R11: 0000000000000000 R12: 0000000000430043
R13: ffff81006b74c770 R14: ffff81004d9342fc R15: 00000000ffffffff
FS:  0000000000000000(0000) GS:ffff810037cb4cc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000019672b28 CR3: 00000000592d5000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffff810037c68000, task ffff810037c66400)
Stack:  ffffffff80234d83 ffff810021943d30 ffff810021943d30 ffff810021943d30
 ffffffff8002a077 0000000000000008 ffffffff800557d0 0000000000000000
 0000000000000000 4400810000000002 ffff810021943d30 ffffffff8038e020
Call Trace:
 <IRQ>  [<ffffffff80234d83>] skb_release_data+0x5f/0x99
 [<ffffffff8002a077>] __kfree_skb+0x11/0x1a
 [<ffffffff800557d0>] udp_rcv+0x385/0x542
 [<ffffffff8003698d>] ip_local_deliver+0x19d/0x263
 [<ffffffff80037b0a>] ip_rcv+0x539/0x57c
 [<ffffffff80021b4c>] netif_receive_skb+0x3ce/0x406
 [<ffffffff8822bb63>] :e1000e:e1000_receive_skb+0x177/0x198
 [<ffffffff8823023b>] :e1000e:e1000_clean_rx_irq+0x250/0x2f9
 [<ffffffff8822e2eb>] :e1000e:e1000_clean+0x7c/0x2b1
 [<ffffffff8000d0a5>] net_rx_action+0x60/0x1fc
 [<ffffffff8000d0fb>] net_rx_action+0xb6/0x1fc
 [<ffffffff80012e24>] __do_softirq+0x54/0x152
 [<ffffffff80012e64>] __do_softirq+0x94/0x152
 [<ffffffff800613d0>] call_softirq+0x1c/0x28
 [<ffffffff80070bc1>] do_softirq+0x35/0xa0
 [<ffffffff80070b83>] do_IRQ+0xfb/0x104
 [<ffffffff801a5ec8>] acpi_processor_idle_simple+0x189/0x324
 [<ffffffff80060652>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffff80065fc3>] __sched_text_start+0xc03/0xc3e
 [<ffffffff801a5ec8>] acpi_processor_idle_simple+0x189/0x324
 [<ffffffff801a5ec8>] acpi_processor_idle_simple+0x189/0x324
 [<ffffffff801a5ed2>] acpi_processor_idle_simple+0x193/0x324
 [<ffffffff801a5ec8>] acpi_processor_idle_simple+0x189/0x324
 [<ffffffff801a5d3f>] acpi_processor_idle_simple+0x0/0x324
 [<ffffffff801a5d3f>] acpi_processor_idle_simple+0x0/0x324
 [<ffffffff8004bc67>] cpu_idle+0x9a/0xbd
 [<ffffffff8007b27f>] start_secondary+0x498/0x4a7

Code: 8b 07 f6 c4 40 74 05 e9 80 31 02 00 8b 47 08 85 c0 75 0a 0f 
RIP  [<ffffffff8000d241>] put_page+0x0/0x2e
 RSP <ffff810037c6fcc8>

Expected results:
Test should run to completion

Additional info:
This issue may be relates to the following BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=530619

Comment 3 Neil Horman 2009-11-25 21:47:47 UTC

Jeff, do you have a system set up to look at this vmcore by any chance?  I'd like to take a look at it if I can

Comment 5 Neil Horman 2009-11-26 01:18:51 UTC

Thanks don, I actually just reserved a machine.  I'll look at the core asap.

Comment 6 Neil Horman 2009-11-26 01:45:15 UTC

quick note to self: I'm not 100% sure that this is strictly an e1000e problem (I'm not sure that its not either, mind).  But looking at the core in comment #2, it appears that the first logged slab corruption comes right after a rash of messages about selinux initalizing nfs services and eth0 entering and leaving promiscuous mode again and again.  Clearly e1000e has some involvement here, but I'm not sure if its a problem all on its own.  It might be interesting to see if the problem is reproducible with selinux disabled.

Comment 7 Neil Horman 2009-11-28 01:53:57 UTC

note to self: in looking through the additional patches for 5.5 that aren't in 5.4.z, I ran across bz 516216. Its an selinux bug that shows crashes in various points in the network stack.  It doesn't indicate slab corruption, but they were using a production kernel, rather than the debug kernel, so I'm not overly suprised about that.  I think its worth testing a kernel with the patch from 516216 included in zstream.  I'm putting the kernel together now, and will run an rhts test workflow as soon as its ready.

Comment 8 Neil Horman 2009-11-29 01:21:21 UTC

arrggh.  Scrap my thought here.  It appears this patch is already included in zstream.  Back to the drawing board.

Comment 9 Andy Gospodarek 2009-11-30 13:57:16 UTC

Have you tried running this test with 'iommu=soft' (presuming the system has an IOMMU) on the command line?

This smells like it could be similar to bug 530619.

Comment 10 Jeff Burke 2009-11-30 14:02:53 UTC

Andy,
   The issue still occurs even with iommu=soft

Comment 11 Neil Horman 2009-11-30 15:01:55 UTC

I've been looking over the logs of the vmcore that we have, and the first thing I notice is that the oops in the network driver isn't the first slab corruption we log.  The first one is in selinux code, in audit_alloc, and it comes on the heels of several SELinux nfs initalizations coupled with entry to/exit from promiscuous mode on the NIC.  These are the log mesages and the first corruption report:
SELinux: initialized (dev 0:18, type nfs), uses genfs_contexts
device eth0 left promiscuous mode
device eth0 entered promiscuous mode
SELinux: initialized (dev 0:18, type nfs), uses genfs_contexts
slab error in cache_alloc_debugcheck_after(): cache `size-2048': double free, or memory outside object was overwritten

Call Trace:
 [<ffffffff8000ce81>] cache_alloc_debugcheck_after+0xa2/0x1c1
 [<ffffffff8000b065>] kmem_cache_alloc+0xdf/0xeb
 [<ffffffff8004c2eb>] audit_alloc+0x70/0x123
 [<ffffffff8013911f>] selinux_task_alloc_security+0x1e/0x55
 [<ffffffff800206c3>] copy_process+0x6e6/0x17a1
 [<ffffffff800688bd>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800a2749>] alloc_pid+0x26e/0x294
 [<ffffffff80032fca>] do_fork+0x68/0x1c0
 [<ffffffff800c135a>] audit_syscall_entry+0x180/0x1b3
 [<ffffffff800602a6>] tracesys+0xd5/0xdf
 [<ffffffff8006045f>] ptregscall_common+0x67/0xac

ffff81004d934ad8: redzone 1:0x5a2c0500, redzone 2:0x5a2cf071

As a test, I've submitted this job:
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=104688&type=Single
To run the same test, but with selinux disabled to see if the error reproduces.

Comment 12 Neil Horman 2009-11-30 15:49:28 UTC

scrap that test, it was malformed, this is the right one:
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=104696

Comment 14 Neil Horman 2009-11-30 17:18:54 UTC

Ok, that last job with selinux disabled passed all the connectathon tests.  I'm resubmitting the job to see if the results are repeatable.  If they are, I'll get eparis to look at this with me

Comment 15 Neil Horman 2009-11-30 17:21:20 UTC

new test run:
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=104719

Comment 16 Neil Horman 2009-11-30 18:42:30 UTC

Ok, two rhts test runs have passed connectathon now with selinux disabled, whereas with selinux enabled it seems to fail every time.  eparis, could you please take a look at the core and offer your opinion about what might be going on here.  My first thought was that perhaps we're overflowing an skbuff with xattrs to transport selinux labels on the mount, which leads to various slab object overruns that eventually corrupts the send path, but thats just a complete wild guess

Thanks!

Comment 17 Neil Horman 2009-11-30 19:44:39 UTC

note to self: Looking at teh core, on all the slab corruptions, its only redzone1 thats corrupt, and its first halword is still that of RED_ACTIVE (i.e. its only the second haflword thats corrupt).  This I think strongly suggests that we're seeing an overrun of the object that immediately precedes the corrupted one in the size-2048 slab cache.  I'm digging through the core trying to find the affected object.

Comment 18 Neil Horman 2009-11-30 20:37:20 UTC

note to self: While the first corruptions were long since recycled or otherwise overwritten again, a scan of the vmcore for size-2048 objects thatwere corrupt revleals 2:
object at ffff81006c41ee58
object at ffff81006c8355f0

The objects immediately preceding these are, respectively:
ffff81006c41e640
ffff81006c834dd8


In each case, the corrupt object itself is filled with 0x6b, which might be an fscache poisoning technique, but I can be sure.  The preceding objects both appear to be allocated, but I don't recognize any obvious patterns in the data they hold yet.


cc-ing jlayton to chime in on the 6b pattern.  Jeff, does that pattern have anything to do with fscache that you're aware of?  I see fscache does some cookie checking using the 0x6b6b value, but I can't see it ever get set anywhere.  Do you have any thoughts?

Comment 19 Neil Horman 2009-11-30 20:42:31 UTC

scratch the fsccahe thing its POISON_FREE thats marking the slab object.  So after this object is free, the preceding object obverwrites it in such a way that the object itself isn't touched, but the redzone has a corrupt halfword.  This kind of feels like an off by one error in an array or something

Comment 20 Neil Horman 2009-11-30 21:03:51 UTC

note to self: The preceding elements to the corrupt elements I note have no valid redzone2 value, which indicates that they are themselves corrupt, but its not been detected yet, since they've not gone back to the slab allocator yet.  This I think lends more support to the theory that we had a slab overrun which killed us.  I still unfortunately can't idenfity the contents of the preceding objects (listed in comment 18).  Its not a skb I dont think (at least I can't find the ip address of any of the nfs servers used in the connectathon suite).  I'll keep looking....

Comment 21 Neil Horman 2009-11-30 21:18:05 UTC

note to self:

1) epairs asked that I run a test with the audit service disabled to see if that had any affect on the outcome.  This is that job:
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=104743

2) Given that the preceding slabs appear corrupt from overruns, the userword was corrupted as well.  I think I'm going to hack together a debug kernel that places the dbg_userword for slab debugging at the start of the slab, rather than the end.  This way, in the case of an overrun, we can hopefully salvage the slabs and determine who they belong to.  I'll get started on that shortly.

Comment 22 Eric Paris 2009-11-30 22:06:14 UTC

Created attachment 374893 [details]
first try at a patch which places the caller in 2 places in a slab

currently slabs looks like so

REDZONE1
DATA
REDZONE2
ALLOCATING CALLER

since we think data is destroying redzone2 allocating caller and the next redzone1 this patch lays it out like so

REDZONE1
ALLOCATING CALLER
DATA
REDZONE2
ALLOCATING CALLER

Haven't tried it yet....

Comment 24 Neil Horman 2009-12-01 15:26:04 UTC

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2112222

Heres a brew build with erics patch included.  I'll build a yum repo when its done and run the rhts suite on it.  With luck the core it generates will result in us knowing the owner of the slab that corrupted its neighbor, and give us a lead on the root cause of this bug

Comment 25 Neil Horman 2009-12-01 18:31:24 UTC

http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=105094

Looks like our last rhts test failed due to transient network errors in rhts.  However, I've finished building erics debug patch, and deployed it to a repo on my people page, I've sumbitted the above rhts test to install that debug kernel and re-run connectathon.  If rhts works properly and we get a core, it should point us at the owner of the slab thats overrunning its boundaries

Comment 26 Neil Horman 2009-12-01 19:42:04 UTC

dang, I forgot to reenable audit/selinux in the above test.  Heres the resubmission:
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=105097

Comment 27 Neil Horman 2009-12-01 21:12:01 UTC

grr, even with the fixups the kernel still passed, not sure why.  We're using the debug variant of the kernel and its got selinux and audit enabled.  so we should see a failure.  I'm running the test once again just in case:
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=105107

I'll also look at the patch to see if anything in there might be preventing a failure

Comment 28 Neil Horman 2009-12-01 21:34:36 UTC

Gah, jburke just gave me an education on RHTS!  Apparently the system_rebooted message is a giveaway to a panic condition which i misunderstood to be a result of a problem during the release of the system reservation.  When this next run from the above comment completes, we should have a vmcore, I just need to check for it rather than just assuming all the pass indicators on the rhts page mean that all is well.  Thanks for the clarification jeff!

Comment 29 Neil Horman 2009-12-01 23:00:57 UTC

gahh!  looks like in the most recent test the system just hard reset in the middle of a test (ie. hard reset, no panic occured, so kdump never triggered.  Since we got a crash previously, I'm re-submitting the test to see if we can get a panic this time:
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=105113

Comment 30 Neil Horman 2009-12-02 13:50:19 UTC

sigh, so jburke and I looked at the latest failure in rhts, and found that we were in fact saving a core successfully with kdump, but after the core is saved, we go to reboot and the ehci_hcd encounters a fatal error which appears to prevent a successful sync on the drives, and as such, the saved core is lost.  I'm trying once again without iommu=soft, just to see if we get lucky and save a core.  Also, I'm building a scratch kexec-tools package that we can use to just drop to a shell when a panic occurs in kdump.  This will allow us to save a vmcore manually to several places of our choosing, so that we can analyze it.

The good news is that, from looking at the panic, Erics redzone/userword changes to slabdebug appear to be working.  We dont print the correct userword on panic, but thats just because its not the corrupt slab we want to look at but rather the preceding one.  Assuming its not itself corrupted, the vmcore should have the info we're after.

Comment 32 Neil Horman 2009-12-02 20:32:46 UTC

ok, glory!  We had one of the systems dump and I got dropped to a shell in kdump. I've manually saved the vmcore to the local drive successfully and am in the process of transferring a copy to nec-em21.rhts.redhat.com.  We'll start looking at it as soon as its there.  note this core was produced by the slbdebug kernel, and the debuginfo for that kernel is installed on nec-em21

Comment 33 Neil Horman 2009-12-02 21:14:21 UTC

ok, we have three new cores saved on nec-em21.rhts.bos.redhat.com now, all from the slbdbg kernel.  eparis, jlayton and I will start reviewing them shortly.

Comment 34 Neil Horman 2009-12-02 21:26:52 UTC

note to self, looking at one of our new cores, it looks like eparis' patch worked well (kudos eric).  A quick poke about the core shows the userword at the front of the object that overran its bound and corrupted a following object has a value of:
0xffffffff80235558
disassembling that places the owner of that corrupting object in:
__netdev_alloc_skb

Looks like gospo might need to take a peek at this after all
Andy, any initial thoughts?

Comment 35 Andy Gospodarek 2009-12-02 22:19:01 UTC

I seem to be seeing the same thing as Neil:

crash> dmesg | tail -20

Call Trace:
 [<ffffffff8000ce98>] cache_alloc_debugcheck_after+0xb9/0x1e6
 [<ffffffff800e617a>] __kmalloc+0x11d/0x129
 [<ffffffff800327be>] expand_files+0x124/0x2a7
 [<ffffffff80023833>] dup_fd+0x153/0x2a8
 [<ffffffff8004be20>] copy_files+0x47/0x63
 [<ffffffff80020731>] copy_process+0x714/0x17a1
 [<ffffffff800688bd>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800a2749>] alloc_pid+0x26e/0x294
 [<ffffffff8003305b>] do_fork+0x68/0x1c0
 [<ffffffff8002343a>] fd_install+0x2e/0x68
 [<ffffffff800c135a>] audit_syscall_entry+0x180/0x1b3
 [<ffffffff800602a6>] tracesys+0xd5/0xdf
 [<ffffffff8006045f>] ptregscall_common+0x67/0xac

ffff81003518ece0: redzone 1:0x5a2c0c00, redzone 2:0x5a2cf071
Kernel panic - not syncing: Slab error!!

 
crash> kmem ffff81003518ece0
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff81000c0f5340 size-2048               2080        789       978    326     8k
SLAB              MEMORY            TOTAL  ALLOCATED  FREE
ffff81003518e480  ffff81003518e4c0      3          2     1
FREE / [ALLOCATED]
  [ffff81003518ece0]

      PAGE       PHYSICAL      MAPPING       INDEX CNT FLAGS
ffff81000a0ea540 3518e000                0     4200  1 30080000000080
crash> gdb x /20xg 0xffff81003518e4c0
0xffff81003518e4c0:     0x00000000170fc2a5      0xffffffff80235558
0xffff81003518e4d0:     0x5a5a5a5a5a5a5a5a      0x5a5a5a5a5a5a5a5a
0xffff81003518e4e0:     0xffffffffffff5a5a      0x06080074279d1f00
0xffff81003518e4f0:     0x0100040600080100      0x100a0074279d1f00
0xffff81003518e500:     0x000000000000fc2f      0x00000000192a100a
0xffff81003518e510:     0x0000000000000000      0x68ca000000000000
0xffff81003518e520:     0x68ca164bcfd40400      0x0e00008acfd40400
0xffff81003518e530:     0x208e0e0000aa208e      0x033f000000000000
0xffff81003518e540:     0x9a690c4b1a00cb58      0x0008e8baae200008
0xffff81003518e550:     0x00403f039c000045      0xda2a100afb0c11ff
crash> dis 0xffffffff80235558
0xffffffff80235558 <__netdev_alloc_skb+18>:     test   %rax,%rax

I think that is being done correctly since you need to use the reported memory location from the output of kmem <objp> and it appears from Eric's patch that the user will be stored in the 2nd 64-bit word.

Comment 37 Eric Paris 2009-12-03 05:21:15 UTC

Created attachment 375672 [details]
UNTESTED!!!  tripwire memory allocator

Patch ?might? implement an skb data allocator with a tripwire in case of allocation overwrite.  It also might eat your first born.  I don't know....

Comment 38 Neil Horman 2009-12-03 15:36:49 UTC

thanks for the patch eric, but I think given what we know about the bug at the moment, we need something more targeted. Given that we know the corrupting slab is part of an skb allocated in the receive path by the e1000e driver, we can determine that there are 4 places where this skb might have been allocated:

1) e1000_alloc_rx_buffers
2) e1000_alloc_rx_buffers_ps
3) e1000_alloc_jumbo_rx_buffers
4) e1000_clean_rx_irq

I think we can discount (3) since jubo frames would require the allocation of objects from slabs larger than size-2048 (which is where the corruptor came from). We can also discount (2), since we only use packet splitting for NICS that don't have the FLAG_IS_ICH flag set (which this NIC does). that leaves 1 and 4. (4) is the call location I mentioned in comment #36. (1) is tricky as it allocates skb's which will go on the nic rx ring for dma operations, which as you mentione to me before will circumvent your write protection. I think what we need to do here is the following:

A) augment (1) so that we validate how much space we have in the skb prior to doing the subsequent memcpy in the copybreak path, expanding the skb if needed, or BUG halt if we cant.

B) validate that the rctl register in the hardware is apprpriately reflective of the size of the skbs we are allocating for refill operations. If its not, bug halt, or increase the skb size. This should handle the case in which the rctl register is set for frames of size X and we allocate frames of size Y where X > Y. If thats the case, we'll need to find where the rctl register gets out of sync with our buffer allocation size value.

I'll have a patch ready shortly.

Comment 39 Neil Horman 2009-12-03 18:04:42 UTC

I'm still working on the patch, but it occured to me we can exclude cast (4) above without any code change:
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=105714
I've modified this test to give me a window in which I can modify the system during the test.  I'm going to go in and set the e1000e's cobybreak module parameter to 0.  By doing this we'll skip the path that causes the netdev_alloc_skb in (4) above.  If the problem fails to reproduce, then we can be reasonably sure that location was a problem.  If it continues to reproduce, we can eliminate that call site as an issue, leaving us only with 1 point to investigate.

Comment 40 Neil Horman 2009-12-03 19:41:46 UTC

Ok, so I ran the test with copybreak=0 for the e1000e module, and I still got the panic, so from that we can conclude that the copybreak path has no bearing on this corruption.  That only leaves one place to investigate.  I'm finishing up the patch for that now, and will post it here with a build link shortly.

Comment 41 Neil Horman 2009-12-03 20:34:27 UTC

Created attachment 375906 [details]
targeted debug patch

Ok, its horrible and ugly, but this patch should give us some visibility into the status of the frames as they come off of the hardware.  I've copied the internal slab debug code from slab.c into the e1000e driver, and used it to build a function that should, with luck, validate that the skbs as they come off the hardware are intact.  Specifically it validates that the redzones at the front and end of the allocated data segment are set to RED_ACTIVE.  If they're not, we panic the system,  We preform this validation immediately after we get the frame off the hardware, and again right before we pass it up to the network stack.  I've built it, but not run it yet, so it may still be a bit buggy, we'll see.  The tag in cvs is kernel-2_6_18-164_8_1_el5_slbdbg_v2 and the brew build is right here:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2123575
I'll run it through rhts as soon as its done.

Comment 42 Neil Horman 2009-12-04 14:33:35 UTC

Think I might see the problem.  Its in e1000_setup_rctl, and if this is in fact the issue, the problem exists in potentially in all RHEL releases, and upstream for all intel drivers that pass buffer size information to the NIC in this way.  I was testing the above patch and ran into a boot up panic indicating that rx_buffer_length was smaller than the configured size for rctl.

e1000_setup_rctl is used to configure the device receive control register.  one of the fields in this register informs the NIC  hardware how large the buffers that we place in the receive ring are.  There are discreet values for this setting, and so the available buffer size choices are 256, 512, 1024, 2048, 4096, 8192 and 16384 bytes.  THe code looks like this:
======================================================
switch (adapter->rx_buffer_len) {
        case 256:
                rctl |= E1000_RCTL_SZ_256;
                rctl &= ~E1000_RCTL_BSEX;
                break;
        case 512:
                rctl |= E1000_RCTL_SZ_512;
                rctl &= ~E1000_RCTL_BSEX;
                break;
        case 1024:
                rctl |= E1000_RCTL_SZ_1024;
                rctl &= ~E1000_RCTL_BSEX;
                break;
        case 2048:
        default:
                rctl |= E1000_RCTL_SZ_2048;
                rctl &= ~E1000_RCTL_BSEX;
                break;
        case 4096:
                rctl |= E1000_RCTL_SZ_4096;
                break;
        case 8192:
                rctl |= E1000_RCTL_SZ_8192;
                break;
        case 16384:
                rctl |= E1000_RCTL_SZ_16384;
                break;
        }
=========================================

Where rx_buffer_length is the size we pass to netdev_alloc_skb to allocate an skb.  The problem is that there is no guarantee that rx_buffer_length will be any one of the cased values in this switch statement.  In fact it quite likely won't be, since rx_buffer_length is commonly set to be the mtu of the interface, or roughly 1500 bytes.  Thats going to send us down the default path of this switch which tells the hardware that each buffer is 2048 bytes long.  While thats normally ok (-ish), since a skb of 1500 bytes will grab a data buffer from the size-2048 slab, its still technically an overrun, and if we reserve to much data at the head of the slab (such as we might do if we enabled slab debug perhaps :) ), then we could easily overrun into the next object in that slab.

The solution to this problem is to modify this setup to be a floor function.  The rctl size field should be set to the largest setting possible that is not larger than the rx_buffer_length.  I'll have a patch building shortly.

Comment 43 Neil Horman 2009-12-04 16:13:05 UTC

Created attachment 376112 [details]
test fix for our corruptor problem.

I still need to build it, but heres a first pass at a correction for the problem I described previously.

Comment 44 Neil Horman 2009-12-04 17:27:39 UTC

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2129029

build with test fix cooking now

Comment 45 Neil Horman 2009-12-04 18:56:40 UTC

build done, I've smoke tested it on my nec-em21 system, and it seems to be working well.  Building a repo to test in rhts on the reproducer systems now.

Comment 46 Neil Horman 2009-12-04 19:08:34 UTC

http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106140

rhts job id running my test fix.  We should know if it helps soon.

Comment 47 Neil Horman 2009-12-04 20:16:26 UTC

Created attachment 376190 [details]
updated version of fix under test

grr, had to resubmit the rhts job (last one failed due to network connection issues):
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106155

Also, attached is an updated version of the patch with the various typos and other errors corrected.

Comment 48 Neil Horman 2009-12-04 20:20:20 UTC

http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106159
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106158
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106157
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106155

I've actually taken the shotgun approach.  I've queued the same job with the v4 test kernel to  all the available z200 systems we have, in the hopes of getting better test results.

Comment 49 Neil Horman 2009-12-04 21:22:33 UTC

gahh, my patch seem to be somewhat broken.  by normalizig the rx_buffer_len value and setting the rctl register to the largest value smaller than that, we prevent ourselves from receiving frames that are sized larger than the dma limit in rctl but smaller than the rx_buffer_len value.  What we actually need to do here is I think is normalize the input, but round up, rather than down, then set the rx_buffer_len value to be the same as that value.  That way we will be able to receive frames that are the maximum mtu value, and we'll have the space to store all that data.

Comment 50 Neil Horman 2009-12-04 21:37:27 UTC

Created attachment 376211 [details]
new patch

sorry, that was stupid of me.  We can't round down because we can't get full mtu sized frames in then (we'll drop them), leading to odd network behavior. We instead need to round up, and then bump up rx_buffer_len for the adapter so that we then allocate enough space to handle the dma, like this patch does.  Building a test here:


Also, I meant to note that our dma doesn't have to traverse the whole buffer to cause problems, even if we have a 2k buffer from the slab allocator, we also append skb_shared_info to the end of the data block, beyond the allocation size.  So, if we allocate a 1500 byte skb, even though we have a 2k slab, at 1500 bytes into the data block, we have a live data structure that an overlong dma will trample into, causing other problems down the road.

Comment 51 Neil Horman 2009-12-04 21:39:05 UTC

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2130213


helps if I include a link to the build. :)

Comment 52 Neil Horman 2009-12-05 12:40:55 UTC

http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106201
Eureka!  The corrected patch seems to pass connectathons suite without error.  I'm going to kick off several more jobs to ensure that its consistent.  If so, I think we have a fix.

Comment 53 Neil Horman 2009-12-05 12:48:19 UTC

http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106268
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106267
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106266
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106265
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106264
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106263
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106261
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106259

Set of patch validation jobs to run over the weekend

Comment 54 Neil Horman 2009-12-05 17:28:07 UTC

So far so good.  It looks like the patch I have in comment #50 is a winner.  I'm getting some warnings during kdump setup, but it doesn't appear to be at all related to this problem.  All the connectathon tests in each of the above jobs is completing without an oops or slab error warning.  Looking at the upstream kernel, at least e1000 and e1000e are affected by this bug, as well as possibly ixgb and ixgbe.  Upstream needs to be patched, as well as RHEL5 and likely RHEL4.  I'll get started as soon as I can.

Comment 55 Neil Horman 2009-12-07 01:28:50 UTC

Ok, of the jobs in comment #53, we had three failures: 106268, 106263 and 106261.  106268 and 106261 were failures during the initial kernel install (the system just hung up during install until the external watchdog triggered). 106263 was a hang during one the connectathon tests, and appears based on the console log to be the result of the NFS server not responding for a period that exceeded the external watchdog timeout.  So I think these failures can be discounted as they were failures prior to the test starting, and caused by elements external to the test.  Of the remaining tests, they all passed all of connectathon, so I think we're good here.  I've got the patch set ready to go upstream and will post monday morning, after which I will begin posting for 5.5 and 5.4.z.

Comment 56 Neil Horman 2009-12-07 15:23:48 UTC

Created attachment 376703 [details]
final patch for e1000, e1000e and ixgb

This is the combined version of the patch for e1000 e10000e and ixgb that I've sent upstream:
http://marc.info/?l=linux-netdev&m=126019719608959&w=2

Comment 57 Neil Horman 2009-12-07 16:15:12 UTC

Created attachment 376718 [details]
final final patch :)

sorry, a few comments from upstream made this patch much more concise, so I'm updating them here

Comment 58 RHEL Program Management 2009-12-07 21:47:28 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 60 Don Howard 2009-12-09 19:14:23 UTC

*** Bug 545074 has been marked as a duplicate of this bug. ***

Comment 61 Neil Horman 2009-12-11 12:05:41 UTC

Jessee this is the orig. bug we tracked down that corruptor in:
http://marc.info/?l=linux-netdev&m=126019719608959&w=2
If you could give us thumbs up/down on it asap that would be good

Comment 62 Jesse Brandeburg 2009-12-11 17:12:38 UTC

Neil, a couple of things, I don't think your patch is good for several reasons.

The original reason for shortening the the rx_buffer_len in e1000/e1000e (only in the 1500 MTU case) to *1522!* is because the silly stack charges the socket for whatever size the driver allocates for the receive buffer (truesize) regardless of whether it is used. So this quickly becomes a performance issue with your change back to the way it was (you could probably find the patch that introduces the 1522 rx_buffer_len, and just revert it) The code was meant to create a special case for the default mtu (1500) that performed at the optimal rate.

a quick primer on our hardware's operation, and why the original should work correctly, unless there were backport issues:

1) this change should only be made on hardware which supports the LPE (Long Packet Enable) bit. The LPE bit when 0 makes the hardware drop all frames longer than 1522 bytes. BTW it just occurred to me that if the patch to enable vlan all the time was enabled this might mess with the length of packet that LPE drops, not sure... If LPE==0 doesn't drop all frames for some reason (like store bad packets is set) then this optimization must be disabled. for ixgb, we had a MFS (max frame size) register which makes the hardware drop any frame longer than that value, so it should allow the change in the same way.

2) all the intel hardware works by chaining multiple descriptors together when receiving a frame longer than the descriptor length set in receive control. Packet split or not changes the behavior slightly, but the idea is the same.

3) when jumbo frames are enabled we should reallocate rx buffers with a full rx_buffer_len (ideally of 4kb or PAGE_SIZE), and the receive path will use more than one rx descriptor to contain the whole frame from the wire.

Comment 63 Neil Horman 2009-12-11 18:24:21 UTC

Jesse, I appreciate the desire to optimize your allocation size, but if its causing a data corruptor, then its not really an optimization. You can say the stack is silly for accounting truesize if you like, but even if you don't use all the data, your allocation has restricted the rest of the kernel from using it (even the part the driver doesn't know it has), so the driver needs to take the charge from that, lest we just allow every driver as much ram as they want. Regardless, debating accounting strategy is really a side note here, given the data above pretty clearly shows a 2k slab that was the datablock for an skb in the e1000e drivers rx ring ovrerand the bounds of the skb and eventually trampled on the redzone at the end of that slab.

Emil already give me your primer on intel hardware behavior. It makes sense to me how it should work, my assertion is that for the NICs we have in the z-200's here, its _not_ working. Emil suggested that we update the eeprom code in these systems to see if that corrected the problem, and if I have time I will, but regardless, if firmware is in the field that might violate the setting of the LPE bit that needs to be handled in the driver. It might be reasonable to restrict this size minimization such that its not done on NICS in which this is problem is identified (if thats possible).

I agree with you regarding the vlan possibility. I had thought that a vlan tagged frame might be responsible, as that would just spill over a bit on your 1522 byte boundary, leading to a corruption of the skb_shared_info struture for the skb. not sure how we can determine which NICs are affected by such an issue though, so we can selectively disable the size optimization. If you have any thoughts, that would be appreciated.

As for the use of multiple buffers, I don't think thats particularly relevant here, as the e1000/e1000e drivers have code to check for spanned packets and drop them. The hardware supports that, but the driver does not.

We never tested jumbo frames here, as we were able to reproduce the problem using a normal mtu. Although that does beg the quesiton as to what behavior we could expect if the LPE bit issue you described above were coupled with a misconfigured peer switch that allowed jumbo frames. i.e. what would happen if we didn't discard frames over 1522 bytes in the hardware, and our network locally here sent a jumbo frame to that NIC. Clearly thats a network misconfiguration, and should be fixed, but regardless, I would imagine that the NIC erroneously not discarding the frame would lead to multiple 2048 byte dma's (since thats what the deafault rx_buffer_len value sets the rctl dma size to), possibly spanning multiple buffers as the hw documentation indicates it would. Actually, thinking about it, this makes a good deal of sense. Have a look at the rx path in e1000e:
e1000_clean_rx_irq
...
/* !EOP means multiple descriptors were used to store a single
* packet, also make sure the frame isn't just CRC only */
if (!(status & E1000_RXD_STAT_EOP) || (length <= 4)) {
/* All receives must fit into a single buffer */
e_dbg("Receive packet consumed multiple buffers\n");
/* recycle */
buffer_info->skb = skb;
goto next_desc;
}
...
This is the code gospo added to prevent short frames from getting into the network stack a few months back. If the NIC ignored the LPE bit on the e1000e nics in the z-200 systems we have, and dma'ed that frame into our ring buffer, we'd get several 2048 byte dmas. Since net_dev_alloc skb automatically adds a few bytes to the head of an allocation and reserves them, we'd spill over our allocated 2048 byte slab by a few bytes, violating our redzone. Since these frames just get recycled though in the driver, we don't free them and the redzone doesn't get checked, but we notice some slab poisoning going on in other objects, as these dmas spilled onto them. Only the last frame gets through (if the length is greater than 4), and makes it into the stack, causing various and sundry problems up the stack.

Hmm, I guess it all comes back to the LPE bit. If this hardware is indeed ignoring it, whats a good way to check for the existance of that problem so that we can adjust the rx_buffer_len appropriately without compromising the optimization on NICS that honor that bit properly?

Comment 64 Emil Tantilov 2009-12-11 20:40:56 UTC

Neil,

Ideally the LPE bit should not be ignored and if it is then we have a HW issue that we need to deal with. That is why I pointed out to the version of the FW you're using. The z-200 is still under development and like any other platform had gone through different FW revisions and reworks each with their own problems. It's not unusual for issues like this to be fixed in later versions. I don't know if a problem in the FW can cause the LPE to be ignored - probably not, but there may be other factors like BIOS, ME settings etc... that can cause issues. That is why I am trying to get a local repro so we can get to the bottom of it. 

So far I have not been able to reproduce the issue on the systems we have in the lab using the test you suggested which leads me to believe that there may be something else other than just DEBUG_SLAB and default settings that is significant. 

I will try another test using the RH 5.4 kernel (or if you have a link to a kernel you used that would be even better). Hopefully this will get me closer to your setup. Do you know if your test was done with default BIOS settings? If not - is ME enabled? How about VTd? Can you maybe try and reproduce after loading the BIOS default settings?

If there are any other details in your setup that you believe to be significant please let me know as this will help me to recreate your setup as close as possible and hopefully be able to get a repro.

Comment 65 Neil Horman 2009-12-11 21:00:03 UTC

Hey Emil-

I understand what your saying, in that the LPE bit shouldn't be ignored, and that needs to be a hardware fix/firmware fix if it is.  My concern is that if such firmware is in the field, it might behoove us to detect that in the driver and prevent it, although I'm not sure how you go about detecting this particular failure.

As far as the reproducer goes, you might want to check your logs.  There were consistent redzone violations that were non fatal prior to a crash, and the crash didn't always occur.  The only kernel where a crash did occur was our latest 5.4.z kernel, which I think I can provide to you if you like.  But all the kernels I tested showed redzone violation in our test setup here, so you may want to look at that.

AS far as bios settings goes, I honestly don't konw what they are here, one of our lab guys may be able to give you details here (they should be cc'ed on this bug).  

regarding other test setups, I'm not sure if the theory has legs or not, but when jesse and I were talking above, I wondered if maybe we were seeing two problems here, (1) that the NIC was ignoring the LPE bit, and (2) the network segment was erroneously seeing jumbo frames on the wire.  Its all in comment 63.  I'll see if I can tcpdump a relevant segment of our network in case such frames exist here.

Comment 66 Emil Tantilov 2009-12-11 21:19:33 UTC

checking logs is fairly basic operation and I do it all the time (even automated for certain errors), but there is nothing in dmesg during and after the test. I spent last couple of days trying all kinds of tests but could not get past the LPE protection. Do you see the issue in only one connectathon run? How long does it usually take?

Link to a kernel would be nice - either RH or upstream.

Comment 67 Jesse Brandeburg 2009-12-12 00:42:30 UTC

can you run the ethregs (register dump) tool (available from e1000.sf.net) and get your registers (not sure if ethregs needs to be updated to support PCH) on one of the failing systems (before it fails is fine, as long as we aren't running your fixed driver)?

That we can't repro this is a warning signal we have to pay attention to.  We need to work towards a local repro so we have a chance to figure out the real root cause (hardware bug or other)

Comment 68 Neil Horman 2009-12-12 17:30:35 UTC

sure I can run ethregs if you like, all you had to do is ask.

Emil, as for the kernel, I'm just verifying that we don't have anything embargoed in the 5.4.z stream here, then I'll post a link to the kernel for you.  I'll also rebuild the config I used on the upstream kernel and give you a link to that as well.

Just so that you know, the upstream reproducer was rare, the 5.4.z stream was consistent, so I really focused on that kernel.

I understand that you need to focus on getting a local reproduction.  please understand that this bug is high priority for us, so we're looking to move forward.  If it helps any, I have a patch handy here which verifies that, with slab debug enabled, redzones aren't violated immediately when we remove skbs from the rx ring in the napi poll routine (which I used to point me to dma problems in the NIC, and was the reason I focused on the buffer optimization we've been discussing as the problem).  If you like I can send that to you as well.

Comment 69 Jesse Brandeburg 2009-12-14 18:03:27 UTC

Neil, with any of your patches, did you have one that reported the length of the packet that was received from the hardware rx descriptor?  I'm curious if the hardware was actually reporting that it DMA'd more bytes than rx_buffer_len.

Comment 70 Jesse Brandeburg 2009-12-14 18:05:37 UTC

And, just to make it clear, have you ever reproduced this on any other e1000e hardware or any of the other drivers?  We still strongly suspect your pre-production hardware (this particular hardware, PCH, was really buggy for a long time, with lots of LAN and BIOS firmware changes late into the qualification phase)

Comment 71 Neil Horman 2009-12-14 19:06:04 UTC

Created attachment 378311 [details]
my debug patch

Jesse, to answer your first question, no I'm afraid that I didn't check the length that was returned from the hardware at all, but it can be added to the above debug patch pretty easily if you like.

To answer your other questions, no we never observed this problem on other hardware.  All the reproduction was on the z-200 systems, and only consistently with RHEL5.4.z.

I'm entirely willing to believe that this problem is restricted to pre-production hardware, and if thats the case, so be it.  It still makes me nervous to tell the hardware we have more dma space than is truly available in the skb, even if the lpe bit is supposed to guard against overlong frame reseption, but if you can fix this in the hardware, then I suppose the optimization is worth it.

Comment 72 Jesse Brandeburg 2009-12-21 22:25:53 UTC

the patch only patches netconsole.c and netpoll.c, is that the correct patch?

We trust the hardware to work predictably, and so have enabled the LPE bit.  You are correct things can blow up handily if the length DMA'd is longer than the buffer.

maybe it would be worth a runtime check?

BUG_ON(rx_desc->length > adapter->rx_buffer_len);

or something like that, for non-packet split of course.

Comment 73 Jesse Brandeburg 2009-12-21 22:40:44 UTC

also, are all your z200 systems identical?  If the LPE trick isn't working on those systems, we need to know, and why.  Since we can't reproduce the issue on the z200's here, we need you to give the info on yours (dmidecode, bios version, motherboard model number/sticker (if it has revision info)

All those systems should also be thoroughly vetted to make sure they are running production BIOSes and motherboards/components.

Yes, it's nice to have an SDP, but not once a model has gone production, unless you can get production bits in your SDP.

Comment 74 Jeff Burke 2009-12-22 14:25:12 UTC

Created attachment 379832 [details]
dmidecode from system

dmidecode requested

Comment 76 Prarit Bhargava 2010-01-23 12:43:17 UTC

Hi all -- I'm seeing this issue on 3 different platforms, both AMD and Intel systems.

After applying nhorman's patch for this BZ the problem appears to resolve itself.  However, I'm left with other SLAB corruption issues ...

nhorman's patch fixed *something* -- likely a HW issue?

P.

Comment 77 Jesse Brandeburg 2010-02-18 00:49:23 UTC

AMD with what ethernet hardware exactly?

also, bios on z200 needs to be up to date.  Ours is newer than yours, and these bioses have been changing VERY frequently.


 BIOS Information
 	Vendor: Hewlett-Packard
-	Version: 786H3 v00.50
-	Release Date: 09/03/2009
+	Version: 786H3 v00.55
+	Release Date: 11/19/2009
 	Address: 0xE0000
 	Runtime Size: 128 kB
 	ROM Size: 8192 kB

Comment 78 Neil Horman 2010-02-18 01:54:53 UTC

jesse, IIRC our z200's have a variety of ethernet kit on them.  Some have e1000(e) (not sure which exactly), some have tg3 and some have bnx2 IIRC.

Comment 80 Neil Horman 2010-03-02 11:38:21 UTC

discussed with jiri.  given that this was opened against 5.4.z, and we just fixed this in 5.5, I'm closing this as a duplicate of bz 558809, and jiri will clone 558809 to handle the z-stream backport.

*** This bug has been marked as a duplicate of bug 558809 ***

Note You need to log in before you can comment on or make changes to this bug.