249511 – DMA broken with atl1 network adapter and >4GB of memory

Bug 249511 - DMA broken with atl1 network adapter and >4GB of memory

Summary: DMA broken with atl1 network adapter and >4GB of memory

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	7
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:	http://smolt.fedoraproject.org/show?U...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-07-25 10:11 UTC by Pedro Morais
Modified:	2007-11-30 22:12 UTC (History)
CC List:	5 users (show)
Fixed In Version:	2.6.22.5-76.fc7
Clone Of:
Environment:
Last Closed:	2007-09-20 15:20:55 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
stacktrace for the lockup (4.31 KB, text/plain) 2007-07-25 10:11 UTC, Pedro Morais	no flags	Details
Requested stacktrace (79.17 KB, image/jpeg) 2007-07-25 16:35 UTC, Pedro Morais	no flags	Details
Another stack trace (493.28 KB, image/jpeg) 2007-07-25 18:20 UTC, Pedro Morais	no flags	Details
backtrace (49.63 KB, text/plain) 2007-08-03 23:15 UTC, Andy Lawrence	no flags	Details
View All

Description Pedro Morais 2007-07-25 10:11:17 UTC

Description of problem:
Kernel locks when copying a large amount of data (using scp) over a 100 Mbit
connection.

Version-Release number of selected component (if applicable):
all kernel versions releases for Fedora 7, up to the latest kernel-2.6.22.1-27.fc7

How reproducible:
Always - just use a large amount mass of data

Steps to Reproduce:
1. In a server 1, start copying, using scp, a large amount of data (something
like 10 Gb) to server 2
2. After a while, server 2 (running F7) will lockup
  
Actual results:
After a while, server 2 (running F7) will lockup

Expected results:
No lockup

Additional info:

We used memtest to test the memory of this machine, over several days; no
problems reported. The machine is not overclocked or "tuned" in any way.
It's a quad core Intel processor (Q6600) with a RAID 5 array of 4 500 Gb SATA disks.

The smolt profile is available in
http://smolt.fedoraproject.org/show?UUID=e783e932-c71f-4f79-b39a-2d2f0923797d

The attached stacktrace was from the only time it got registered in the logs;
usually it's not even registered. This specific stacktrace is from kernel.x86_64
2.6.21-1.3228.fc7, but the same occurs in in kernel-2.6.22.1-27.fc7.

Comment 1 Pedro Morais 2007-07-25 10:11:17 UTC

Created attachment 159912 [details]
stacktrace for the lockup

Comment 2 Pedro Morais 2007-07-25 11:08:08 UTC

Ran bonnie++ several times and I haven't been able to trigger the problem.
Also, ran some rather large "cp" commands and no lockup either.
Seems I can only reproduce it in the scp case.

Comment 3 Chuck Ebbert 2007-07-25 14:22:07 UTC

Can you try kernel-2.6.22.1-33.fc7 from the updates-testing repo?

Comment 4 Pedro Morais 2007-07-25 15:47:24 UTC

[pedro.morais@inovasrv3 ~]$ uname -a
Linux inovasrv3 2.6.22.1-33.fc7 #1 SMP Mon Jul 23 16:59:15 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux

Same problem.

Any boot parameters that I might try? I've tried googling for the problem but
found nothing relevant.

Comment 5 Chuck Ebbert 2007-07-25 15:56:01 UTC

Please post the entire oops message from the 2.6.22.1 kernel.
We need to see the exact addresses where that one fails.

Comment 6 Pedro Morais 2007-07-25 16:35:58 UTC

Created attachment 159955 [details]
Requested stacktrace

A photo of the stacktrace is attached.

I didn't copy it by hand because I would probably make some mistake :-( and I
don't have access to a serial console.

I see a "generic_unplug_device" in this stacktrace... does the SATA layer think
we are physically unplugging a disk? (we are not, the disks are not hot swap).

Thanks

Comment 7 Chuck Ebbert 2007-07-25 16:43:03 UTC

We really need the entire message, including register contents etc.
Booting in 50-line mode may work:

  - add "vga=ask" to kernel command line then select 50-line mode at boot
  - may need to override default font -- easy way is to
      (temporarily) rename /lib/kbd/consolefonts

Comment 8 Pedro Morais 2007-07-25 18:20:14 UTC

Created attachment 159968 [details]
Another stack trace

Another oops, using your instructions... unfortunately, it's different this
time.
I've tried it several times and everytime I get a different error, so I think
something must be corrupting memory.

Since the memory passed the memtest, the disks showed no problems after a
rather violent bonnie++ test run, I'm starting to think something must be wrong
with the network card (driver?)...

Tomorrow I'll try to connect a different network card.

Thanks!

Comment 9 Chris Snook 2007-07-25 21:39:02 UTC

Can the problem be reproduced with the upstream 2.6.23-rc1 kernel, and the atl1
card?  We fixed a bug with DMA setup that may be relevant.

Comment 10 Chuck Ebbert 2007-07-25 21:59:12 UTC

The DMA patches from 2.6.23 will be in the next Fedora 7 kernel.
Hopefully they will fix this problem.

Comment 11 Pedro Morais 2007-07-26 16:38:40 UTC

Ok, I connected an old rtl8139 card, disabled the onboard NIC, and the problem
went away.
I'll wait for the next kernel to hit updates-testing to test if it fixes the issue.
Thanks.

Comment 12 Jay Cliburn 2007-07-27 22:34:48 UTC

You should be able work around the problem by booting with mem=3900 on the
kernel command line.

Comment 13 Pedro Morais 2007-08-01 20:43:12 UTC

Just installed the -41 kernel; unfortunately I can't reboot the machine right
now, but I will try to see if it fixes the problem as soon as possible.
Thanks

Comment 14 Andy Lawrence 2007-08-03 23:15:42 UTC

Created attachment 160680 [details]
backtrace

Comment 15 Andy Lawrence 2007-08-03 23:16:56 UTC

Sorry, for the above, .41 is no better, backtrace attached.

Comment 16 Chris Snook 2007-08-27 16:26:42 UTC

Fixed upstream in 2.6.23-rc3-mm1, still unclear when the patch will hit mainline.

Comment 17 Chuck Ebbert 2007-08-28 20:08:48 UTC

Will put the patch from -mm in F7.

Comment 18 Chuck Ebbert 2007-09-07 13:27:23 UTC

kernel-2.6.22.5-76.fc7 is in the updates-testing repository with a fix for this bug.

Comment 19 Oliver Paukstadt 2007-09-09 05:26:46 UTC

I had a similar problem.
ASUS P5B-Plus, Quad Core, 4GB of memory.
I hit three different (unusual) kernel bugs during the first 8 hours of
operation (updating, rsyncing old box).

Looks like kernel-2.6.22.5-76.fc7 fixes this as the system did not hit a bug
last night (9 hours) and still rsyncing.

Comment 20 Christopher Brown 2007-09-20 12:21:03 UTC

Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug and will try and assist you in resolving it if I can.

There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel as comment #18 indicates
this resolves the issue.

If the problem no longer exists then please close this bug or I'll do so in a
few days if there is no additional information lodged.

Cheers
Chris

Comment 21 Chuck Ebbert 2007-09-20 15:20:55 UTC

(In reply to comment #20)
> 
> I'm reviewing this bug as part of the kernel bug triage project, an attempt to
> isolate current bugs in the fedora kernel.
> 

Please don't change status when it's MODIFIED, as that means a patch for the
problem is in testing.

Comment 22 Christopher Brown 2007-09-20 15:55:05 UTC

(In reply to comment #21)

> Please don't change status when it's MODIFIED, as that means a patch for the
> problem is in testing.

Apologies - I did not realise that. I have updated the wiki to reflect this.

Cheers
Chris

Comment 23 Chris Snook 2007-09-20 17:44:35 UTC

Patch is merged in 2.6.23-rc7 with a "we don't know why, but this helps"
comment.  I figured out the problem last night and sent a patch to Jeff Garzik
to update the comment with an explanation.  Basically, the atl1 chip can do DMA
to 64-bit addresses, but all of the rings have to be within a single 4GB-aligned
block.  In other words, it's just 32-bit DMA with an option to work around
BIOSes that reserve a huge amount of low physical address space.  The clean fix
is the one we're already using, which is 32-bit DMA.

Comment 24 Pedro Morais 2007-09-21 07:50:34 UTC

Hi,

Unfortunately I'll only be able to test this at the end of this month, that's
when  I'll have access to this machine.

Thanks,
Pedro Morais

Comment 25 Pedro Morais 2007-09-27 11:56:34 UTC

Just checked it; did some stress tests (copied large amounts of data) that
previously would always kill the server.
Everthing is ok, we can close this one.

Note You need to log in before you can comment on or make changes to this bug.