Bug 249511

Summary: DMA broken with atl1 network adapter and >4GB of memory
Product: [Fedora] Fedora Reporter: Pedro Morais <pedro.morais>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 7CC: chris.brown, csnook, dr.diesel, jcliburn, pstadt
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
URL: http://smolt.fedoraproject.org/show?UUID=e783e932-c71f-4f79-b39a-2d2f0923797d
Whiteboard:
Fixed In Version: 2.6.22.5-76.fc7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-09-20 15:20:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
stacktrace for the lockup
none
Requested stacktrace
none
Another stack trace
none
backtrace none

Description Pedro Morais 2007-07-25 10:11:17 UTC
Description of problem:
Kernel locks when copying a large amount of data (using scp) over a 100 Mbit
connection.

Version-Release number of selected component (if applicable):
all kernel versions releases for Fedora 7, up to the latest kernel-2.6.22.1-27.fc7

How reproducible:
Always - just use a large amount mass of data

Steps to Reproduce:
1. In a server 1, start copying, using scp, a large amount of data (something
like 10 Gb) to server 2
2. After a while, server 2 (running F7) will lockup
  
Actual results:
After a while, server 2 (running F7) will lockup

Expected results:
No lockup

Additional info:

We used memtest to test the memory of this machine, over several days; no
problems reported. The machine is not overclocked or "tuned" in any way.
It's a quad core Intel processor (Q6600) with a RAID 5 array of 4 500 Gb SATA disks.

The smolt profile is available in
http://smolt.fedoraproject.org/show?UUID=e783e932-c71f-4f79-b39a-2d2f0923797d

The attached stacktrace was from the only time it got registered in the logs;
usually it's not even registered. This specific stacktrace is from kernel.x86_64
2.6.21-1.3228.fc7, but the same occurs in in kernel-2.6.22.1-27.fc7.

Comment 1 Pedro Morais 2007-07-25 10:11:17 UTC
Created attachment 159912 [details]
stacktrace for the lockup

Comment 2 Pedro Morais 2007-07-25 11:08:08 UTC
Ran bonnie++ several times and I haven't been able to trigger the problem.
Also, ran some rather large "cp" commands and no lockup either.
Seems I can only reproduce it in the scp case.

Comment 3 Chuck Ebbert 2007-07-25 14:22:07 UTC
Can you try kernel-2.6.22.1-33.fc7 from the updates-testing repo?

Comment 4 Pedro Morais 2007-07-25 15:47:24 UTC
[pedro.morais@inovasrv3 ~]$ uname -a
Linux inovasrv3 2.6.22.1-33.fc7 #1 SMP Mon Jul 23 16:59:15 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux

Same problem.

Any boot parameters that I might try? I've tried googling for the problem but
found nothing relevant.

Comment 5 Chuck Ebbert 2007-07-25 15:56:01 UTC
Please post the entire oops message from the 2.6.22.1 kernel.
We need to see the exact addresses where that one fails.

Comment 6 Pedro Morais 2007-07-25 16:35:58 UTC
Created attachment 159955 [details]
Requested stacktrace

A photo of the stacktrace is attached.

I didn't copy it by hand because I would probably make some mistake :-( and I
don't have access to a serial console.

I see a "generic_unplug_device" in this stacktrace... does the SATA layer think
we are physically unplugging a disk? (we are not, the disks are not hot swap).

Thanks

Comment 7 Chuck Ebbert 2007-07-25 16:43:03 UTC
We really need the entire message, including register contents etc.
Booting in 50-line mode may work:

  - add "vga=ask" to kernel command line then select 50-line mode at boot
  - may need to override default font -- easy way is to
      (temporarily) rename /lib/kbd/consolefonts


Comment 8 Pedro Morais 2007-07-25 18:20:14 UTC
Created attachment 159968 [details]
Another stack trace

Another oops, using your instructions... unfortunately, it's different this
time.
I've tried it several times and everytime I get a different error, so I think
something must be corrupting memory.

Since the memory passed the memtest, the disks showed no problems after a
rather violent bonnie++ test run, I'm starting to think something must be wrong
with the network card (driver?)...

Tomorrow I'll try to connect a different network card.

Thanks!

Comment 9 Chris Snook 2007-07-25 21:39:02 UTC
Can the problem be reproduced with the upstream 2.6.23-rc1 kernel, and the atl1
card?  We fixed a bug with DMA setup that may be relevant.

Comment 10 Chuck Ebbert 2007-07-25 21:59:12 UTC
The DMA patches from 2.6.23 will be in the next Fedora 7 kernel.
Hopefully they will fix this problem.

Comment 11 Pedro Morais 2007-07-26 16:38:40 UTC
Ok, I connected an old rtl8139 card, disabled the onboard NIC, and the problem
went away.
I'll wait for the next kernel to hit updates-testing to test if it fixes the issue.
Thanks.

Comment 12 Jay Cliburn 2007-07-27 22:34:48 UTC
You should be able work around the problem by booting with mem=3900 on the
kernel command line.

Comment 13 Pedro Morais 2007-08-01 20:43:12 UTC
Just installed the -41 kernel; unfortunately I can't reboot the machine right
now, but I will try to see if it fixes the problem as soon as possible.
Thanks

Comment 14 Andy Lawrence 2007-08-03 23:15:42 UTC
Created attachment 160680 [details]
backtrace

Comment 15 Andy Lawrence 2007-08-03 23:16:56 UTC
Sorry, for the above, .41 is no better, backtrace attached.

Comment 16 Chris Snook 2007-08-27 16:26:42 UTC
Fixed upstream in 2.6.23-rc3-mm1, still unclear when the patch will hit mainline.

Comment 17 Chuck Ebbert 2007-08-28 20:08:48 UTC
Will put the patch from -mm in F7.

Comment 18 Chuck Ebbert 2007-09-07 13:27:23 UTC
kernel-2.6.22.5-76.fc7 is in the updates-testing repository with a fix for this bug.

Comment 19 Oliver Paukstadt 2007-09-09 05:26:46 UTC
I had a similar problem.
ASUS P5B-Plus, Quad Core, 4GB of memory.
I hit three different (unusual) kernel bugs during the first 8 hours of
operation (updating, rsyncing old box).

Looks like kernel-2.6.22.5-76.fc7 fixes this as the system did not hit a bug
last night (9 hours) and still rsyncing. 

Comment 20 Christopher Brown 2007-09-20 12:21:03 UTC
Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug and will try and assist you in resolving it if I can.

There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel as comment #18 indicates
this resolves the issue.

If the problem no longer exists then please close this bug or I'll do so in a
few days if there is no additional information lodged.

Cheers
Chris

Comment 21 Chuck Ebbert 2007-09-20 15:20:55 UTC
(In reply to comment #20)
> 
> I'm reviewing this bug as part of the kernel bug triage project, an attempt to
> isolate current bugs in the fedora kernel.
> 

Please don't change status when it's MODIFIED, as that means a patch for the
problem is in testing.

Comment 22 Christopher Brown 2007-09-20 15:55:05 UTC
(In reply to comment #21)

> Please don't change status when it's MODIFIED, as that means a patch for the
> problem is in testing.

Apologies - I did not realise that. I have updated the wiki to reflect this.

Cheers
Chris


Comment 23 Chris Snook 2007-09-20 17:44:35 UTC
Patch is merged in 2.6.23-rc7 with a "we don't know why, but this helps"
comment.  I figured out the problem last night and sent a patch to Jeff Garzik
to update the comment with an explanation.  Basically, the atl1 chip can do DMA
to 64-bit addresses, but all of the rings have to be within a single 4GB-aligned
block.  In other words, it's just 32-bit DMA with an option to work around
BIOSes that reserve a huge amount of low physical address space.  The clean fix
is the one we're already using, which is 32-bit DMA.

Comment 24 Pedro Morais 2007-09-21 07:50:34 UTC
Hi,

Unfortunately I'll only be able to test this at the end of this month, that's
when  I'll have access to this machine.

Thanks,
Pedro Morais

Comment 25 Pedro Morais 2007-09-27 11:56:34 UTC
Just checked it; did some stress tests (copied large amounts of data) that
previously would always kill the server.
Everthing is ok, we can close this one.