Red Hat Bugzilla – Bug 249511
DMA broken with atl1 network adapter and >4GB of memory
Last modified: 2007-11-30 17:12:11 EST
Description of problem:
Kernel locks when copying a large amount of data (using scp) over a 100 Mbit
Version-Release number of selected component (if applicable):
all kernel versions releases for Fedora 7, up to the latest kernel-188.8.131.52-27.fc7
Always - just use a large amount mass of data
Steps to Reproduce:
1. In a server 1, start copying, using scp, a large amount of data (something
like 10 Gb) to server 2
2. After a while, server 2 (running F7) will lockup
After a while, server 2 (running F7) will lockup
We used memtest to test the memory of this machine, over several days; no
problems reported. The machine is not overclocked or "tuned" in any way.
It's a quad core Intel processor (Q6600) with a RAID 5 array of 4 500 Gb SATA disks.
The smolt profile is available in
The attached stacktrace was from the only time it got registered in the logs;
usually it's not even registered. This specific stacktrace is from kernel.x86_64
2.6.21-1.3228.fc7, but the same occurs in in kernel-184.108.40.206-27.fc7.
Created attachment 159912 [details]
stacktrace for the lockup
Ran bonnie++ several times and I haven't been able to trigger the problem.
Also, ran some rather large "cp" commands and no lockup either.
Seems I can only reproduce it in the scp case.
Can you try kernel-220.127.116.11-33.fc7 from the updates-testing repo?
[pedro.morais@inovasrv3 ~]$ uname -a
Linux inovasrv3 18.104.22.168-33.fc7 #1 SMP Mon Jul 23 16:59:15 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux
Any boot parameters that I might try? I've tried googling for the problem but
found nothing relevant.
Please post the entire oops message from the 22.214.171.124 kernel.
We need to see the exact addresses where that one fails.
Created attachment 159955 [details]
A photo of the stacktrace is attached.
I didn't copy it by hand because I would probably make some mistake :-( and I
don't have access to a serial console.
I see a "generic_unplug_device" in this stacktrace... does the SATA layer think
we are physically unplugging a disk? (we are not, the disks are not hot swap).
We really need the entire message, including register contents etc.
Booting in 50-line mode may work:
- add "vga=ask" to kernel command line then select 50-line mode at boot
- may need to override default font -- easy way is to
(temporarily) rename /lib/kbd/consolefonts
Created attachment 159968 [details]
Another stack trace
Another oops, using your instructions... unfortunately, it's different this
I've tried it several times and everytime I get a different error, so I think
something must be corrupting memory.
Since the memory passed the memtest, the disks showed no problems after a
rather violent bonnie++ test run, I'm starting to think something must be wrong
with the network card (driver?)...
Tomorrow I'll try to connect a different network card.
Can the problem be reproduced with the upstream 2.6.23-rc1 kernel, and the atl1
card? We fixed a bug with DMA setup that may be relevant.
The DMA patches from 2.6.23 will be in the next Fedora 7 kernel.
Hopefully they will fix this problem.
Ok, I connected an old rtl8139 card, disabled the onboard NIC, and the problem
I'll wait for the next kernel to hit updates-testing to test if it fixes the issue.
You should be able work around the problem by booting with mem=3900 on the
kernel command line.
Just installed the -41 kernel; unfortunately I can't reboot the machine right
now, but I will try to see if it fixes the problem as soon as possible.
Created attachment 160680 [details]
Sorry, for the above, .41 is no better, backtrace attached.
Fixed upstream in 2.6.23-rc3-mm1, still unclear when the patch will hit mainline.
Will put the patch from -mm in F7.
kernel-126.96.36.199-76.fc7 is in the updates-testing repository with a fix for this bug.
I had a similar problem.
ASUS P5B-Plus, Quad Core, 4GB of memory.
I hit three different (unusual) kernel bugs during the first 8 hours of
operation (updating, rsyncing old box).
Looks like kernel-188.8.131.52-76.fc7 fixes this as the system did not hit a bug
last night (9 hours) and still rsyncing.
I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the fedora kernel.
I am CC'ing myself to this bug and will try and assist you in resolving it if I can.
There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel as comment #18 indicates
this resolves the issue.
If the problem no longer exists then please close this bug or I'll do so in a
few days if there is no additional information lodged.
(In reply to comment #20)
> I'm reviewing this bug as part of the kernel bug triage project, an attempt to
> isolate current bugs in the fedora kernel.
Please don't change status when it's MODIFIED, as that means a patch for the
problem is in testing.
(In reply to comment #21)
> Please don't change status when it's MODIFIED, as that means a patch for the
> problem is in testing.
Apologies - I did not realise that. I have updated the wiki to reflect this.
Patch is merged in 2.6.23-rc7 with a "we don't know why, but this helps"
comment. I figured out the problem last night and sent a patch to Jeff Garzik
to update the comment with an explanation. Basically, the atl1 chip can do DMA
to 64-bit addresses, but all of the rings have to be within a single 4GB-aligned
block. In other words, it's just 32-bit DMA with an option to work around
BIOSes that reserve a huge amount of low physical address space. The clean fix
is the one we're already using, which is 32-bit DMA.
Unfortunately I'll only be able to test this at the end of this month, that's
when I'll have access to this machine.
Just checked it; did some stress tests (copied large amounts of data) that
previously would always kill the server.
Everthing is ok, we can close this one.