Bug 249511
Summary: | DMA broken with atl1 network adapter and >4GB of memory | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Pedro Morais <pedro.morais> | ||||||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | low | ||||||||||||
Version: | 7 | CC: | chris.brown, csnook, dr.diesel, jcliburn, pstadt | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
URL: | http://smolt.fedoraproject.org/show?UUID=e783e932-c71f-4f79-b39a-2d2f0923797d | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | 2.6.22.5-76.fc7 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2007-09-20 15:20:55 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Pedro Morais
2007-07-25 10:11:17 UTC
Created attachment 159912 [details]
stacktrace for the lockup
Ran bonnie++ several times and I haven't been able to trigger the problem. Also, ran some rather large "cp" commands and no lockup either. Seems I can only reproduce it in the scp case. Can you try kernel-2.6.22.1-33.fc7 from the updates-testing repo? [pedro.morais@inovasrv3 ~]$ uname -a Linux inovasrv3 2.6.22.1-33.fc7 #1 SMP Mon Jul 23 16:59:15 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux Same problem. Any boot parameters that I might try? I've tried googling for the problem but found nothing relevant. Please post the entire oops message from the 2.6.22.1 kernel. We need to see the exact addresses where that one fails. Created attachment 159955 [details]
Requested stacktrace
A photo of the stacktrace is attached.
I didn't copy it by hand because I would probably make some mistake :-( and I
don't have access to a serial console.
I see a "generic_unplug_device" in this stacktrace... does the SATA layer think
we are physically unplugging a disk? (we are not, the disks are not hot swap).
Thanks
We really need the entire message, including register contents etc. Booting in 50-line mode may work: - add "vga=ask" to kernel command line then select 50-line mode at boot - may need to override default font -- easy way is to (temporarily) rename /lib/kbd/consolefonts Created attachment 159968 [details]
Another stack trace
Another oops, using your instructions... unfortunately, it's different this
time.
I've tried it several times and everytime I get a different error, so I think
something must be corrupting memory.
Since the memory passed the memtest, the disks showed no problems after a
rather violent bonnie++ test run, I'm starting to think something must be wrong
with the network card (driver?)...
Tomorrow I'll try to connect a different network card.
Thanks!
Can the problem be reproduced with the upstream 2.6.23-rc1 kernel, and the atl1 card? We fixed a bug with DMA setup that may be relevant. The DMA patches from 2.6.23 will be in the next Fedora 7 kernel. Hopefully they will fix this problem. Ok, I connected an old rtl8139 card, disabled the onboard NIC, and the problem went away. I'll wait for the next kernel to hit updates-testing to test if it fixes the issue. Thanks. You should be able work around the problem by booting with mem=3900 on the kernel command line. Just installed the -41 kernel; unfortunately I can't reboot the machine right now, but I will try to see if it fixes the problem as soon as possible. Thanks Created attachment 160680 [details]
backtrace
Sorry, for the above, .41 is no better, backtrace attached. Fixed upstream in 2.6.23-rc3-mm1, still unclear when the patch will hit mainline. Will put the patch from -mm in F7. kernel-2.6.22.5-76.fc7 is in the updates-testing repository with a fix for this bug. I had a similar problem. ASUS P5B-Plus, Quad Core, 4GB of memory. I hit three different (unusual) kernel bugs during the first 8 hours of operation (updating, rsyncing old box). Looks like kernel-2.6.22.5-76.fc7 fixes this as the system did not hit a bug last night (9 hours) and still rsyncing. Hello, I'm reviewing this bug as part of the kernel bug triage project, an attempt to isolate current bugs in the fedora kernel. http://fedoraproject.org/wiki/KernelBugTriage I am CC'ing myself to this bug and will try and assist you in resolving it if I can. There hasn't been much activity on this bug for a while. Could you tell me if you are still having problems with the latest kernel as comment #18 indicates this resolves the issue. If the problem no longer exists then please close this bug or I'll do so in a few days if there is no additional information lodged. Cheers Chris (In reply to comment #20) > > I'm reviewing this bug as part of the kernel bug triage project, an attempt to > isolate current bugs in the fedora kernel. > Please don't change status when it's MODIFIED, as that means a patch for the problem is in testing. (In reply to comment #21) > Please don't change status when it's MODIFIED, as that means a patch for the > problem is in testing. Apologies - I did not realise that. I have updated the wiki to reflect this. Cheers Chris Patch is merged in 2.6.23-rc7 with a "we don't know why, but this helps" comment. I figured out the problem last night and sent a patch to Jeff Garzik to update the comment with an explanation. Basically, the atl1 chip can do DMA to 64-bit addresses, but all of the rings have to be within a single 4GB-aligned block. In other words, it's just 32-bit DMA with an option to work around BIOSes that reserve a huge amount of low physical address space. The clean fix is the one we're already using, which is 32-bit DMA. Hi, Unfortunately I'll only be able to test this at the end of this month, that's when I'll have access to this machine. Thanks, Pedro Morais Just checked it; did some stress tests (copied large amounts of data) that previously would always kill the server. Everthing is ok, we can close this one. |