Bug 213937
Summary: | 2.6.18-1.2200.fc5 kernel crashes when copying files over NFS | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | H.J. Lu <hongjiu.lu> |
Component: | kernel | Assignee: | Dave Jones <davej> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 5 | CC: | esandeen, jarod, pfrields, steved, wtogami |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-11-12 05:48:06 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
H.J. Lu
2006-11-03 20:51:24 UTC
Not sure if NFS is really at fault, or some other bit in the stack. Most of the trace there looks ext3-related, but that may just be where we happened to be when something else blew up. I've got a few boxes running the same kernel that transfer several gigabytes of data (lots of large video files) to and fro via nfs-atop-ext3 every day without a problem. Eric? Steve? Whatever is at fault, it appears to be the same issue as bug 213901. Did this bug show up under 2.6.18-1.2798.fc6 (bug 213901) on the same hardware or different hardware? hrmph, dying in ext3's walk_page_buffers... the other was in journal_commit_transaction, either way looks like it could be ext-related. How reproducible is it? Can you get a dump? looking at the disassembly, I -think- this looks like a page with a corrupted buffer. We wind up trying to follow a bh->b_this_page where bh is not an address (ff0f375aff0f375a) One other thing that bugs me is I think %rdx should contain the "from" variable passed in, and it's never changed, only tested.... but it contains 00000000ff0f375a (half of the above bad bh...) but the calling function only ever passes in 0. But, I could still learn a thing or two about x86_64 assy so I could be missing something. A dump would help here, I think. This bug and bug 213937 are on the same hardware. Kernel 2.6.18-1.2739.el5 from RHEL 5 beta survived the same operation. My bugs may be related to http://people.redhat.com/esandeen/traces/ Have those ext3 bugs been fixed? Yes, it could certainly be related... I didn't realize that the kernels you were testing didn't yet have that fix in them. :) The ext3 traces were from bug 209647 - and that fix for that should be in RHEL5 beta as of kernel-2.6.18-1.2739.el5 and later, so if it survived when you tested it, that's good news. The fix is -not- in 2.6.18-1.2200.fc5, but should be in the next update. I'm not certain when it will show up in a released FC6 kernel, but probably the next update if not already. How reliably could you reproduce it on the older kernels? It happens every time when I copy more 10GB directory tree with "cp -af ..." under FC5/FC6 kernel. Good to know, thanks. If it's 100% with those kernels, and 0% with the latest RHEL5 (and soon-to-be-released FC5/FC6 updates...) then it most likely is the same problem. Since you've seen it pass on RHEL5, can you test with later FC6 kernels too? 1.2798 and beyond in FC6 should have the fix too, see http://download.fedora.redhat.com/pub/fedora/linux/core/development/x86_64/os/Fedora/RPMS/ for example... Thanks, -Eric See comment #2. 1.2798 in FC6 has a similar problem. ah, sorry, missed that. So you can hit this 100% of the time on 2.6.18-1.2798.fc6 but 0% of the time on kernel-2.6.18-1.2739.el5? odd... they both should have the previously mentioned ext3 fix in them. this bug and Bug #213901 do have slightly different signatures, they may not be the same thing... I'll see if I can hit this here, if so maybe I'll just do a search through the kernels, first, to see when it showed up. Thanks, -Eric I am quite certain that both 2.6.18-1.2798.fc6 and 2.6.18-1.2200.fc5 panic when I copy big tree while kernel-2.6.18-1.2739.el5 is OK. Looking at kernel changelog, it doesn't look like kernel-2.6.18-1.2798.fc6 has the fix in kernel-2.6.18-1.2739.el5. Does 1.2814 FC6 kernel in devel have the fix in ELF? [root@gnu-2 yum.repos.d]# rpm -q --changelog kernel-2.6.18-1.2798.fc6 | head -10 * Mon Oct 16 2006 Dave Jones <davej> - Silence another noisy boot-time printk. (#210810) - Remove broken VIA quirk that prevented booting on some EPIAs (#210817) - Fix JBD crash with 1K blocksize filesystems. (#209005) [root@gnu-4 export]# rpm -q --changelog kernel-2.6.18-1.2739.el5 | head -10 * Thu Oct 26 2006 Don Zickus <dzickus> [2.6.18-1.2739.el5] - SHPCHP driver doesn't work (Keiichiro Tokunaga) [210478] - ext3/jbd panic (Eric Sandeen) [209647] - Oops in nfs_cancel_commit_list (Jeff Layton) [210679] - kernel Soft lockup detected on corrupted ext3 filesystem (Eric Sandeen) [212053] - CIFS doesn't work (Steve Dickson) [211070] the changelogs don't quite look the same, but: - Fix JBD crash with 1K blocksize filesystems. (#209005) and - ext3/jbd panic (Eric Sandeen) [209647] are actually the same issue, bug 209005 and bug 209647 are clones, one for fc6 and one for rhel5. I'm not sure what you mean by "the fix in ELF?" Oh... EL5? Yes, it does: [esandeen@host esandeen]$ rpm -qp --changelog kernel-2.6.18-1.2814.fc6.src.rpm | grep "209005\|209647" - Fix JBD crash with 1K blocksize filesystems. (#209005) *** Bug 213901 has been marked as a duplicate of this bug. *** 2.6.18-1.2837 from FC6 fixes the problem for me. Good news, thanks. Out of curiosity, does the filesystem being served have a block size < page size? No, I am using the default 4K block size on Intel64. should be fixed in 2.6.18-1.2239.fc5 now in updates. |