Bug 449111
Summary: | [RHEL5.2] makedumpfile corrupts vmcore on ia64: crash's bt fails to unwind | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Kiyoshi Ueda <kueda> |
Component: | kexec-tools | Assignee: | Neil Horman <nhorman> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.2 | CC: | anderson, coughlan, cward, ishida-sxc, junichi.nomura, kueda, m-ikeda, mikeda, nhorman, oomichi, qcai, tachibana, tao, tatsu-ab1 |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | ia64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-01-20 20:58:21 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 457233 | ||
Attachments: |
Description
Kiyoshi Ueda
2008-05-30 14:03:42 UTC
Have you tried this without makedumpfile? I'd like to be sure that it only happens with makedumpfile, and is not a more systemic problem with kexec. Thanks! No problem without makedumpfile. So it's makedumpfile problem, I think. Has it been ascertained that it's due to corruption -- or could it be caused by a missing page? +1 to daves comment. The use of makedumpfile without options is particularly interesting, given that that should make makedumpfile act effectively as a no-op. Can you take the vmcore you created and comment #2 and run it through makedumpfile by hand without any -c,-d or -E option? Compre the sizes of the two, I would expect them to be identical. If not, that would seem to suggest taht makedumpfile is removing something that it shouldn't. I forget (or rather never knew) -- what is the difference between: (1) makedumpfile (2) makedumpfile -E (i.e., without any other arguments in both cases)? And in the case of the "corrupt" vmcore, does the "bt" fail to show the register set for *all* tasks? For just the active tasks? Or just the panic task? Created attachment 307237 [details] Command results of filesize comparison and crash 'bt -a' Re: Comment#3 Probably it's due to corruption, but not sure yet. It's still under investigation in NEC. Re: Comment#4 and Comment#5 o The size of the following vmcores are different. 1) vmcore without using makedumpfile : 7965380204 byte 2) vmcore filtered (1) by 'makedumpfile' : 7970694360 byte 3) vmcore filtered (1) by 'makedumpfile -E': 7965392492 byte o "bt" fails to show the register set for *all* tasks, when the vmcore is corrupted. Please see the attached file for the actual results. Created attachment 308179 [details]
Add the check of overlapping load segment.
Hi, I investigated this problem, and I found the makedumpfile problem that it cannot output valid data around overlapping PT_LOAD area. The following data is my problematic /proc/vmcore. Paddr [0x4000000 - 0x4638ce0] is overlapping. PT_LOAD(1): Paddr [0x4000000 - 0x4638ce0] PT_LOAD(2): Paddr [0x4000000 - 0x4db3000] The crash utility (4.0-6.3) gets invalid UNW_LENGTH(hdr) (== 0) at build_script(), and unw_decode() does not be operated. In my test environment, invalid UNW_LENGTH(hdr) is gotten when reading the physical address 0x463b058. This address is contained in PT_LOAD(2) but makedumpfile outputs dump data around PT_LOAD(1) to a dumpfile. makedumpfile outputs dump data by each page without checking a continuous page in the came PT_LOAD. The attached patch adds the check logic and fixes this problem. > The following data is my problematic /proc/vmcore.
> Paddr [0x4000000 - 0x4638ce0] is overlapping.
> PT_LOAD(1): Paddr [0x4000000 - 0x4638ce0]
> PT_LOAD(2): Paddr [0x4000000 - 0x4db3000
Just to clarify, when you say, "problematic /proc/vmcore" you mean
the problematic vmcore that makedumpfile created, correct?
In other words, the original /proc/vmcore did not have any
overlapping PT_LOAD segments, correct?
The patch looks good to me. Dave judging by the patch, the answer to your question is 'yes'. The patch makes changes to the segments that are written to disk, rather than the segments that are read in from /proc/vmcore itself. This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. > The patch looks good to me. Dave judging by the patch, the answer to your
> question is 'yes'. The patch makes changes to the segments that are written
> to disk, rather than the segments that are read in from /proc/vmcore itself.
Right, I understand, and I'm not disputing the patch.
But given that the "makedumpfile with no options" should pretty much emulate
the same PT_LOAD segments that are registered in the original /proc/vmcore,
I'm wondering how the overlap/confusion would arise to begin with?
Could it have something to do with a p_memsz/p_filesz in the original
vmcore's PT_LOAD segments not being page-aligned values? Where does the
0x4638ce0 come from? (more specifically, the "ce0")
Created attachment 308264 [details] results of 'objdump -x' against the original vmcore before filtering by makedumpfile Re: Comment#9 Although Ken'ichi may give better answer, I think the answer is 'no'. The original vmcore before filtering by makedumpfile has the overlapping segments. See the attached file for details. I'm sorry, but I don't see any overlapping segments in that output (but I may just be glossing over them). Can you point it out to me? Thanks Kiyoshi -- now I understand. The ia64 vmcore contains a separate PT_LOAD segment for the mapped kernel text and static data in region 5: vaddr: 0xa000000100000000 paddr: 0x0000000004000000 memsz: 0x0000000000638ce0 and there's also this overlapping unity-mapped segment here in region 7: vaddr: 0xe000000004000000 paddr: 0x0000000004000000 memsz: 0x0000000000db3000 I forgot about that. And that's also where the "ce0" comes into play, as it is associated with the "end" symbol in the mapped kernel segment. It seems like that region 5 section should probably be rounded up to a page boundary, but there may be some compelling reason that it's not. Dave pointed the overlapping segments in Comment#15 for the question in Comment#14. Back to ASSIGNED. yep, I see them now, thanks for that! I'll check this in as soon as pm acks it. fixed in 1.102pre-25.el5. I tested 1.102pre-29.el5 and confirmed this problem is solved. On ia64 RHEL5.2GA, subcommand 'bt' works fine like the following: crash> bt PID: 5744 TASK: e00000017b080000 CPU: 0 COMMAND: "bash" #0 [BSP:e00000017b0813e8] machine_kexec at a000000100059760 #1 [BSP:e00000017b0813c8] crash_kexec at a0000001000ca690 #2 [BSP:e00000017b0813a0] sysrq_handle_crashdump at a0000001003b3900 #3 [BSP:e00000017b081350] __handle_sysrq at a0000001003b3140 #4 [BSP:e00000017b081320] write_sysrq_trigger at a0000001001f2850 #5 [BSP:e00000017b0812d0] vfs_write at a0000001001644e0 #6 [BSP:e00000017b081258] sys_write at a000000100165030 #7 [BSP:e00000017b081258] __ia64_trace_syscall at a00000010000bdb0 EFRAME: e00000017b087e40 B0: 20000000001564a0 CR_IIP: a000000000010620 CR_IPSR: 00001213085a6010 CR_IFS: 0000000000000008 AR_PFS: c000000000000008 AR_RSC: 000000000000000f AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000 AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f LOADRS: 0000000001b80000 AR_BSPSTORE: 600007ffffa542d0 B6: 200000000021e780 B7: a000000000010640 PR: 0000000000590a41 R1: 2000000000290238 R2: 60000fffffa4f9a0 R3: 60000fffffa4f9b0 R8: 0000000000000001 R9: 0000000000000004 R10: 0000000000000000 R11: c000000000000512 R12: 60000fffffa4f9b0 R13: 2000000000314e00 R14: 00000000000001f5 R15: 0000000000000403 R16: 60000fffffa4f890 R17: 400000000000ec76 R18: 400000000000ebb0 R19: 20000000003103f8 R20: 6000000000021260 R21: 0000000000000030 R22: 2000000000310410 R23: 6000000000021238 R24: 0000000000000000 R25: 0000000000000000 R26: c000000000000004 R27: 000000000000000f R28: a000000000010620 R29: 00001213085a6010 R30: 0000000000000004 R31: 00000000005a0a41 F6: 000000000000000000000 F7: 000000000000000000000 F8: 000000000000000000000 F9: 000000000000000000000 F10: 000000000000000000000 F11: 000000000000000000000 #8 [BSP:e00000017b081258] __kernel_syscall_via_break at a000000000010620 crash> Thank you for merging the patch. ~~~ Attention Partners! ~~~ Please test this URGENT / HIGH priority bug at your earliest convenience to ensure it makes it into the upcoming RHEL 5.3 release. The fix should be present in the Partner Snapshot #2 (kernel*-122), available NOW at ftp://partners.redhat.com. As we are approaching the end of the RHEL 5.3 test cycle, it is critical that you report back testing results as soon as possible. If you have VERIFIED the fix, please add PartnerVerified to the Bugzilla Keywords field to indicate this. If you find that this issue has not been properly fixed, set the bug status to ASSIGNED with a comment describing the issues you encountered. All NEW issues encountered (not part of this bug fix) should have a new bug created with the proper keywords and flags set to trigger a review for their inclusion in the upcoming RHEL 5.3 or other future release. Post a link in this bugzilla pointing to the new issue to ensure it is not overlooked. For any additional questions, speak with your Partner Manager. ~~ Snapshot 3 is now available ~~ Snapshot 3 is now available for Partner Testing, which should contain a fix that resolves this bug. ISO's available as usual at ftp://partners.redhat.com. Your testing feedback is vital! Please let us know if you encounter any NEW issues (file a new bug) or if you have VERIFIED the fix is present and functioning as expected (add PartnerVerified Keyword). Ping your Partner Manager with any additional questions. Thanks! NEC confirmed that this problem was resolved on RHEL5.3 Snapshot2. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0105.html |