Description of problem: When I run crash on RHEL5 with a RHEL4U2 kernel namelist, and dumpfile, it fails to start with the following error: # crash ./usr/lib/debug/lib/modules/2.6.9-22.0.1.ELsmp/vmlinux vmcore ... crash: read error: kernel virtual address: 1020385a004 type: "tss_struct ist array" The vmcore file is from a 8-way server. I'm running crash on ProLiant DL360 G4p that also has 8-cpus. This looks like BZ154566, but it was resolved in crash 3.10-13.10. The vmcore file I got from my customer is incomplete. I am not sure if that is causing the problem. I have requested for a full kernel crash dump, but looks like it may take awhile. Will you be able to verify? Version-Release number of selected component (if applicable): crash-4.0-3.14 How reproducible: Always Steps to Reproduce: 1. Run crash on x86_64 RHEL5 with x86_64 kernel namelist and kernel dumpfile. 2. 3. Actual results: crash fails during initialization with a "tss_struct ist array" read error. Expected results: crash session should come up normally. Additional info: # cat /etc/redhat-release Red Hat Enterprise Linux Server release 5 (Tikanga) # uname -a Linux dl360g4p.gsslab.rdu.redhat.com 2.6.18-8.1.1.el5xen #1 SMP Mon Feb 26 20:51:53 EST 2007 x86_64 x86_64 x86_64 GNU/Linux # file vmcore vmcore: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'nux' I have attached a crash -d7 log as well.
Created attachment 150868 [details] crash -d7 ./usr/lib/debug/lib/modules/2.6.9-22.0.1.ELsmp/vmlinux vmcore
Thanks for the "-d7" log -- that's usually my first request... If we strip out just dumpfile memory accesses, we see this: <readmem: ffffffff804d51d0, KVADDR, "xtime", 16, (FOE), 9ef570> <readmem: ffffffff803cc1a0, KVADDR, "system_utsname", 390, (ROE), 9efb5c> <readmem: ffffffff803cc180, KVADDR, "linux_banner", 8, (FOE), 7fff58fe6c48> <readmem: ffffffff80315dc2, KVADDR, "accessible check", 8, (ROE|Q), 7fff58fe68c8> <readmem: ffffffff80315dc2, KVADDR, "readstring characters", 574, (ROE|Q), 7fff58fe58b0> <readmem: ffffffff804d3080, KVADDR, "cpu_pda entry", 128, (FOE), a20540> <readmem: ffffffff804d3100, KVADDR, "cpu_pda entry", 128, (FOE), a20540> <readmem: ffffffff804d3180, KVADDR, "cpu_pda entry", 128, (FOE), a20540> <readmem: ffffffff804d3200, KVADDR, "cpu_pda entry", 128, (FOE), a20540> <readmem: ffffffff804d3280, KVADDR, "cpu_pda entry", 128, (FOE), a20540> <readmem: ffffffff804d3300, KVADDR, "cpu_pda entry", 128, (FOE), a20540> <readmem: ffffffff804d3380, KVADDR, "cpu_pda entry", 128, (FOE), a20540> <readmem: ffffffff804d3400, KVADDR, "cpu_pda entry", 128, (FOE), a20540> <readmem: 10010000084, KVADDR, "tss_struct ist array", 56, (FOE), 9fb090> <readmem: 1020385a004, KVADDR, "tss_struct ist array", 56, (FOE), 9fb0c8> crash: read error: kernel virtual address: 1020385a004 type: "tss_struct ist array" The last kernel virtual address access at 1020385a004 failed. The x86_64 has two "unity-mapped" virtual address spaces, one beginning at ffffffff00000000 (__START_KERNEL_map) and the second one beginning at 10000000000. The first one maps the kernel's static text and data, and the second one maps all physical memory into virtual memory. In both cases, the identifier can be stripped off, and that leaves the physical memory address. So the largest kernel text/data virtual address read was at ffffffff804d51d0 ("xtime"), or 4d51d0 physical. The last two reads were generic virtual address accesses, the first one at 10010000084, 10000084 physical, was successfully read, while the second one at 1020385a004, 20385a004 physical, failed. The netdump format is as simple as it gets -- it contains a page-sized ELF header, followed by the contents of physical memory. So the dumpfile should be equal to the size of physical memory plus a page for the ELF header data. Since the last fatal read attempt was at 20385a004 physical, the dumpfile would have to be over 8GB (0x200000000) in length. The other addresses shown for the "level4_pgt" page table addresses are all in the 15GB region, so I guessing that this system is ~16GB. So I'm presuming that the vmcore-incomplete is too small -- just do an "ls -l" on it.
Thanks for the analysis. I learnt a lot. Yes, the incomplete vmcore is only 4.8GB and I was expecting a 16GB vmcore.
Yep, that's unfortunate... Even if the crash code was hacked to skip the "ist" (interrupt stack) initialization, it's doubtful that it would get too far beyond that given that it's only got a quarter of the physical memory. For 32-bit x86 systems, you can often analyze vmcore-incomplete files as long as they at least contain all of "lowmem", i.e., at least 896MB. You wouldn't be able to access module data since that typically gets vmalloc'd out of highmem, but the crash session will initialize, and, since all kernel stacks are in lowmem, you could get backtraces for all tasks. In fact, most commands work just fine since kernel static data, slab memory, etc. comes out of lowmem. Highmem will only contain user-memory and vmalloc'd kernel memory (mostly for modules). But for 64-bit systems, stuff gets allocated from all over the physical memory map, and despite this just being "ist" related, it would invariably bump into another piece of critical data if that were ignored.