Description of problem: With 2.6.18-92.1.6PAE.el5 vmlinux and 2.6.18-92PAE.el5 vmcore, crash was running into infinite loop, and eventually out of memory, # crash /usr/lib/debug/lib/modules/2.6.18-92.1.6.el5PAE/vmlinux /var/crash/2008-07-31-03\:56/vmcore ... KERNEL: /usr/lib/debug/lib/modules/2.6.18-92.1.6.el5PAE/vmlinux DUMPFILE: /var/crash/2008-07-31-03:56/vmcore CPUS: 2 DATE: Thu Jul 31 03:56:29 2008 KERNEL: /usr/lib/debug/lib/modules/2.6.18-92.1.6.el5PAE/vmlinux DUMPFILE: /var/crash/2008-07-31-03:56/vmcore CPUS: 2 DATE: Thu Jul 31 03:56:29 2008 KERNEL: /usr/lib/debug/lib/modules/2.6.18-92.1.6.el5PAE/vmlinux DUMPFILE: /var/crash/2008-07-31-03:56/vmcore CPUS: 2 DATE: Thu Jul 31 03:56:29 2008 KERNEL: /usr/lib/debug/lib/modules/2.6.18-92.1.6.el5PAE/vmlinux DUMPFILE: /var/crash/2008-07-31-03:56/vmcore CPUS: 2 DATE: Thu Jul 31 03:56:29 2008 ... It would be nice to be more robust. Version-Release number of selected component (if applicable): crash-4.0-5.0.3 How reproducible: always
I have the machine (dell-pe830-02.rhts.bos.redhat.com) and vmcore reserved for the next two days, in case if you might want to have a look.
I have been on vacation for the last couple of weeks. It appears that dell-pe830-02.rhts.bos.redhat.com has been re-assigned. Can you recreate a vmcore and vmlinux file and make them available please?
Although you mention "vmlinux and vmcore Mismatch" in the BZ subject line, you have provided no evidence of that fact. In the future, can you please post the complete output of the crash command? Also, when debugging initialization issued, it's typically helpful if hou also re-run the command with the -d debug flag, i.e., as in "crash -d3 ..."?
Created attachment 313528 [details] full output from crash -d3 This is the full output generated by, crash -d3 /usr/lib/debug/lib/modules/2.6.18-92.1.6.el5PAE/vmlinux /var/crash/2008-08-06-08\:07/vmcore >output 2>&1 For vmlinux and vmcore mismatched, I mean the above vmcore was generated from RHEL-5.2 GA Kernel kernel-PAE-2.6.18.el5, while the vmlinux was from a RHEL-5.2.z Kernel kernel-PAE-debuginfo-2.6.18-92.1.6.el5. You could find both files at, http://porkchop.devel.redhat.com/qa/qa/qcai/bz/457371
If the above line does not work, you could try it from SSH, porkchop.devel.redhat.com:/mnt/redhat/qa/qa/qcai/bz/457371
OK, thanks for the vmlinux/vmcore pair. This one was truly bizarre. It is legitimate to use a non-matching vmlinux with a vmcore, although when done, it's usually accompanied by adding the relevant System.map file to the crash command line in order to get the "correct" symbol values to match the vmcore. In this case, the symbol values from vmlinux (2.6.18-92.1.6.el5PAE) are only *slightly* different from the vmcore (2.6.18-92.el5PAE), but one of symbols that changed was that of "cfq_slice_async", which is used to correctly calculate the kernels "HZ" value. In this case, however, an incorrect value of 0 was read for cfq_slice_async (because the wrong address was used), and as a result, the saved "HZ" value used by crash was calculated to be 0. Then later, when the UPTIME: display was doing its calculation, it divides the kernel's "jiffies_64" value by the pre-calculated "hz" value, causing a SIGFPE exception. That in turn caused a setjmp() operation, and it ended up going into an endless loop doing the same thing. In any case, if the default HZ value is kept in place, then the problem goes away, and a crash session can be initiated: $ ./crash vmlinux vmcore crash 4.0-6.3a Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i686-pc-linux-gnu"... KERNEL: vmlinux DUMPFILE: vmcore CPUS: 2 DATE: Wed Aug 6 08:07:17 2008 UPTIME: 00:06:48 LOAD AVERAGE: 0.36, 0.36, 0.15 TASKS: 95 NODENAME: dell-pe830-01.rhts.bos.redhat.com RELEASE: 2.6.18-92.el5PAE VERSION: #1 SMP Tue Apr 29 13:31:02 EDT 2008 MACHINE: i686 (3200 Mhz) MEMORY: 0 PANIC: "SysRq : Trigger a crashdump" PID: 2713 COMMAND: "bash" TASK: f7180000 [THREAD_INFO: f603c000] CPU: 1 STATE: TASK_RUNNING (SYSRQ) crash> Other commands may fail, but at least the session can be brought up and potentially debugged using the "wrong" vmlinux file. This patch addresses the problem: Index: kernel.c =================================================================== RCS file: /nfs/projects/cvs/crash/kernel.c,v retrieving revision 1.185 diff -u -r1.185 kernel.c --- kernel.c 11 Apr 2008 15:21:22 -0000 1.185 +++ kernel.c 6 Aug 2008 20:12:21 -0000 @@ -933,6 +933,10 @@ } if (CRASHDEBUG(1)) { + error(WARNING, + "\ncannot find matching kernel version in %s file:\n\n", + namelist); + fprintf(fp, "verify_namelist:\n"); fprintf(fp, "/proc/version:\n%s\n", kt->proc_version); fprintf(fp, "utsname version: %s\n", kt->utsname.version); Index: task.c =================================================================== RCS file: /nfs/projects/cvs/crash/task.c,v retrieving revision 1.143 diff -u -r1.143 task.c --- task.c 4 Apr 2008 20:26:33 -0000 1.143 +++ task.c 6 Aug 2008 20:23:01 -0000 @@ -306,12 +306,15 @@ get_symbol_data("cfq_slice_async", sizeof(int), &cfq_slice_async); - machdep->hz = cfq_slice_async * 25; - if (CRASHDEBUG(2)) - fprintf(fp, - "cfq_slice_async exitsts: setting hz to %d\n", - machdep->hz); + if (cfq_slice_async) { + machdep->hz = cfq_slice_async * 25; + + if (CRASHDEBUG(2)) + fprintf(fp, + "cfq_slice_async exists: setting hz to %d\n", + machdep->hz); + } } if (VALID_MEMBER(runqueue_arrays)) Index: tools.c =================================================================== RCS file: /nfs/projects/cvs/crash/tools.c,v retrieving revision 1.61 diff -u -r1.61 tools.c --- tools.c 14 May 2008 17:53:15 -0000 1.61 +++ tools.c 6 Aug 2008 20:17:36 -0000 @@ -4436,6 +4436,11 @@ if (CRASHDEBUG(2)) error(INFO, "convert_time: %lld (%llx)\n", count, count); + if (!machdep->hz) { + sprintf(buf, "(cannot calculate: unknown HZ value)"); + return buf; + } + total = (count)/(ulonglong)machdep->hz; days = total / SEC_DAYS;
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0240.html