Bug 457371 - Infinite Loop when vmlinux and vmcore Mismatch
Summary: Infinite Loop when vmlinux and vmcore Mismatch
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: crash
Version: 5.2
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Dave Anderson
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-07-31 08:38 UTC by Qian Cai
Modified: 2009-01-20 22:13 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 22:13:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
full output from crash -d3 (1.43 MB, text/plain)
2008-08-06 06:21 UTC, Qian Cai
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:0240 0 normal SHIPPED_LIVE crash bug fix update 2009-01-20 16:06:42 UTC

Description Qian Cai 2008-07-31 08:38:37 UTC
Description of problem:
With 2.6.18-92.1.6PAE.el5 vmlinux and 2.6.18-92PAE.el5 vmcore, crash was running
into infinite loop, and eventually out of memory,

# crash /usr/lib/debug/lib/modules/2.6.18-92.1.6.el5PAE/vmlinux
/var/crash/2008-07-31-03\:56/vmcore

...
      KERNEL: /usr/lib/debug/lib/modules/2.6.18-92.1.6.el5PAE/vmlinux
    DUMPFILE: /var/crash/2008-07-31-03:56/vmcore
        CPUS: 2
        DATE: Thu Jul 31 03:56:29 2008
      KERNEL: /usr/lib/debug/lib/modules/2.6.18-92.1.6.el5PAE/vmlinux
    DUMPFILE: /var/crash/2008-07-31-03:56/vmcore
        CPUS: 2
        DATE: Thu Jul 31 03:56:29 2008
      KERNEL: /usr/lib/debug/lib/modules/2.6.18-92.1.6.el5PAE/vmlinux
    DUMPFILE: /var/crash/2008-07-31-03:56/vmcore
        CPUS: 2
        DATE: Thu Jul 31 03:56:29 2008
      KERNEL: /usr/lib/debug/lib/modules/2.6.18-92.1.6.el5PAE/vmlinux
    DUMPFILE: /var/crash/2008-07-31-03:56/vmcore
        CPUS: 2
        DATE: Thu Jul 31 03:56:29 2008
...

It would be nice to be more robust.

Version-Release number of selected component (if applicable):
crash-4.0-5.0.3

How reproducible:
always

Comment 1 Qian Cai 2008-07-31 08:49:13 UTC
I have the machine (dell-pe830-02.rhts.bos.redhat.com) and vmcore reserved for
the next two days, in case if you might want to have a look.

Comment 2 Dave Anderson 2008-08-05 13:49:27 UTC
I have been on vacation for the last couple of weeks.
It appears that dell-pe830-02.rhts.bos.redhat.com has
been re-assigned.

Can you recreate a vmcore and vmlinux file and make them
available please?

Comment 3 Dave Anderson 2008-08-05 13:57:33 UTC
Although you mention "vmlinux and vmcore Mismatch" in the BZ subject line,
you have provided no evidence of that fact.
  
In the future, can you please post the complete output of the crash command?
Also, when debugging initialization issued, it's typically helpful if hou
also re-run the command with the -d debug flag, i.e., as in "crash -d3 ..."?

Comment 4 Qian Cai 2008-08-06 06:21:56 UTC
Created attachment 313528 [details]
full output from crash -d3

This is the full output generated by,

crash -d3 /usr/lib/debug/lib/modules/2.6.18-92.1.6.el5PAE/vmlinux /var/crash/2008-08-06-08\:07/vmcore >output 2>&1

For vmlinux and vmcore mismatched, I mean the above vmcore was generated from RHEL-5.2 GA Kernel kernel-PAE-2.6.18.el5, while the vmlinux was from a RHEL-5.2.z Kernel kernel-PAE-debuginfo-2.6.18-92.1.6.el5.

You could find both files at,
http://porkchop.devel.redhat.com/qa/qa/qcai/bz/457371

Comment 5 Qian Cai 2008-08-06 07:05:40 UTC
If the above line does not work, you could try it from SSH,

porkchop.devel.redhat.com:/mnt/redhat/qa/qa/qcai/bz/457371

Comment 6 Dave Anderson 2008-08-06 20:36:51 UTC
OK, thanks for the vmlinux/vmcore pair.

This one was truly bizarre.  It is legitimate to use a non-matching vmlinux
with a vmcore, although when done, it's usually accompanied by adding the
relevant System.map file to the crash command line in order to get the
"correct" symbol values to match the vmcore.  In this case, the symbol
values from vmlinux (2.6.18-92.1.6.el5PAE) are only *slightly* different
from the vmcore (2.6.18-92.el5PAE), but one of symbols that changed was
that of "cfq_slice_async", which is used to correctly calculate the 
kernels "HZ" value.  In this case, however, an incorrect value of 0 was
read for cfq_slice_async (because the wrong address was used), and as
a result, the saved "HZ" value used by crash was calculated to be 0.

Then later, when the UPTIME: display was doing its calculation, it
divides the kernel's "jiffies_64" value by the pre-calculated "hz"
value, causing a SIGFPE exception.  That in turn caused a setjmp()
operation, and it ended up going into an endless loop doing the
same thing.

In any case, if the default HZ value is kept in place, then the problem
goes away, and a crash session can be initiated:

$ ./crash vmlinux vmcore

crash 4.0-6.3a
Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...

      KERNEL: vmlinux                           
    DUMPFILE: vmcore
        CPUS: 2
        DATE: Wed Aug  6 08:07:17 2008
      UPTIME: 00:06:48
LOAD AVERAGE: 0.36, 0.36, 0.15
       TASKS: 95
    NODENAME: dell-pe830-01.rhts.bos.redhat.com
     RELEASE: 2.6.18-92.el5PAE
     VERSION: #1 SMP Tue Apr 29 13:31:02 EDT 2008
     MACHINE: i686  (3200 Mhz)
      MEMORY: 0
       PANIC: "SysRq : Trigger a crashdump"
         PID: 2713
     COMMAND: "bash"
        TASK: f7180000  [THREAD_INFO: f603c000]
         CPU: 1
       STATE: TASK_RUNNING (SYSRQ)

crash> 

Other commands may fail, but at least the session can be brought up
and potentially debugged using the "wrong" vmlinux file.

This patch addresses the problem:

Index: kernel.c
===================================================================
RCS file: /nfs/projects/cvs/crash/kernel.c,v
retrieving revision 1.185
diff -u -r1.185 kernel.c
--- kernel.c    11 Apr 2008 15:21:22 -0000      1.185
+++ kernel.c    6 Aug 2008 20:12:21 -0000
@@ -933,6 +933,10 @@
        }
 
         if (CRASHDEBUG(1)) {
+               error(WARNING, 
+                   "\ncannot find matching kernel version in %s file:\n\n",
+                       namelist);
+
                        fprintf(fp, "verify_namelist:\n");
                 fprintf(fp, "/proc/version:\n%s\n", kt->proc_version);
                 fprintf(fp, "utsname version: %s\n", kt->utsname.version);
Index: task.c
===================================================================
RCS file: /nfs/projects/cvs/crash/task.c,v
retrieving revision 1.143
diff -u -r1.143 task.c
--- task.c      4 Apr 2008 20:26:33 -0000       1.143
+++ task.c      6 Aug 2008 20:23:01 -0000
@@ -306,12 +306,15 @@
 
                get_symbol_data("cfq_slice_async", sizeof(int), 
                        &cfq_slice_async);
-               machdep->hz = cfq_slice_async * 25; 
 
-               if (CRASHDEBUG(2))
-                       fprintf(fp, 
-                           "cfq_slice_async exitsts: setting hz to %d\n", 
-                               machdep->hz);
+               if (cfq_slice_async) {
+                       machdep->hz = cfq_slice_async * 25; 
+
+                       if (CRASHDEBUG(2))
+                               fprintf(fp, 
+                                   "cfq_slice_async exists: setting hz to %d\n", 
+                                       machdep->hz);
+               }
        }
 
        if (VALID_MEMBER(runqueue_arrays)) 
Index: tools.c
===================================================================
RCS file: /nfs/projects/cvs/crash/tools.c,v
retrieving revision 1.61
diff -u -r1.61 tools.c
--- tools.c     14 May 2008 17:53:15 -0000      1.61
+++ tools.c     6 Aug 2008 20:17:36 -0000
@@ -4436,6 +4436,11 @@
        if (CRASHDEBUG(2))
                error(INFO, "convert_time: %lld (%llx)\n", count, count);
 
+       if (!machdep->hz) {
+               sprintf(buf, "(cannot calculate: unknown HZ value)");
+               return buf;
+       }
+
         total = (count)/(ulonglong)machdep->hz;
 
         days = total / SEC_DAYS;

Comment 12 errata-xmlrpc 2009-01-20 22:13:52 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0240.html


Note You need to log in before you can comment on or make changes to this bug.