Bug 649053

Summary: KVM dumpfiles: fix x86/x86_64 cpu count determination for crashed guests
Product: Red Hat Enterprise Linux 6 Reporter: Dave Anderson <anderson>
Component: crashAssignee: Dave Anderson <anderson>
Status: CLOSED ERRATA QA Contact: Kernel Dump QE <kernel-dump-qe>
Severity: high Docs Contact:
Priority: low    
Version: 6.0CC: phan, qcai
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: crash-5.1.1-1.el6 Doc Type: Bug Fix
Doc Text:
When creating a KVM dump file, the "virsh dump" operation marks all non-crashing CPUs as offline. Due to an incorrect use of the "cpu_online_map" mask to determine the CPU count, previous version of the crash utility may have reported a wrong number of CPUs when analyzing dumps created by the "virsh dump" command on x86 guest systems. With this update, the underlying source code has been adapted to use the "cpu_present_map" mask instead, so that the crash utility reports the correct number of CPUs.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 13:04:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 649070    
Bug Blocks:    

Description Dave Anderson 2010-11-02 20:30:11 UTC
Description of problem:

When KVM guest kernels crash, the panic path will call smp_send_stop()
for the active non-crashing cpus, which offlines them (unlike kdump,  
in diskdump and netdump operations).  Therefore, the kernel's
cpu_online_map cannot be used for determining how many cpus
were running when the system crashed, and to the crash utility
requires a fix for correctly determining the cpu count in crashed
KVM dumpfiles.

It has been fixed upstream for x86 in crash version 5.0.9:

 - Fix for the cpu count determination in crashed x86 KVM dumpfiles, 
   where the non-crashing cpus are marked offline in the kernel's
   cpu_online_mask by smp_stop_cpu().  Depending upon the cpu number
   of the crashing task, the cpu count may be set to a value that is
   less than the number of present cpus.
   (anderson)

and inadvertently fixed for x86_64 in crash version 5.0.8:

 - Change to the manner in which the cpu count is determined for x86_64
   kernels.  SLES11 2.6.32 kernels delay the call to crash_kexec() until
   after smp_send_stop() is called by panic(), and so the cpu_online_map
   cannot be used for determining the cpu count.  With the patch, the
   cpu_present_map is used.
   (Jeffrey.Hagen)

Version-Release number of selected component (if applicable):

crash-5.0.0-23.el6

How reproducible:

Always -- but it depends upon which cpu has crashed.

Steps to Reproduce:
1. Forcibly crash an x86 or x86_64 guest system that has several
   cpus, and then do a "virsh dump" on the guest.
2. Run crash on the dumpfile, and note the "CPUS:" count, which will
   be based upon which cpu number has crashed, and not the actual
   number of cpus.
3.
  
Actual results:

This "virsh dump" example shows a 4-cpu x86 guest that has crashed,
but only 1 cpu is found, because the crashing task was running
on cpu 0:

  # crash vmlinux guest32-crash

  crash 5.0.0-23.el6
  ... [ cut ] ...
        KERNEL: vmlinux        
      DUMPFILE: guest32-crash
          CPUS: 1
          DATE: Thu Oct 21 11:06:25 2010
        UPTIME: 1 days, 00:05:47
  LOAD AVERAGE: 8.87, 3.11, 1.11
         TASKS: 291
      NODENAME: localhost.localdomain
       RELEASE: 2.6.32-71.el6.i686
       VERSION: #1 SMP Wed Sep 1 01:26:34 EDT 2010
       MACHINE: i686  (2666 Mhz)
        MEMORY: 1.5 GB
         PANIC: "Oops: 0002 [#1] SMP " (check log for details)
           PID: 25708
       COMMAND: "bash"
          TASK: e9e9a560  [THREAD_INFO: f263a000]
           CPU: 0
         STATE: TASK_RUNNING (PANIC)
  
  crash> 

Expected results:

With the fix applied, all 4 cpus are found: 
  
  # crash vmlinux guest32-crash
  
  crash 5.0.9
  ... [ cut ] ...
        KERNEL: vmlinux        
      DUMPFILE: guest32-crash
          CPUS: 4
          DATE: Thu Oct 21 11:06:25 2010
        UPTIME: 1 days, 00:05:47
  LOAD AVERAGE: 8.87, 3.11, 1.11
         TASKS: 294
      NODENAME: localhost.localdomain
       RELEASE: 2.6.32-71.el6.i686
       VERSION: #1 SMP Wed Sep 1 01:26:34 EDT 2010
       MACHINE: i686  (2666 Mhz)
        MEMORY: 1.5 GB
         PANIC: "Oops: 0002 [#1] SMP " (check log for details)
           PID: 25708
       COMMAND: "bash"
          TASK: e9e9a560  [THREAD_INFO: f263a000]
           CPU: 0
         STATE: TASK_RUNNING (PANIC)
  
  crash>

Actual results:

This is also applicable to x86_64, where in this example,
only the first 2 cpus are recognized in a 4-cpu system,
because the crashing task was running on cpu 1:

  # crash vmlinux guest64-crash2
  
  crash 5.0.0-23.el6
  ... [ cut ]...
        KERNEL: vmlinux      
      DUMPFILE: guest64-crash2
          CPUS: 2
          DATE: Thu Oct 21 15:49:45 2010
        UPTIME: 04:44:53
  LOAD AVERAGE: 1.85, 0.46, 0.15
         TASKS: 299
      NODENAME: localhost.localdomain
       RELEASE: 2.6.32-71.el6.x86_64
       VERSION: #1 SMP Wed Sep 1 01:33:01 EDT 2010
       MACHINE: x86_64  (2666 Mhz)
        MEMORY: 1.5 GB
         PANIC: "Oops: 0002 [#1] SMP " (check log for details)
           PID: 7879
       COMMAND: "bash"
          TASK: ffff880036ac2080  [THREAD_INFO: ffff88003ee64000]
           CPU: 1
         STATE: TASK_RUNNING (PANIC)
  
  crash>

Expected results:

With the fixes applied, all 4 cpus are recognized:
  
  # crash vmlinux guest64-crash2
  
  crash 5.0.9
  ... [ cut ] ...
        KERNEL: vmlinux      
      DUMPFILE: guest64-crash2
          CPUS: 4
          DATE: Thu Oct 21 15:49:45 2010
        UPTIME: 04:44:53
  LOAD AVERAGE: 1.85, 0.46, 0.15
         TASKS: 301
      NODENAME: localhost.localdomain
       RELEASE: 2.6.32-71.el6.x86_64
       VERSION: #1 SMP Wed Sep 1 01:33:01 EDT 2010
       MACHINE: x86_64  (2666 Mhz)
        MEMORY: 1.5 GB
         PANIC: "Oops: 0002 [#1] SMP " (check log for details)
           PID: 7879
       COMMAND: "bash"
          TASK: ffff880036ac2080  [THREAD_INFO: ffff88003ee64000]
           CPU: 1
         STATE: TASK_RUNNING (SYSRQ)
  
  crash> 

Additional info:

It should be noted that this can be worked around by using the
"--cpus <count>" command line option, as in:

  # crash --cpus 4 vmlinux vmcore

Comment 5 Jaromir Hradilek 2011-04-27 19:20:23 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When creating a KVM dump file, the "virsh dump" operation marks all non-crashing CPUs as offline. Due to an incorrect use of the "cpu_online_map" mask to determine the CPU count, previous version of the crash utility may have reported a wrong number of CPUs when analyzing dumps created by the "virsh dump" command on x86 guest systems. With this update, the underlying source code has been adapted to use the "cpu_present_map" mask instead, so that the crash utility reports the correct number of CPUs.

Comment 6 errata-xmlrpc 2011-05-19 13:04:12 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0561.html