649053 – KVM dumpfiles: fix x86/x86_64 cpu count determination for crashed guests

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 649053 - KVM dumpfiles: fix x86/x86_64 cpu count determination for crashed guests

Summary: KVM dumpfiles: fix x86/x86_64 cpu count determination for crashed guests

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	crash
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Dave Anderson
QA Contact:	Kernel Dump QE
Docs Contact:
URL:
Whiteboard:
Depends On:	649070
Blocks:
TreeView+	depends on / blocked

Reported:	2010-11-02 20:30 UTC by Dave Anderson
Modified:	2011-05-19 13:04 UTC (History)
CC List:	2 users (show)
Fixed In Version:	crash-5.1.1-1.el6
Doc Type:	Bug Fix
Doc Text:	When creating a KVM dump file, the "virsh dump" operation marks all non-crashing CPUs as offline. Due to an incorrect use of the "cpu_online_map" mask to determine the CPU count, previous version of the crash utility may have reported a wrong number of CPUs when analyzing dumps created by the "virsh dump" command on x86 guest systems. With this update, the underlying source code has been adapted to use the "cpu_present_map" mask instead, so that the crash utility reports the correct number of CPUs.
Clone Of:
Environment:
Last Closed:	2011-05-19 13:04:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0561	0	normal	SHIPPED_LIVE	crash bug fix and enhancement update	2011-05-18 17:57:16 UTC

Description Dave Anderson 2010-11-02 20:30:11 UTC

Description of problem:

When KVM guest kernels crash, the panic path will call smp_send_stop()
for the active non-crashing cpus, which offlines them (unlike kdump,  
in diskdump and netdump operations).  Therefore, the kernel's
cpu_online_map cannot be used for determining how many cpus
were running when the system crashed, and to the crash utility
requires a fix for correctly determining the cpu count in crashed
KVM dumpfiles.

It has been fixed upstream for x86 in crash version 5.0.9:

 - Fix for the cpu count determination in crashed x86 KVM dumpfiles, 
   where the non-crashing cpus are marked offline in the kernel's
   cpu_online_mask by smp_stop_cpu().  Depending upon the cpu number
   of the crashing task, the cpu count may be set to a value that is
   less than the number of present cpus.
   (anderson)

and inadvertently fixed for x86_64 in crash version 5.0.8:

 - Change to the manner in which the cpu count is determined for x86_64
   kernels.  SLES11 2.6.32 kernels delay the call to crash_kexec() until
   after smp_send_stop() is called by panic(), and so the cpu_online_map
   cannot be used for determining the cpu count.  With the patch, the
   cpu_present_map is used.
   (Jeffrey.Hagen)

Version-Release number of selected component (if applicable):

crash-5.0.0-23.el6

How reproducible:

Always -- but it depends upon which cpu has crashed.

Steps to Reproduce:
1. Forcibly crash an x86 or x86_64 guest system that has several
   cpus, and then do a "virsh dump" on the guest.
2. Run crash on the dumpfile, and note the "CPUS:" count, which will
   be based upon which cpu number has crashed, and not the actual
   number of cpus.
3.
  
Actual results:

This "virsh dump" example shows a 4-cpu x86 guest that has crashed,
but only 1 cpu is found, because the crashing task was running
on cpu 0:

  # crash vmlinux guest32-crash

  crash 5.0.0-23.el6
  ... [ cut ] ...
        KERNEL: vmlinux        
      DUMPFILE: guest32-crash
          CPUS: 1
          DATE: Thu Oct 21 11:06:25 2010
        UPTIME: 1 days, 00:05:47
  LOAD AVERAGE: 8.87, 3.11, 1.11
         TASKS: 291
      NODENAME: localhost.localdomain
       RELEASE: 2.6.32-71.el6.i686
       VERSION: #1 SMP Wed Sep 1 01:26:34 EDT 2010
       MACHINE: i686  (2666 Mhz)
        MEMORY: 1.5 GB
         PANIC: "Oops: 0002 [#1] SMP " (check log for details)
           PID: 25708
       COMMAND: "bash"
          TASK: e9e9a560  [THREAD_INFO: f263a000]
           CPU: 0
         STATE: TASK_RUNNING (PANIC)
  
  crash> 

Expected results:

With the fix applied, all 4 cpus are found: 
  
  # crash vmlinux guest32-crash
  
  crash 5.0.9
  ... [ cut ] ...
        KERNEL: vmlinux        
      DUMPFILE: guest32-crash
          CPUS: 4
          DATE: Thu Oct 21 11:06:25 2010
        UPTIME: 1 days, 00:05:47
  LOAD AVERAGE: 8.87, 3.11, 1.11
         TASKS: 294
      NODENAME: localhost.localdomain
       RELEASE: 2.6.32-71.el6.i686
       VERSION: #1 SMP Wed Sep 1 01:26:34 EDT 2010
       MACHINE: i686  (2666 Mhz)
        MEMORY: 1.5 GB
         PANIC: "Oops: 0002 [#1] SMP " (check log for details)
           PID: 25708
       COMMAND: "bash"
          TASK: e9e9a560  [THREAD_INFO: f263a000]
           CPU: 0
         STATE: TASK_RUNNING (PANIC)
  
  crash>

Actual results:

This is also applicable to x86_64, where in this example,
only the first 2 cpus are recognized in a 4-cpu system,
because the crashing task was running on cpu 1:

  # crash vmlinux guest64-crash2
  
  crash 5.0.0-23.el6
  ... [ cut ]...
        KERNEL: vmlinux      
      DUMPFILE: guest64-crash2
          CPUS: 2
          DATE: Thu Oct 21 15:49:45 2010
        UPTIME: 04:44:53
  LOAD AVERAGE: 1.85, 0.46, 0.15
         TASKS: 299
      NODENAME: localhost.localdomain
       RELEASE: 2.6.32-71.el6.x86_64
       VERSION: #1 SMP Wed Sep 1 01:33:01 EDT 2010
       MACHINE: x86_64  (2666 Mhz)
        MEMORY: 1.5 GB
         PANIC: "Oops: 0002 [#1] SMP " (check log for details)
           PID: 7879
       COMMAND: "bash"
          TASK: ffff880036ac2080  [THREAD_INFO: ffff88003ee64000]
           CPU: 1
         STATE: TASK_RUNNING (PANIC)
  
  crash>

Expected results:

With the fixes applied, all 4 cpus are recognized:
  
  # crash vmlinux guest64-crash2
  
  crash 5.0.9
  ... [ cut ] ...
        KERNEL: vmlinux      
      DUMPFILE: guest64-crash2
          CPUS: 4
          DATE: Thu Oct 21 15:49:45 2010
        UPTIME: 04:44:53
  LOAD AVERAGE: 1.85, 0.46, 0.15
         TASKS: 301
      NODENAME: localhost.localdomain
       RELEASE: 2.6.32-71.el6.x86_64
       VERSION: #1 SMP Wed Sep 1 01:33:01 EDT 2010
       MACHINE: x86_64  (2666 Mhz)
        MEMORY: 1.5 GB
         PANIC: "Oops: 0002 [#1] SMP " (check log for details)
           PID: 7879
       COMMAND: "bash"
          TASK: ffff880036ac2080  [THREAD_INFO: ffff88003ee64000]
           CPU: 1
         STATE: TASK_RUNNING (SYSRQ)
  
  crash> 

Additional info:

It should be noted that this can be worked around by using the
"--cpus <count>" command line option, as in:

  # crash --cpus 4 vmlinux vmcore

Comment 5 Jaromir Hradilek 2011-04-27 19:20:23 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When creating a KVM dump file, the "virsh dump" operation marks all non-crashing CPUs as offline. Due to an incorrect use of the "cpu_online_map" mask to determine the CPU count, previous version of the crash utility may have reported a wrong number of CPUs when analyzing dumps created by the "virsh dump" command on x86 guest systems. With this update, the underlying source code has been adapted to use the "cpu_present_map" mask instead, so that the crash utility reports the correct number of CPUs.

Comment 6 errata-xmlrpc 2011-05-19 13:04:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0561.html

Note You need to log in before you can comment on or make changes to this bug.