309081 – i386 compressed diskdump header contains incorrect panic cpu

Bug 309081 - i386 compressed diskdump header contains incorrect panic cpu

Summary: i386 compressed diskdump header contains incorrect panic cpu

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.5
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Dave Anderson
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	430698
TreeView+	depends on / blocked

Reported:	2007-09-27 14:11 UTC by Dave Anderson
Modified:	2008-07-24 19:17 UTC (History)
CC List:	5 users (show)
Fixed In Version:	RHSA-2008-0665
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-07-24 19:17:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Patch to store cpu in diskdump stack's thread_info (521 bytes, patch) 2007-10-03 19:44 UTC, Dave Anderson	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2008:0665	0	normal	SHIPPED_LIVE	Moderate: Updated kernel packages for Red Hat Enterprise Linux 4.7	2008-07-24 16:41:06 UTC

Description Dave Anderson 2007-09-27 14:11:39 UTC

Description of problem:

i386 compressed diskdump header contains incorrect panic cpu.

Version-Release number of selected component (if applicable):

2.6.9-55.9.EL

How reproducible:

Requires that a crash occur on other than cpu 0.

Steps to Reproduce:
1. On i386, force a compressed diskump crash on a cpu other than cpu 0
2. Run crash utility on dumpfile
3. Note the the wrong task is selected as the panic task. 
  
Actual results:

Wrong task is selected.

Expected results:

Correct panic task is selected.


Additional info:

Looking at a RHEL4 compressed i686 diskdump (2.6.9-55.9.ELhugemem), 
from this bugzilla:

  BZ #291641: uswxornasp01 crashed while running vxdisksetup
  https://bugzilla.redhat.com/show_bug.cgi?id=291641

It has this information in its header:

  crash> help -n
  ...
           current_cpu: 0
               nr_cpus: 4
        tasks[nr_cpus]: 231ca80
                        7aa10b0
                        cfae52f0
                        7aa05b0
  ...
  crash>

indicating that the panicking task would be 231ca80,
because .current_cpu is 0.  And therefore crash comes
up with this:

  crash> set
      PID: 0
  COMMAND: "swapper"
     TASK: 231ca80  (1 of 4)  [THREAD_INFO: 238e000]
      CPU: 0
    STATE: TASK_RUNNING (PANIC)
  crash>

But it is not the panic task:

  crash> bt
  PID: 0      TASK: 231ca80   CPU: 0   COMMAND: "swapper"
   #0 [ 238efa8] smp_call_function_interrupt at 2116c69
   #1 [ 238efb0] call_function_interrupt at fffecd6a
      EAX: 00000000  EBX: 0238e000  ECX: 02104018  EDX: 0238e000  EBP: 00493007
      DS:  007b      ESI: 00000000  ES:  007b      EDI: 023c7120
      CS:  0060      EIP: 02104041  ERR: fffffffb  EFLAGS: 00000246
   #2 [ 238efe4] default_idle at 2104041
   #3 [ 238efe8] cpu_idle at 210409e
  crash>

The actual panic task, and the diskdump operation, ran on cpu 2:

  crash> set cfae52f0
      PID: 2114
  COMMAND: "vxedpart"
     TASK: cfae52f0  [THREAD_INFO: c2f3c000]
      CPU: 2
    STATE: TASK_RUNNING (ACTIVE)
  crash> bt
  PID: 2114   TASK: cfae52f0  CPU: 2   COMMAND: "vxedpart"
   #0 [c2f3ca34] disk_dump at f892f1a2
   #1 [c2f3ca38] printk at 212292e
   #2 [c2f3ca44] freeze_other_cpus at f892eef5
   #3 [c2f3ca54] start_disk_dump at f892efa0
   #4 [c2f3ca64] try_crashdump at 213386e
   #5 [c2f3ca6c] die at 2106335
   #6 [c2f3caa0] do_page_fault at 211b249
   #7 [c2f3cb80] error_code (via page_fault) at fffecede
      EAX: 00000002  EBX: d11aa000  ECX: 07bfb802  EDX: cec2c800  EBP: cec2c880
      DS:  007b      ESI: 0241e4bf  ES:  007b      EDI: 00000000
      CS:  0060      EIP: 02285f4a  ERR: ffffffff  EFLAGS: 00010086
   #8 [c2f3cbbc] netpoll_send_skb at 2285f4a
   #9 [c2f3cbd8] write_msg at f8a7c151
  #10 [c2f3cc04] __call_console_drivers at 2122714
  #11 [c2f3cc18] release_console_sem at 2122b1d
  #12 [c2f3cc28] vprintk at 2122a67
  #13 [c2f3cc38] printk at 212292e
  #14 [c2f3cc44] do_page_fault at 211b1e7
  #15 [c2f3cd0c] error_code (via page_fault) at fffecede
      EAX: 00000002  EBX: d11aa000  ECX: 39e07a02  EDX: cec2c600  EBP: cec2c640
      DS:  007b      ESI: 0241e48c  ES:  007b      EDI: 00000000
      CS:  0060      EIP: 02285f4a  ERR: ffffffff  EFLAGS: 00010086
  #16 [c2f3cd48] netpoll_send_skb at 2285f4a
  #17 [c2f3cd64] write_msg at f8a7c151
  #18 [c2f3cd90] __call_console_drivers at 2122714
  #19 [c2f3cda4] call_console_drivers at 2122829
  #20 [c2f3cdb8] release_console_sem at 2122b1d
  #21 [c2f3cdc8] vprintk at 2122a67
  #22 [c2f3cdd8] printk at 212292e
  #23 [c2f3cde4] sd_read_capacity at f882443c
  #24 [c2f3ce68] sd_revalidate_disk at f88248f3
  #25 [c2f3ce88] rescan_partitions at 218b8a4
  #26 [c2f3ceac] blkdev_reread_part at 2223ecf
  #27 [c2f3cebc] blkdev_ioctl at 222419a
  #28 [c2f3cee0] block_ioctl at 2161b34
  #29 [c2f3cee8] dmp_ioctl_by_bdev at f8b1f07f
  #30 [c2f3cf0c] vxdmp_blkrrpart at f8b1f3ca
  #31 [c2f3cf30] dmp_revalidate at f8b35dd7
  #32 [c2f3cf34] rescan_partitions at 218b8a4
  #33 [c2f3cf58] blkdev_reread_part at 2223ecf
  #34 [c2f3cf68] blkdev_ioctl at 222419a
  #35 [c2f3cf8c] block_ioctl at 2161b34
  #36 [c2f3cf94] sys_ioctl at 216a943
  #37 [c2f3cfc0] system_call at fffec219
      EAX: 00000036  EBX: 00000006  ECX: 0000125f  EDX: 00000200
      DS:  007b      ESI: 0837a000  ES:  007b      EDI: 00000006
      SS:  007b      ESP: fee52014  EBP: fee52258
      CS:  0073      EIP: 003d67a2  ERR: 00000036  EFLAGS: 00000246
  crash>

Given this is an i686 and CONFIG_4KSTACKS, the diskdump operation
switches stacks in platform_start_crashdump(): 
  
  static inline void platform_start_crashdump(void *stackptr,
                                             crashdump_func_t dumpfunc,
                                             struct pt_regs *regs)
  {
  #ifdef CONFIG_4KSTACKS
          u32 *dsp;
          union irq_ctx * curctx;
          union irq_ctx * dumpctx;
  
          if (!stackptr)
                  dumpfunc(regs, NULL);
          else {
                  curctx = (union irq_ctx *) current_thread_info();
                  dumpctx = (union irq_ctx *) stackptr;
  
                  /* build the stack frame on the IRQ stack */
                  dsp = (u32*) ((char*)dumpctx + sizeof(*dumpctx));
                  dumpctx->tinfo.task = curctx->tinfo.task;
                  dumpctx->tinfo.real_stack = curctx->tinfo.real_stack;
                  dumpctx->tinfo.virtual_stack = curctx->tinfo.virtual_stack;
                  dumpctx->tinfo.previous_esp = current_stack_pointer();
  
                  *--dsp = (u32) NULL;
                  *--dsp = (u32) regs;
  
                  asm volatile(
                          "       xchgl   %%ebx,%%esp     \n"
                          "       call    *%%eax          \n"
                          "       xchgl   %%ebx,%%esp     \n"
                          : : "a"(dumpfunc), "b"(dsp)
                          : "memory", "cc", "edx", "ecx"
                  );
          }
  #else
          dumpfunc(regs, NULL);
  #endif
  }

When setting up the "fake" thread_info in the dumpctx stack,
it only initializes the .task, .real_stack, .virtual_stack and
.previous_esp fields.  But the .cpu field is left alone with
a value of zero (given that the passed-in stack was cleared when
allocated).

So later on when the diskdump header is set up in the disk_dump()
function, it does this:

         dump_header.current_cpu      = smp_processor_id();

which will pick up the zeroed-out cpu value in the thread_info
of the switched stack.

  #define smp_processor_id() (current_thread_info()->cpu)

Seems like we would have seen this before?  Anyway, it only
affects i686, since the only other architecture that switches
switches stacks is the x86_64, but it gets its cpu from the
per-cpu PDA structure.

Comment 1 Dave Anderson 2007-10-03 19:44:30 UTC

Created attachment 214961 [details]
Patch to store cpu in diskdump stack's thread_info

Comment 2 Dave Anderson 2007-10-03 19:49:42 UTC

Attached patch tested by forcing a crash on a cpu other than cpu 0,
and verified with the crash utility:

# crash vmlinux vmcore
...
crash> bt
PID: 6830   TASK: f72a23b0  CPU: 3   COMMAND: "sh"
 #0 [f7006dd4] disk_dump at f8c771a8
 #1 [f7006dd8] printk at c01228eb
 #2 [f7006de4] freeze_other_cpus at f8c76ef5
 #3 [f7006df4] start_disk_dump at f8c76fa0
 #4 [f7006e04] try_crashdump at c0134b1e
 #5 [f7006e0c] die at c0106045
 #6 [f7006e40] do_page_fault at c011b260
 #7 [f7006f20] error_code (via page_fault) at c02d8f05
    EAX: f7006000  EBX: c0343674  ECX: 00000000  EDX: 00000000  EBP: 00000000
    DS:  007b      ESI: 00000063  ES:  007b      EDI: 00000000
    CS:  0060      EIP: c020ff55  ERR: ffffffff  EFLAGS: 00010206
 #8 [f7006f5c] sysrq_handle_crash at c020ff55
 #9 [f7006f60] __handle_sysrq at c02100e1
#10 [f7006f80] write_sysrq_trigger at c018d127
#11 [f7006f88] vfs_write at c015bbb4
#12 [f7006fa4] sys_write at c015bc7c
#13 [f7006fc0] system_call at c02d8408
    EAX: 00000004  EBX: 00000001  ECX: b7ce7000  EDX: 00000002
    DS:  007b      ESI: 00000002  ES:  007b      EDI: b7ce7000
    SS:  007b      ESP: bfe18cf0  EBP: bfe18d10
    CS:  0073      EIP: 005047a2  ERR: 00000004  EFLAGS: 00000246
crash>

And as verified by the contents of the diskdump header:

crash> help -n
...
         current_cpu: 3
             nr_cpus: 4
      tasks[nr_cpus]: f72a3430
                      f6f4c870
                      f72ec3f0
                      f72a23b0
...
crash>

Comment 3 Dave Anderson 2007-12-05 15:43:58 UTC

Patch posted to RHKL:

http://post-office.corp.redhat.com/archives/rhkernel-list/2007-December/msg00117.html

Comment 4 RHEL Program Management 2007-12-19 22:16:14 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Jason Baron 2007-12-20 18:43:13 UTC

committed in stream U7 build 68.4. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/

Comment 10 errata-xmlrpc 2008-07-24 19:17:22 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html

Note You need to log in before you can comment on or make changes to this bug.