Bug 1208557 - crash-7.1.0-1.el6 spins at 'please wait... (gathering task table data)' when loading rhel6.4.z vmcore
Summary: crash-7.1.0-1.el6 spins at 'please wait... (gathering task table data)' when ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: crash
Version: 6.7
Hardware: Unspecified
OS: Linux
low
high
Target Milestone: rc
: ---
Assignee: Dave Anderson
QA Contact: Qiao Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-04-02 14:09 UTC by Dave Wysochanski
Modified: 2015-07-22 06:27 UTC (History)
7 users (show)

Fixed In Version: crash-7.1.0-3.el6
Doc Type: Bug Fix
Doc Text:
Attempting to run the crash utility with the vmcore and vmlinux files previously caused crash to enter an infinite loop and became unresponsive. With this update, the handling of errors when gathering tasks from pid_hash[] chains during session initialization has been enhanced. Now, if a pid_hash[] chain has been corrupted, the patch prevents the initialization sequence from entering an infinite loop. This prevents the described failure of the crash utility from occurring. In addition, the error messages associated with corrupt/invalid pid_hash[] chains have been updated to report the pid_hash[] index number.
Clone Of:
Environment:
Last Closed: 2015-07-22 06:27:20 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:1309 normal SHIPPED_LIVE crash bug fix and enhancement update 2015-07-20 17:53:40 UTC

Description Dave Wysochanski 2015-04-02 14:09:08 UTC
crash gets stuck at "gathering task table data" when loading vmcore

$ ./usr/bin/crash .../vmcore .../2.6.32-358.23.2.el6.x86_64/vmlinux

crash 7.1.0-2.el6
Copyright (C) 2002-2014  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

please wait... (gathering task table data)      



Version-Release number of selected component (if applicable):
crash-7.1.0-1.el6

*NOTE* We originally saw this on crash-7.0.9-1.el6 which was upstream crash-7.0.9 srpm rebuilt for our rhel6 vmcore system.  However I reproduced it on the latest crash built for rhel6.7, crash-7.1.0-1.el6



How reproducible:
Everytime on this vmcore.


Steps to Reproduce:
Run crash with vmcore and vmlinux for this one specific vmcore.


Actual results:
crash spins at 100% cpu for over 48 hours


Additional info:
Will post location of vmcore and command to reproduce.  Might be a damaged vmcore or legitimate one - not sure.

Here's the backtrace

Program received signal SIGINT, Interrupt.
read_diskdump (fd=-1, bufptr=0xf4a9e0, cnt=32, addr=<value optimized out>, paddr=26794098928) at diskdump.c:1113
1113            if (KDUMP_SPLIT()) {
(gdb) bt
#0  read_diskdump (fd=-1, bufptr=0xf4a9e0, cnt=32, addr=<value optimized out>, paddr=26794098928) at diskdump.c:1113
#1  0x000000000047840a in readmem (addr=<value optimized out>, memtype=1, buffer=0xf4a9e0, size=32, type=0x895958 "pid_hash upid",
    error_handle=6) at memory.c:2177
#2  0x00000000004bdf95 in refresh_hlist_task_table_v3 () at task.c:2018
#3  0x00000000004c6664 in task_init () at task.c:475
#4  0x000000000046525b in main_loop () at main.c:727
#5  0x0000000000689dc3 in captured_command_loop (data=<value optimized out>) at main.c:258
#6  0x00000000006886ab in catch_errors (func=0x689db0 <captured_command_loop>, func_args=0x0, errstring=0x8ce986 "", mask=6)
    at exceptions.c:557
#7  0x000000000068ac96 in captured_main (data=<value optimized out>) at main.c:1064
#8  0x00000000006886ab in catch_errors (func=0x689ed0 <captured_main>, func_args=0x7fffffffe320, errstring=0x8ce986 "", mask=6)
    at exceptions.c:557
#9  0x0000000000689bc4 in gdb_main (args=<value optimized out>) at main.c:1079
#10 0x0000000000689bfe in gdb_main_entry (argc=<value optimized out>, argv=<value optimized out>) at main.c:1099
#11 0x0000000000466060 in main (argc=3, argv=0x7fffffffe498) at main.c:677

Comment 3 Dave Anderson 2015-04-02 14:34:07 UTC
Thanks Dave, I'll take a look.

Note that "crash --log vmcore" works, and it shows a serious memory
corruption:

$ crash --log vmcore
... [ cut ] ...
1>BUG: unable to handle kernel paging request at 00000000ad6fdfe0
<1>IP: [<ffffffff81056b14>] update_curr+0x144/0x1f0
<4>PGD 413df1067 PUD 0 
<0>Thread overran stack, or stack corrupted
<4>Oops: 0000 [#1] SMP 
<4>last sysfs file: /sys/module/ipv6/initstate
<4>CPU 5 
<4>Modules linked in: mvfs(U) cdr(P)(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipv6 ppdev parport_pc parport sg vmware_balloon microcode vmxnet3 i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom vmw_pvscsi pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 30737, comm: export_mvfs Tainted: P           ---------------    2.6.32-358.23.2.el6.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
<4>RIP: 0010:[<ffffffff81056b14>]  [<ffffffff81056b14>] update_curr+0x144/0x1f0
<4>RSP: 0018:ffff880247403db8  EFLAGS: 00010082
<4>RAX: ffff880425fc2aa0 RBX: 0000000025760028 RCX: ffff88063d0de440
<4>RDX: 00000000000192d8 RSI: 0000000000000000 RDI: ffff880425fc2ad8
<4>RBP: ffff880247403de8 R08: ffffffff8160bb65 R09: 0000000000000000
<4>R10: 0000000000000010 R11: 0000000000000000 R12: ffff880247416768
<4>R13: 00000000000f4435 R14: 000000d543a6c0cf R15: ffff880425fc2aa0
<4>FS:  0000000000000000(0000) GS:ffff880247400000(0063) knlGS:00000000f77a06c0
<4>CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
<4>CR2: 00000000ad6fdfe0 CR3: 0000000425fb2000 CR4: 00000000000407e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process export_mvfs (pid: 30737, threadinfo ffff880425760000, task ffff880425fc2aa0)
<4>Stack:
<4> ffff880247403dc8 ffffffff81013643 ffff880425fc2ad8 ffff880247416768
<4><d> 0000000000000000 0000000000000000 ffff880247403e18 ffffffff810570cb
<4><d> ffff880247416700 0000000000000005 0000000000016700 0000000000000005
<4>Call Trace:
<4> <IRQ> 
<4> [<ffffffff81013643>] ? native_sched_clock+0x13/0x80
<4> [<ffffffff810570cb>] task_tick_fair+0xdb/0x160
<4> [<ffffffff8105af11>] scheduler_tick+0xc1/0x260
<4> [<ffffffff810a8060>] ? tick_sched_timer+0x0/0xc0
<4> [<ffffffff810812fe>] update_process_times+0x6e/0x90
<4> [<ffffffff810a80c6>] tick_sched_timer+0x66/0xc0
<4> [<ffffffff8109b4ae>] __run_hrtimer+0x8e/0x1a0
<4> [<ffffffff810a219f>] ? ktime_get_update_offsets+0x4f/0xd0
<4> [<ffffffff8107710f>] ? __do_softirq+0x11f/0x1e0
<4> [<ffffffff8109b816>] hrtimer_interrupt+0xe6/0x260
<4> [<ffffffff8151785b>] smp_apic_timer_interrupt+0x6b/0x9b
<4> [<ffffffff8100bb93>] apic_timer_interrupt+0x13/0x20
<4> <EOI> 
<4>Code: 00 8b 15 04 2b a4 00 85 d2 74 34 48 8b 50 08 8b 5a 18 48 8b 90 10 09 00 00 48 8b 4a 50 48 85 c9 74 1d 48 63 db 66 90 48 8b 51 20 <48> 03 14 dd a0 de bf 81 4c 01 2a 48 8b 49 78 48 85 c9 75 e8 48 
<1>RIP  [<ffffffff81056b14>] update_curr+0x144/0x1f0
<4> RSP <ffff880247403db8>
<4>CR2: 00000000ad6fdfe0
$

Anyway, crash ends up re-reading the same location endlessly, so I'm 
guessing that there's some kind of corruption in the pid_hash area.

Comment 4 Dave Anderson 2015-04-02 14:44:08 UTC
> Note that "crash --log vmcore" works, and it shows a serious memory
> corruption:

and crash --minimal also works.

Comment 5 Dave Anderson 2015-04-02 15:53:38 UTC
This behavior was introduced by a crash-7.0.9 patch, which fixed a 
problem where tasks in a chain could get skipped:

   - Fix for the one-time (dumpfile), or as-required (live system),
     gathering of tasks from the kernel pid_hash[] in 2.6.24 and later 
     kernels.  Without the patch, if an entry in a pid_hash[] chain is
     not related to the "init_pid_ns" pid_namespace structure, any 
     remaining entries in the hlist chain are skipped. 
     (vvs@parallels.com)

Unfortunately, it has the side effect seen in this case when the
pid_hash[] chains has been corrupted, presumably due to prior
corruption of a kernel stack.

Running with an older rhel6 version (crash-6.1.0-5.el6) does initialize
like so: 

$ /tmp/crash vm*

crash 6.1.0-5.el6
Copyright (C) 2002-2012  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

please wait... (gathering task table data)      
crash: duplicate task in pid_hash: ffff88043cda23b0

crash: invalid task address: ffff88043cda23b0
please wait... (determining panic task)    
WARNING: active task ffff880425fc2aa0 on cpu 5 not found in PID hash


WARNING: active task ffff880425fc2aa0 on cpu 5: corrupt cpu value: 628490280

      KERNEL: vmlinux-2.6.32-358.23.2.el6
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 20
        DATE: Sun Mar 29 07:03:01 2015
      UPTIME: 00:15:15
LOAD AVERAGE: 2.62, 1.95, 1.11
       TASKS: 227
    NODENAME: pv0il0244.cc0.mercadona.es
     RELEASE: 2.6.32-358.23.2.el6.x86_64
     VERSION: #1 SMP Sat Sep 14 05:32:37 EDT 2013
     MACHINE: x86_64  (2900 Mhz)
      MEMORY: 32 GB
       PANIC: "Oops: 0000 [#1] SMP " (check log for details)
         PID: 30737
     COMMAND: "export_mvfs"
        TASK: ffff880425fc2aa0  [THREAD_INFO: ffff880425760000]
         CPU: 5
       STATE: TASK_RUNNING (PANIC)

crash> 

I'll look into seeing how/if this can be recognized and handled, although
I worry that a "fix" may only address this specific instance of corruption.

Comment 6 Dave Wysochanski 2015-04-02 18:56:33 UTC
Thanks Dave A!  

If it's a damaged task_struct / vmcore which causes crash to go bonkers here it may be 'low' priority but leaving 'medium' for now and setting 'regression' though that may make it sound too important.  We've only seen it one time but we've not been been running crash-7.0.9 too long - installed Jan 20, 2015 so only a little over 2 months.  If stack overflows trigger this bug then those do happen more on rhel6 from what I've seen but it's probably only a couple percent of vmcores.

Comment 9 Dave Anderson 2015-04-02 19:59:46 UTC
There's a corruption associated with the pid_hash[62] chain that causes
the crash utility to go into an infinite loop.

The pid_hash[62] hlist_head structure points to an embedded hlist_node
in the first upid structure in the chain:
  
  crash> p pid_hash[62]
  $2 = {
    first = 0xffff8804255f5c80
  }
  crash> struct upid -l upid.pid_chain 0xffff8804255f5c80
  struct upid {
    nr = -1607242614, 
    ns = 0xffffffff81aa31a0 <init_pid_ns>, 
    pid_chain = {
      next = 0xffff880425761e48, 
      pprev = 0xffff88043cda28c0
    }
  }
  
The PID "nr" of -1607242614 is obviously not correct:
  
  crash> eval -1607242614
  hexadecimal: ffffffffa0336c8a  
      decimal: 18446744072102309002  (-1607242614)
        octal: 1777777777764014666212
       binary: 1111111111111111111111111111111110100000001100110110110010001010
  crash> sym ffffffffa0336c8a
  ffffffffa0336c8a (t) fsf_ops_lookup+58 [cdr] 
  crash> mod -t
  NAME  TAINTS
  cdr   P(U)
  mvfs  (U)
  crash> 
  
So it appears that the proprietary "cdr" module, along with the 
unsigned "mvfs" module, are involved in the stack overrun of the
"export_mvfs" task:
  
  crash> set
      PID: 30737
  COMMAND: "export_mvfs"
     TASK: ffff880425fc2aa0  [THREAD_INFO: ffff880425760000]
      CPU: 5
    STATE: TASK_RUNNING (PANIC)
  crash> 
  
Dump the overrun stack contents from the bottom:
  
  crash> bt -T
  PID: 30737  TASK: ffff880425fc2aa0  CPU: 5   COMMAND: "export_mvfs"
    [ffff880425760070] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257600a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257600b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257600e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257600f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760120] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760130] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760160] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760170] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257601a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257601b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257601e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257601f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760220] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760230] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760260] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760270] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257602a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257602b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257602e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257602f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760320] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760330] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760360] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760370] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257603a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257603b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257603e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257603f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760420] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760430] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760460] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760470] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257604a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257604b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257604e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257604f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760520] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760530] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760560] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760570] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257605a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257605b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257605e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257605f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760620] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760630] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760660] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760670] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257606a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257606b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257606e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257606f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760720] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760730] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760760] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760770] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257607a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257607b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257607e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257607f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760820] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760830] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760860] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760870] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257608a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257608b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257608e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257608f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760920] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760930] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760960] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760970] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257609a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257609b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257609e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257609f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760a20] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760a30] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760a60] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760a70] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760aa0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760ab0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760ae0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760af0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760b20] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760b30] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760b60] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760b70] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760ba0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760bb0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760be0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760bf0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760c20] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760c30] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760c60] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760c70] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760ca0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760cb0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760ce0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760cf0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760d20] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760d30] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760d60] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760d70] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760da0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760db0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760de0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760df0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760e20] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760e30] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760e60] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760e70] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760ea0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760eb0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760ee0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760ef0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760f20] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760f30] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760f60] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760f70] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760fa0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760fb0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425760fe0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425760ff0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761020] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761030] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761060] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761070] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257610a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257610b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257610e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257610f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761120] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761130] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761160] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761170] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257611a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257611b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257611e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257611f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761220] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761230] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761260] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761270] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257612a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257612b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257612e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257612f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761320] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761330] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761360] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761370] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257613a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257613b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257613e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257613f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761420] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761430] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761460] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761470] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257614a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257614b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257614e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257614f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761520] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761530] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761560] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761570] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257615a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257615b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257615e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257615f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761620] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761630] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761660] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761670] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257616a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257616b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257616e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257616f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761720] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761730] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761760] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761770] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257617a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257617b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257617e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257617f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761820] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761830] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761860] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761870] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257618a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257618b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257618e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257618f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761920] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761930] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761938] zone_statistics at ffffffff8113b579
    [ffff880425761960] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761970] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257619a0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257619b0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff8804257619e0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff8804257619f0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761a20] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761a30] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761a60] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761a70] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761aa0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761ab0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761ae0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761af0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761b20] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761b30] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761b60] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761b70] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761ba0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761bb0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761be0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761bf0] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761c20] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761c30] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761c60] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761c70] fsf_ops_lookup at ffffffffa0336c8a [cdr]
    [ffff880425761ca0] vnlayer_hijacked_lookup at ffffffffa039749d [mvfs]
    [ffff880425761cb0] do_lookup at ffffffff81190865
    [ffff880425761d10] __link_path_walk at ffffffff81191024
    [ffff880425761d20] page_remove_rmap at ffffffff8114cf34
    [ffff880425761d90] handle_mm_fault at ffffffff8114452a
    [ffff880425761dd0] path_walk at ffffffff81191baa
    [ffff880425761e10] do_path_lookup at ffffffff81191d7b
    [ffff880425761e40] user_path_at at ffffffff81192a07
    [ffff880425761ee0] security_prepare_creds at ffffffff8121c0e6
    [ffff880425761f10] sys_faccessat at ffffffff8117f130
    [ffff880425761f70] sys_access at ffffffff8117f248
    [ffff880425761f80] sysenter_dispatch at ffffffff8104d830
      RIP: 0000000000734430  RSP: 00000000fffaee0c  RFLAGS: 00000296
      RAX: 0000000000000021  RBX: ffffffff8104d830  RCX: 0000000000000000
      RDX: 00000000008d3490  RSI: 00000000fffafe4c  RDI: 0000000000000001
      RBP: 00000000fffb03c8   R8: 0000000000000000   R9: 0000000000000000
      R10: 0000000000000000  R11: 0000000000000000  R12: ffffffff8117f248
      R13: ffff880425761f78  R14: 0000000000000000  R15: 0000000000000000
      ORIG_RAX: 0000000000000021  CS: 0023  SS: 002b
  crash>  
  
Anyway, following the corrupt upid chain leads to the crash utility spin.
I have a couple of things I can add to prevent that from happening,
although in all probability, it's unlikely it will ever be seen again.
  
I guess you could call it a regression, but again, if a vmcore is corrupted
to the point where some of the basic requirements for the crash session
to come up are compromised, well, then shit like this can happen.

Comment 10 Dave Anderson 2015-04-02 20:02:12 UTC
By "if a vmcore is corrupted", I mean "if the crashed system's memory was
corrupted".  The vmcore is fine.

Comment 11 Dave Anderson 2015-04-02 20:16:46 UTC
By the search command, you can see that the stack overrun continued
downwards for 529 pages, or over 2MB worth of memory corruption.

That's pretty impressive corruption right there...  ;-)

Comment 12 Dave Anderson 2015-04-02 20:37:57 UTC
And the corruption overwrote the memory containing that first upid 
structure in the pid_hash[62] chain:

  crash> struct upid -l upid.pid_chain 0xffff8804255f5c80
  struct upid {
    nr = -1607242614, 
    ns = 0xffffffff81aa31a0 <init_pid_ns>, 
    pid_chain = {
      next = 0xffff880425761e48, 
      pprev = 0xffff88043cda28c0
    }
  }
  crash> rd -S 0xffff8804255f5c80 100
  ffff8804255f5c80:  ffff880425761e48 [ext4_inode_cache] 
  ffff8804255f5c90:  [dentry]         [pid]            
  ffff8804255f5ca0:  vnlayer_hijacked_lookup+44 [pid]            
  ffff8804255f5cb0:  fsf_ops_lookup+58 0000000000000000 
  ffff8804255f5cc0:  ffff880425761e48 [ext4_inode_cache] 
  ffff8804255f5cd0:  [dentry]         [pid]            
  ffff8804255f5ce0:  vnlayer_hijacked_lookup+44 [pid]            
  ffff8804255f5cf0:  fsf_ops_lookup+58 init_pid_ns      
  ffff8804255f5d00:  ffff880425761e48 [ext4_inode_cache] 
  ffff8804255f5d10:  [dentry]         [pid]            
  ffff8804255f5d20:  vnlayer_hijacked_lookup+44 [pid]            
  ffff8804255f5d30:  fsf_ops_lookup+58 0000000000000000 
  ... [ repeat ] ...

Comment 13 Dave Anderson 2015-04-09 15:39:17 UTC
A fix for handling this type of kernel memory corruption has been
applied upstream:

  https://github.com/crash-utility/crash/commit/39fffdc78c13c8a3464b373beac99a89c25456bc

  Fortified the error handling of task gathering from the pid_hash[]
  chains during session initialization.  If a chain has been corrupted,
  the patch prevents the sequence from entering an infinite loop, and
  the error messages associated with corrupt/invalid chains have been
  updated to report the pid_hash[] index number.
  (anderson@redhat.com)

Comment 15 Dave Anderson 2015-04-21 14:23:00 UTC
Information for build crash-7.1.0-3.el6:
https://brewweb.devel.redhat.com/buildinfo?buildID=430557

Comment 24 errata-xmlrpc 2015-07-22 06:27:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1309.html


Note You need to log in before you can comment on or make changes to this bug.