521610 – Kernel crash when running multiple stap commands

Bug 521610 - Kernel crash when running multiple stap commands

Summary: Kernel crash when running multiple stap commands

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	systemtap
Sub Component:
Version:	5.4
Hardware:	ia64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	5.5
Assignee:	David Smith
QA Contact:	qe-baseos-tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	525215 533192 543058
TreeView+	depends on / blocked

Reported:	2009-09-07 09:12 UTC by Sachin Prabhu
Modified:	2018-10-27 16:05 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 09:05:25 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg log (4.71 KB, text/plain) 2009-10-22 17:22 UTC, Frank Ch. Eigler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2010:0308	0	normal	SHIPPED_LIVE	systemtap bug fix and enhancement update	2010-03-29 14:36:19 UTC

Description Sachin Prabhu 2009-09-07 09:12:00 UTC

We have a possible regression reported by a user. 

Kernel panics when stap command of systemtap is executed multiple times concurrently. It does not occur in RHEL5.3(kernel-2.6.18-128.7.1.el5). 

The issue has been confirmed on RHEL 5.4 beta and Snapshot 2 and on the GA kernel.

The Oops message accompanying the crash is

crash> bt
PID: 31620  TASK: e0000040d7b20000  CPU: 0   COMMAND: "staprun"
#0 [BSP:e0000040d7b21608] machine_kexec at a00000010005a380
#1 [BSP:e0000040d7b215e0] crash_kexec at a0000001000cef90
#2 [BSP:e0000040d7b21578] panic at a00000010007bbd0
#3 [BSP:e0000040d7b21530] die at a000000100037ef0
#4 [BSP:e0000040d7b214e0] ia64_do_page_fault at a00000010066dd90
#5 [BSP:e0000040d7b214e0] __ia64_leave_kernel at a00000010000bfe0
EFRAME: e0000040d7b27ba0
    B0: a0000001001a1dc0      CR_IIP: a0000001001a1e40
CR_IPSR: 00001210085a6010      CR_IFS: 8000000000000610
AR_PFS: 0000000000000610      AR_RSC: 0000000000000003
AR_UNAT: 0000000000000000     AR_RNAT: e0000040d7b27d30
AR_CCV: 0000000000000000     AR_FPSR: 0009804c0270033f
LOADRS: 0000000000000000 AR_BSPSTORE: e0000040d7b20000
    B6: a0000001002dcc40          B7: a00000010026e720
    PR: 0000000000556a99          R1: a000000100c36560
    R2: e0000040d7b20328          R3: e0000040d7b2032c
    R8: 0000000000000000          R9: e0000001308f9a80
   R10: 9e37fffffffc0001         R11: a000000100818318
   R12: e0000040d7b27d60         R13: e0000040d7b20000
   R14: 2031363439656c69         R15: 000000000000006c
   R16: 00fc7000357651f1         R17: e0000040d7b27d88
   R18: 000000003797ae1e         R19: 00000000003fffff
   R20: 0000000000000016         R21: a000000100a4a988
   R22: 00fc6ffffddea3d3         R23: 000000003797ae1e
   R24: a000000100818310         R25: a000000100a4a978
   R26: 7e37fffeef51e9d9         R27: 0000000000000000
   R28: a000000100a4a980         R29: 0000000000000001
   R30: e0000001308f99e0         R31: e000000110adea60
    F6: 000000000000000000000     F7: 000000000000000000000
    F8: 000000000000000000000     F9: 000000000000000000000
   F10: 000000000000000000000    F11: 000000000000000000000
#6 [BSP:e0000040d7b21460] __link_path_walk at a0000001001a1e40
#7 [BSP:e0000040d7b21420] link_path_walk at a0000001001a2c90
#8 [BSP:e0000040d7b213d8] do_path_lookup at a0000001001a3750
#9 [BSP:e0000040d7b21380] __path_lookup_intent_open at a0000001001a4ea0
#10 [BSP:e0000040d7b21340] path_lookup_open at a0000001001a5020
#11 [BSP:e0000040d7b212d0] open_namei at a0000001001a5330
#12 [BSP:e0000040d7b21298] do_filp_open at a0000001001734b0
#13 [BSP:e0000040d7b21248] do_sys_open at a0000001001735a0
#14 [BSP:e0000040d7b211e8] sys_open at a000000100173790
#15 [BSP:e0000040d7b211e8] __ia64_trace_syscall at a00000010000bd70
EFRAME: e0000040d7b27e40
    B0: 200000080014a4f0      CR_IIP: a000000000010620
CR_IPSR: 00001213085a6010      CR_IFS: 0000000000000008
AR_PFS: c000000000000008      AR_RSC: 000000000000000f
AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000
AR_CCV: 0000000000000000     AR_FPSR: 0009804c8a70033f
LOADRS: 0000000001480000 AR_BSPSTORE: 600007fffffb8130
    B6: 2000000800210f80          B7: a000000000010640
    PR: 0000000000551a41          R1: 2000000800280238
    R2: e0000040d7b27ee0          R3: e0000040d7b27ef8
    R8: 0000000000000001          R9: 2000000800303c88
   R10: 0000000000000000         R11: c000000000000593
   R12: 60000ffffffb2fe0         R13: 200000080030c6d0
   R14: a000000000010640         R15: 0000000000000404
   R16: e0000040d7b210a0         R17: 0000000000000710
   R18: 0000000000000160         R19: 0000000001400000
   R20: 0009804c8a70033f         R21: 2000000800008600
   R22: 0000000000000000         R23: 600007fffffb8130
   R24: 0000000000000000         R25: 0000000000000000
   R26: c000000000000004         R27: 000000000000000f
   R28: a000000000010620         R29: 00001213085a6010
   R30: 0000000000000004         R31: 0000000000551a41
    F6: 000000000000000000000     F7: 000000000000000000000
    F8: 000000000000000000000     F9: 000000000000000000000
   F10: 000000000000000000000    F11: 000000000000000000000
#16 [BSP:e0000040d7b211e8] __kernel_syscall_via_break at a000000000010620


Version-Release number of selected component:
Red Hat Enterprise Linux Version Number: 5.4
Release Number: Snapshot 2
Architecture: ia64
Kernel Version: 2.6.18-157.el5
Related Package Version: systemtap-0.9.7-5.el5
Related Middleware / Application: none

Comment 4 Dave Anderson 2009-09-16 12:41:06 UTC

Could I get a pointer to the vmlinux/vmcore pair?

Comment 6 Dave Anderson 2009-09-16 20:46:18 UTC

> And our developer and investigator use as one of debugging work when the
> trouble occurs. They can not use SystemTap command by this issue.
> It is influence that the shipment of RHEL5.4 and relevant middleware might be
> delayed.

Can we agree that this test is insane?  And to state that the
failure implies that "They can not use SystemTap command by this
issue." is a bit hard to swallow.

I'm not much of a systemtap user, but as I understand it, all of these
scripts and their resultant kernel modules are run simulaneously:

  disktop.stp: places retprobes in vfs_write(), vfs_read(), plus 7 other
              (timer-related?) locations

  para-callgraph2.stp: places 3277 kprobes in all functions in the 
            kernel's "fs" directory tree.

  test1.stp: places 512 kprobes and kretprobes -- in all system calls

  traceio.stp: places retprobes in vfs_write(), vfs_read() and a timer

  kmalloc2.stp: kprobes __kmalloc()

  syscalls_by_proc.stp: places 266 kprobes (all system calls)

  top.stp: places 266 system call (all system calls)

I've created my own kprobes for debugging purposes, but I was not aware
that it was even "legal" to have simultaneous kprobes on the same function?
Perhaps that's OK to do -- maybe they get "chained" somehow if a kprobe
is placed at the same location more than one time?

The dumps are interesting, but they don't point to anything that is an
obvious bug, but rather they seem both bump into a section of kmalloc'd
memory that has been corrupted with ASCII data.

Anyway, in the first core, "staprun" is trying to open
the file "/sys/module/ohci_hcd/sections/.data.rel.1", and
it ran into what appears to be inode corruption in the
__link_path_walk() function:
  
  crash> bt
  PID: 31620  TASK: e0000040d7b20000  CPU: 0   COMMAND: "staprun"
   #0 [BSP:e0000040d7b21608] machine_kexec at a00000010005a380
   #1 [BSP:e0000040d7b215e0] crash_kexec at a0000001000cef90
   #2 [BSP:e0000040d7b21578] panic at a00000010007bbd0
   #3 [BSP:e0000040d7b21530] die at a000000100037ef0
   #4 [BSP:e0000040d7b214e0] ia64_do_page_fault at a00000010066dd90
   #5 [BSP:e0000040d7b214e0] __ia64_leave_kernel at a00000010000bfe0
    EFRAME: e0000040d7b27ba0
        B0: a0000001001a1dc0      CR_IIP: a0000001001a1e40
   CR_IPSR: 00001210085a6010      CR_IFS: 8000000000000610
    AR_PFS: 0000000000000610      AR_RSC: 0000000000000003
   AR_UNAT: 0000000000000000     AR_RNAT: e0000040d7b27d30
    AR_CCV: 0000000000000000     AR_FPSR: 0009804c0270033f
    LOADRS: 0000000000000000 AR_BSPSTORE: e0000040d7b20000
        B6: a0000001002dcc40          B7: a00000010026e720
        PR: 0000000000556a99          R1: a000000100c36560
        R2: e0000040d7b20328          R3: e0000040d7b2032c
        R8: 0000000000000000          R9: e0000001308f9a80
       R10: 9e37fffffffc0001         R11: a000000100818318
       R12: e0000040d7b27d60         R13: e0000040d7b20000
       R14: 2031363439656c69         R15: 000000000000006c
       R16: 00fc7000357651f1         R17: e0000040d7b27d88
       R18: 000000003797ae1e         R19: 00000000003fffff
       R20: 0000000000000016         R21: a000000100a4a988
       R22: 00fc6ffffddea3d3         R23: 000000003797ae1e
       R24: a000000100818310         R25: a000000100a4a978
       R26: 7e37fffeef51e9d9         R27: 0000000000000000
       R28: a000000100a4a980         R29: 0000000000000001
       R30: e0000001308f99e0         R31: e000000110adea60
        F6: 000000000000000000000     F7: 000000000000000000000
        F8: 000000000000000000000     F9: 000000000000000000000
       F10: 000000000000000000000    F11: 000000000000000000000
   #6 [BSP:e0000040d7b21460] __link_path_walk at a0000001001a1e40
   #7 [BSP:e0000040d7b21420] link_path_walk at a0000001001a2c90
   #8 [BSP:e0000040d7b213d8] do_path_lookup at a0000001001a3750
   #9 [BSP:e0000040d7b21380] __path_lookup_intent_open at a0000001001a4ea0
  #10 [BSP:e0000040d7b21340] path_lookup_open at a0000001001a5020
  #11 [BSP:e0000040d7b212d0] open_namei at a0000001001a5330
  #12 [BSP:e0000040d7b21298] do_filp_open at a0000001001734b0
  #13 [BSP:e0000040d7b21248] do_sys_open at a0000001001735a0
  #14 [BSP:e0000040d7b211e8] sys_open at a000000100173790
  #15 [BSP:e0000040d7b211e8] __ia64_trace_syscall at a00000010000bd70
    EFRAME: e0000040d7b27e40
        B0: 200000080014a4f0      CR_IIP: a000000000010620
   CR_IPSR: 00001213085a6010      CR_IFS: 0000000000000008
    AR_PFS: c000000000000008      AR_RSC: 000000000000000f
   AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000
    AR_CCV: 0000000000000000     AR_FPSR: 0009804c8a70033f
    LOADRS: 0000000001480000 AR_BSPSTORE: 600007fffffb8130
        B6: 2000000800210f80          B7: a000000000010640
        PR: 0000000000551a41          R1: 2000000800280238
        R2: e0000040d7b27ee0          R3: e0000040d7b27ef8
        R8: 0000000000000001          R9: 2000000800303c88
       R10: 0000000000000000         R11: c000000000000593
       R12: 60000ffffffb2fe0         R13: 200000080030c6d0
       R14: a000000000010640         R15: 0000000000000404
       R16: e0000040d7b210a0         R17: 0000000000000710
       R18: 0000000000000160         R19: 0000000001400000
       R20: 0009804c8a70033f         R21: 2000000800008600
       R22: 0000000000000000         R23: 600007fffffb8130
       R24: 0000000000000000         R25: 0000000000000000
       R26: c000000000000004         R27: 000000000000000f
       R28: a000000000010620         R29: 00001213085a6010
       R30: 0000000000000004         R31: 0000000000551a41
        F6: 000000000000000000000     F7: 000000000000000000000
        F8: 000000000000000000000     F9: 000000000000000000000
  crash>

The address it tried to use was 2031363439656cb9:

  crash> log | grep Unable
  Unable to handle kernel paging request at virtual address 2031363439656cb9
  crash>

which is ASCII data for the most part (except for the b9 at the end):

  crash> ascii 2031363439656cb9
  2031363439656cb9: <b9>le9461 
  crash>

And as it turns out there is corruption in the inode slab cache:

  crash> kmem -s
  ... [ snip ] ...
  e00000010a5a1800 bdev_cache               896         31        36      2    16k
  e00000010a561780 sysfs_dir_cache           88       6451      6726     38    16k
  e00000010a559700 mnt_cache                256         33       124      2    16k
  kmem: inode_cache: full list: slab: e000000130884000  bad next pointer: 6c69662f30726964
  kmem: inode_cache: full list: slab: e000000130884000  bad prev pointer: 6964203837333865
  kmem: inode_cache: full list: slab: e000000130884000  bad inuse counter: 1818846767
  kmem: inode_cache: full list: slab: e000000130884000  bad inuse counter: 1818846767
  kmem: inode_cache: full list: slab: e000000130884000  bad s_mem pointer: 3072696420393733
  e000000109559680 inode_cache              568        749       980     35    16k
  e000000107529600 dentry_cache             216      56855     57498    777    16k
  ...
  
The slab structure at the bottom of an inode cache slab (at e000000130884000)
contains repetetive ASCII data -- but the ASCII data starts several pages 
below that at e00000013087c000:

  crash> rd e00000013087c000 6934
  e00000013087c000:  2037333036656c69 6c69662f30726964   ile6037 dir0/fil
  e00000013087c010:  6964203833303665 36656c69662f3072   e6038 dir0/file6
  e00000013087c020:  3072696420393330 343036656c69662f   039 dir0/file604
  e00000013087c030:  662f307269642030 2031343036656c69   0 dir0/file6041 
  e00000013087c040:  6c69662f30726964 6964203234303665   dir0/file6042 di
  e00000013087c050:  36656c69662f3072 3072696420333430   r0/file6043 dir0
  e00000013087c060:  343036656c69662f 662f307269642034   /file6044 dir0/f
  e00000013087c070:  2035343036656c69 6c69662f30726964   ile6045 dir0/fil
  e00000013087c080:  6964203634303665 36656c69662f3072   e6046 dir0/file6
  e00000013087c090:  3072696420373430 343036656c69662f   047 dir0/file604
  e00000013087c0a0:  662f307269642038 2039343036656c69   8 dir0/file6049 
  e00000013087c0b0:  6c69662f30726964 6964203035303665   dir0/file6050 di
  e00000013087c0c0:  36656c69662f3072 3072696420313530   r0/file6051 dir0
  e00000013087c0d0:  353036656c69662f 662f307269642032   /file6052 dir0/f
  e00000013087c0e0:  2033353036656c69 6c69662f30726964   ile6053 dir0/fil
  e00000013087c0f0:  6964203435303665 36656c69662f3072   e6054 dir0/file6
  e00000013087c100:  3072696420353530 353036656c69662f   055 dir0/file605
  e00000013087c110:  662f307269642036 2037353036656c69   6 dir0/file6057 
  e00000013087c120:  6c69662f30726964 6964203835303665   dir0/file6058 di
  e00000013087c130:  36656c69662f3072 3072696420393530   r0/file6059 dir0
  e00000013087c140:  363036656c69662f 662f307269642030   /file6060 dir0/f
  e00000013087c150:  2031363036656c69 6c69662f30726964   ile6061 dir0/fil
  e00000013087c160:  6964203236303665 36656c69662f3072   e6062 dir0/file6
  e00000013087c170:  3072696420333630 363036656c69662f   063 dir0/file606
  e00000013087c180:  662f307269642034 2035363036656c69   4 dir0/file6065 
  ...

and it continues for several pages (55472 bytes) until here:

  ...
  e000000130889870:  39656c69662f3072 3072696420353939   r0/file9995 dir0
  e000000130889880:  393939656c69662f 662f307269642036   /file9996 dir0/f
  e000000130889890:  2037393939656c69 6c69662f30726964   ile9997 dir0/fil
  e0000001308898a0:  6964203839393965 39656c69662f3072   e9998 dir0/file9
  e0000001308898b0:  0000000000393939 0000000000000000   999.............
  e0000001308898c0:  0000000000000000 0000000000000000   ................
  e0000001308898d0:  0000000000000000 0000000000000000   ................
  e0000001308898e0:  0000000000000000 0000000000000000   ................
  e0000001308898f0:  0000000000000000 0000000000000000   ................

If I strip off the "b9" from the 2031363439656cb9, which was probably 
created by adding a structure member offset, and look for it in the 
memory containing the ASCII data, I see this:

  crash> rd e00000013087c000 6934 | grep 2031363439656c
  e000000130887b40:  2031363439656c69 6c69662f30726964   ile9461 dir0/fil
  crash>

And if I search the crashing task's stack, I see that the address
above has been used:

  crash> rd e0000040d7b20000 4096 | grep e000000130887b40
  e0000040d7b214a0:  e000000130887b40 a0000001001a2c90   @{.0.....,......
  crash>

That's as far as I have gone with that vmcore.


In core2, slab corruption also exists:

  crash> kmem -s
  ... [ snip ] ...
  e000000107388480 size-1024               1024        562       660     44    16k
  e000000107380400 size-512(DMA)            512          0         0      0    16k
  kmem: size-512: full list: slab: e00000408339c080  bad next pointer: 6964203634303665
  kmem: size-512: full list: slab: e00000408339c080  bad prev pointer: 36656c69662f3072
  kmem: size-512: full list: slab: e00000408339c080  bad inuse counter: 1768169528
  kmem: size-512: full list: slab: e00000408339c080  bad inuse counter: 1768169528
  kmem: size-512: full list: slab: e00000408339c080  bad s_mem pointer: 343036656c69662f
  e000000107378380 size-512                 512       5790      6262    202    16k
  e000000107370300 size-256(DMA)            256          0         0      0    16k
  e000000107368280 size-256                 256       2201      3286     53    16k
  e000000107360200 size-128(DMA)            128          0         0      0    16k
  e000000107350180 size-64(DMA)              64          0         0      0    16k
  e000000107348100 size-128                 128       1801      2706     22    16k
  kmem: size-64: full list: slab: e0000040833ac000  bad next pointer: 6420333031392d72
  kmem: size-64: full list: slab: e0000040833ac000  bad prev pointer: 20343031392d7269
  kmem: size-64: full list: slab: e0000040833ac000  bad inuse counter: 1768169526
  kmem: size-64: full list: slab: e0000040833ac000  bad inuse counter: 1768169526
  kmem: size-64: full list: slab: e0000040833ac000  bad s_mem pointer: 3031392d72696420
  e000000107340080 size-64                   64     733752    748320   3118    16k
  a000000100a0c3e8 kmem_cache             16512        149       149    149    32k
  crash> 

Like the other vmcore, there's a multiple-page chunk of repetetive
ASCII data.  In this dump it starts at e000004083398980 and ends at 
e0000040833ae280.  It contains similar "dir#/file###" entries, a small
chunk of what looks like systemtap output data, and then more file
stuff:

  crash> rd e000004083398980 11040
  e000004083398980:  662f30006f2e6e74 2031343035656c69   tn.o.0/file5041 
  e000004083398990:  6c69662f30726964 6964203234303565   dir0/file5042 di
  e0000040833989a0:  35656c69662f3072 3072696420333430   r0/file5043 dir0
  e0000040833989b0:  343035656c69662f 662f307269642034   /file5044 dir0/f
  e0000040833989c0:  2035343035656c69 6c69662f30726964   ile5045 dir0/fil
  e0000040833989d0:  6964203634303565 35656c69662f3072   e5046 dir0/file5
  e0000040833989e0:  3072696420373430 343035656c69662f   047 dir0/file504
  e0000040833989f0:  662f307269642038 2039343035656c69   8 dir0/file5049 
  e000004083398a00:  6c69662f30726964 6964203035303565   dir0/file5050 di
  e000004083398a10:  35656c69662f3072 3072696420313530   r0/file5051 dir0
  ... [ snip ] ...
  e000004083398dc0:  6964203831313565 35656c69662f3072   e5118 dir0/file5
  e000004083398dd0:  3072696420393131 323135656c69662f   119 dir0/file512
  e000004083398de0:  662f307269642030 0000000000000000   0 dir0/f........
  e000004083398df0:  2020202020202020 2064687373202020              sshd 
  e000004083398e00:  3637322020203a72 2042694b20323338   r:   276832 KiB 
  e000004083398e10:  3937322020203a77 0a42694b20373232   w:   279227 KiB.
  e000004083398e20:  2020202020202020 206f697061747320            stapio 
  e000004083398e30:  3633322020203a72 2042694b20333136   r:   236613 KiB 
  e000004083398e40:  3633322020203a77 0a42694b20323036   w:   236602 KiB.
  e000004083398e50:  2020202020202020 2068736172632020             crash 
  e000004083398e60:  3831362020203a72 2042694b20333137   r:   618713 KiB 
  e000004083398e70:  35656c69662f3072 3072696420313331   r0/file5131 dir0
  e000004083398e80:  333135656c69662f 662f307269642032   /file5132 dir0/f
  e000004083398e90:  2033333135656c69 6c69662f30726964   ile5133 dir0/fil
  ... [ snip ] ...
  e0000040833a9850:  662f307269642032 2033393939656c69   2 dir0/file9993 
  e0000040833a9860:  6c69662f30726964 6964203439393965   dir0/file9994 di
  e0000040833a9870:  39656c69662f3072 3072696420353939   r0/file9995 dir0
  e0000040833a9880:  393939656c69662f 662f307269642036   /file9996 dir0/f
  e0000040833a9890:  2037393939656c69 6c69662f30726964   ile9997 dir0/fil
  e0000040833a98a0:  6964203839393965 39656c69662f3072   e9998 dir0/file9
  e0000040833a98b0:  3830382d00393939 30382d7269642035   999.-8085 dir-80
  e0000040833a98c0:  382d726964203638 2d72696420373830   86 dir-8087 dir-
  e0000040833a98d0:  7269642038383038 696420393830382d   8088 dir-8089 di
  e0000040833a98e0:  6964203930382d72 6420303930382d72   r-809 dir-8090 d
  e0000040833a98f0:  20313930382d7269 323930382d726964   ir-8091 dir-8092
  e0000040833a9900:  3930382d72696420 30382d7269642033    dir-8093 dir-80
  ... [ snip ] ...
  
The address being dereferenced in this dump is 2d72696420303439:

  crash> log | grep Unable
  Unable to handle kernel paging request at virtual address 2d72696420303439
  crash> ascii 2d72696420303439
  2d72696420303439: 940 dir-
  crash>
  
  crash> bt
  PID: 4287   TASK: e00000409b170000  CPU: 1   COMMAND: "stapio"
   #0 [BSP:e00000409b1714a0] machine_kexec at a00000010005a380
   #1 [BSP:e00000409b171480] crash_kexec at a0000001000cef90
   #2 [BSP:e00000409b171418] panic at a00000010007bbd0
   #3 [BSP:e00000409b1713c8] die at a000000100037ef0
   #4 [BSP:e00000409b171378] ia64_do_page_fault at a00000010066dd90
   #5 [BSP:e00000409b171378] __ia64_leave_kernel at a00000010000bfe0
    EFRAME: e00000409b177bb0
        B0: a00000010066f230      CR_IIP: a00000010066f1f0
   CR_IPSR: 00001210085a6010      CR_IFS: 8000000000000389
    AR_PFS: 0000000000000389      AR_RSC: 0000000000000003
   AR_UNAT: 0000000000000000     AR_RNAT: 0000000000551559
    AR_CCV: 0000000000000000     AR_FPSR: 0009804c8a70033f
    LOADRS: 0000000000000000 AR_BSPSTORE: a0000001001691c0
        B6: a0000001006690e0          B7: a00000010000b7e0
        PR: 0000000000556659          R1: a000000100c36560
        R2: 0000000000000389          R3: a0000001001caa60
        R8: e0000040833ae040          R9: 0000000000000054
       R10: a000000209f336b8         R11: e000004089508028
       R12: e00000409b177d70         R13: e00000409b170000
       R14: 392d726964203439         R15: e0000040833ae048
       R16: 0000000000000000         R17: 0000000000200200
       R18: 0000000000200200         R19: 392d726964203441
       R20: 2d72696420303439         R21: 392d726964203439
       R22: 0000000000000063         R23: 0000000000000062
       R24: 0000000000000080         R25: 0000000000024000
       R26: 0000000000000000         R27: 0000000000004000
       R28: 0000000000004000         R29: 00000000070f04ce
       R30: 0000000000000000         R31: a000000100a4f708
        F6: 1003e00000000000000a5     F7: 0ffdd8000000000000000
        F8: 10006a4ffd6c000000000     F9: 100058000000000000000
       F10: 10014b23947fff4dc6b80    F11: 1003e00000000002c8e52
   #6 [BSP:e00000409b171330] unregister_kretprobes at a00000010066f1f0
   #7 [BSP:e00000409b1712f8] _stp_cleanup_and_exit at a000000209e185d0
   #8 [BSP:e00000409b171270] _stp_ctl_write_cmd at a000000209e19170
   #9 [BSP:e00000409b171220] vfs_write at a000000100177ea0
  #10 [BSP:e00000409b1711a0] kretprobe_trampoline at a000000100059f80
  #11 [BSP:e00000409b171128] unregister_kretprobes at a00000010066f230
  crash>
  
And if I look in the corrupted memory section for it, I can see
a couple references:

  crash> rd e000004083398980 11040 | grep 2d72696420303439
  e0000040833acb70:  2d72696420303439 7269642030303439   940 dir-9400 dir
  e0000040833ae040:  392d726964203439 2d72696420303439   94 dir-9940 dir-
  crash>

And looking into the kernel stack of the crashing task, the second
reference is in the exception frame:

  crash> bt | grep e0000040833ae040
        R8: e0000040833ae040          R9: 0000000000000054
  crash>

But at this point, I don't know where to go from there.  I'm guessing
that the corruption took place long before either of the two instances
that crashed ran into it.  And it's highly likely that the problem may
be in the systemtap-generated kernel modules themselves and not the
kernel proper.

Comment 9 Dave Anderson 2009-09-17 13:07:29 UTC

Actually it looks like it's a buffer larger than 64K, the second one
is at least 88K bytes in length.

Comment 11 Dave Anderson 2009-10-01 15:47:25 UTC

I was unsuccessful in reproducing this on an ia64 using the
supplied instructions, although they don't make a lot of
sense.  I ran the "all_exec.sh" script to kick off the set
of stap scripts, and then started up the "io-load-test.sh"
script.

But then I did't understand what is supposed to be done
after the above:

> 3. Do one or some of the followings:

Huh?  "one or some of the followings"?

> a: Execute a stap script  again.
>    For example, execution of "stap -v -k disktop.stp"

OK, I did the example above in the background, and find the
PID of the "stap -v -k disktop.stp" command's child "stapio"
process.
 
> b: Send  SIGTERM to a process of stapio.

OK, I did that -- again presuming they're talking about the 
child "stapio" of the parent " stap -v -k disktop.stp" command.
But when the "stapio" child is killed, it also causes the parent
"stap" command to die as well.

> c: Run gdb.
 
Run gdb on what?  Just enter "gdb" alone?

> d: Type Ctrl-C to terminate a process of stap.

It's already dead as a result of the SIGTERM I sent to the child
"stapio" process.

So after giving up, I started looking at the two supplied
vmcore files on the same system, and later on decided to "scp" them
to another machine for storage.  Only *then* did the system crash!

Anyway, here's what I see in the new vmcore:

An "ssh" instance was shutting down a socket, and bumped into a 
kprobe at the first instruction in sys_shutdown():
  
  crash> bt 
  PID: 24734  TASK: e0000003a1b20000  CPU: 0   COMMAND: "ssh"
   #0 [BSP:e0000003a1b213d8] machine_kexec at a00000010005a380
   #1 [BSP:e0000003a1b213b8] crash_kexec at a0000001000cefb0
   #2 [BSP:e0000003a1b21350] panic at a00000010007bc70
   #3 [BSP:e0000003a1b21308] die at a000000100037ef0
   #4 [BSP:e0000003a1b212b8] ia64_do_page_fault at a00000010066dd90
   #5 [BSP:e0000003a1b212b8] __ia64_leave_kernel at a00000010000bfe0
    EFRAME: e0000003a1b279f0
        B0: a00000010066cb90      CR_IIP: a000000100670500
   CR_IPSR: 0000121008022038      CR_IFS: 8000000000000001
    AR_PFS: 0000000000000286      AR_RSC: 0000000000000003
   AR_UNAT: 0000000000000000     AR_RNAT: 0000000500030002
    AR_CCV: 0000000000000000     AR_FPSR: 0009804c0270033f
    LOADRS: 0000000000000000 AR_BSPSTORE: 0000000000000024
        B6: a00000010066ff40          B7: a000000100670640
        PR: 00000000005619a5          R1: a000000100c36560
        R2: 0000000000000389          R3: 0000000000100100
        R8: 0000000000200200          R9: 373038656c696637
       R10: 662f307269642036         R11: 373038656c69662f
       R12: e0000003a1b27bb0         R13: e0000003a1b20000
       R14: a000000100059f80         R15: e0000003a496afb0
       R16: a0000002050c314c         R17: a000000100d28b88
       R18: ffffffffffff0210         R19: e0000003a496af88
       R20: e0000003a496af90         R21: e0000003a496af98
       R22: e0000003a496afa0         R23: 0004014dc5800000
       R24: a000000100537160         R25: e0000003a84011b0
       R26: e0000040f10ee780         R27: 0000000000004000
       R28: e0000001075d4838         R29: e0000001075d4808
       R30: 0000000000000000         R31: a000000100a4f750
        F6: 000000000000000000000     F7: 000000000000000000000
        F8: 000000000000000000000     F9: 000000000000000000000
       F10: 0fffef13d31b9b66f9000    F11: 0ffdcc3f0000000000000
   #6 [BSP:e0000003a1b212b0] add_rp_inst at a000000100670500
   #7 [BSP:e0000003a1b21288] arch_prepare_kretprobe at a00000010066cb90
   #8 [BSP:e0000003a1b21250] pre_handler_kretprobe at a00000010066ff80
   #9 [BSP:e0000003a1b21220] aggr_pre_handler at a0000001006706d0
  #10 [BSP:e0000003a1b211b8] kprobe_exceptions_notify at a00000010066b670
  #11 [BSP:e0000003a1b21180] notifier_call_chain at a00000010066df10
  #12 [BSP:e0000003a1b21150] atomic_notifier_call_chain at a00000010066dfb0
  #13 [BSP:e0000003a1b21128] ia64_bad_break at a00000010066ac80
  #14 [BSP:e0000003a1b21128] __ia64_leave_kernel at a00000010000bfe0
    EFRAME: e0000003a1b27c70
        B0: a000000100059f80      CR_IIP: a000000100537160
   CR_IPSR: 0000101008526030      CR_IFS: 8000000000000008
    AR_PFS: 0000000000000008      AR_RSC: 0000000000000003
   AR_UNAT: 0000000000000000     AR_RNAT: 0000000000551559
    AR_CCV: 0000000000000000     AR_FPSR: 0009804c0270033f
    LOADRS: 0000000000000000 AR_BSPSTORE: a000000100059f80
        B6: a000000100537160          B7: a0000001000102b0
        PR: 0000000000569559          R1: a000000100c36560
        R2: 0000000000000000          R3: 0000000000000118
        R8: 000000004abcdba0          R9: 00000000381f5ee0
       R10: e0000003a1b20e98         R11: 0000000000000010
       R12: e0000003a1b27e30         R13: e0000003a1b20000
       R14: a000000100a4d9e8         R15: 00000000000000b2
       R16: a0000001006f14b8         R17: e0000003a1b27ff0
       R18: 00000000381f5ee0         R19: a000000100879300
       R20: a000000100537160         R21: a000000100d21088
       R22: 0000000000000001         R23: a000000100a4f720
       R24: a000000100a4f720         R25: 0000000000000000
       R26: a000000100a4f73c         R27: e0000001075d4840
       R28: e0000001075d4838         R29: e0000001075d4808
       R30: e0000001075d4824         R31: e0000001075d4830
        F6: 000000000000000000000     F7: 000000000000000000000
        F8: 000000000000000000000     F9: 000000000000000000000
       F10: 0fffef13d31b9b66f9000    F11: 0ffdcc3f0000000000000
  #15 [BSP:e0000003a1b210e8] sys_shutdown at a000000100537160
  #16 [BSP:e0000003a1b210e8] kretprobe_trampoline at a000000100059f80
  bt: unwind: ip, sp, bsp remain unchanged; stopping here (ip=0xa000000100059f80)
  crash> 

The kprobe handling ended up calling add_rp_inst() with a kretprobe_instance
pointer argument of e0000003a496af80:

  crash> bt -f
  ...
  
   #6 [BSP:e0000003a1b212b0] add_rp_inst at a000000100670500
      (e0000003a496af80)
  ...
  crash>
  
  void __kprobes add_rp_inst(struct kretprobe_instance *ri)

The data structure contains "familiar" garbage (ASCII data generated
from the supplied "io-load-test.sh" program) in its uflist and
hlist members:
  
  crash> kretprobe_instance e0000003a496af80
  struct kretprobe_instance {
    uflist = {
      next = 0x373038656c69662f, 
      pprev = 0x662f307269642036
    }, 
    hlist = {
      next = 0x2037373038656c69, 
      pprev = 0x6c69662f30726964
    }, 
    rp = 0xa0000002050c30a0, 
    ret_addr = 0xa00000010000bd70, 
    task = 0xe0000003a1b20000
  }
  crash> kretprobe_instance -r e0000003a496af80
  e0000003a496af80:  373038656c69662f 662f307269642036   /file8076 dir0/f
  e0000003a496af90:  2037373038656c69 6c69662f30726964   ile8077 dir0/fil
  e0000003a496afa0:  a0000002050c30a0 a00000010000bd70   .0......p.......
  e0000003a496afb0:  e0000003a1b20000                    ........
  crash> 

The size-64 slab cache has been corrupted, and the slab page containing
the kretprobe_instance structure above e0000003a496af80 is one of two
contiguous corrupted slab pages, at e0000003a4968000 and e0000003a496c000:
  
  crash> kmem -s e0000003a496af80
  kmem: size-64: full list: slab: e0000003a496c000  bad next pointer: 6c69662f30726964
  kmem: size-64: full list: slab: e0000003a496c000  bad prev pointer: 6964203837333865
  kmem: size-64: full list: slab: e0000003a496c000  bad inuse counter: 1818846767
  kmem: size-64: full list: slab: e0000003a496c000  bad inuse counter: 1818846767
  kmem: size-64: full list: slab: e0000003a496c000  bad s_mem pointer: 3072696420393733
  kmem: size-64: partial list: slab: e0000003a4968000  bad next pointer: 3072696420373032
  kmem: size-64: partial list: slab: e0000003a4968000  bad inuse counter: 812804452
  kmem: size-64: partial list: slab: e0000003a4968000  bad s_mem pointer: 2039303237656c69
  kmem: size-64: full list: slab: e0000003a4968000  bad next pointer: 3072696420373032
  kmem: size-64: full list: slab: e0000003a4968000  bad inuse counter: 812804452
  kmem: size-64: full list: slab: e0000003a4968000  bad s_mem pointer: 2039303237656c69
  kmem: size-64: free list: slab: e0000003a4968000  bad next pointer: 3072696420373032
  kmem: size-64: free list: slab: e0000003a4968000  bad inuse counter: 812804452
  kmem: size-64: free list: slab: e0000003a4968000  bad s_mem pointer: 2039303237656c69
  kmem: size-64: address not found in cache: e0000003a496af80
  crash>

So in this case, it doesn't appear that the kprobe code is using a stale
kretprobe_instance pointer, but rather the running of "io-load-test.sh" 
script is leading to the corruption of memory.

It is possible to follow the "io-load-test.sh" ASCII pattern across page 
boundaries by simply looking at the "file###" values.  In this case the 
pattern starts somewhere around here:

  ...
  e0000003a49607b0:  0000000000000032 0000000000000001   2...............
  e0000003a49607c0:  004c465445535f46 5c2b375c305c305c   F_SETFL.\0\0\7+\
  e0000003a49607d0:  765c335c47373731 325c315c345c305c   177G\3\v\0\4\1\2
  e0000003a49607e0:  315c30345c72365c 2e225c5c305c335c   \6r\40\1\3\0\\".
  e0000003a49607f0:  313035656c002e2e 662f307269642032   ...le5012 dir0/f
  e0000003a4960800:  2033313035656c69 6c69662f30726964   ile5013 dir0/fil
  e0000003a4960810:  6964203431303565 35656c69662f3072   e5014 dir0/file5
  e0000003a4960820:  3072696420353130 313035656c69662f   015 dir0/file501
  e0000003a4960830:  662f307269642036 000000000000001c   6 dir0/f........
  e0000003a4960840:  7c474e41484f4e57 45434152544e5557   WNOHANG|WUNTRACE
  e0000003a4960850:  49544e4f43577c44 3072007c4445554e   D|WCONTINUED|.r0
  e0000003a4960860:  323035656c69662f 662f307269642030   /file5020 dir0/f
  e0000003a4960870:  2031323035656c69 6c69662f30726964   ile5021 dir0/fil
  e0000003a4960880:  6964203232303565 35656c69662f3072   e5022 dir0/file5
  e0000003a4960890:  3072696420333230 323035656c69662f   023 dir0/file502
  e0000003a49608a0:  662f307269642034 2035323035656c69   4 dir0/file5025 
  e0000003a49608b0:  6c69662f30726964 6964203632303565   dir0/file5026 di
  e0000003a49608c0:  7c474e41484f4e57 45434152544e5557   WNOHANG|WUNTRACE
  e0000003a49608d0:  49544e4f43577c44 662f30004445554e   D|WCONTINUED.0/f
  e0000003a49608e0:  2039323035656c69 6c69662f30726964   ile5029 dir0/fil
  e0000003a49608f0:  6964203033303565 35656c69662f3072   e5030 dir0/file5
  e0000003a4960900:  3072696420313330 333035656c69662f   031 dir0/file503
  e0000003a4960910:  662f307269642032 2033333035656c69   2 dir0/file5033
  ...
  
In the page above there are a few unrelated ASCII parts like
the "WNHANG.." stuff.  But from e0000003a4960e70 in the page above, 
ASCII data containing only the "dir#/file###" pattern continues
uninterrupted, and which can be followed across page boundaries
like this:
  
  ...
  e0000003a4963fd0:  6c69662f30726964 6964203433303665   dir0/file6034 di
  e0000003a4963fe0:  36656c69662f3072 3072696420353330   r0/file6035 dir0
  e0000003a4963ff0:  333036656c69662f 662f307269642036   /file6036 dir0/f
  e0000003a4964000:  2037333036656c69 6c69662f30726964   ile6037 dir0/fil
  e0000003a4964010:  6964203833303665 36656c69662f3072   e6038 dir0/file6
  e0000003a4964020:  3072696420393330 343036656c69662f   039 dir0/file604 
  ...
  
until it seemingly ends here at e0000003a4971970:
  
  ...
  e0000003a4971860:  6c69662f30726964 6964203439393965   dir0/file9994 di
  e0000003a4971870:  39656c69662f3072 3072696420353939   r0/file9995 dir0
  e0000003a4971880:  393939656c69662f 662f307269642036   /file9996 dir0/f
  e0000003a4971890:  2037393939656c69 6c69662f30726964   ile9997 dir0/fil
  e0000003a49718a0:  6964203839393965 39656c69662f3072   e9998 dir0/file9
  e0000003a49718b0:  3830382d00393939 30382d7269642035   999.-8085 dir-80
  e0000003a49718c0:  382d726964203638 2d72696420373830   86 dir-8087 dir-
  e0000003a49718d0:  7269642038383038 696420393830382d   8088 dir-8089 di
  e0000003a49718e0:  6964203930382d72 6420303930382d72   r-809 dir-8090 d
  e0000003a49718f0:  20313930382d7269 323930382d726964   ir-8091 dir-8092
  e0000003a4971900:  3930382d72696420 30382d7269642033    dir-8093 dir-80
  e0000003a4971910:  382d726964203439 2d72696420353930   94 dir-8095 dir-
  e0000003a4971920:  7269642036393038 696420373930382d   8096 dir-8097 di
  e0000003a4971930:  6420383930382d72 20393930382d7269   r-8098 dir-8099 
  e0000003a4971940:  642031382d726964 64203031382d7269   dir-81 dir-810 d
  e0000003a4971950:  20303031382d7269 313031382d726964   ir-8100 dir-8101
  e0000003a4971960:  3031382d72696420 31382d7269642032    dir-8102 dir-81
  e0000003a4971970:  382d726964203330 2d72696420343031   03 dir-8104 dir-
  e0000003a4971980:  00000000000081a4 4ab9336c4ab93231   ........12.Jl3.J
  e0000003a4971990:  4ab9336c4ab9336c 0000000000000000   l3.Jl3.J........
  e0000003a49719a0:  0000000000000000 0000000000000000   ................
  e0000003a49719b0:  0000000000000000 0000000000000000   ................
  ...

So the pages containing this span of continuous 64K+ ASCII stream 
are these:

  (1) e0000003a4960000 -- allocated "size-16384" slab data page
  (2) e0000003a4964000 -- not allocated in slab subsystem
  (3) e0000003a4968000 -- "size-64" slab page
  (4) e0000003a496c000 -- "size-64" slab page
  (5) e0000003a4970000 -- not allocated in slab subsystem

Clearly the pages (3) and (4) being used by the "size-64" slab 
cache have been corrupted.  Page (1) is a "size-16384" slab data
page that is currently allocated -- it appears to be corrupted 
because its ASCII data stream flows into the next page above it:

  e0000003a4963fe0:  36656c69662f3072 3072696420353330   r0/file6035 dir0
  e0000003a4963ff0:  333036656c69662f 662f307269642036   /file6036 dir0/f
  e0000003a4964000:  2037333036656c69 6c69662f30726964   ile6037 dir0/fil
  e0000003a4964010:  6964203833303665 36656c69662f3072   e6038 dir0/file6
  e0000003a4964020:  3072696420393330 343036656c69662f   039 dir0/file604 

so it goes beyond its 16K limit, and in fact the uninterrupted ASCII
stream continues unabated into page (5).

In any case, I'm going to go back and look at the first two vmcores
again.  Perhaps the systemtap operations -- which generate many thousands
"size-64" slab cache requests -- are innocent bystanders that get
trashed by something generated as a result of the "io-load-test.sh"
operation?

Comment 12 Dave Anderson 2009-10-01 19:52:48 UTC

Revisiting one of the two original vmcores, the symptoms
of corruption is remarkably similar to the crash that I
was able to reproduce.

In this case, an "stapio" process was trying to unregister
a set of kretprobes:
  
  crash> bt
  PID: 4287   TASK: e00000409b170000  CPU: 1   COMMAND: "stapio"
   #0 [BSP:e00000409b1714a0] machine_kexec at a00000010005a380
   #1 [BSP:e00000409b171480] crash_kexec at a0000001000cef90
   #2 [BSP:e00000409b171418] panic at a00000010007bbd0
   #3 [BSP:e00000409b1713c8] die at a000000100037ef0
   #4 [BSP:e00000409b171378] ia64_do_page_fault at a00000010066dd90
   #5 [BSP:e00000409b171378] __ia64_leave_kernel at a00000010000bfe0
    EFRAME: e00000409b177bb0
        B0: a00000010066f230      CR_IIP: a00000010066f1f0
   CR_IPSR: 00001210085a6010      CR_IFS: 8000000000000389
    AR_PFS: 0000000000000389      AR_RSC: 0000000000000003
   AR_UNAT: 0000000000000000     AR_RNAT: 0000000000551559
    AR_CCV: 0000000000000000     AR_FPSR: 0009804c8a70033f
    LOADRS: 0000000000000000 AR_BSPSTORE: a0000001001691c0
        B6: a0000001006690e0          B7: a00000010000b7e0
        PR: 0000000000556659          R1: a000000100c36560
        R2: 0000000000000389          R3: a0000001001caa60
        R8: e0000040833ae040          R9: 0000000000000054
       R10: a000000209f336b8         R11: e000004089508028
       R12: e00000409b177d70         R13: e00000409b170000
       R14: 392d726964203439         R15: e0000040833ae048
       R16: 0000000000000000         R17: 0000000000200200
       R18: 0000000000200200         R19: 392d726964203441
       R20: 2d72696420303439         R21: 392d726964203439
       R22: 0000000000000063         R23: 0000000000000062
       R24: 0000000000000080         R25: 0000000000024000
       R26: 0000000000000000         R27: 0000000000004000
       R28: 0000000000004000         R29: 00000000070f04ce
       R30: 0000000000000000         R31: a000000100a4f708
        F6: 1003e00000000000000a5     F7: 0ffdd8000000000000000
        F8: 10006a4ffd6c000000000     F9: 100058000000000000000
       F10: 10014b23947fff4dc6b80    F11: 1003e00000000002c8e52
   #6 [BSP:e00000409b171330] unregister_kretprobes at a00000010066f1f0
   #7 [BSP:e00000409b1712f8] _stp_cleanup_and_exit at a000000209e185d0
   #8 [BSP:e00000409b171270] _stp_ctl_write_cmd at a000000209e19170
   #9 [BSP:e00000409b171220] vfs_write at a000000100177ea0
  #10 [BSP:e00000409b1711a0] kretprobe_trampoline at a000000100059f80
  #11 [BSP:e00000409b171128] unregister_kretprobes at a00000010066f230
  crash> 
  
It crashed in the loop where cleanup_rp_inst() is called:
  
  void __kprobes unregister_kretprobes(struct kretprobe **rps, int num)
  {
          int i;
  
          if (num <= 0)
                  return;
          mutex_lock(&kprobe_mutex);
          for (i = 0; i < num; i++)
                  if (__unregister_kprobe_top(&rps[i]->kp) < 0)
                          rps[i]->kp.addr = NULL;
          mutex_unlock(&kprobe_mutex);
  
          synchronize_sched();
          for (i = 0; i < num; i++) {
                  if (rps[i]->kp.addr) {
                          __unregister_kprobe_bottom(&rps[i]->kp);
                          cleanup_rp_inst(rps[i]);
                  }
          }
  }

The "rps" and "num" arguments were:
  
   #6 [BSP:e00000409b171330] unregister_kretprobes at a00000010066f1f0
      (a000000209f14710, 106)
  
While cycling through the array of 262 (0x106) kretprobe pointers, it was
working on the one at index 84 (0x54) located at address a000000209f149b0:
  
  crash> rd a000000209f14710 262
  a000000209f14710:  a000000209f2c280 a000000209f2c3e0   ................
  a000000209f14720:  a000000209f2c540 a000000209f2c6a0   @...............
  a000000209f14730:  a000000209f2c800 a000000209f2c960   ........`.......
  ... [ snip ] ...
  a000000209f149b0:  a000000209f33600 a000000209f33760   .6......`7......
  ... [ snip ] ...
  
The kretprobe pointer there (a000000209f33600) points to this:
  
  crash> kretprobe a000000209f33600
  struct kretprobe {
    kp = {
      hlist = {
        next = 0x0, 
        pprev = 0x0
      }, 
      list = {
        next = 0xa000000209f336d0, 
        prev = 0x200200
      }, 
      mod_refcounted = 0, 
      nmissed = 0, 
      addr = 0xa0000001001caa60, 
      symbol_name = 0x0, 
      offset = 0, 
      pre_handler = @0xa00000010077fde0: 0xa00000010066ff40 <pre_handler_kretprobe>, 
      post_handler = 0, 
      fault_handler = 0, 
      break_handler = 0, 
      opcode = {
        bundle = {
          quad0 = {
            template = 8, 
            slot0 = 33816576, 
            slot1_p0 = 896
          }, 
          quad1 = {
            slot1_p1 = 4341815, 
            slot2 = 712462303360
          }
        }
      }, 
      ainsn = {
        insn = 0xa000000207b3a080, 
        inst_flag = 8, 
        target_br_reg = 0
      }
    }, 
    handler = 0xa000000209ed9668, 
    maxactive = 4096, 
    nmissed = 0, 
    free_instances = {
      first = 0xe0000040833ae040
    }, 
    used_instances = {
      first = 0x0
    }
  }
  
And the code is in the act of dealing with the "free_instances"
hlist_head, which points to an hlist_node at 0xe0000040833ae040,
  
  crash> hlist_node 0xe0000040833ae040
  struct hlist_node {
    next = 0x392d726964203439, 
    pprev = 0x2d72696420303439
  }

which is embedded in a kretprobe_instance structure:

  struct kretprobe_instance {
        struct hlist_node uflist; /* either on free list or used list */
        struct hlist_node hlist;
        struct kretprobe *rp;
        kprobe_opcode_t *ret_addr;
        struct task_struct *task;
  };
  
And that's where the crash occurs:

  crash> log | grep 2d72696420303439
  Unable to handle kernel paging request at virtual address 2d72696420303439
  r20 : 2d72696420303439 r21 : 392d726964203439 r22 : 0000000000000063
  crash>
  
The kretprobe_instance structure is 56 bytes long, and is allocated
from the "size-64" slab cache.  Because of its small size, it will fit
on a slab page where the slab structure sits at the bottom of the page,
followed by the number of 64-bit items that fit in the rest of the page:
And the whole slab page containing the structure is corrupted:
  
  crash> kmem -s 0xe0000040833ae040
  kmem: size-64: full list: slab: e0000040833ac000  bad next pointer: 6420333031392d72
  kmem: size-64: full list: slab: e0000040833ac000  bad prev pointer: 20343031392d7269
  kmem: size-64: full list: slab: e0000040833ac000  bad inuse counter: 1768169526
  kmem: size-64: full list: slab: e0000040833ac000  bad inuse counter: 1768169526
  kmem: size-64: full list: slab: e0000040833ac000  bad s_mem pointer: 3031392d72696420
  kmem: size-64: partial list: slab: e0000040833ac000  bad next pointer: 6420333031392d72
  kmem: size-64: partial list: slab: e0000040833ac000  bad inuse counter: 1768169526
  kmem: size-64: partial list: slab: e0000040833ac000  bad s_mem pointer: 3031392d72696420
  kmem: size-64: full list: slab: e0000040833ac000  bad next pointer: 6420333031392d72
  kmem: size-64: full list: slab: e0000040833ac000  bad inuse counter: 1768169526
  kmem: size-64: full list: slab: e0000040833ac000  bad s_mem pointer: 3031392d72696420
  kmem: size-64: free list: slab: e0000040833ac000  bad next pointer: 6420333031392d72
  kmem: size-64: free list: slab: e0000040833ac000  bad inuse counter: 1768169526
  kmem: size-64: free list: slab: e0000040833ac000  bad s_mem pointer: 3031392d72696420
  kmem: size-64: address not found in cache: e0000040833ae040
  crash>

The slab page contains ASCII data generated from the "io-load-test.sh" script,
from the beginning of the page (corrupting the slab structure) for about 8832
bytes of the 16k page:
  
  crash> rd e0000040833ac000 2048
  e0000040833ac000:  6420333031392d72 20343031392d7269   r-9103 dir-9104 
  e0000040833ac010:  353031392d726964 3031392d72696420   dir-9105 dir-910
  e0000040833ac020:  31392d7269642036 392d726964203730   6 dir-9107 dir-9
  e0000040833ac030:  2d72696420383031 7269642039303139   108 dir-9109 dir
  e0000040833ac040:  726964203131392d 696420303131392d   -911 dir-9110 di
  e0000040833ac050:  6420313131392d72 20323131392d7269   r-9111 dir-9112 
  e0000040833ac060:  333131392d726964 3131392d72696420   dir-9113 dir-911
  e0000040833ac070:  31392d7269642034 392d726964203531   4 dir-9115 dir-9
  e0000040833ac080:  2d72696420363131 7269642037313139   116 dir-9117 dir
  e0000040833ac090:  696420383131392d 6420393131392d72   -9118 dir-9119 d
  e0000040833ac0a0:  64203231392d7269 20303231392d7269   ir-912 dir-9120 
  ... [ snip repeating pattern ] ...
  e0000040833ae1d0:  2d72696420303839 7269642031383939   980 dir-9981 dir
  e0000040833ae1e0:  696420323839392d 6420333839392d72   -9982 dir-9983 d
  e0000040833ae1f0:  20343839392d7269 353839392d726964   ir-9984 dir-9985
  e0000040833ae200:  3839392d72696420 39392d7269642036    dir-9986 dir-99
  e0000040833ae210:  392d726964203738 2d72696420383839   87 dir-9988 dir-
  e0000040833ae220:  7269642039383939 726964203939392d   9989 dir-999 dir
  e0000040833ae230:  696420303939392d 6420313939392d72   -9990 dir-9991 d
  e0000040833ae240:  20323939392d7269 333939392d726964   ir-9992 dir-9993
  e0000040833ae250:  3939392d72696420 39392d7269642034    dir-9994 dir-99
  e0000040833ae260:  392d726964203539 2d72696420363939   95 dir-9996 dir-
  e0000040833ae270:  7269642037393939 696420383939392d   9997 dir-9998 di
  e0000040833ae280:  e000393939392d72 e0000040833ae240   r-9999..@.:.@...
  e0000040833ae290:  000000000000dac4 0000000000000000   ................
  e0000040833ae2a0:  2000000006f3fa00 2000000003b5cbd0   ....... ....... 
  e0000040833ae2b0:  0000000000001408 0000000000000040   ........@.......
  e0000040833ae2c0:  e0000040833ae300 e0000040833ae280   ..:.@.....:.@...
  ...

The actual beginning of the ASCII pattern starts in the middle of a 
data page much lower in memory, which is from the "size-16384" slab cache, 
and extends to the location shown above:
  
  crash> kmem -s  e000004083398000 
  CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
  e0000001073c8880 size-16384             16384         26        33     33    16k
  SLAB              MEMORY            TOTAL  ALLOCATED  FREE
  e0000040c9ef3380  e000004083398000      1          1     0
  FREE / [ALLOCATED]
    [e000004083398000]
  crash>
  
  crash> rd e000004083398000 2048
  ... [ snip ] ... 
  e0000040833988f0:  636c2d2064656465 2d206363676c2d20   eded -lc -lgcc -
  e000004083398900:  6465656e2d73612d 6363676c2d206465   -as-needed -lgcc
  e000004083398910:  2d6f6e2d2d20735f 656465656e2d7361   _s --no-as-neede
  e000004083398920:  6c2f7273752f2064 692f6363672f6269   d /usr/lib/gcc/i
  e000004083398930:  686465722d343661 78756e696c2d7461   a64-redhat-linux
  e000004083398940:  632f322e312e342f 206f2e646e657472   /4.1.2/crtend.o 
  e000004083398950:  62696c2f7273752f 3661692f6363672f   /usr/lib/gcc/ia6
  e000004083398960:  7461686465722d34 342f78756e696c2d   4-redhat-linux/4
  e000004083398970:  2f2e2e2f322e312e 72632f2e2e2f2e2e   .1.2/../../../cr
  e000004083398980:  662f30006f2e6e74 2031343035656c69   tn.o.0/file5041 
  e000004083398990:  6c69662f30726964 6964203234303565   dir0/file5042 di
  e0000040833989a0:  35656c69662f3072 3072696420333430   r0/file5043 dir0
  e0000040833989b0:  343035656c69662f 662f307269642034   /file5044 dir0/f
  e0000040833989c0:  2035343035656c69 6c69662f30726964   ile5045 dir0/fil
  e0000040833989d0:  6964203634303565 35656c69662f3072   e5046 dir0/file5
  e0000040833989e0:  3072696420373430 343035656c69662f   047 dir0/file504
  e0000040833989f0:  662f307269642038 2039343035656c69   8 dir0/file5049 
  e000004083398a00:  6c69662f30726964 6964203035303565   dir0/file5050 di
  ... [ snip ] ...
  e000004083398d40:  2039303135656c69 6c69662f30726964   ile5109 dir0/fil
  e000004083398d50:  6964203031313565 35656c69662f3072   e5110 dir0/file5
  e000004083398d60:  3072696420313131 2000000801842878   111 dir0x(..... 
  e000004083398d70:  6e776f6e6b6e753c 49544e4f4357003e   <unknown>.WCONTI
  e000004083398d80:  6c69007c4445554e 6964203431313565   NUED|.ile5114 di
  e000004083398d90:  35656c69662f3072 3072696420353131   r0/file5115 dir0
  e000004083398da0:  313135656c69662f 662f307269642036   /file5116 dir0/f
  e000004083398db0:  2037313135656c69 6c69662f30726964   ile5117 dir0/fil
  e000004083398dc0:  6964203831313565 35656c69662f3072   e5118 dir0/file5
  e000004083398dd0:  3072696420393131 323135656c69662f   119 dir0/file512
  e000004083398de0:  662f307269642030 0000000000000000   0 dir0/f........
  e000004083398df0:  2020202020202020 2064687373202020              sshd 
  e000004083398e00:  3637322020203a72 2042694b20323338   r:   276832 KiB 
  e000004083398e10:  3937322020203a77 0a42694b20373232   w:   279227 KiB.
  e000004083398e20:  2020202020202020 206f697061747320            stapio 
  e000004083398e30:  3633322020203a72 2042694b20333136   r:   236613 KiB 
  e000004083398e40:  3633322020203a77 0a42694b20323036   w:   236602 KiB.
  e000004083398e50:  2020202020202020 2068736172632020             crash 
  e000004083398e60:  3831362020203a72 2042694b20333137   r:   618713 KiB 
  e000004083398e70:  35656c69662f3072 3072696420313331   r0/file5131 dir0
  e000004083398e80:  333135656c69662f 662f307269642032   /file5132 dir0/f
  ...
  
From there the ASCII data continues on into the next page in an
uninterrupted (corrupting) fashion:
  
  e00000408339bfd0:  6c69662f30726964 6964203433303665   dir0/file6034 di
  e00000408339bfe0:  36656c69662f3072 3072696420353330   r0/file6035 dir0
  e00000408339bff0:  333036656c69662f 662f307269642036   /file6036 dir0/f
  e00000408339c000:  2037333036656c69 6c69662f30726964   ile6037 dir0/fil
  e00000408339c010:  6964203833303665 36656c69662f3072   e6038 dir0/file6
  e00000408339c020:  3072696420393330 343036656c69662f   039 dir0/file604
  
The next page at e00000408339c000 above is a "size-512" slab page, and so
the slab structure at its base is corrupted:
  
  crash> kmem -s e00000408339c000
  kmem: size-512: full list: slab: e00000408339c080  bad next pointer: 6964203634303665
  kmem: size-512: full list: slab: e00000408339c080  bad prev pointer: 36656c69662f3072
  kmem: size-512: full list: slab: e00000408339c080  bad inuse counter: 1768169528
  kmem: size-512: full list: slab: e00000408339c080  bad inuse counter: 1768169528
  kmem: size-512: full list: slab: e00000408339c080  bad s_mem pointer: 343036656c69662f
  kmem: size-512: partial list: slab: e00000408339c080  bad next pointer: 6964203634303665
  kmem: size-512: partial list: slab: e00000408339c080  bad inuse counter: 1768169528
  kmem: size-512: partial list: slab: e00000408339c080  bad s_mem pointer: 343036656c69662f
  kmem: size-512: full list: slab: e00000408339c080  bad next pointer: 6964203634303665
  kmem: size-512: full list: slab: e00000408339c080  bad inuse counter: 1768169528
  kmem: size-512: full list: slab: e00000408339c080  bad s_mem pointer: 343036656c69662f
  kmem: size-512: free list: slab: e00000408339c080  bad next pointer: 6964203634303665
  kmem: size-512: free list: slab: e00000408339c080  bad inuse counter: 1768169528
  kmem: size-512: free list: slab: e00000408339c080  bad s_mem pointer: 343036656c69662f
  kmem: size-512: address not found in cache: e00000408339c000
  crash> 
    
The corruption continues through the next two pages, which are not used
by the slab subsystem, and then finally bumps into the "size-64" slab
cache page that lead to eventual kernel crash:
  
  crash> kmem -s e0000040833a0000
  kmem: address is not allocated in slab subsystem: e0000040833a0000
  crash> kmem -s e0000040833a4000
  kmem: address is not allocated in slab subsystem: e0000040833a4000
  crash> kmem -s e0000040833ac000
  kmem: size-64: full list: slab: e0000040833ac000  bad next pointer: 6420333031392d72
  kmem: size-64: full list: slab: e0000040833ac000  bad prev pointer: 20343031392d7269
  kmem: size-64: full list: slab: e0000040833ac000  bad inuse counter: 1768169526
  kmem: size-64: full list: slab: e0000040833ac000  bad inuse counter: 1768169526
  kmem: size-64: full list: slab: e0000040833ac000  bad s_mem pointer: 3031392d72696420
  kmem: size-64: partial list: slab: e0000040833ac000  bad next pointer: 6420333031392d72
  kmem: size-64: partial list: slab: e0000040833ac000  bad inuse counter: 1768169526
  kmem: size-64: partial list: slab: e0000040833ac000  bad s_mem pointer: 3031392d72696420
  kmem: size-64: full list: slab: e0000040833ac000  bad next pointer: 6420333031392d72
  kmem: size-64: full list: slab: e0000040833ac000  bad inuse counter: 1768169526
  kmem: size-64: full list: slab: e0000040833ac000  bad s_mem pointer: 3031392d72696420
  kmem: size-64: free list: slab: e0000040833ac000  bad next pointer: 6420333031392d72
  kmem: size-64: free list: slab: e0000040833ac000  bad inuse counter: 1768169526
  kmem: size-64: free list: slab: e0000040833ac000  bad s_mem pointer: 3031392d72696420
  kmem: size-64: address not found in cache: e0000040833ac000
  crash> 

So like the crash the I was able to reproduce, the kprobe code seems to
be referencing legitimate slab cache addresses which have been corrupted
due to the running of the "io-load-test.sh".  

So I guess the question is:

 (1) Are the systemtap/kprobe operations an innocent victim of something 
     that the "io-load-test.sh" causes to happen? 

Or:

 (2) Do the systemtap/kprobe operations interfere with the kernel code 
     that runs as a result of the "io-load-test.sh"?

And I have no idea which is the case.

Perhaps testing should be done that takes one of them out of the picture,
i.e., testing that:

  (1) uses the systemtap scripts -- but NOT "io-load-test.sh"
  (2) uses "io-load-test.sh" -- but NOT systemtap scripts.

And in the two cases above, run something else to get a heavy system 
load going.

I'm curious whether Fujitsu has ever seen the problem occur without
both things (systemtap and io-load_test.sh) going on simultaneously?

Dave

Comment 14 Dave Anderson 2009-10-08 16:04:28 UTC

> As a result, I guess that this issue might only occur with the combination
> of systemtap scripts and "io-load-test.sh".
>
> I will test a little more in other kinds of high load situations.

OK.  Also, if we could at least try to narrow down *which* systemtap script
(or which combination of scripts) seems to be the causal factor would be
helpful.  Just running all of the scripts and then running the io-load-test.sh
on top of them makes it almost impossible to debug.

All we really have to work with is that the directories and files created
by the io-load-test.sh script seem to be listed in an ASCII data buffer
that gets scribbled across kernel memory, corrupting whatever was there.

Comment 15 Dave Anderson 2009-10-16 19:40:54 UTC

Just an update...

I have only been able to reproduce this 3 more times, and so far it 
has required:

 1. all of the systemtap tests provided to be running
 2. the io-load-test.sh script to be running
 3. additional test load that bumps the load average very high and
    consumes all of available memory.  (using usex exerciser running
    a combination of disk I/O, virtual memory tests, and some hand-created
    network activity)

However, the vmcore files all show the same symptom -- a slab page in the
size-64 slab cache (and perhaps another one as well) gets corrupted with
recognizable ASCII data from the io-load-test.sh file, and eventually some
entity ends up accessing memory from a corrupted page.  When running the
test, the size-64 slab cache typically has several hundred thousand objects
allocated, so that could be why it tends to be the slab cache most likely
to be corrupted. (?)  The corruption is always several contiguous pages in
length.

In any case, given that the kernel crashes occur well after the corruption
has taken place, there's no smoking gun to work with, just the results of
the corruption.

I did try to narrow it down to one or more of the system tap scripts, but was
unsuccessful.  

I'll keep trying, but at this point, I've pretty much gotten nowhere...

Comment 16 Dave Anderson 2009-10-20 14:56:46 UTC

Another update:

I was able to reproduce the crash with just 3 of the systemtap scripts
running:

  disktop.stp 
  para-callgraph2.stp sys_read '*@fs/*.c' 
  test1.stp

And again, I ran the io-load-test.sh script, plus a heavy load of tests
from a "usex" framework that created both a heavy load and a low memory
scenario.

The crash was familiar -- an "stapio" task associated with the 
para-callgraph2.stp script was doing an unregister_kretprobes() of a
kretprobe_instance, and bumped into the same type of ASCII data corruption
of the whole "size-64" slab page containing the structure.
The unregister_kretprobes() call was being done because that particular
script always fails within a few seconds like so:

# stap -v -k -o /tmp/result/paracall.result para-callgraph2.stp sys_read '*@fs/*.c'
Pass 1: parsed user script and 51 library script(s) in 343usr/6sys/429real ms.
Pass 2: analyzed script: 3277 probe(s), 12 function(s), 2 embed(s), 3 global(s) in 956usr/323sys/8312real ms.
Pass 3: translated to C into "/tmp/stapbuTHw6/stap_25967.c" in 446usr/150sys/604real ms.
Pass 4: compiled C into "stap_25967.ko" in 8264usr/407sys/8585real ms.
Pass 5: starting run.
ERROR: probe overhead exceeded threshold
WARNING: Number of errors: 1, skipped probes: 0
Pass 5: run completed in 17usr/4795sys/4951real ms.
Keeping temporary directory "/tmp/stapbuTHw6"
# dmesg
... [ snip ] ...
stap_25967: systemtap: 0.9.7/0.141, base: a000000202340000, memory: 2903064+73024+58944+14000 data+text+ctx+net, probes: 3277
#

Because of the script's failure, my tests had been set up to repeatedly
re-run the script after it fails.

In any case, because of the para-callgraph2.stp's "failure mode",
and the fact that it is so kernel-intrusive (placing 3277 probes in
all of the functions in the kernel/fs directory), it seems highly
likely that it is a prime suspect.

Comment 17 Dave Anderson 2009-10-20 20:08:20 UTC

I have a new crash which seems to pinpoint the para-callgraph.stp kernel
module.  After a reboot, all I did was to start a shell script that simply
ran the para-callgraph.stp script in a loop -- because as I mentioned
above, it always fails in a matter of a few seconds.  After the script
had run only 2 (or 3?) times, I entered Ctrl-C and the system crashed.

Neither the io-load-test.sh script, the usex test suite, nor any other
systemtap script was running or had been run -- again, it was just after
a fresh system reboot.  

Unfortunately the vmcore that was created was corrupt, and caused the
crash utility generate a segmentation violation while trying to parse
through the NT_PRSTATUS segments.  The "readelf" command confirms the
corruption immediately, and you can see the bogus Elf Note sections
at the end:
 
 $ readelf -a vmcore
 readelf: Warning: corrupt note found at offset 12cc into core notes
 readelf: Warning:  type: 7fe07c78, namesize: 00a94db0, descsize: a0000001
 ELF Header:
   Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
   Class:                             ELF64
   Data:                              2's complement, little endian
   OS/ABI:                            UNIX - System V
   ABI Version:                       0
   Type:                              CORE (Core file)
   Machine:                           Intel IA-64
   Version:                           0x1
   Entry point address:               0x0
   Start of program headers:          64 (bytes into file)
   Start of section headers:          0 (bytes into file)
   Flags:                             0x0
   Size of this header:               64 (bytes)
   Size of program headers:           56 (bytes)
   Number of program headers:         34
   Size of section headers:           0 (bytes)
   Number of section headers:         0
   Section header string table index: 0
 
 There are no sections in this file.
 
 There are no sections in this file.
 
 Program Headers:
   Type           Offset             VirtAddr           PhysAddr
                  FileSiz            MemSiz              Flags  Align
   NOTE           0x00000000000007b0 0x0000000000000000 0x0000000000000000
                  0x0000000000002064 0x0000000000002064         0
   LOAD           0x0000000000002814 0xa000000100000000 0x0000000004000000
                  0x00000000006709a0 0x00000000006709a0  RWE    0
   LOAD           0x00000000006731b4 0xe000000000000000 0x0000000000000000
                  0x0000000000001000 0x0000000000001000  RWE    0
   LOAD           0x00000000006741b4 0xe000000000001000 0x0000000000001000
                  0x000000000008a000 0x000000000008a000  RWE    0
   LOAD           0x00000000006fe1b4 0xe00000000008b000 0x000000000008b000
                  0x0000000000015000 0x0000000000015000  RWE    0
   LOAD           0x00000000007131b4 0xe000000000100000 0x0000000000100000
                  0x0000000003f00000 0x0000000003f00000  RWE    0
   LOAD           0x00000000046131b4 0xe000000004000000 0x0000000004000000
                  0x0000000000e0d000 0x0000000000e0d000  RWE    0
   LOAD           0x00000000054201b4 0xe000000004e0d000 0x0000000004e0d000
                  0x00000000031f3000 0x00000000031f3000  RWE    0
   LOAD           0x00000000086131b4 0xe000000028000000 0x0000000028000000
                  0x0000000050000000 0x0000000050000000  RWE    0
   LOAD           0x00000000586131b4 0xe00000007c000000 0x000000007c000000
                  0x000000000295a000 0x000000000295a000  RWE    0
   LOAD           0x000000005af6d1b4 0xe00000007f300000 0x000000007f300000
                  0x0000000000a8c000 0x0000000000a8c000  RWE    0
   LOAD           0x000000005b9f91b4 0xe00000007fd8c000 0x000000007fd8c000
                  0x000000000000c000 0x000000000000c000  RWE    0
   LOAD           0x000000005ba051b4 0xe00000007fda0000 0x000000007fda0000
                  0x0000000000004000 0x0000000000004000  RWE    0
   LOAD           0x000000005ba091b4 0xe000000100000000 0x0000000100000000
                  0x0000000376e3a000 0x0000000376e3a000  RWE    0
   LOAD           0x00000003d28431b4 0xe000000476e3a000 0x0000000476e3a000
                  0x0000000000558000 0x0000000000558000  RWE    0
   LOAD           0x00000003d2d9b1b4 0xe000000477392000 0x0000000477392000
                  0x000000000005e000 0x000000000005e000  RWE    0
   LOAD           0x00000003d2df91b4 0xe0000004773f0000 0x00000004773f0000
                  0x0000000000002000 0x0000000000002000  RWE    0
   LOAD           0x00000003d2dfb1b4 0xe0000004773f2000 0x00000004773f2000
                  0x0000000000050000 0x0000000000050000  RWE    0
   LOAD           0x00000003d2e4b1b4 0xe000000477442000 0x0000000477442000
                  0x000000000000c000 0x000000000000c000  RWE    0
   LOAD           0x00000003d2e571b4 0xe00000047744e000 0x000000047744e000
                  0x00000000043b0000 0x00000000043b0000  RWE    0
   LOAD           0x00000003d72071b4 0xe00000047f800000 0x000000047f800000
                  0x0000000000001000 0x0000000000001000  RWE    0
   LOAD           0x00000003d72081b4 0xe00000047f801000 0x000000047f801000
                  0x00000000000c5000 0x00000000000c5000  RWE    0
   LOAD           0x00000003d72cd1b4 0xe00000047f8c6000 0x000000047f8c6000
                  0x0000000000002000 0x0000000000002000  RWE    0
   LOAD           0x00000003d72cf1b4 0xe00000047f8c8000 0x000000047f8c8000
                  0x000000000004e000 0x000000000004e000  RWE    0
   LOAD           0x00000003d731d1b4 0xe00000047f916000 0x000000047f916000
                  0x0000000000002000 0x0000000000002000  RWE    0
   LOAD           0x00000003d731f1b4 0xe00000047f918000 0x000000047f918000
                  0x0000000000002000 0x0000000000002000  RWE    0
   LOAD           0x00000003d73211b4 0xe00000047f91a000 0x000000047f91a000
                  0x0000000000002000 0x0000000000002000  RWE    0
   LOAD           0x00000003d73231b4 0xe00000047f91c000 0x000000047f91c000
                  0x0000000000008000 0x0000000000008000  RWE    0
   LOAD           0x00000003d732b1b4 0xe00000047f924000 0x000000047f924000
                  0x00000000000d6000 0x00000000000d6000  RWE    0
   LOAD           0x00000003d74011b4 0xe00000047f9fa000 0x000000047f9fa000
                  0x0000000000001000 0x0000000000001000  RWE    0
   LOAD           0x00000003d74021b4 0xe00000047f9fb000 0x000000047f9fb000
                  0x0000000000003000 0x0000000000003000  RWE    0
   LOAD           0x00000003d74051b4 0xe00000047f9fe000 0x000000047f9fe000
                  0x0000000000358000 0x0000000000358000  RWE    0
   LOAD           0x00000003d775d1b4 0xe00000047fd56000 0x000000047fd56000
                  0x000000000006c000 0x000000000006c000  RWE    0
   LOAD           0x00000003d77c91b4 0xe00000047fe7e000 0x000000047fe7e000
                  0x0000000000124000 0x0000000000124000  RWE    0
 
 There is no dynamic section in this file.
 
 There are no relocations in this file.
 
 There are no unwind sections in this file.
 
 No version information found in this file.
 
 Notes at offset 0x000007b0 with length 0x00002064:
   Owner         Data size       Description
   CORE          0x00000478      NT_PRSTATUS (prstatus structure)
   CORE          0x00000478      NT_PRSTATUS (prstatus structure)
   CORE          0x00000478      NT_PRSTATUS (prstatus structure)
   CORE          0x00000478      NT_PRSTATUS (prstatus structure)
   (NONE)                0x00000000      Unknown note type: (0x00000000)
   (NONE)                0x00000000      Unknown note type: (0x00000000)
   (NONE)                0x00000000      Unknown note type: (0x00000000)
   (NONE)                0x00000000      Unknown note type: (0x00000000)
   (NONE)                0x00000000      Unknown note type: (0x00000000)
   (NONE)                0x00000000      Unknown note type: (0x00000000)
   (NONE)                0x00000000      Unknown note type: (0x00000000)
   (NONE)                0x00000000      Unknown note type: (0x00000000)
   (NONE)                0x00000000      Unknown note type: (0x00000000)
   (NONE)                0x00000000      Unknown note type: (0x00000000)
   (NONE)                0x00000000      Unknown note type: (0x00000000)
   (NONE)                0x00000000      Unknown note type: (0x00000000)
   (NONE)                0x00000000      Unknown note type: (0xffffffff)
 $ 
 
There should be an NT_PRSTATUS note for each processor, and for 
whatever reason, the kdump process failed.

However, I was able to create a patched version of the crash utility
that would ignore the the bogus notes section and continue on.

Here is the backtrace, with the -f option adding function arguments:

  crash> bt -f
  PID: 5515   TASK: e00000016cd78000  CPU: 7   COMMAND: "staprun"
   #0 [BSP:e00000016cd794e8] machine_kexec at a00000010005a380
      (e000000104a2b280)
   #1 [BSP:e00000016cd794c8] crash_kexec at a0000001000cefb0
      (a000000100a4f688)
   #2 [BSP:e00000016cd79460] panic at a00000010007bc70
      (0)
   #3 [BSP:e00000016cd79418] die at a000000100037ef0
      (ffffffffffffffff, a000000100a46cc8, 80400000000)
   #4 [BSP:e00000016cd793c0] ia64_do_page_fault at a00000010066dd90
      (800000004, 80400000000, e00000016cd7fa30)
   #5 [BSP:e00000016cd793c0] __ia64_leave_kernel at a00000010000bfe0
    EFRAME: e00000016cd7fa30
        B0: a0000001000956b0      CR_IIP: a0000001000958a0
   CR_IPSR: 0000101008026018      CR_IFS: 800000000000030a
    AR_PFS: 000000000000030a      AR_RSC: 0000000000000003
   AR_UNAT: 0000000000000000     AR_RNAT: e000000100040000
    AR_CCV: 0000000000000000     AR_FPSR: 0009804c8a70033f
    LOADRS: 0000000000000000 AR_BSPSTORE: ffffffffffff0028
        B6: a000000100011940          B7: a000000100095620
        PR: 000000000056a5a5          R1: a000000100c36560
        R2: e00000016cd7fc08          R3: e00000016cd7fc00
        R8: e000000185460008          R9: e000000111d715c0
       R10: e000000111d715b8         R11: e000000105f6fe00
       R12: e00000016cd7fbf0         R13: e00000016cd78000
       R14: 0000000000200200         R15: e00000016cd7fc00
       R16: e00000016cd7fc00         R17: 0000000000000000
       R18: e000000111d715a0         R19: a000000100a36bc8
       R20: e000000185460fc8         R21: e000000111d715a0
       R22: e00000016cd7fc00         R23: e000000185460fd0
       R24: e00000016cd7fc08         R25: e000000111d715a8
       R26: e000000111d715a0         R27: e000000185460010
       R28: 00000001033dd2fc         R29: e000000185460fb0
       R30: 00000001033dd2fb         R31: 00000000000000fb
        F6: 1003e000000000000062e     F7: 1003e0044b82fa09b5a53
        F8: 1003e00000000001707d8     F9: 100018000000000000000
       F10: 10009ac17fffff53e8000    F11: 1003e0000000000000560
   #6 [BSP:e00000016cd79370] run_timer_softirq at a0000001000958a0
      (e000000185460000)
   #7 [BSP:e00000016cd792f0] __do_softirq at a00000010008b3e0
      (void)
   #8 [BSP:e00000016cd79290] do_softirq at a00000010008b650
      (void)
   #9 [BSP:e00000016cd79278] irq_exit at a00000010008b720
      (void)
  #10 [BSP:e00000016cd79248] ia64_handle_irq at a000000100011840
      (f, e00000016cd7fc10)
  #11 [BSP:e00000016cd79248] __ia64_leave_kernel at a00000010000bfe0
    EFRAME: e00000016cd7fc10
        B0: a0000001001488b0      CR_IIP: a00000010011ae20
   CR_IPSR: 00001010085a6010      CR_IFS: 8000000000000288
    AR_PFS: 000000000000030b      AR_RSC: 0000000000000003
   AR_UNAT: 0000000000000000     AR_RNAT: 0000000000565699
    AR_CCV: 0000000000000000     AR_FPSR: 0009804c8a70033f
    LOADRS: 0000000000000000 AR_BSPSTORE: a000000100059f80
        B6: a000000100061a40          B7: a0000001001b9dc0
        PR: 0000000000566959          R1: a000000100c36560
        R2: e00000002bfc4700          R3: e00000002bfc4700
        R8: 0000000000000001          R9: e00000010ab24ce0
       R10: 000000000000009c         R11: e000000120c10d18
       R12: e00000016cd7fdd0         R13: e00000016cd78000
       R14: 0000000000000001         R15: e000000100bf8400
       R16: 000000000000009c         R17: 00000000000000f5
       R18: 00000000000000b6         R19: e000000120c10d20
       R20: 000000000128a532         R21: 0000000000024000
       R22: 0000000000000000         R23: 0000000000000000
       R24: e00000016cd7fdd0         R25: e00000016cd7fdd0
       R26: 0000000000000000         R27: 000000000000002a
       R28: 0000000000000007         R29: 0000000000004000
       R30: 0000000000004000         R31: e000000100f6fa00
        F6: 1003e0000000000000000     F7: 1003e0000000000000000
        F8: 1003e0000000000000000     F9: 1003e0000000000000001
       F10: 1003e0000000029a9e4c0    F11: 1003e431bde82d7b634db
  #12 [BSP:e00000016cd79208] __free_pages at a00000010011ae20
      (e000000100bf83f8, 0)
  #13 [BSP:e00000016cd791d0] __vunmap at a0000001001488b0
      (9d, 1)
  #14 [BSP:e00000016cd791b0] vfree at a000000100148bd0
      (a000000202328000)
  #15 [BSP:e00000016cd79180] module_free at a000000100054d10
      (a0000002024be500, a000000202328000)
  #16 [BSP:e00000016cd79150] free_module at a0000001000c2590
      (a0000002024be500)
  #17 [BSP:e00000016cd790e0] sys_delete_module at a0000001000c6190
      (20000008003080b0, 0)
  #18 [BSP:e00000016cd790e0] __ia64_trace_syscall at a00000010000bd70
    EFRAME: e00000016cd7fe40
        B0: 20000008000042e0      CR_IIP: a000000000010620
   CR_IPSR: 00001213085a6010      CR_IFS: 0000000000000004
    AR_PFS: c000000000000004      AR_RSC: 000000000000000f
   AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000
    AR_CCV: 0000000000000000     AR_FPSR: 0009804c0270033f
    LOADRS: 0000000000400000 AR_BSPSTORE: 600007ffff8c0100
        B6: 2000000800223760          B7: a000000000010640
        PR: 0000000000561a41          R1: 2000000800270238
        R2: c00000000000048c          R3: e00000016cd7fef8
        R8: 0000000000000001          R9: 20000008002f3e88
       R10: 0000000000000000         R11: c00000000000040c
       R12: 60000fffff8bae50         R13: 20000008002fc6d0
       R14: 200000080001d0d0         R15: 000000000000046e
       R16: 2000000800200840         R17: 0000000000000710
       R18: 0000000000000000         R19: 0000000000043220
       R20: 0000000000000002         R21: 0000000000004449
       R22: 3a474e494e524157         R23: 200000080000ae48
       R24: 0000000000000020         R25: 0000000000000000
       R26: c000000000000006         R27: 000000000000000f
       R28: a000000000010620         R29: 00001213085a6010
       R30: 0000000000000008         R31: 0000000000561a41
        F6: 000000000000000000000     F7: 000000000000000000000
        F8: 000000000000000000000     F9: 000000000000000000000
       F10: 000000000000000000000    F11: 000000000000000000000
  #19 [BSP:e00000016cd790e0] __kernel_syscall_via_break at a000000000010620
  crash>
  
The "staprun" command was running on behalf of the para-callgraph.stp
script, and was shutting down the systemtap kernel module -- either
because of the Ctrl-C or because of the failure that always occurs
when running that script:

  crash> set
      PID: 5515
  COMMAND: "staprun"
     TASK: e00000016cd78000  [THREAD_INFO: e00000016cd79040]
      CPU: 7
    STATE: TASK_RUNNING (PANIC)
  crash>

So it had proceeded through the module deletion phase all the 
way to the point to where it was vfree'ing the contigous chunk
of vmalloc'd pages that had made up the systemtap kernel module.

Note that the module_free() function receives the vmalloc address
of the module structure itself, which is contained within the 
whole vmalloc'd region pointed to by "module_region": 
  
  void
  module_free (struct module *mod, void *module_region)
  {
          if (mod && mod->arch.init_unw_table &&
              module_region == mod->module_init) {
                  unw_remove_unwind_table(mod->arch.init_unw_table);
                  mod->arch.init_unw_table = NULL;
          }
          vfree(module_region);
  }
  
The backtrace shows that the module data structure was at vmalloc
address a0000002024be500, which was itself contained within the
whole vmalloc region that starts at a000000202328000:

  #15 [BSP:e00000016cd79180] module_free at a000000100054d10
      (a0000002024be500, a000000202328000)

We don't know the actual size and therefore the end of the kernel
module, but we do know that at a minimum it consisted of the
vmalloc address space starting at a000000202328000, and going
some amount of space beyond a0000002024be500.  Now that continguous
vmalloc region has already been vfree'd:

 crash> kmem -v
    VM_STRUCT                 ADDRESS RANGE               SIZE
 e000000185c05940  a000000200000000 - a000000200404000  4210688
 e000000185c05900  a000000200404000 - a000000200608000  2113536
 e000000184e3f780  a000000200608000 - a00000020070c000  1064960
 e000000101f2f380  a00000020070c000 - a000000200f10000  8404992
 ... [ snip ] ...
 e00000018083c400  a000000201e1c000 - a000000201f44000  1212416
 e00000010b66dcc0  a000000201f44000 - a000000201fa4000  393216
 e0000001069e9440  a000000201fa4000 - a000000201ff4000  327680
 e0000001044ca6c0  a000000201ff4000 - a000000202050000  376832
 crash>
 
So we can't even look at the formerly-vmalloc'd address space
because it's underlying physical pages have been given back
by __free_pages():
 
 crash> vtop a000000202328000
 VIRTUAL           PHYSICAL        
 a000000202328000  (not mapped)
 
 PAGE DIRECTORY: a000000100808000
    PGD: a000000100808000 => 10136c000
    PMD: e00000010136c808 => 1076ac000
    PTE: e0000001076ac650 => 0
 
 crash> 
  
Anyway, while doing the __free_pages() operations, a timer interrupt
kicked off, and after handling it, irq_exit() saw that there were
softirq's pending, so it called do_softirq(). 

do_softirq() found a queued (per-cpu) function and called it, which
is what caused the crash.  As it turns out, the function that was
attempted was at address a0000002023ea718, which is solidly within
the vmalloc address space formerly utilized by the systemtap script
being dismantled:

  a000000202328000 < a0000002023ea718 < a0000002024be500
  
In fact you can see that the function at a0000002023ea718 is still 
queued on the cpu 0 and cpu 6 timer queues:
 
 crash>  timer
 TVEC_BASES[0]: a000000100d21100
   JIFFIES 
 4349350651
   EXPIRES      TIMER_LIST         FUNCTION    
 4349350652  e0000001057c0b20  a0000002023ea718  
 4349350684  e000000102f59938  a000000100776440  <rh_timer_func>
 4349351143  e000000104647dd0  a000000100777c10  <process_timeout>
 4349352000  e0000001082f4500  a000000201806de0  <e1000_watchdog>
 4349352414  a0000001009eade8  a000000100778590  <neigh_periodic_timer>
 4349352441  e000000028007090  a00000010077bd20  <delayed_work_timer_fn>
 ... [ snip ] ...
 TVEC_BASES[6]: e000000185450000
   JIFFIES 
 4349350651
   EXPIRES      TIMER_LIST         FUNCTION    
 4349350651  e000000111d718a0  a0000002023ea718  
 4349351144  a000000201eb6968  a000000100778590  <neigh_periodic_timer>
 4349351188  e00000010f7afab0  a000000100777c10  <process_timeout>
 4349351424  e000000102efb960  a00000010077bd20  <delayed_work_timer_fn>
 ... [ snip ] ...
 crash>
 
And it looks to have been on the other cpus as well, which must have tried
to execute it at the same time:
 
 crash> log | grep a0000002023ea718
 Unable to handle kernel paging request at virtual address a0000002023ea718
 Unable to handle kernel paging request at virtual address a0000002023ea718
 Unable to handle kernel paging request at virtual address a0000002023ea718
 Unable to handle kernel paging request at virtual address a0000002023ea718
 Unable to handle kernel paging request at virtual address a0000002023ea718
 Unable to handle kernel paging request at virtual address a0000002023ea718
 crash>
 
But only 1 of the cpus ended up attempting the kdump.

In any case, it would seem that the function at a0000002023ea718 should
have been removed from the timer queues before the module was allowed
to be vfree'd.

Also, I don't know how (or if) this is related to the other crashes where
memory corruption was encountered.
 
Dave

Comment 18 Dave Anderson 2009-10-20 20:35:16 UTC

Looking back at the other crash from comment #16 (which crashed due to
ASCII memory corruption in the size-64 slab cache), I can see that an
instance of the kernel module associated with the para-callgraph.stp
script is running, and that it has the same function that caused the crash
in comment #17 above on each cpu's timer queue.  It's a different vmalloc
address, but it's at the same offset from the beginning of the module:

  crash> timer | grep a0000002026f6718
  4326367115  e0000001477f0aa0  a0000002026f6718  <__stp_time_timer_callback>
  4326367114  e0000001477f0420  a0000002026f6718  <__stp_time_timer_callback>
  4326367114  e0000001477f1320  a0000002026f6718  <__stp_time_timer_callback>
  4326367114  e0000001477f1920  a0000002026f6718  <__stp_time_timer_callback>
  4326367114  e0000001086d1aa0  a0000002026f6718  <__stp_time_timer_callback>
  4326367115  e0000001086d32a0  a0000002026f6718  <__stp_time_timer_callback>
  4326367114  e0000001086d3120  a0000002026f6718  <__stp_time_timer_callback>
  4326367115  e0000001086d23a0  a0000002026f6718  <__stp_time_timer_callback>
  crash> 

In this case, it can be translated by crash as "__stp_time_timer_callback"
because it still exists as a legitimate kernel symbol.

Comment 19 Dave Anderson 2009-10-20 21:12:40 UTC

A del_timer_sync() call should have been made by the systemtap
kernel module during the shutdown operation.  

But trying to understand the kernel module code generated by
systemtap is mind-boggling.

Comment 20 Dave Anderson 2009-10-22 12:53:11 UTC

Given the situation above, shouldn't this BZ be re-assigned to the "systemtap"
component?  It's certainly not the job of the kernel proper (or the kprobe
subsystem within it) to clean up timer tasks queued by systemtap.

Comment 21 Sachin Prabhu 2009-10-22 13:23:49 UTC

Dave,

Thanks for your analysis of the problem. The analysis points to bugs in the system-tap component. Re-assigning to the system-tap component.

Sachin Prabhu

Comment 22 Frank Ch. Eigler 2009-10-22 14:46:34 UTC

Dave, thank you for your ongoing analysis of the problem.  While
reading systemtap-generated C code is not easy, "stap -p4 -k ..."
is a start.  The temp-directory + /usr/share/systemtap/runtime
directories should give the complete picture.

Systemtap does (only) use del_timer_sync() to cancel timers, and
mod_timer() within the callbacks to reschedule them.

Perhaps the kernel is not robust with respect to a combination of
these operations from modules that may be removed.

Has anyone tried reproducing this problem from an x86 or x86-64 machine?

Comment 23 Dave Anderson 2009-10-22 15:10:55 UTC

> Dave, thank you for your ongoing analysis of the problem.  While
> reading systemtap-generated C code is not easy, "stap -p4 -k ..."
> is a start.  The temp-directory + /usr/share/systemtap/runtime
> directories should give the complete picture.

Right -- I only looked at the C files in the /tmp/stapXXX directories
and couldn't see where the timers were being activated/deactivated,
but could see that del_timer_sync was part of each .ko module.

> Perhaps the kernel is not robust with respect to a combination of
> these operations from modules that may be removed.

The last crash, however, was *only* running a signle module at the time,
which was the para-callgraph.stp generated module that I was running
in a loop from a bash script, and it crashed either due to my entry of
Ctrl-C or simply because the module always fails quickly by itself in
Pass 5 like this:

  ...
  Pass 5: starting run.
  ERROR: probe overhead exceeded threshold
  WARNING: Number of errors: 1, skipped probes: 0

It seems that the del_timer_sync call should be made as part of
the mod->exit() callback into the module here in sys_delete_module()

asmlinkage long
sys_delete_module(const char __user *name_user, unsigned int flags)

        ... [ snip ] ...

        /* Final destruction now noone is using it. */
        if (mod->exit != NULL) {
                mutex_unlock(&module_mutex);
                mod->exit();
                mutex_lock(&module_mutex);
        }
        free_module(mod);

 out:
        mutex_unlock(&module_mutex);
        return ret;
}

But it's gone beyond that and has called free_module().  And all but one
of the other cpus are also working on the same stale callback that exists
on their per-cpu timer queue.

As to other machine types, they claim it's only been an ia64 issue,
but I have to believe that's it's really unknown at this point because
the original reproducer test procedure was so convaluted.

So anyway, this is what I did to narrow it down:
 
  while true
  do
    stap -v -o /dev/null para-callgraph2.stp sys_read '*@fs/*.c'
  done

and on the 2nd or 3rd attempt, I hit Ctrl-C.

Comment 24 Frank Ch. Eigler 2009-10-22 17:08:19 UTC

Were you also running the io-load-test to reproduce the problem with just para-callgraph2.stp?

Comment 25 Dave Anderson 2009-10-22 17:15:11 UTC

(In reply to comment #24)
> Were you also running the io-load-test to reproduce the problem with just
> para-callgraph2.stp?  

No -- I had just rebooted the system, and then I did the above.

BTW, I just got a fresh x86_64 system that I'm going to try to reproduce
it on as well.

Comment 26 Frank Ch. Eigler 2009-10-22 17:22:34 UTC

Created attachment 365759 [details]
dmesg log

The same loop caused a crash on my x86-64 rhel5.4 VM.
This was with a bunch of ^C's and maybe a ^Z in there.
too, trying to disrupt it.

Comment 27 Dave Anderson 2009-10-22 17:48:38 UTC

Hmmm, I'm not able to get it to reproduce on the x86_64.  I just
wait until it says "Pass 5: starting run." and then hit CTRL-C.
But it seems to clean up as expected...

On the other hand, I tried it again on the ia64 machine
I've got, and it only took 3 repetitions before I got
it to crash again.

In this new case, cpus 0, 3, 4, 5, 6 and 7 are all in the 
idle "swapper" task running run_timer_softirq() when they 
come upon the same queued function to call at a000000202402718, 
which is no longer mapped in vmalloc space.  Cpus 1 and 2 just
haven't gotten around to handling it yet, but the function is
still sitting on their per-cpu timer queue:
  
  crash> timer
  ... [snip] ...
  TVEC_BASES[1]: e000000185194000
    JIFFIES 
  4483518118
    EXPIRES      TIMER_LIST         FUNCTION    
  4483518118  e000000105dabba0  a000000202402718  
  4483520108  e000000028017090  a00000010077bd20  <delayed_work_timer_fn>
  4488172090  e000000110c08900  a00000010077c600  <tcp_keepalive_timer>
  4490094598  e000000110c09700  a00000010077c600  <tcp_keepalive_timer>
  TVEC_BASES[2]: e000000185234000
    JIFFIES 
  4483518118
    EXPIRES      TIMER_LIST         FUNCTION    
  4483518118  e000000105da82a0  a000000202402718  
  4483518352  e00000010305edb8  a000000100776440  <rh_timer_func>
  4483518752  e000000028027090  a00000010077bd20  <delayed_work_timer_fn>
  4483519116  e00000010c26fab0  a000000100777c10  <process_timeout>
  4483530474  e000000107a05218  a00000010077b540  <neigh_timer_handler>
  4515112326  e0000001073e7a60  a000000100777c10  <process_timeout>
  ...
 
So in this case, the systemtap module has already been deleted, but
with timer functions still referenced:

  crash> mod | grep stap
  crash>

Comment 28 Frank Ch. Eigler 2009-10-23 15:24:30 UTC

Some experiments show evidence of a possible race condition
between abnormal shutdown triggered at the same time as
normal startup.  We are formulating a cunning plan to solve it.

Comment 29 Frank Ch. Eigler 2009-10-28 15:19:49 UTC

We believe we have fixed this in upstream systemtap.
This will be included in future RHEL releases.

http://sourceware.org/bugzilla/show_bug.cgi?id=10854

Comment 34 Frank Ch. Eigler 2009-11-30 16:16:25 UTC

I believe we need PM to authorize 5.4.Z first.

Comment 36 Frank Ch. Eigler 2009-12-01 15:41:30 UTC

patch is ready for backporting into rhel5.4.Z;
rhel5.5 will get it as a part of a near-future rebase

Comment 42 errata-xmlrpc 2010-03-30 09:05:25 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0308.html

Note You need to log in before you can comment on or make changes to this bug.