We have a possible regression reported by a user. Kernel panics when stap command of systemtap is executed multiple times concurrently. It does not occur in RHEL5.3(kernel-2.6.18-128.7.1.el5). The issue has been confirmed on RHEL 5.4 beta and Snapshot 2 and on the GA kernel. The Oops message accompanying the crash is crash> bt PID: 31620 TASK: e0000040d7b20000 CPU: 0 COMMAND: "staprun" #0 [BSP:e0000040d7b21608] machine_kexec at a00000010005a380 #1 [BSP:e0000040d7b215e0] crash_kexec at a0000001000cef90 #2 [BSP:e0000040d7b21578] panic at a00000010007bbd0 #3 [BSP:e0000040d7b21530] die at a000000100037ef0 #4 [BSP:e0000040d7b214e0] ia64_do_page_fault at a00000010066dd90 #5 [BSP:e0000040d7b214e0] __ia64_leave_kernel at a00000010000bfe0 EFRAME: e0000040d7b27ba0 B0: a0000001001a1dc0 CR_IIP: a0000001001a1e40 CR_IPSR: 00001210085a6010 CR_IFS: 8000000000000610 AR_PFS: 0000000000000610 AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: e0000040d7b27d30 AR_CCV: 0000000000000000 AR_FPSR: 0009804c0270033f LOADRS: 0000000000000000 AR_BSPSTORE: e0000040d7b20000 B6: a0000001002dcc40 B7: a00000010026e720 PR: 0000000000556a99 R1: a000000100c36560 R2: e0000040d7b20328 R3: e0000040d7b2032c R8: 0000000000000000 R9: e0000001308f9a80 R10: 9e37fffffffc0001 R11: a000000100818318 R12: e0000040d7b27d60 R13: e0000040d7b20000 R14: 2031363439656c69 R15: 000000000000006c R16: 00fc7000357651f1 R17: e0000040d7b27d88 R18: 000000003797ae1e R19: 00000000003fffff R20: 0000000000000016 R21: a000000100a4a988 R22: 00fc6ffffddea3d3 R23: 000000003797ae1e R24: a000000100818310 R25: a000000100a4a978 R26: 7e37fffeef51e9d9 R27: 0000000000000000 R28: a000000100a4a980 R29: 0000000000000001 R30: e0000001308f99e0 R31: e000000110adea60 F6: 000000000000000000000 F7: 000000000000000000000 F8: 000000000000000000000 F9: 000000000000000000000 F10: 000000000000000000000 F11: 000000000000000000000 #6 [BSP:e0000040d7b21460] __link_path_walk at a0000001001a1e40 #7 [BSP:e0000040d7b21420] link_path_walk at a0000001001a2c90 #8 [BSP:e0000040d7b213d8] do_path_lookup at a0000001001a3750 #9 [BSP:e0000040d7b21380] __path_lookup_intent_open at a0000001001a4ea0 #10 [BSP:e0000040d7b21340] path_lookup_open at a0000001001a5020 #11 [BSP:e0000040d7b212d0] open_namei at a0000001001a5330 #12 [BSP:e0000040d7b21298] do_filp_open at a0000001001734b0 #13 [BSP:e0000040d7b21248] do_sys_open at a0000001001735a0 #14 [BSP:e0000040d7b211e8] sys_open at a000000100173790 #15 [BSP:e0000040d7b211e8] __ia64_trace_syscall at a00000010000bd70 EFRAME: e0000040d7b27e40 B0: 200000080014a4f0 CR_IIP: a000000000010620 CR_IPSR: 00001213085a6010 CR_IFS: 0000000000000008 AR_PFS: c000000000000008 AR_RSC: 000000000000000f AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000 AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f LOADRS: 0000000001480000 AR_BSPSTORE: 600007fffffb8130 B6: 2000000800210f80 B7: a000000000010640 PR: 0000000000551a41 R1: 2000000800280238 R2: e0000040d7b27ee0 R3: e0000040d7b27ef8 R8: 0000000000000001 R9: 2000000800303c88 R10: 0000000000000000 R11: c000000000000593 R12: 60000ffffffb2fe0 R13: 200000080030c6d0 R14: a000000000010640 R15: 0000000000000404 R16: e0000040d7b210a0 R17: 0000000000000710 R18: 0000000000000160 R19: 0000000001400000 R20: 0009804c8a70033f R21: 2000000800008600 R22: 0000000000000000 R23: 600007fffffb8130 R24: 0000000000000000 R25: 0000000000000000 R26: c000000000000004 R27: 000000000000000f R28: a000000000010620 R29: 00001213085a6010 R30: 0000000000000004 R31: 0000000000551a41 F6: 000000000000000000000 F7: 000000000000000000000 F8: 000000000000000000000 F9: 000000000000000000000 F10: 000000000000000000000 F11: 000000000000000000000 #16 [BSP:e0000040d7b211e8] __kernel_syscall_via_break at a000000000010620 Version-Release number of selected component: Red Hat Enterprise Linux Version Number: 5.4 Release Number: Snapshot 2 Architecture: ia64 Kernel Version: 2.6.18-157.el5 Related Package Version: systemtap-0.9.7-5.el5 Related Middleware / Application: none
Could I get a pointer to the vmlinux/vmcore pair?
> And our developer and investigator use as one of debugging work when the > trouble occurs. They can not use SystemTap command by this issue. > It is influence that the shipment of RHEL5.4 and relevant middleware might be > delayed. Can we agree that this test is insane? And to state that the failure implies that "They can not use SystemTap command by this issue." is a bit hard to swallow. I'm not much of a systemtap user, but as I understand it, all of these scripts and their resultant kernel modules are run simulaneously: disktop.stp: places retprobes in vfs_write(), vfs_read(), plus 7 other (timer-related?) locations para-callgraph2.stp: places 3277 kprobes in all functions in the kernel's "fs" directory tree. test1.stp: places 512 kprobes and kretprobes -- in all system calls traceio.stp: places retprobes in vfs_write(), vfs_read() and a timer kmalloc2.stp: kprobes __kmalloc() syscalls_by_proc.stp: places 266 kprobes (all system calls) top.stp: places 266 system call (all system calls) I've created my own kprobes for debugging purposes, but I was not aware that it was even "legal" to have simultaneous kprobes on the same function? Perhaps that's OK to do -- maybe they get "chained" somehow if a kprobe is placed at the same location more than one time? The dumps are interesting, but they don't point to anything that is an obvious bug, but rather they seem both bump into a section of kmalloc'd memory that has been corrupted with ASCII data. Anyway, in the first core, "staprun" is trying to open the file "/sys/module/ohci_hcd/sections/.data.rel.1", and it ran into what appears to be inode corruption in the __link_path_walk() function: crash> bt PID: 31620 TASK: e0000040d7b20000 CPU: 0 COMMAND: "staprun" #0 [BSP:e0000040d7b21608] machine_kexec at a00000010005a380 #1 [BSP:e0000040d7b215e0] crash_kexec at a0000001000cef90 #2 [BSP:e0000040d7b21578] panic at a00000010007bbd0 #3 [BSP:e0000040d7b21530] die at a000000100037ef0 #4 [BSP:e0000040d7b214e0] ia64_do_page_fault at a00000010066dd90 #5 [BSP:e0000040d7b214e0] __ia64_leave_kernel at a00000010000bfe0 EFRAME: e0000040d7b27ba0 B0: a0000001001a1dc0 CR_IIP: a0000001001a1e40 CR_IPSR: 00001210085a6010 CR_IFS: 8000000000000610 AR_PFS: 0000000000000610 AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: e0000040d7b27d30 AR_CCV: 0000000000000000 AR_FPSR: 0009804c0270033f LOADRS: 0000000000000000 AR_BSPSTORE: e0000040d7b20000 B6: a0000001002dcc40 B7: a00000010026e720 PR: 0000000000556a99 R1: a000000100c36560 R2: e0000040d7b20328 R3: e0000040d7b2032c R8: 0000000000000000 R9: e0000001308f9a80 R10: 9e37fffffffc0001 R11: a000000100818318 R12: e0000040d7b27d60 R13: e0000040d7b20000 R14: 2031363439656c69 R15: 000000000000006c R16: 00fc7000357651f1 R17: e0000040d7b27d88 R18: 000000003797ae1e R19: 00000000003fffff R20: 0000000000000016 R21: a000000100a4a988 R22: 00fc6ffffddea3d3 R23: 000000003797ae1e R24: a000000100818310 R25: a000000100a4a978 R26: 7e37fffeef51e9d9 R27: 0000000000000000 R28: a000000100a4a980 R29: 0000000000000001 R30: e0000001308f99e0 R31: e000000110adea60 F6: 000000000000000000000 F7: 000000000000000000000 F8: 000000000000000000000 F9: 000000000000000000000 F10: 000000000000000000000 F11: 000000000000000000000 #6 [BSP:e0000040d7b21460] __link_path_walk at a0000001001a1e40 #7 [BSP:e0000040d7b21420] link_path_walk at a0000001001a2c90 #8 [BSP:e0000040d7b213d8] do_path_lookup at a0000001001a3750 #9 [BSP:e0000040d7b21380] __path_lookup_intent_open at a0000001001a4ea0 #10 [BSP:e0000040d7b21340] path_lookup_open at a0000001001a5020 #11 [BSP:e0000040d7b212d0] open_namei at a0000001001a5330 #12 [BSP:e0000040d7b21298] do_filp_open at a0000001001734b0 #13 [BSP:e0000040d7b21248] do_sys_open at a0000001001735a0 #14 [BSP:e0000040d7b211e8] sys_open at a000000100173790 #15 [BSP:e0000040d7b211e8] __ia64_trace_syscall at a00000010000bd70 EFRAME: e0000040d7b27e40 B0: 200000080014a4f0 CR_IIP: a000000000010620 CR_IPSR: 00001213085a6010 CR_IFS: 0000000000000008 AR_PFS: c000000000000008 AR_RSC: 000000000000000f AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000 AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f LOADRS: 0000000001480000 AR_BSPSTORE: 600007fffffb8130 B6: 2000000800210f80 B7: a000000000010640 PR: 0000000000551a41 R1: 2000000800280238 R2: e0000040d7b27ee0 R3: e0000040d7b27ef8 R8: 0000000000000001 R9: 2000000800303c88 R10: 0000000000000000 R11: c000000000000593 R12: 60000ffffffb2fe0 R13: 200000080030c6d0 R14: a000000000010640 R15: 0000000000000404 R16: e0000040d7b210a0 R17: 0000000000000710 R18: 0000000000000160 R19: 0000000001400000 R20: 0009804c8a70033f R21: 2000000800008600 R22: 0000000000000000 R23: 600007fffffb8130 R24: 0000000000000000 R25: 0000000000000000 R26: c000000000000004 R27: 000000000000000f R28: a000000000010620 R29: 00001213085a6010 R30: 0000000000000004 R31: 0000000000551a41 F6: 000000000000000000000 F7: 000000000000000000000 F8: 000000000000000000000 F9: 000000000000000000000 crash> The address it tried to use was 2031363439656cb9: crash> log | grep Unable Unable to handle kernel paging request at virtual address 2031363439656cb9 crash> which is ASCII data for the most part (except for the b9 at the end): crash> ascii 2031363439656cb9 2031363439656cb9: <b9>le9461 crash> And as it turns out there is corruption in the inode slab cache: crash> kmem -s ... [ snip ] ... e00000010a5a1800 bdev_cache 896 31 36 2 16k e00000010a561780 sysfs_dir_cache 88 6451 6726 38 16k e00000010a559700 mnt_cache 256 33 124 2 16k kmem: inode_cache: full list: slab: e000000130884000 bad next pointer: 6c69662f30726964 kmem: inode_cache: full list: slab: e000000130884000 bad prev pointer: 6964203837333865 kmem: inode_cache: full list: slab: e000000130884000 bad inuse counter: 1818846767 kmem: inode_cache: full list: slab: e000000130884000 bad inuse counter: 1818846767 kmem: inode_cache: full list: slab: e000000130884000 bad s_mem pointer: 3072696420393733 e000000109559680 inode_cache 568 749 980 35 16k e000000107529600 dentry_cache 216 56855 57498 777 16k ... The slab structure at the bottom of an inode cache slab (at e000000130884000) contains repetetive ASCII data -- but the ASCII data starts several pages below that at e00000013087c000: crash> rd e00000013087c000 6934 e00000013087c000: 2037333036656c69 6c69662f30726964 ile6037 dir0/fil e00000013087c010: 6964203833303665 36656c69662f3072 e6038 dir0/file6 e00000013087c020: 3072696420393330 343036656c69662f 039 dir0/file604 e00000013087c030: 662f307269642030 2031343036656c69 0 dir0/file6041 e00000013087c040: 6c69662f30726964 6964203234303665 dir0/file6042 di e00000013087c050: 36656c69662f3072 3072696420333430 r0/file6043 dir0 e00000013087c060: 343036656c69662f 662f307269642034 /file6044 dir0/f e00000013087c070: 2035343036656c69 6c69662f30726964 ile6045 dir0/fil e00000013087c080: 6964203634303665 36656c69662f3072 e6046 dir0/file6 e00000013087c090: 3072696420373430 343036656c69662f 047 dir0/file604 e00000013087c0a0: 662f307269642038 2039343036656c69 8 dir0/file6049 e00000013087c0b0: 6c69662f30726964 6964203035303665 dir0/file6050 di e00000013087c0c0: 36656c69662f3072 3072696420313530 r0/file6051 dir0 e00000013087c0d0: 353036656c69662f 662f307269642032 /file6052 dir0/f e00000013087c0e0: 2033353036656c69 6c69662f30726964 ile6053 dir0/fil e00000013087c0f0: 6964203435303665 36656c69662f3072 e6054 dir0/file6 e00000013087c100: 3072696420353530 353036656c69662f 055 dir0/file605 e00000013087c110: 662f307269642036 2037353036656c69 6 dir0/file6057 e00000013087c120: 6c69662f30726964 6964203835303665 dir0/file6058 di e00000013087c130: 36656c69662f3072 3072696420393530 r0/file6059 dir0 e00000013087c140: 363036656c69662f 662f307269642030 /file6060 dir0/f e00000013087c150: 2031363036656c69 6c69662f30726964 ile6061 dir0/fil e00000013087c160: 6964203236303665 36656c69662f3072 e6062 dir0/file6 e00000013087c170: 3072696420333630 363036656c69662f 063 dir0/file606 e00000013087c180: 662f307269642034 2035363036656c69 4 dir0/file6065 ... and it continues for several pages (55472 bytes) until here: ... e000000130889870: 39656c69662f3072 3072696420353939 r0/file9995 dir0 e000000130889880: 393939656c69662f 662f307269642036 /file9996 dir0/f e000000130889890: 2037393939656c69 6c69662f30726964 ile9997 dir0/fil e0000001308898a0: 6964203839393965 39656c69662f3072 e9998 dir0/file9 e0000001308898b0: 0000000000393939 0000000000000000 999............. e0000001308898c0: 0000000000000000 0000000000000000 ................ e0000001308898d0: 0000000000000000 0000000000000000 ................ e0000001308898e0: 0000000000000000 0000000000000000 ................ e0000001308898f0: 0000000000000000 0000000000000000 ................ If I strip off the "b9" from the 2031363439656cb9, which was probably created by adding a structure member offset, and look for it in the memory containing the ASCII data, I see this: crash> rd e00000013087c000 6934 | grep 2031363439656c e000000130887b40: 2031363439656c69 6c69662f30726964 ile9461 dir0/fil crash> And if I search the crashing task's stack, I see that the address above has been used: crash> rd e0000040d7b20000 4096 | grep e000000130887b40 e0000040d7b214a0: e000000130887b40 a0000001001a2c90 @{.0.....,...... crash> That's as far as I have gone with that vmcore. In core2, slab corruption also exists: crash> kmem -s ... [ snip ] ... e000000107388480 size-1024 1024 562 660 44 16k e000000107380400 size-512(DMA) 512 0 0 0 16k kmem: size-512: full list: slab: e00000408339c080 bad next pointer: 6964203634303665 kmem: size-512: full list: slab: e00000408339c080 bad prev pointer: 36656c69662f3072 kmem: size-512: full list: slab: e00000408339c080 bad inuse counter: 1768169528 kmem: size-512: full list: slab: e00000408339c080 bad inuse counter: 1768169528 kmem: size-512: full list: slab: e00000408339c080 bad s_mem pointer: 343036656c69662f e000000107378380 size-512 512 5790 6262 202 16k e000000107370300 size-256(DMA) 256 0 0 0 16k e000000107368280 size-256 256 2201 3286 53 16k e000000107360200 size-128(DMA) 128 0 0 0 16k e000000107350180 size-64(DMA) 64 0 0 0 16k e000000107348100 size-128 128 1801 2706 22 16k kmem: size-64: full list: slab: e0000040833ac000 bad next pointer: 6420333031392d72 kmem: size-64: full list: slab: e0000040833ac000 bad prev pointer: 20343031392d7269 kmem: size-64: full list: slab: e0000040833ac000 bad inuse counter: 1768169526 kmem: size-64: full list: slab: e0000040833ac000 bad inuse counter: 1768169526 kmem: size-64: full list: slab: e0000040833ac000 bad s_mem pointer: 3031392d72696420 e000000107340080 size-64 64 733752 748320 3118 16k a000000100a0c3e8 kmem_cache 16512 149 149 149 32k crash> Like the other vmcore, there's a multiple-page chunk of repetetive ASCII data. In this dump it starts at e000004083398980 and ends at e0000040833ae280. It contains similar "dir#/file###" entries, a small chunk of what looks like systemtap output data, and then more file stuff: crash> rd e000004083398980 11040 e000004083398980: 662f30006f2e6e74 2031343035656c69 tn.o.0/file5041 e000004083398990: 6c69662f30726964 6964203234303565 dir0/file5042 di e0000040833989a0: 35656c69662f3072 3072696420333430 r0/file5043 dir0 e0000040833989b0: 343035656c69662f 662f307269642034 /file5044 dir0/f e0000040833989c0: 2035343035656c69 6c69662f30726964 ile5045 dir0/fil e0000040833989d0: 6964203634303565 35656c69662f3072 e5046 dir0/file5 e0000040833989e0: 3072696420373430 343035656c69662f 047 dir0/file504 e0000040833989f0: 662f307269642038 2039343035656c69 8 dir0/file5049 e000004083398a00: 6c69662f30726964 6964203035303565 dir0/file5050 di e000004083398a10: 35656c69662f3072 3072696420313530 r0/file5051 dir0 ... [ snip ] ... e000004083398dc0: 6964203831313565 35656c69662f3072 e5118 dir0/file5 e000004083398dd0: 3072696420393131 323135656c69662f 119 dir0/file512 e000004083398de0: 662f307269642030 0000000000000000 0 dir0/f........ e000004083398df0: 2020202020202020 2064687373202020 sshd e000004083398e00: 3637322020203a72 2042694b20323338 r: 276832 KiB e000004083398e10: 3937322020203a77 0a42694b20373232 w: 279227 KiB. e000004083398e20: 2020202020202020 206f697061747320 stapio e000004083398e30: 3633322020203a72 2042694b20333136 r: 236613 KiB e000004083398e40: 3633322020203a77 0a42694b20323036 w: 236602 KiB. e000004083398e50: 2020202020202020 2068736172632020 crash e000004083398e60: 3831362020203a72 2042694b20333137 r: 618713 KiB e000004083398e70: 35656c69662f3072 3072696420313331 r0/file5131 dir0 e000004083398e80: 333135656c69662f 662f307269642032 /file5132 dir0/f e000004083398e90: 2033333135656c69 6c69662f30726964 ile5133 dir0/fil ... [ snip ] ... e0000040833a9850: 662f307269642032 2033393939656c69 2 dir0/file9993 e0000040833a9860: 6c69662f30726964 6964203439393965 dir0/file9994 di e0000040833a9870: 39656c69662f3072 3072696420353939 r0/file9995 dir0 e0000040833a9880: 393939656c69662f 662f307269642036 /file9996 dir0/f e0000040833a9890: 2037393939656c69 6c69662f30726964 ile9997 dir0/fil e0000040833a98a0: 6964203839393965 39656c69662f3072 e9998 dir0/file9 e0000040833a98b0: 3830382d00393939 30382d7269642035 999.-8085 dir-80 e0000040833a98c0: 382d726964203638 2d72696420373830 86 dir-8087 dir- e0000040833a98d0: 7269642038383038 696420393830382d 8088 dir-8089 di e0000040833a98e0: 6964203930382d72 6420303930382d72 r-809 dir-8090 d e0000040833a98f0: 20313930382d7269 323930382d726964 ir-8091 dir-8092 e0000040833a9900: 3930382d72696420 30382d7269642033 dir-8093 dir-80 ... [ snip ] ... The address being dereferenced in this dump is 2d72696420303439: crash> log | grep Unable Unable to handle kernel paging request at virtual address 2d72696420303439 crash> ascii 2d72696420303439 2d72696420303439: 940 dir- crash> crash> bt PID: 4287 TASK: e00000409b170000 CPU: 1 COMMAND: "stapio" #0 [BSP:e00000409b1714a0] machine_kexec at a00000010005a380 #1 [BSP:e00000409b171480] crash_kexec at a0000001000cef90 #2 [BSP:e00000409b171418] panic at a00000010007bbd0 #3 [BSP:e00000409b1713c8] die at a000000100037ef0 #4 [BSP:e00000409b171378] ia64_do_page_fault at a00000010066dd90 #5 [BSP:e00000409b171378] __ia64_leave_kernel at a00000010000bfe0 EFRAME: e00000409b177bb0 B0: a00000010066f230 CR_IIP: a00000010066f1f0 CR_IPSR: 00001210085a6010 CR_IFS: 8000000000000389 AR_PFS: 0000000000000389 AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: 0000000000551559 AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f LOADRS: 0000000000000000 AR_BSPSTORE: a0000001001691c0 B6: a0000001006690e0 B7: a00000010000b7e0 PR: 0000000000556659 R1: a000000100c36560 R2: 0000000000000389 R3: a0000001001caa60 R8: e0000040833ae040 R9: 0000000000000054 R10: a000000209f336b8 R11: e000004089508028 R12: e00000409b177d70 R13: e00000409b170000 R14: 392d726964203439 R15: e0000040833ae048 R16: 0000000000000000 R17: 0000000000200200 R18: 0000000000200200 R19: 392d726964203441 R20: 2d72696420303439 R21: 392d726964203439 R22: 0000000000000063 R23: 0000000000000062 R24: 0000000000000080 R25: 0000000000024000 R26: 0000000000000000 R27: 0000000000004000 R28: 0000000000004000 R29: 00000000070f04ce R30: 0000000000000000 R31: a000000100a4f708 F6: 1003e00000000000000a5 F7: 0ffdd8000000000000000 F8: 10006a4ffd6c000000000 F9: 100058000000000000000 F10: 10014b23947fff4dc6b80 F11: 1003e00000000002c8e52 #6 [BSP:e00000409b171330] unregister_kretprobes at a00000010066f1f0 #7 [BSP:e00000409b1712f8] _stp_cleanup_and_exit at a000000209e185d0 #8 [BSP:e00000409b171270] _stp_ctl_write_cmd at a000000209e19170 #9 [BSP:e00000409b171220] vfs_write at a000000100177ea0 #10 [BSP:e00000409b1711a0] kretprobe_trampoline at a000000100059f80 #11 [BSP:e00000409b171128] unregister_kretprobes at a00000010066f230 crash> And if I look in the corrupted memory section for it, I can see a couple references: crash> rd e000004083398980 11040 | grep 2d72696420303439 e0000040833acb70: 2d72696420303439 7269642030303439 940 dir-9400 dir e0000040833ae040: 392d726964203439 2d72696420303439 94 dir-9940 dir- crash> And looking into the kernel stack of the crashing task, the second reference is in the exception frame: crash> bt | grep e0000040833ae040 R8: e0000040833ae040 R9: 0000000000000054 crash> But at this point, I don't know where to go from there. I'm guessing that the corruption took place long before either of the two instances that crashed ran into it. And it's highly likely that the problem may be in the systemtap-generated kernel modules themselves and not the kernel proper.
Actually it looks like it's a buffer larger than 64K, the second one is at least 88K bytes in length.
I was unsuccessful in reproducing this on an ia64 using the supplied instructions, although they don't make a lot of sense. I ran the "all_exec.sh" script to kick off the set of stap scripts, and then started up the "io-load-test.sh" script. But then I did't understand what is supposed to be done after the above: > 3. Do one or some of the followings: Huh? "one or some of the followings"? > a: Execute a stap script again. > For example, execution of "stap -v -k disktop.stp" OK, I did the example above in the background, and find the PID of the "stap -v -k disktop.stp" command's child "stapio" process. > b: Send SIGTERM to a process of stapio. OK, I did that -- again presuming they're talking about the child "stapio" of the parent " stap -v -k disktop.stp" command. But when the "stapio" child is killed, it also causes the parent "stap" command to die as well. > c: Run gdb. Run gdb on what? Just enter "gdb" alone? > d: Type Ctrl-C to terminate a process of stap. It's already dead as a result of the SIGTERM I sent to the child "stapio" process. So after giving up, I started looking at the two supplied vmcore files on the same system, and later on decided to "scp" them to another machine for storage. Only *then* did the system crash! Anyway, here's what I see in the new vmcore: An "ssh" instance was shutting down a socket, and bumped into a kprobe at the first instruction in sys_shutdown(): crash> bt PID: 24734 TASK: e0000003a1b20000 CPU: 0 COMMAND: "ssh" #0 [BSP:e0000003a1b213d8] machine_kexec at a00000010005a380 #1 [BSP:e0000003a1b213b8] crash_kexec at a0000001000cefb0 #2 [BSP:e0000003a1b21350] panic at a00000010007bc70 #3 [BSP:e0000003a1b21308] die at a000000100037ef0 #4 [BSP:e0000003a1b212b8] ia64_do_page_fault at a00000010066dd90 #5 [BSP:e0000003a1b212b8] __ia64_leave_kernel at a00000010000bfe0 EFRAME: e0000003a1b279f0 B0: a00000010066cb90 CR_IIP: a000000100670500 CR_IPSR: 0000121008022038 CR_IFS: 8000000000000001 AR_PFS: 0000000000000286 AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: 0000000500030002 AR_CCV: 0000000000000000 AR_FPSR: 0009804c0270033f LOADRS: 0000000000000000 AR_BSPSTORE: 0000000000000024 B6: a00000010066ff40 B7: a000000100670640 PR: 00000000005619a5 R1: a000000100c36560 R2: 0000000000000389 R3: 0000000000100100 R8: 0000000000200200 R9: 373038656c696637 R10: 662f307269642036 R11: 373038656c69662f R12: e0000003a1b27bb0 R13: e0000003a1b20000 R14: a000000100059f80 R15: e0000003a496afb0 R16: a0000002050c314c R17: a000000100d28b88 R18: ffffffffffff0210 R19: e0000003a496af88 R20: e0000003a496af90 R21: e0000003a496af98 R22: e0000003a496afa0 R23: 0004014dc5800000 R24: a000000100537160 R25: e0000003a84011b0 R26: e0000040f10ee780 R27: 0000000000004000 R28: e0000001075d4838 R29: e0000001075d4808 R30: 0000000000000000 R31: a000000100a4f750 F6: 000000000000000000000 F7: 000000000000000000000 F8: 000000000000000000000 F9: 000000000000000000000 F10: 0fffef13d31b9b66f9000 F11: 0ffdcc3f0000000000000 #6 [BSP:e0000003a1b212b0] add_rp_inst at a000000100670500 #7 [BSP:e0000003a1b21288] arch_prepare_kretprobe at a00000010066cb90 #8 [BSP:e0000003a1b21250] pre_handler_kretprobe at a00000010066ff80 #9 [BSP:e0000003a1b21220] aggr_pre_handler at a0000001006706d0 #10 [BSP:e0000003a1b211b8] kprobe_exceptions_notify at a00000010066b670 #11 [BSP:e0000003a1b21180] notifier_call_chain at a00000010066df10 #12 [BSP:e0000003a1b21150] atomic_notifier_call_chain at a00000010066dfb0 #13 [BSP:e0000003a1b21128] ia64_bad_break at a00000010066ac80 #14 [BSP:e0000003a1b21128] __ia64_leave_kernel at a00000010000bfe0 EFRAME: e0000003a1b27c70 B0: a000000100059f80 CR_IIP: a000000100537160 CR_IPSR: 0000101008526030 CR_IFS: 8000000000000008 AR_PFS: 0000000000000008 AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: 0000000000551559 AR_CCV: 0000000000000000 AR_FPSR: 0009804c0270033f LOADRS: 0000000000000000 AR_BSPSTORE: a000000100059f80 B6: a000000100537160 B7: a0000001000102b0 PR: 0000000000569559 R1: a000000100c36560 R2: 0000000000000000 R3: 0000000000000118 R8: 000000004abcdba0 R9: 00000000381f5ee0 R10: e0000003a1b20e98 R11: 0000000000000010 R12: e0000003a1b27e30 R13: e0000003a1b20000 R14: a000000100a4d9e8 R15: 00000000000000b2 R16: a0000001006f14b8 R17: e0000003a1b27ff0 R18: 00000000381f5ee0 R19: a000000100879300 R20: a000000100537160 R21: a000000100d21088 R22: 0000000000000001 R23: a000000100a4f720 R24: a000000100a4f720 R25: 0000000000000000 R26: a000000100a4f73c R27: e0000001075d4840 R28: e0000001075d4838 R29: e0000001075d4808 R30: e0000001075d4824 R31: e0000001075d4830 F6: 000000000000000000000 F7: 000000000000000000000 F8: 000000000000000000000 F9: 000000000000000000000 F10: 0fffef13d31b9b66f9000 F11: 0ffdcc3f0000000000000 #15 [BSP:e0000003a1b210e8] sys_shutdown at a000000100537160 #16 [BSP:e0000003a1b210e8] kretprobe_trampoline at a000000100059f80 bt: unwind: ip, sp, bsp remain unchanged; stopping here (ip=0xa000000100059f80) crash> The kprobe handling ended up calling add_rp_inst() with a kretprobe_instance pointer argument of e0000003a496af80: crash> bt -f ... #6 [BSP:e0000003a1b212b0] add_rp_inst at a000000100670500 (e0000003a496af80) ... crash> void __kprobes add_rp_inst(struct kretprobe_instance *ri) The data structure contains "familiar" garbage (ASCII data generated from the supplied "io-load-test.sh" program) in its uflist and hlist members: crash> kretprobe_instance e0000003a496af80 struct kretprobe_instance { uflist = { next = 0x373038656c69662f, pprev = 0x662f307269642036 }, hlist = { next = 0x2037373038656c69, pprev = 0x6c69662f30726964 }, rp = 0xa0000002050c30a0, ret_addr = 0xa00000010000bd70, task = 0xe0000003a1b20000 } crash> kretprobe_instance -r e0000003a496af80 e0000003a496af80: 373038656c69662f 662f307269642036 /file8076 dir0/f e0000003a496af90: 2037373038656c69 6c69662f30726964 ile8077 dir0/fil e0000003a496afa0: a0000002050c30a0 a00000010000bd70 .0......p....... e0000003a496afb0: e0000003a1b20000 ........ crash> The size-64 slab cache has been corrupted, and the slab page containing the kretprobe_instance structure above e0000003a496af80 is one of two contiguous corrupted slab pages, at e0000003a4968000 and e0000003a496c000: crash> kmem -s e0000003a496af80 kmem: size-64: full list: slab: e0000003a496c000 bad next pointer: 6c69662f30726964 kmem: size-64: full list: slab: e0000003a496c000 bad prev pointer: 6964203837333865 kmem: size-64: full list: slab: e0000003a496c000 bad inuse counter: 1818846767 kmem: size-64: full list: slab: e0000003a496c000 bad inuse counter: 1818846767 kmem: size-64: full list: slab: e0000003a496c000 bad s_mem pointer: 3072696420393733 kmem: size-64: partial list: slab: e0000003a4968000 bad next pointer: 3072696420373032 kmem: size-64: partial list: slab: e0000003a4968000 bad inuse counter: 812804452 kmem: size-64: partial list: slab: e0000003a4968000 bad s_mem pointer: 2039303237656c69 kmem: size-64: full list: slab: e0000003a4968000 bad next pointer: 3072696420373032 kmem: size-64: full list: slab: e0000003a4968000 bad inuse counter: 812804452 kmem: size-64: full list: slab: e0000003a4968000 bad s_mem pointer: 2039303237656c69 kmem: size-64: free list: slab: e0000003a4968000 bad next pointer: 3072696420373032 kmem: size-64: free list: slab: e0000003a4968000 bad inuse counter: 812804452 kmem: size-64: free list: slab: e0000003a4968000 bad s_mem pointer: 2039303237656c69 kmem: size-64: address not found in cache: e0000003a496af80 crash> So in this case, it doesn't appear that the kprobe code is using a stale kretprobe_instance pointer, but rather the running of "io-load-test.sh" script is leading to the corruption of memory. It is possible to follow the "io-load-test.sh" ASCII pattern across page boundaries by simply looking at the "file###" values. In this case the pattern starts somewhere around here: ... e0000003a49607b0: 0000000000000032 0000000000000001 2............... e0000003a49607c0: 004c465445535f46 5c2b375c305c305c F_SETFL.\0\0\7+\ e0000003a49607d0: 765c335c47373731 325c315c345c305c 177G\3\v\0\4\1\2 e0000003a49607e0: 315c30345c72365c 2e225c5c305c335c \6r\40\1\3\0\\". e0000003a49607f0: 313035656c002e2e 662f307269642032 ...le5012 dir0/f e0000003a4960800: 2033313035656c69 6c69662f30726964 ile5013 dir0/fil e0000003a4960810: 6964203431303565 35656c69662f3072 e5014 dir0/file5 e0000003a4960820: 3072696420353130 313035656c69662f 015 dir0/file501 e0000003a4960830: 662f307269642036 000000000000001c 6 dir0/f........ e0000003a4960840: 7c474e41484f4e57 45434152544e5557 WNOHANG|WUNTRACE e0000003a4960850: 49544e4f43577c44 3072007c4445554e D|WCONTINUED|.r0 e0000003a4960860: 323035656c69662f 662f307269642030 /file5020 dir0/f e0000003a4960870: 2031323035656c69 6c69662f30726964 ile5021 dir0/fil e0000003a4960880: 6964203232303565 35656c69662f3072 e5022 dir0/file5 e0000003a4960890: 3072696420333230 323035656c69662f 023 dir0/file502 e0000003a49608a0: 662f307269642034 2035323035656c69 4 dir0/file5025 e0000003a49608b0: 6c69662f30726964 6964203632303565 dir0/file5026 di e0000003a49608c0: 7c474e41484f4e57 45434152544e5557 WNOHANG|WUNTRACE e0000003a49608d0: 49544e4f43577c44 662f30004445554e D|WCONTINUED.0/f e0000003a49608e0: 2039323035656c69 6c69662f30726964 ile5029 dir0/fil e0000003a49608f0: 6964203033303565 35656c69662f3072 e5030 dir0/file5 e0000003a4960900: 3072696420313330 333035656c69662f 031 dir0/file503 e0000003a4960910: 662f307269642032 2033333035656c69 2 dir0/file5033 ... In the page above there are a few unrelated ASCII parts like the "WNHANG.." stuff. But from e0000003a4960e70 in the page above, ASCII data containing only the "dir#/file###" pattern continues uninterrupted, and which can be followed across page boundaries like this: ... e0000003a4963fd0: 6c69662f30726964 6964203433303665 dir0/file6034 di e0000003a4963fe0: 36656c69662f3072 3072696420353330 r0/file6035 dir0 e0000003a4963ff0: 333036656c69662f 662f307269642036 /file6036 dir0/f e0000003a4964000: 2037333036656c69 6c69662f30726964 ile6037 dir0/fil e0000003a4964010: 6964203833303665 36656c69662f3072 e6038 dir0/file6 e0000003a4964020: 3072696420393330 343036656c69662f 039 dir0/file604 ... until it seemingly ends here at e0000003a4971970: ... e0000003a4971860: 6c69662f30726964 6964203439393965 dir0/file9994 di e0000003a4971870: 39656c69662f3072 3072696420353939 r0/file9995 dir0 e0000003a4971880: 393939656c69662f 662f307269642036 /file9996 dir0/f e0000003a4971890: 2037393939656c69 6c69662f30726964 ile9997 dir0/fil e0000003a49718a0: 6964203839393965 39656c69662f3072 e9998 dir0/file9 e0000003a49718b0: 3830382d00393939 30382d7269642035 999.-8085 dir-80 e0000003a49718c0: 382d726964203638 2d72696420373830 86 dir-8087 dir- e0000003a49718d0: 7269642038383038 696420393830382d 8088 dir-8089 di e0000003a49718e0: 6964203930382d72 6420303930382d72 r-809 dir-8090 d e0000003a49718f0: 20313930382d7269 323930382d726964 ir-8091 dir-8092 e0000003a4971900: 3930382d72696420 30382d7269642033 dir-8093 dir-80 e0000003a4971910: 382d726964203439 2d72696420353930 94 dir-8095 dir- e0000003a4971920: 7269642036393038 696420373930382d 8096 dir-8097 di e0000003a4971930: 6420383930382d72 20393930382d7269 r-8098 dir-8099 e0000003a4971940: 642031382d726964 64203031382d7269 dir-81 dir-810 d e0000003a4971950: 20303031382d7269 313031382d726964 ir-8100 dir-8101 e0000003a4971960: 3031382d72696420 31382d7269642032 dir-8102 dir-81 e0000003a4971970: 382d726964203330 2d72696420343031 03 dir-8104 dir- e0000003a4971980: 00000000000081a4 4ab9336c4ab93231 ........12.Jl3.J e0000003a4971990: 4ab9336c4ab9336c 0000000000000000 l3.Jl3.J........ e0000003a49719a0: 0000000000000000 0000000000000000 ................ e0000003a49719b0: 0000000000000000 0000000000000000 ................ ... So the pages containing this span of continuous 64K+ ASCII stream are these: (1) e0000003a4960000 -- allocated "size-16384" slab data page (2) e0000003a4964000 -- not allocated in slab subsystem (3) e0000003a4968000 -- "size-64" slab page (4) e0000003a496c000 -- "size-64" slab page (5) e0000003a4970000 -- not allocated in slab subsystem Clearly the pages (3) and (4) being used by the "size-64" slab cache have been corrupted. Page (1) is a "size-16384" slab data page that is currently allocated -- it appears to be corrupted because its ASCII data stream flows into the next page above it: e0000003a4963fe0: 36656c69662f3072 3072696420353330 r0/file6035 dir0 e0000003a4963ff0: 333036656c69662f 662f307269642036 /file6036 dir0/f e0000003a4964000: 2037333036656c69 6c69662f30726964 ile6037 dir0/fil e0000003a4964010: 6964203833303665 36656c69662f3072 e6038 dir0/file6 e0000003a4964020: 3072696420393330 343036656c69662f 039 dir0/file604 so it goes beyond its 16K limit, and in fact the uninterrupted ASCII stream continues unabated into page (5). In any case, I'm going to go back and look at the first two vmcores again. Perhaps the systemtap operations -- which generate many thousands "size-64" slab cache requests -- are innocent bystanders that get trashed by something generated as a result of the "io-load-test.sh" operation?
Revisiting one of the two original vmcores, the symptoms of corruption is remarkably similar to the crash that I was able to reproduce. In this case, an "stapio" process was trying to unregister a set of kretprobes: crash> bt PID: 4287 TASK: e00000409b170000 CPU: 1 COMMAND: "stapio" #0 [BSP:e00000409b1714a0] machine_kexec at a00000010005a380 #1 [BSP:e00000409b171480] crash_kexec at a0000001000cef90 #2 [BSP:e00000409b171418] panic at a00000010007bbd0 #3 [BSP:e00000409b1713c8] die at a000000100037ef0 #4 [BSP:e00000409b171378] ia64_do_page_fault at a00000010066dd90 #5 [BSP:e00000409b171378] __ia64_leave_kernel at a00000010000bfe0 EFRAME: e00000409b177bb0 B0: a00000010066f230 CR_IIP: a00000010066f1f0 CR_IPSR: 00001210085a6010 CR_IFS: 8000000000000389 AR_PFS: 0000000000000389 AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: 0000000000551559 AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f LOADRS: 0000000000000000 AR_BSPSTORE: a0000001001691c0 B6: a0000001006690e0 B7: a00000010000b7e0 PR: 0000000000556659 R1: a000000100c36560 R2: 0000000000000389 R3: a0000001001caa60 R8: e0000040833ae040 R9: 0000000000000054 R10: a000000209f336b8 R11: e000004089508028 R12: e00000409b177d70 R13: e00000409b170000 R14: 392d726964203439 R15: e0000040833ae048 R16: 0000000000000000 R17: 0000000000200200 R18: 0000000000200200 R19: 392d726964203441 R20: 2d72696420303439 R21: 392d726964203439 R22: 0000000000000063 R23: 0000000000000062 R24: 0000000000000080 R25: 0000000000024000 R26: 0000000000000000 R27: 0000000000004000 R28: 0000000000004000 R29: 00000000070f04ce R30: 0000000000000000 R31: a000000100a4f708 F6: 1003e00000000000000a5 F7: 0ffdd8000000000000000 F8: 10006a4ffd6c000000000 F9: 100058000000000000000 F10: 10014b23947fff4dc6b80 F11: 1003e00000000002c8e52 #6 [BSP:e00000409b171330] unregister_kretprobes at a00000010066f1f0 #7 [BSP:e00000409b1712f8] _stp_cleanup_and_exit at a000000209e185d0 #8 [BSP:e00000409b171270] _stp_ctl_write_cmd at a000000209e19170 #9 [BSP:e00000409b171220] vfs_write at a000000100177ea0 #10 [BSP:e00000409b1711a0] kretprobe_trampoline at a000000100059f80 #11 [BSP:e00000409b171128] unregister_kretprobes at a00000010066f230 crash> It crashed in the loop where cleanup_rp_inst() is called: void __kprobes unregister_kretprobes(struct kretprobe **rps, int num) { int i; if (num <= 0) return; mutex_lock(&kprobe_mutex); for (i = 0; i < num; i++) if (__unregister_kprobe_top(&rps[i]->kp) < 0) rps[i]->kp.addr = NULL; mutex_unlock(&kprobe_mutex); synchronize_sched(); for (i = 0; i < num; i++) { if (rps[i]->kp.addr) { __unregister_kprobe_bottom(&rps[i]->kp); cleanup_rp_inst(rps[i]); } } } The "rps" and "num" arguments were: #6 [BSP:e00000409b171330] unregister_kretprobes at a00000010066f1f0 (a000000209f14710, 106) While cycling through the array of 262 (0x106) kretprobe pointers, it was working on the one at index 84 (0x54) located at address a000000209f149b0: crash> rd a000000209f14710 262 a000000209f14710: a000000209f2c280 a000000209f2c3e0 ................ a000000209f14720: a000000209f2c540 a000000209f2c6a0 @............... a000000209f14730: a000000209f2c800 a000000209f2c960 ........`....... ... [ snip ] ... a000000209f149b0: a000000209f33600 a000000209f33760 .6......`7...... ... [ snip ] ... The kretprobe pointer there (a000000209f33600) points to this: crash> kretprobe a000000209f33600 struct kretprobe { kp = { hlist = { next = 0x0, pprev = 0x0 }, list = { next = 0xa000000209f336d0, prev = 0x200200 }, mod_refcounted = 0, nmissed = 0, addr = 0xa0000001001caa60, symbol_name = 0x0, offset = 0, pre_handler = @0xa00000010077fde0: 0xa00000010066ff40 <pre_handler_kretprobe>, post_handler = 0, fault_handler = 0, break_handler = 0, opcode = { bundle = { quad0 = { template = 8, slot0 = 33816576, slot1_p0 = 896 }, quad1 = { slot1_p1 = 4341815, slot2 = 712462303360 } } }, ainsn = { insn = 0xa000000207b3a080, inst_flag = 8, target_br_reg = 0 } }, handler = 0xa000000209ed9668, maxactive = 4096, nmissed = 0, free_instances = { first = 0xe0000040833ae040 }, used_instances = { first = 0x0 } } And the code is in the act of dealing with the "free_instances" hlist_head, which points to an hlist_node at 0xe0000040833ae040, crash> hlist_node 0xe0000040833ae040 struct hlist_node { next = 0x392d726964203439, pprev = 0x2d72696420303439 } which is embedded in a kretprobe_instance structure: struct kretprobe_instance { struct hlist_node uflist; /* either on free list or used list */ struct hlist_node hlist; struct kretprobe *rp; kprobe_opcode_t *ret_addr; struct task_struct *task; }; And that's where the crash occurs: crash> log | grep 2d72696420303439 Unable to handle kernel paging request at virtual address 2d72696420303439 r20 : 2d72696420303439 r21 : 392d726964203439 r22 : 0000000000000063 crash> The kretprobe_instance structure is 56 bytes long, and is allocated from the "size-64" slab cache. Because of its small size, it will fit on a slab page where the slab structure sits at the bottom of the page, followed by the number of 64-bit items that fit in the rest of the page: And the whole slab page containing the structure is corrupted: crash> kmem -s 0xe0000040833ae040 kmem: size-64: full list: slab: e0000040833ac000 bad next pointer: 6420333031392d72 kmem: size-64: full list: slab: e0000040833ac000 bad prev pointer: 20343031392d7269 kmem: size-64: full list: slab: e0000040833ac000 bad inuse counter: 1768169526 kmem: size-64: full list: slab: e0000040833ac000 bad inuse counter: 1768169526 kmem: size-64: full list: slab: e0000040833ac000 bad s_mem pointer: 3031392d72696420 kmem: size-64: partial list: slab: e0000040833ac000 bad next pointer: 6420333031392d72 kmem: size-64: partial list: slab: e0000040833ac000 bad inuse counter: 1768169526 kmem: size-64: partial list: slab: e0000040833ac000 bad s_mem pointer: 3031392d72696420 kmem: size-64: full list: slab: e0000040833ac000 bad next pointer: 6420333031392d72 kmem: size-64: full list: slab: e0000040833ac000 bad inuse counter: 1768169526 kmem: size-64: full list: slab: e0000040833ac000 bad s_mem pointer: 3031392d72696420 kmem: size-64: free list: slab: e0000040833ac000 bad next pointer: 6420333031392d72 kmem: size-64: free list: slab: e0000040833ac000 bad inuse counter: 1768169526 kmem: size-64: free list: slab: e0000040833ac000 bad s_mem pointer: 3031392d72696420 kmem: size-64: address not found in cache: e0000040833ae040 crash> The slab page contains ASCII data generated from the "io-load-test.sh" script, from the beginning of the page (corrupting the slab structure) for about 8832 bytes of the 16k page: crash> rd e0000040833ac000 2048 e0000040833ac000: 6420333031392d72 20343031392d7269 r-9103 dir-9104 e0000040833ac010: 353031392d726964 3031392d72696420 dir-9105 dir-910 e0000040833ac020: 31392d7269642036 392d726964203730 6 dir-9107 dir-9 e0000040833ac030: 2d72696420383031 7269642039303139 108 dir-9109 dir e0000040833ac040: 726964203131392d 696420303131392d -911 dir-9110 di e0000040833ac050: 6420313131392d72 20323131392d7269 r-9111 dir-9112 e0000040833ac060: 333131392d726964 3131392d72696420 dir-9113 dir-911 e0000040833ac070: 31392d7269642034 392d726964203531 4 dir-9115 dir-9 e0000040833ac080: 2d72696420363131 7269642037313139 116 dir-9117 dir e0000040833ac090: 696420383131392d 6420393131392d72 -9118 dir-9119 d e0000040833ac0a0: 64203231392d7269 20303231392d7269 ir-912 dir-9120 ... [ snip repeating pattern ] ... e0000040833ae1d0: 2d72696420303839 7269642031383939 980 dir-9981 dir e0000040833ae1e0: 696420323839392d 6420333839392d72 -9982 dir-9983 d e0000040833ae1f0: 20343839392d7269 353839392d726964 ir-9984 dir-9985 e0000040833ae200: 3839392d72696420 39392d7269642036 dir-9986 dir-99 e0000040833ae210: 392d726964203738 2d72696420383839 87 dir-9988 dir- e0000040833ae220: 7269642039383939 726964203939392d 9989 dir-999 dir e0000040833ae230: 696420303939392d 6420313939392d72 -9990 dir-9991 d e0000040833ae240: 20323939392d7269 333939392d726964 ir-9992 dir-9993 e0000040833ae250: 3939392d72696420 39392d7269642034 dir-9994 dir-99 e0000040833ae260: 392d726964203539 2d72696420363939 95 dir-9996 dir- e0000040833ae270: 7269642037393939 696420383939392d 9997 dir-9998 di e0000040833ae280: e000393939392d72 e0000040833ae240 r-9999..@.:.@... e0000040833ae290: 000000000000dac4 0000000000000000 ................ e0000040833ae2a0: 2000000006f3fa00 2000000003b5cbd0 ....... ....... e0000040833ae2b0: 0000000000001408 0000000000000040 ........@....... e0000040833ae2c0: e0000040833ae300 e0000040833ae280 ..:.@.....:.@... ... The actual beginning of the ASCII pattern starts in the middle of a data page much lower in memory, which is from the "size-16384" slab cache, and extends to the location shown above: crash> kmem -s e000004083398000 CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE e0000001073c8880 size-16384 16384 26 33 33 16k SLAB MEMORY TOTAL ALLOCATED FREE e0000040c9ef3380 e000004083398000 1 1 0 FREE / [ALLOCATED] [e000004083398000] crash> crash> rd e000004083398000 2048 ... [ snip ] ... e0000040833988f0: 636c2d2064656465 2d206363676c2d20 eded -lc -lgcc - e000004083398900: 6465656e2d73612d 6363676c2d206465 -as-needed -lgcc e000004083398910: 2d6f6e2d2d20735f 656465656e2d7361 _s --no-as-neede e000004083398920: 6c2f7273752f2064 692f6363672f6269 d /usr/lib/gcc/i e000004083398930: 686465722d343661 78756e696c2d7461 a64-redhat-linux e000004083398940: 632f322e312e342f 206f2e646e657472 /4.1.2/crtend.o e000004083398950: 62696c2f7273752f 3661692f6363672f /usr/lib/gcc/ia6 e000004083398960: 7461686465722d34 342f78756e696c2d 4-redhat-linux/4 e000004083398970: 2f2e2e2f322e312e 72632f2e2e2f2e2e .1.2/../../../cr e000004083398980: 662f30006f2e6e74 2031343035656c69 tn.o.0/file5041 e000004083398990: 6c69662f30726964 6964203234303565 dir0/file5042 di e0000040833989a0: 35656c69662f3072 3072696420333430 r0/file5043 dir0 e0000040833989b0: 343035656c69662f 662f307269642034 /file5044 dir0/f e0000040833989c0: 2035343035656c69 6c69662f30726964 ile5045 dir0/fil e0000040833989d0: 6964203634303565 35656c69662f3072 e5046 dir0/file5 e0000040833989e0: 3072696420373430 343035656c69662f 047 dir0/file504 e0000040833989f0: 662f307269642038 2039343035656c69 8 dir0/file5049 e000004083398a00: 6c69662f30726964 6964203035303565 dir0/file5050 di ... [ snip ] ... e000004083398d40: 2039303135656c69 6c69662f30726964 ile5109 dir0/fil e000004083398d50: 6964203031313565 35656c69662f3072 e5110 dir0/file5 e000004083398d60: 3072696420313131 2000000801842878 111 dir0x(..... e000004083398d70: 6e776f6e6b6e753c 49544e4f4357003e <unknown>.WCONTI e000004083398d80: 6c69007c4445554e 6964203431313565 NUED|.ile5114 di e000004083398d90: 35656c69662f3072 3072696420353131 r0/file5115 dir0 e000004083398da0: 313135656c69662f 662f307269642036 /file5116 dir0/f e000004083398db0: 2037313135656c69 6c69662f30726964 ile5117 dir0/fil e000004083398dc0: 6964203831313565 35656c69662f3072 e5118 dir0/file5 e000004083398dd0: 3072696420393131 323135656c69662f 119 dir0/file512 e000004083398de0: 662f307269642030 0000000000000000 0 dir0/f........ e000004083398df0: 2020202020202020 2064687373202020 sshd e000004083398e00: 3637322020203a72 2042694b20323338 r: 276832 KiB e000004083398e10: 3937322020203a77 0a42694b20373232 w: 279227 KiB. e000004083398e20: 2020202020202020 206f697061747320 stapio e000004083398e30: 3633322020203a72 2042694b20333136 r: 236613 KiB e000004083398e40: 3633322020203a77 0a42694b20323036 w: 236602 KiB. e000004083398e50: 2020202020202020 2068736172632020 crash e000004083398e60: 3831362020203a72 2042694b20333137 r: 618713 KiB e000004083398e70: 35656c69662f3072 3072696420313331 r0/file5131 dir0 e000004083398e80: 333135656c69662f 662f307269642032 /file5132 dir0/f ... From there the ASCII data continues on into the next page in an uninterrupted (corrupting) fashion: e00000408339bfd0: 6c69662f30726964 6964203433303665 dir0/file6034 di e00000408339bfe0: 36656c69662f3072 3072696420353330 r0/file6035 dir0 e00000408339bff0: 333036656c69662f 662f307269642036 /file6036 dir0/f e00000408339c000: 2037333036656c69 6c69662f30726964 ile6037 dir0/fil e00000408339c010: 6964203833303665 36656c69662f3072 e6038 dir0/file6 e00000408339c020: 3072696420393330 343036656c69662f 039 dir0/file604 The next page at e00000408339c000 above is a "size-512" slab page, and so the slab structure at its base is corrupted: crash> kmem -s e00000408339c000 kmem: size-512: full list: slab: e00000408339c080 bad next pointer: 6964203634303665 kmem: size-512: full list: slab: e00000408339c080 bad prev pointer: 36656c69662f3072 kmem: size-512: full list: slab: e00000408339c080 bad inuse counter: 1768169528 kmem: size-512: full list: slab: e00000408339c080 bad inuse counter: 1768169528 kmem: size-512: full list: slab: e00000408339c080 bad s_mem pointer: 343036656c69662f kmem: size-512: partial list: slab: e00000408339c080 bad next pointer: 6964203634303665 kmem: size-512: partial list: slab: e00000408339c080 bad inuse counter: 1768169528 kmem: size-512: partial list: slab: e00000408339c080 bad s_mem pointer: 343036656c69662f kmem: size-512: full list: slab: e00000408339c080 bad next pointer: 6964203634303665 kmem: size-512: full list: slab: e00000408339c080 bad inuse counter: 1768169528 kmem: size-512: full list: slab: e00000408339c080 bad s_mem pointer: 343036656c69662f kmem: size-512: free list: slab: e00000408339c080 bad next pointer: 6964203634303665 kmem: size-512: free list: slab: e00000408339c080 bad inuse counter: 1768169528 kmem: size-512: free list: slab: e00000408339c080 bad s_mem pointer: 343036656c69662f kmem: size-512: address not found in cache: e00000408339c000 crash> The corruption continues through the next two pages, which are not used by the slab subsystem, and then finally bumps into the "size-64" slab cache page that lead to eventual kernel crash: crash> kmem -s e0000040833a0000 kmem: address is not allocated in slab subsystem: e0000040833a0000 crash> kmem -s e0000040833a4000 kmem: address is not allocated in slab subsystem: e0000040833a4000 crash> kmem -s e0000040833ac000 kmem: size-64: full list: slab: e0000040833ac000 bad next pointer: 6420333031392d72 kmem: size-64: full list: slab: e0000040833ac000 bad prev pointer: 20343031392d7269 kmem: size-64: full list: slab: e0000040833ac000 bad inuse counter: 1768169526 kmem: size-64: full list: slab: e0000040833ac000 bad inuse counter: 1768169526 kmem: size-64: full list: slab: e0000040833ac000 bad s_mem pointer: 3031392d72696420 kmem: size-64: partial list: slab: e0000040833ac000 bad next pointer: 6420333031392d72 kmem: size-64: partial list: slab: e0000040833ac000 bad inuse counter: 1768169526 kmem: size-64: partial list: slab: e0000040833ac000 bad s_mem pointer: 3031392d72696420 kmem: size-64: full list: slab: e0000040833ac000 bad next pointer: 6420333031392d72 kmem: size-64: full list: slab: e0000040833ac000 bad inuse counter: 1768169526 kmem: size-64: full list: slab: e0000040833ac000 bad s_mem pointer: 3031392d72696420 kmem: size-64: free list: slab: e0000040833ac000 bad next pointer: 6420333031392d72 kmem: size-64: free list: slab: e0000040833ac000 bad inuse counter: 1768169526 kmem: size-64: free list: slab: e0000040833ac000 bad s_mem pointer: 3031392d72696420 kmem: size-64: address not found in cache: e0000040833ac000 crash> So like the crash the I was able to reproduce, the kprobe code seems to be referencing legitimate slab cache addresses which have been corrupted due to the running of the "io-load-test.sh". So I guess the question is: (1) Are the systemtap/kprobe operations an innocent victim of something that the "io-load-test.sh" causes to happen? Or: (2) Do the systemtap/kprobe operations interfere with the kernel code that runs as a result of the "io-load-test.sh"? And I have no idea which is the case. Perhaps testing should be done that takes one of them out of the picture, i.e., testing that: (1) uses the systemtap scripts -- but NOT "io-load-test.sh" (2) uses "io-load-test.sh" -- but NOT systemtap scripts. And in the two cases above, run something else to get a heavy system load going. I'm curious whether Fujitsu has ever seen the problem occur without both things (systemtap and io-load_test.sh) going on simultaneously? Dave
> As a result, I guess that this issue might only occur with the combination > of systemtap scripts and "io-load-test.sh". > > I will test a little more in other kinds of high load situations. OK. Also, if we could at least try to narrow down *which* systemtap script (or which combination of scripts) seems to be the causal factor would be helpful. Just running all of the scripts and then running the io-load-test.sh on top of them makes it almost impossible to debug. All we really have to work with is that the directories and files created by the io-load-test.sh script seem to be listed in an ASCII data buffer that gets scribbled across kernel memory, corrupting whatever was there.
Just an update... I have only been able to reproduce this 3 more times, and so far it has required: 1. all of the systemtap tests provided to be running 2. the io-load-test.sh script to be running 3. additional test load that bumps the load average very high and consumes all of available memory. (using usex exerciser running a combination of disk I/O, virtual memory tests, and some hand-created network activity) However, the vmcore files all show the same symptom -- a slab page in the size-64 slab cache (and perhaps another one as well) gets corrupted with recognizable ASCII data from the io-load-test.sh file, and eventually some entity ends up accessing memory from a corrupted page. When running the test, the size-64 slab cache typically has several hundred thousand objects allocated, so that could be why it tends to be the slab cache most likely to be corrupted. (?) The corruption is always several contiguous pages in length. In any case, given that the kernel crashes occur well after the corruption has taken place, there's no smoking gun to work with, just the results of the corruption. I did try to narrow it down to one or more of the system tap scripts, but was unsuccessful. I'll keep trying, but at this point, I've pretty much gotten nowhere...
Another update: I was able to reproduce the crash with just 3 of the systemtap scripts running: disktop.stp para-callgraph2.stp sys_read '*@fs/*.c' test1.stp And again, I ran the io-load-test.sh script, plus a heavy load of tests from a "usex" framework that created both a heavy load and a low memory scenario. The crash was familiar -- an "stapio" task associated with the para-callgraph2.stp script was doing an unregister_kretprobes() of a kretprobe_instance, and bumped into the same type of ASCII data corruption of the whole "size-64" slab page containing the structure. The unregister_kretprobes() call was being done because that particular script always fails within a few seconds like so: # stap -v -k -o /tmp/result/paracall.result para-callgraph2.stp sys_read '*@fs/*.c' Pass 1: parsed user script and 51 library script(s) in 343usr/6sys/429real ms. Pass 2: analyzed script: 3277 probe(s), 12 function(s), 2 embed(s), 3 global(s) in 956usr/323sys/8312real ms. Pass 3: translated to C into "/tmp/stapbuTHw6/stap_25967.c" in 446usr/150sys/604real ms. Pass 4: compiled C into "stap_25967.ko" in 8264usr/407sys/8585real ms. Pass 5: starting run. ERROR: probe overhead exceeded threshold WARNING: Number of errors: 1, skipped probes: 0 Pass 5: run completed in 17usr/4795sys/4951real ms. Keeping temporary directory "/tmp/stapbuTHw6" # dmesg ... [ snip ] ... stap_25967: systemtap: 0.9.7/0.141, base: a000000202340000, memory: 2903064+73024+58944+14000 data+text+ctx+net, probes: 3277 # Because of the script's failure, my tests had been set up to repeatedly re-run the script after it fails. In any case, because of the para-callgraph2.stp's "failure mode", and the fact that it is so kernel-intrusive (placing 3277 probes in all of the functions in the kernel/fs directory), it seems highly likely that it is a prime suspect.
I have a new crash which seems to pinpoint the para-callgraph.stp kernel module. After a reboot, all I did was to start a shell script that simply ran the para-callgraph.stp script in a loop -- because as I mentioned above, it always fails in a matter of a few seconds. After the script had run only 2 (or 3?) times, I entered Ctrl-C and the system crashed. Neither the io-load-test.sh script, the usex test suite, nor any other systemtap script was running or had been run -- again, it was just after a fresh system reboot. Unfortunately the vmcore that was created was corrupt, and caused the crash utility generate a segmentation violation while trying to parse through the NT_PRSTATUS segments. The "readelf" command confirms the corruption immediately, and you can see the bogus Elf Note sections at the end: $ readelf -a vmcore readelf: Warning: corrupt note found at offset 12cc into core notes readelf: Warning: type: 7fe07c78, namesize: 00a94db0, descsize: a0000001 ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian OS/ABI: UNIX - System V ABI Version: 0 Type: CORE (Core file) Machine: Intel IA-64 Version: 0x1 Entry point address: 0x0 Start of program headers: 64 (bytes into file) Start of section headers: 0 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 34 Size of section headers: 0 (bytes) Number of section headers: 0 Section header string table index: 0 There are no sections in this file. There are no sections in this file. Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align NOTE 0x00000000000007b0 0x0000000000000000 0x0000000000000000 0x0000000000002064 0x0000000000002064 0 LOAD 0x0000000000002814 0xa000000100000000 0x0000000004000000 0x00000000006709a0 0x00000000006709a0 RWE 0 LOAD 0x00000000006731b4 0xe000000000000000 0x0000000000000000 0x0000000000001000 0x0000000000001000 RWE 0 LOAD 0x00000000006741b4 0xe000000000001000 0x0000000000001000 0x000000000008a000 0x000000000008a000 RWE 0 LOAD 0x00000000006fe1b4 0xe00000000008b000 0x000000000008b000 0x0000000000015000 0x0000000000015000 RWE 0 LOAD 0x00000000007131b4 0xe000000000100000 0x0000000000100000 0x0000000003f00000 0x0000000003f00000 RWE 0 LOAD 0x00000000046131b4 0xe000000004000000 0x0000000004000000 0x0000000000e0d000 0x0000000000e0d000 RWE 0 LOAD 0x00000000054201b4 0xe000000004e0d000 0x0000000004e0d000 0x00000000031f3000 0x00000000031f3000 RWE 0 LOAD 0x00000000086131b4 0xe000000028000000 0x0000000028000000 0x0000000050000000 0x0000000050000000 RWE 0 LOAD 0x00000000586131b4 0xe00000007c000000 0x000000007c000000 0x000000000295a000 0x000000000295a000 RWE 0 LOAD 0x000000005af6d1b4 0xe00000007f300000 0x000000007f300000 0x0000000000a8c000 0x0000000000a8c000 RWE 0 LOAD 0x000000005b9f91b4 0xe00000007fd8c000 0x000000007fd8c000 0x000000000000c000 0x000000000000c000 RWE 0 LOAD 0x000000005ba051b4 0xe00000007fda0000 0x000000007fda0000 0x0000000000004000 0x0000000000004000 RWE 0 LOAD 0x000000005ba091b4 0xe000000100000000 0x0000000100000000 0x0000000376e3a000 0x0000000376e3a000 RWE 0 LOAD 0x00000003d28431b4 0xe000000476e3a000 0x0000000476e3a000 0x0000000000558000 0x0000000000558000 RWE 0 LOAD 0x00000003d2d9b1b4 0xe000000477392000 0x0000000477392000 0x000000000005e000 0x000000000005e000 RWE 0 LOAD 0x00000003d2df91b4 0xe0000004773f0000 0x00000004773f0000 0x0000000000002000 0x0000000000002000 RWE 0 LOAD 0x00000003d2dfb1b4 0xe0000004773f2000 0x00000004773f2000 0x0000000000050000 0x0000000000050000 RWE 0 LOAD 0x00000003d2e4b1b4 0xe000000477442000 0x0000000477442000 0x000000000000c000 0x000000000000c000 RWE 0 LOAD 0x00000003d2e571b4 0xe00000047744e000 0x000000047744e000 0x00000000043b0000 0x00000000043b0000 RWE 0 LOAD 0x00000003d72071b4 0xe00000047f800000 0x000000047f800000 0x0000000000001000 0x0000000000001000 RWE 0 LOAD 0x00000003d72081b4 0xe00000047f801000 0x000000047f801000 0x00000000000c5000 0x00000000000c5000 RWE 0 LOAD 0x00000003d72cd1b4 0xe00000047f8c6000 0x000000047f8c6000 0x0000000000002000 0x0000000000002000 RWE 0 LOAD 0x00000003d72cf1b4 0xe00000047f8c8000 0x000000047f8c8000 0x000000000004e000 0x000000000004e000 RWE 0 LOAD 0x00000003d731d1b4 0xe00000047f916000 0x000000047f916000 0x0000000000002000 0x0000000000002000 RWE 0 LOAD 0x00000003d731f1b4 0xe00000047f918000 0x000000047f918000 0x0000000000002000 0x0000000000002000 RWE 0 LOAD 0x00000003d73211b4 0xe00000047f91a000 0x000000047f91a000 0x0000000000002000 0x0000000000002000 RWE 0 LOAD 0x00000003d73231b4 0xe00000047f91c000 0x000000047f91c000 0x0000000000008000 0x0000000000008000 RWE 0 LOAD 0x00000003d732b1b4 0xe00000047f924000 0x000000047f924000 0x00000000000d6000 0x00000000000d6000 RWE 0 LOAD 0x00000003d74011b4 0xe00000047f9fa000 0x000000047f9fa000 0x0000000000001000 0x0000000000001000 RWE 0 LOAD 0x00000003d74021b4 0xe00000047f9fb000 0x000000047f9fb000 0x0000000000003000 0x0000000000003000 RWE 0 LOAD 0x00000003d74051b4 0xe00000047f9fe000 0x000000047f9fe000 0x0000000000358000 0x0000000000358000 RWE 0 LOAD 0x00000003d775d1b4 0xe00000047fd56000 0x000000047fd56000 0x000000000006c000 0x000000000006c000 RWE 0 LOAD 0x00000003d77c91b4 0xe00000047fe7e000 0x000000047fe7e000 0x0000000000124000 0x0000000000124000 RWE 0 There is no dynamic section in this file. There are no relocations in this file. There are no unwind sections in this file. No version information found in this file. Notes at offset 0x000007b0 with length 0x00002064: Owner Data size Description CORE 0x00000478 NT_PRSTATUS (prstatus structure) CORE 0x00000478 NT_PRSTATUS (prstatus structure) CORE 0x00000478 NT_PRSTATUS (prstatus structure) CORE 0x00000478 NT_PRSTATUS (prstatus structure) (NONE) 0x00000000 Unknown note type: (0x00000000) (NONE) 0x00000000 Unknown note type: (0x00000000) (NONE) 0x00000000 Unknown note type: (0x00000000) (NONE) 0x00000000 Unknown note type: (0x00000000) (NONE) 0x00000000 Unknown note type: (0x00000000) (NONE) 0x00000000 Unknown note type: (0x00000000) (NONE) 0x00000000 Unknown note type: (0x00000000) (NONE) 0x00000000 Unknown note type: (0x00000000) (NONE) 0x00000000 Unknown note type: (0x00000000) (NONE) 0x00000000 Unknown note type: (0x00000000) (NONE) 0x00000000 Unknown note type: (0x00000000) (NONE) 0x00000000 Unknown note type: (0x00000000) (NONE) 0x00000000 Unknown note type: (0xffffffff) $ There should be an NT_PRSTATUS note for each processor, and for whatever reason, the kdump process failed. However, I was able to create a patched version of the crash utility that would ignore the the bogus notes section and continue on. Here is the backtrace, with the -f option adding function arguments: crash> bt -f PID: 5515 TASK: e00000016cd78000 CPU: 7 COMMAND: "staprun" #0 [BSP:e00000016cd794e8] machine_kexec at a00000010005a380 (e000000104a2b280) #1 [BSP:e00000016cd794c8] crash_kexec at a0000001000cefb0 (a000000100a4f688) #2 [BSP:e00000016cd79460] panic at a00000010007bc70 (0) #3 [BSP:e00000016cd79418] die at a000000100037ef0 (ffffffffffffffff, a000000100a46cc8, 80400000000) #4 [BSP:e00000016cd793c0] ia64_do_page_fault at a00000010066dd90 (800000004, 80400000000, e00000016cd7fa30) #5 [BSP:e00000016cd793c0] __ia64_leave_kernel at a00000010000bfe0 EFRAME: e00000016cd7fa30 B0: a0000001000956b0 CR_IIP: a0000001000958a0 CR_IPSR: 0000101008026018 CR_IFS: 800000000000030a AR_PFS: 000000000000030a AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: e000000100040000 AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f LOADRS: 0000000000000000 AR_BSPSTORE: ffffffffffff0028 B6: a000000100011940 B7: a000000100095620 PR: 000000000056a5a5 R1: a000000100c36560 R2: e00000016cd7fc08 R3: e00000016cd7fc00 R8: e000000185460008 R9: e000000111d715c0 R10: e000000111d715b8 R11: e000000105f6fe00 R12: e00000016cd7fbf0 R13: e00000016cd78000 R14: 0000000000200200 R15: e00000016cd7fc00 R16: e00000016cd7fc00 R17: 0000000000000000 R18: e000000111d715a0 R19: a000000100a36bc8 R20: e000000185460fc8 R21: e000000111d715a0 R22: e00000016cd7fc00 R23: e000000185460fd0 R24: e00000016cd7fc08 R25: e000000111d715a8 R26: e000000111d715a0 R27: e000000185460010 R28: 00000001033dd2fc R29: e000000185460fb0 R30: 00000001033dd2fb R31: 00000000000000fb F6: 1003e000000000000062e F7: 1003e0044b82fa09b5a53 F8: 1003e00000000001707d8 F9: 100018000000000000000 F10: 10009ac17fffff53e8000 F11: 1003e0000000000000560 #6 [BSP:e00000016cd79370] run_timer_softirq at a0000001000958a0 (e000000185460000) #7 [BSP:e00000016cd792f0] __do_softirq at a00000010008b3e0 (void) #8 [BSP:e00000016cd79290] do_softirq at a00000010008b650 (void) #9 [BSP:e00000016cd79278] irq_exit at a00000010008b720 (void) #10 [BSP:e00000016cd79248] ia64_handle_irq at a000000100011840 (f, e00000016cd7fc10) #11 [BSP:e00000016cd79248] __ia64_leave_kernel at a00000010000bfe0 EFRAME: e00000016cd7fc10 B0: a0000001001488b0 CR_IIP: a00000010011ae20 CR_IPSR: 00001010085a6010 CR_IFS: 8000000000000288 AR_PFS: 000000000000030b AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: 0000000000565699 AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f LOADRS: 0000000000000000 AR_BSPSTORE: a000000100059f80 B6: a000000100061a40 B7: a0000001001b9dc0 PR: 0000000000566959 R1: a000000100c36560 R2: e00000002bfc4700 R3: e00000002bfc4700 R8: 0000000000000001 R9: e00000010ab24ce0 R10: 000000000000009c R11: e000000120c10d18 R12: e00000016cd7fdd0 R13: e00000016cd78000 R14: 0000000000000001 R15: e000000100bf8400 R16: 000000000000009c R17: 00000000000000f5 R18: 00000000000000b6 R19: e000000120c10d20 R20: 000000000128a532 R21: 0000000000024000 R22: 0000000000000000 R23: 0000000000000000 R24: e00000016cd7fdd0 R25: e00000016cd7fdd0 R26: 0000000000000000 R27: 000000000000002a R28: 0000000000000007 R29: 0000000000004000 R30: 0000000000004000 R31: e000000100f6fa00 F6: 1003e0000000000000000 F7: 1003e0000000000000000 F8: 1003e0000000000000000 F9: 1003e0000000000000001 F10: 1003e0000000029a9e4c0 F11: 1003e431bde82d7b634db #12 [BSP:e00000016cd79208] __free_pages at a00000010011ae20 (e000000100bf83f8, 0) #13 [BSP:e00000016cd791d0] __vunmap at a0000001001488b0 (9d, 1) #14 [BSP:e00000016cd791b0] vfree at a000000100148bd0 (a000000202328000) #15 [BSP:e00000016cd79180] module_free at a000000100054d10 (a0000002024be500, a000000202328000) #16 [BSP:e00000016cd79150] free_module at a0000001000c2590 (a0000002024be500) #17 [BSP:e00000016cd790e0] sys_delete_module at a0000001000c6190 (20000008003080b0, 0) #18 [BSP:e00000016cd790e0] __ia64_trace_syscall at a00000010000bd70 EFRAME: e00000016cd7fe40 B0: 20000008000042e0 CR_IIP: a000000000010620 CR_IPSR: 00001213085a6010 CR_IFS: 0000000000000004 AR_PFS: c000000000000004 AR_RSC: 000000000000000f AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000 AR_CCV: 0000000000000000 AR_FPSR: 0009804c0270033f LOADRS: 0000000000400000 AR_BSPSTORE: 600007ffff8c0100 B6: 2000000800223760 B7: a000000000010640 PR: 0000000000561a41 R1: 2000000800270238 R2: c00000000000048c R3: e00000016cd7fef8 R8: 0000000000000001 R9: 20000008002f3e88 R10: 0000000000000000 R11: c00000000000040c R12: 60000fffff8bae50 R13: 20000008002fc6d0 R14: 200000080001d0d0 R15: 000000000000046e R16: 2000000800200840 R17: 0000000000000710 R18: 0000000000000000 R19: 0000000000043220 R20: 0000000000000002 R21: 0000000000004449 R22: 3a474e494e524157 R23: 200000080000ae48 R24: 0000000000000020 R25: 0000000000000000 R26: c000000000000006 R27: 000000000000000f R28: a000000000010620 R29: 00001213085a6010 R30: 0000000000000008 R31: 0000000000561a41 F6: 000000000000000000000 F7: 000000000000000000000 F8: 000000000000000000000 F9: 000000000000000000000 F10: 000000000000000000000 F11: 000000000000000000000 #19 [BSP:e00000016cd790e0] __kernel_syscall_via_break at a000000000010620 crash> The "staprun" command was running on behalf of the para-callgraph.stp script, and was shutting down the systemtap kernel module -- either because of the Ctrl-C or because of the failure that always occurs when running that script: crash> set PID: 5515 COMMAND: "staprun" TASK: e00000016cd78000 [THREAD_INFO: e00000016cd79040] CPU: 7 STATE: TASK_RUNNING (PANIC) crash> So it had proceeded through the module deletion phase all the way to the point to where it was vfree'ing the contigous chunk of vmalloc'd pages that had made up the systemtap kernel module. Note that the module_free() function receives the vmalloc address of the module structure itself, which is contained within the whole vmalloc'd region pointed to by "module_region": void module_free (struct module *mod, void *module_region) { if (mod && mod->arch.init_unw_table && module_region == mod->module_init) { unw_remove_unwind_table(mod->arch.init_unw_table); mod->arch.init_unw_table = NULL; } vfree(module_region); } The backtrace shows that the module data structure was at vmalloc address a0000002024be500, which was itself contained within the whole vmalloc region that starts at a000000202328000: #15 [BSP:e00000016cd79180] module_free at a000000100054d10 (a0000002024be500, a000000202328000) We don't know the actual size and therefore the end of the kernel module, but we do know that at a minimum it consisted of the vmalloc address space starting at a000000202328000, and going some amount of space beyond a0000002024be500. Now that continguous vmalloc region has already been vfree'd: crash> kmem -v VM_STRUCT ADDRESS RANGE SIZE e000000185c05940 a000000200000000 - a000000200404000 4210688 e000000185c05900 a000000200404000 - a000000200608000 2113536 e000000184e3f780 a000000200608000 - a00000020070c000 1064960 e000000101f2f380 a00000020070c000 - a000000200f10000 8404992 ... [ snip ] ... e00000018083c400 a000000201e1c000 - a000000201f44000 1212416 e00000010b66dcc0 a000000201f44000 - a000000201fa4000 393216 e0000001069e9440 a000000201fa4000 - a000000201ff4000 327680 e0000001044ca6c0 a000000201ff4000 - a000000202050000 376832 crash> So we can't even look at the formerly-vmalloc'd address space because it's underlying physical pages have been given back by __free_pages(): crash> vtop a000000202328000 VIRTUAL PHYSICAL a000000202328000 (not mapped) PAGE DIRECTORY: a000000100808000 PGD: a000000100808000 => 10136c000 PMD: e00000010136c808 => 1076ac000 PTE: e0000001076ac650 => 0 crash> Anyway, while doing the __free_pages() operations, a timer interrupt kicked off, and after handling it, irq_exit() saw that there were softirq's pending, so it called do_softirq(). do_softirq() found a queued (per-cpu) function and called it, which is what caused the crash. As it turns out, the function that was attempted was at address a0000002023ea718, which is solidly within the vmalloc address space formerly utilized by the systemtap script being dismantled: a000000202328000 < a0000002023ea718 < a0000002024be500 In fact you can see that the function at a0000002023ea718 is still queued on the cpu 0 and cpu 6 timer queues: crash> timer TVEC_BASES[0]: a000000100d21100 JIFFIES 4349350651 EXPIRES TIMER_LIST FUNCTION 4349350652 e0000001057c0b20 a0000002023ea718 4349350684 e000000102f59938 a000000100776440 <rh_timer_func> 4349351143 e000000104647dd0 a000000100777c10 <process_timeout> 4349352000 e0000001082f4500 a000000201806de0 <e1000_watchdog> 4349352414 a0000001009eade8 a000000100778590 <neigh_periodic_timer> 4349352441 e000000028007090 a00000010077bd20 <delayed_work_timer_fn> ... [ snip ] ... TVEC_BASES[6]: e000000185450000 JIFFIES 4349350651 EXPIRES TIMER_LIST FUNCTION 4349350651 e000000111d718a0 a0000002023ea718 4349351144 a000000201eb6968 a000000100778590 <neigh_periodic_timer> 4349351188 e00000010f7afab0 a000000100777c10 <process_timeout> 4349351424 e000000102efb960 a00000010077bd20 <delayed_work_timer_fn> ... [ snip ] ... crash> And it looks to have been on the other cpus as well, which must have tried to execute it at the same time: crash> log | grep a0000002023ea718 Unable to handle kernel paging request at virtual address a0000002023ea718 Unable to handle kernel paging request at virtual address a0000002023ea718 Unable to handle kernel paging request at virtual address a0000002023ea718 Unable to handle kernel paging request at virtual address a0000002023ea718 Unable to handle kernel paging request at virtual address a0000002023ea718 Unable to handle kernel paging request at virtual address a0000002023ea718 crash> But only 1 of the cpus ended up attempting the kdump. In any case, it would seem that the function at a0000002023ea718 should have been removed from the timer queues before the module was allowed to be vfree'd. Also, I don't know how (or if) this is related to the other crashes where memory corruption was encountered. Dave
Looking back at the other crash from comment #16 (which crashed due to ASCII memory corruption in the size-64 slab cache), I can see that an instance of the kernel module associated with the para-callgraph.stp script is running, and that it has the same function that caused the crash in comment #17 above on each cpu's timer queue. It's a different vmalloc address, but it's at the same offset from the beginning of the module: crash> timer | grep a0000002026f6718 4326367115 e0000001477f0aa0 a0000002026f6718 <__stp_time_timer_callback> 4326367114 e0000001477f0420 a0000002026f6718 <__stp_time_timer_callback> 4326367114 e0000001477f1320 a0000002026f6718 <__stp_time_timer_callback> 4326367114 e0000001477f1920 a0000002026f6718 <__stp_time_timer_callback> 4326367114 e0000001086d1aa0 a0000002026f6718 <__stp_time_timer_callback> 4326367115 e0000001086d32a0 a0000002026f6718 <__stp_time_timer_callback> 4326367114 e0000001086d3120 a0000002026f6718 <__stp_time_timer_callback> 4326367115 e0000001086d23a0 a0000002026f6718 <__stp_time_timer_callback> crash> In this case, it can be translated by crash as "__stp_time_timer_callback" because it still exists as a legitimate kernel symbol.
A del_timer_sync() call should have been made by the systemtap kernel module during the shutdown operation. But trying to understand the kernel module code generated by systemtap is mind-boggling.
Given the situation above, shouldn't this BZ be re-assigned to the "systemtap" component? It's certainly not the job of the kernel proper (or the kprobe subsystem within it) to clean up timer tasks queued by systemtap.
Dave, Thanks for your analysis of the problem. The analysis points to bugs in the system-tap component. Re-assigning to the system-tap component. Sachin Prabhu
Dave, thank you for your ongoing analysis of the problem. While reading systemtap-generated C code is not easy, "stap -p4 -k ..." is a start. The temp-directory + /usr/share/systemtap/runtime directories should give the complete picture. Systemtap does (only) use del_timer_sync() to cancel timers, and mod_timer() within the callbacks to reschedule them. Perhaps the kernel is not robust with respect to a combination of these operations from modules that may be removed. Has anyone tried reproducing this problem from an x86 or x86-64 machine?
> Dave, thank you for your ongoing analysis of the problem. While > reading systemtap-generated C code is not easy, "stap -p4 -k ..." > is a start. The temp-directory + /usr/share/systemtap/runtime > directories should give the complete picture. Right -- I only looked at the C files in the /tmp/stapXXX directories and couldn't see where the timers were being activated/deactivated, but could see that del_timer_sync was part of each .ko module. > Perhaps the kernel is not robust with respect to a combination of > these operations from modules that may be removed. The last crash, however, was *only* running a signle module at the time, which was the para-callgraph.stp generated module that I was running in a loop from a bash script, and it crashed either due to my entry of Ctrl-C or simply because the module always fails quickly by itself in Pass 5 like this: ... Pass 5: starting run. ERROR: probe overhead exceeded threshold WARNING: Number of errors: 1, skipped probes: 0 It seems that the del_timer_sync call should be made as part of the mod->exit() callback into the module here in sys_delete_module() asmlinkage long sys_delete_module(const char __user *name_user, unsigned int flags) ... [ snip ] ... /* Final destruction now noone is using it. */ if (mod->exit != NULL) { mutex_unlock(&module_mutex); mod->exit(); mutex_lock(&module_mutex); } free_module(mod); out: mutex_unlock(&module_mutex); return ret; } But it's gone beyond that and has called free_module(). And all but one of the other cpus are also working on the same stale callback that exists on their per-cpu timer queue. As to other machine types, they claim it's only been an ia64 issue, but I have to believe that's it's really unknown at this point because the original reproducer test procedure was so convaluted. So anyway, this is what I did to narrow it down: while true do stap -v -o /dev/null para-callgraph2.stp sys_read '*@fs/*.c' done and on the 2nd or 3rd attempt, I hit Ctrl-C.
Were you also running the io-load-test to reproduce the problem with just para-callgraph2.stp?
(In reply to comment #24) > Were you also running the io-load-test to reproduce the problem with just > para-callgraph2.stp? No -- I had just rebooted the system, and then I did the above. BTW, I just got a fresh x86_64 system that I'm going to try to reproduce it on as well.
Created attachment 365759 [details] dmesg log The same loop caused a crash on my x86-64 rhel5.4 VM. This was with a bunch of ^C's and maybe a ^Z in there. too, trying to disrupt it.
Hmmm, I'm not able to get it to reproduce on the x86_64. I just wait until it says "Pass 5: starting run." and then hit CTRL-C. But it seems to clean up as expected... On the other hand, I tried it again on the ia64 machine I've got, and it only took 3 repetitions before I got it to crash again. In this new case, cpus 0, 3, 4, 5, 6 and 7 are all in the idle "swapper" task running run_timer_softirq() when they come upon the same queued function to call at a000000202402718, which is no longer mapped in vmalloc space. Cpus 1 and 2 just haven't gotten around to handling it yet, but the function is still sitting on their per-cpu timer queue: crash> timer ... [snip] ... TVEC_BASES[1]: e000000185194000 JIFFIES 4483518118 EXPIRES TIMER_LIST FUNCTION 4483518118 e000000105dabba0 a000000202402718 4483520108 e000000028017090 a00000010077bd20 <delayed_work_timer_fn> 4488172090 e000000110c08900 a00000010077c600 <tcp_keepalive_timer> 4490094598 e000000110c09700 a00000010077c600 <tcp_keepalive_timer> TVEC_BASES[2]: e000000185234000 JIFFIES 4483518118 EXPIRES TIMER_LIST FUNCTION 4483518118 e000000105da82a0 a000000202402718 4483518352 e00000010305edb8 a000000100776440 <rh_timer_func> 4483518752 e000000028027090 a00000010077bd20 <delayed_work_timer_fn> 4483519116 e00000010c26fab0 a000000100777c10 <process_timeout> 4483530474 e000000107a05218 a00000010077b540 <neigh_timer_handler> 4515112326 e0000001073e7a60 a000000100777c10 <process_timeout> ... So in this case, the systemtap module has already been deleted, but with timer functions still referenced: crash> mod | grep stap crash>
Some experiments show evidence of a possible race condition between abnormal shutdown triggered at the same time as normal startup. We are formulating a cunning plan to solve it.
We believe we have fixed this in upstream systemtap. This will be included in future RHEL releases. http://sourceware.org/bugzilla/show_bug.cgi?id=10854
I believe we need PM to authorize 5.4.Z first.
patch is ready for backporting into rhel5.4.Z; rhel5.5 will get it as a part of a near-future rebase
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0308.html