LTC Owner is: jstultz.com LTC Originator is: sudhanshusingh.com Problem description: LS 20 machine hangs up with OOPS messages from kernel. Describe any custom patches installed. RT patches to RHEL5 glibc patches. Provide output from "uname -a", if possible: $uname -a Linux llm49.in.ibm.com 2.6.21-14ibm #1 SMP PREEMPT RT Thu May 31 21:18:32 CDT 2007 x86_64 x86_64 x86_64 GNU/Linux Hardware Environment LS 20 machine Please provide contact information if the submitter is not the primary contact. sripathik.com Please provide access information for the machine if it is available. llm49.in.ibm.com Did the system produce an OOPS message on the console? If so, copy it here: ================================== Code: 8b 02 f6 c4 04 74 05 49 ff c4 eb 35 8b 02 66 85 c0 79 05 49 RIP [<ffffffff81081787>] show_mem+0x8f/0x144 RSP <ffff810105103b28>ts Code: 8b 02 f6 c4 04 74 05 49 ff c4 eb 35 8b 02 66 85 c0 79 05 49 RIP [<ffffffff81081787>] show_mem+0x8f/0x144 RSP <ffff810105103b28> CR2: 0000000005a00000 usb 1-2: USB disconnect, address 3 usb 1-2.1: USB disconnect, address 6 usb 1-2.3: USB disconnect, address 7 Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<ffffffff81154bb0>] plist_add+0x5b/0xa6 PGD 14d99c067 PUD 16e65d067 PMD 0 Oops: 0000 [161] PREEMPT SMP CPU 1 Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_mirror dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport joydev sr_mod cdrom sg i2c_amd756 pcspkr i2c_core amd_rng shpchp k8temp hwmon tg3 rtc_cmos rtc_core rtc_lib serio_raw usb_storage mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 3136, comm: ps Not tainted 2.6.21-14ibm #1 RIP: 0010:[<ffffffff81154bb0>] [<ffffffff81154bb0>] plist_add+0x5b/0xa6 RSP: 0018:ffff81003059fc78 EFLAGS: 00010093 RAX: ffff81003059fd58 RBX: ffff81003059fd40 RCX: ffff81003059fd58 RDX: ffff81003059fd58 RSI: fffffffffffffff8 RDI: ffff81003059fd40 RBP: ffff81003059fc88 R08: ffff81003059fd48 R09: 000000000001c20b R10: 0000000000000000 R11: 0000000000000002 R12: ffff810211d88ed8 R13: ffff81003059fd18 R14: ffff810211d88800 R15: ffff8101107be9e0 FS: 00002b5f855f9df0(0000) GS:ffff810111c85b40(0000) knlGS:00000000f7fb8b90 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000014d9a2000 CR4: 00000000000006e0 Process ps (pid: 3136, threadinfo ffff81003059e000, task ffff810021228800) Stack: ffff810211d88ed8 ffff810211d88ed0 ffff81003059fce8 ffffffff810ad05d 0000000100000101 0000000000000292 0000000000000006 ffff81003059fd18 ffff81003059fd40 ffff8101107be9e0 ffff8101107be9e0 0000000000000004 Call Trace: [<ffffffff810ad05d>] task_blocks_on_rt_mutex+0x153/0x1bf [<ffffffff81065896>] rt_mutex_slowlock+0x189/0x2a2 [<ffffffff8106556a>] rt_mutex_lock+0x28/0x2a [<ffffffff810ad2d4>] __rt_down_read+0x47/0x4b [<ffffffff810ad2ee>] rt_down_read+0xb/0xd [<ffffffff810d0268>] access_process_vm+0x46/0x174 [<ffffffff81109561>] proc_pid_cmdline+0x6e/0xfb [<ffffffff8110a570>] proc_info_read+0x62/0xca [<ffffffff8100b1fc>] vfs_read+0xcc/0x155 [<ffffffff81011ab0>] sys_read+0x47/0x6f [<ffffffff8105f29e>] tracesys+0xdc/0xe1 [<000000326bebfa10>] CR2: 0000000005a00000 usb 1-2: USB disconnect, address 3 usb 1-2.1: USB disconnect, address 6 usb 1-2.3: USB disconnect, address 7 Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<ffffffff81154bb0>] plist_add+0x5b/0xa6 PGD 14d99c067 PUD 16e65d067 PMD 0 Oops: 0000 [161] PREEMPT SMP CPU 1 Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_mirror dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport joydev sr_mod cdrom sg i2c_amd756 pcspkr i2c_core amd_rng shpchp k8temp hwmon tg3 rtc_cmos rtc_core rtc_lib serio_raw usb_storage mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 3136, comm: ps Not tainted 2.6.21-14ibm #1 RIP: 0010:[<ffffffff81154bb0>] [<ffffffff81154bb0>] plist_add+0x5b/0xa6 RSP: 0018:ffff81003059fc78 EFLAGS: 00010093 RAX: ffff81003059fd58 RBX: ffff81003059fd40 RCX: ffff81003059fd58 RDX: ffff81003059fd58 RSI: fffffffffffffff8 RDI: ffff81003059fd40 RBP: ffff81003059fc88 R08: ffff81003059fd48 R09: 000000000001c20b R10: 0000000000000000 R11: 0000000000000002 R12: ffff810211d88ed8 R13: ffff81003059fd18 R14: ffff810211d88800 R15: ffff8101107be9e0 FS: 00002b5f855f9df0(0000) GS:ffff810111c85b40(0000) knlGS:00000000f7fb8b90 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000014d9a2000 CR4: 00000000000006e0 Process ps (pid: 3136, threadinfo ffff81003059e000, task ffff810021228800) Stack: ffff810211d88ed8 ffff810211d88ed0 ffff81003059fce8 ffffffff810ad05d 0000000100000101 0000000000000292 0000000000000006 ffff81003059fd18 ffff81003059fd40 ffff8101107be9e0 ffff8101107be9e0 0000000000000004 Call Trace: [<ffffffff810ad05d>] task_blocks_on_rt_mutex+0x153/0x1bf [<ffffffff81065896>] rt_mutex_slowlock+0x189/0x2a2 [<ffffffff8106556a>] rt_mutex_lock+0x28/0x2a [<ffffffff810ad2d4>] __rt_down_read+0x47/0x4b [<ffffffff810ad2ee>] rt_down_read+0xb/0xd [<ffffffff810d0268>] access_process_vm+0x46/0x174 [<ffffffff81109561>] proc_pid_cmdline+0x6e/0xfb [<ffffffff8110a570>] proc_info_read+0x62/0xca [<ffffffff8100b1fc>] vfs_read+0xcc/0x155 [<ffffffff81011ab0>] sys_read+0x47/0x6f [<ffffffff8105f29e>] tracesys+0xdc/0xe1 [<000000326bebfa10>] ======================= Additional information: The trace in this bug looks very similar to the one in ltc bug 35202 (RH bug 242865), originating in the vfs read path->task_blocks_on_rt_mutex->plist_add->dump_trace What glibc patches were being used, and what was being run on the box when this was triggered? (In reply to comment #2) > What glibc patches were being used, and what was being run on the box when this > was triggered? glibc was default RHEL5 one : glibc-2.5-12 Sudhanshu was running release-testing.sh on this machine. At the time of failure, I think it was running kernbench. The USB noise is interesting. Was this triggered while switching KVM consoles on the bladecenter? usb 1-2: USB disconnect, address 3 usb 1-2.1: USB disconnect, address 6 usb 1-2.3: USB disconnect, address 7
----- Additional Comments From dvhltc.com 2007-06-06 11:16 EDT ------- I believe the USB messages are BC KVM related.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-07 18:26 EDT ------- At the very top of the original oops I noticed: RIP [<ffffffff81081787>] show_mem+0x8f/0x144 That looks alot like the sysrq-m bug (LTC bug #35225). I'm curious if this is fallout from that triggered by an OOM. Has this been reproduced yet?
------- Additional Comments From sripathi.com (prefers email at sripathik.com) 2007-06-08 02:08 EDT ------- (In reply to comment #8) > At the very top of the original oops I noticed: > RIP [<ffffffff81081787>] show_mem+0x8f/0x144 > > That looks alot like the sysrq-m bug (LTC bug #35225). I'm curious if this is > fallout from that triggered by an OOM. There were plenty of OOM messages in dmesg, so I think show_mem is related to that.
----- Additional Comments From ankigarg.com (prefers email at ankita.com) 2007-06-08 04:39 EDT ------- > Has this been reproduced yet? John, I have not been able to hit this issue. Been running our tests on some boxes.
Several possibly related fixes on the pile o' stuff Clark is pulling together now. ie, -rt10 + stable5 + misc fixes. Don't know the exact kernel version this will be labeled (its still building). How about you guys wait for that kernel and then see if this problem still occurs?
----- Additional Comments From cijurajan.com 2007-06-15 04:45 EDT ------- Just completed the release testing once again using 2.6.21-14ibm2 kernel. The problem didn't get reproduce.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO ------- Additional Comments From cijurajan.com 2007-06-15 08:16 EDT ------- Moving the state to NEEDINFO
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-25 19:56 EDT ------- Just as a sample point, I run multiple overnight runs of kernbench and recalibrate on an LS20 w/ -23ibm3 and -31 kernels and have seen no such problem.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |OPEN ------- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-02 14:35 EDT ------- This has not been reproduced for awhile. And its likely the showmem and softirq fixs that landed in -23ibm3 resolved this. Closing.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|CLOSED |REOPENED Resolution|UNREPRODUCIBLE | ------- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-03 13:10 EDT ------- Please do not move bugs from resolved states to closed. See https://ltc.linux.ibm.com/wiki/rt-linux/Bugs for the realtime bug lifecycle.