Bug 242866

Summary: NULL pointer OOPS in plist_add on LS20
Product: Red Hat Enterprise MRG Reporter: IBM Bug Proxy <bugproxy>
Component: realtime-kernelAssignee: Red Hat Real Time Maintenance <rt-maint>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: 1.0   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: -31 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-07-02 14:24:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description IBM Bug Proxy 2007-06-06 05:33:52 UTC
LTC Owner is: jstultz.com
LTC Originator is: sudhanshusingh.com


Problem description:

LS 20 machine hangs up with OOPS messages from kernel.

Describe any custom patches installed.

RT patches to RHEL5
glibc patches.

       Provide output from "uname -a", if possible:
$uname -a
Linux llm49.in.ibm.com 2.6.21-14ibm #1 SMP PREEMPT RT Thu May 31 21:18:32 CDT
2007 x86_64 x86_64 x86_64 GNU/Linux


Hardware Environment
LS 20 machine


Please provide contact information if the submitter is not the primary contact.
sripathik.com

Please provide access information for the machine if it is available.
llm49.in.ibm.com


Did the system produce an OOPS message on the console?
    If so, copy it here:
==================================
Code: 8b 02 f6 c4 04 74 05 49 ff c4 eb 35 8b 02 66 85 c0 79 05 49
RIP  [<ffffffff81081787>] show_mem+0x8f/0x144
 RSP <ffff810105103b28>ts Code: 8b 02 f6 c4 04 74 05 49 ff c4 eb 35 8b 02 66 85
c0 79 05 49
RIP  [<ffffffff81081787>] show_mem+0x8f/0x144
 RSP <ffff810105103b28>
CR2: 0000000005a00000
usb 1-2: USB disconnect, address 3
usb 1-2.1: USB disconnect, address 6
usb 1-2.3: USB disconnect, address 7
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
 [<ffffffff81154bb0>] plist_add+0x5b/0xa6
PGD 14d99c067 PUD 16e65d067 PMD 0
Oops: 0000 [161] PREEMPT SMP
CPU 1
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc
nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state iptable_filter
ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6
dm_mirror dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp
parport joydev sr_mod cdrom sg i2c_amd756 pcspkr i2c_core amd_rng shpchp k8temp
hwmon tg3 rtc_cmos rtc_core rtc_lib serio_raw usb_storage mptspi mptscsih
mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 3136, comm: ps Not tainted 2.6.21-14ibm #1
RIP: 0010:[<ffffffff81154bb0>]  [<ffffffff81154bb0>] plist_add+0x5b/0xa6
RSP: 0018:ffff81003059fc78  EFLAGS: 00010093
RAX: ffff81003059fd58 RBX: ffff81003059fd40 RCX: ffff81003059fd58
RDX: ffff81003059fd58 RSI: fffffffffffffff8 RDI: ffff81003059fd40
RBP: ffff81003059fc88 R08: ffff81003059fd48 R09: 000000000001c20b
R10: 0000000000000000 R11: 0000000000000002 R12: ffff810211d88ed8
R13: ffff81003059fd18 R14: ffff810211d88800 R15: ffff8101107be9e0
FS:  00002b5f855f9df0(0000) GS:ffff810111c85b40(0000) knlGS:00000000f7fb8b90
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000014d9a2000 CR4: 00000000000006e0
Process ps (pid: 3136, threadinfo ffff81003059e000, task ffff810021228800)
Stack:  ffff810211d88ed8 ffff810211d88ed0 ffff81003059fce8 ffffffff810ad05d
 0000000100000101 0000000000000292 0000000000000006 ffff81003059fd18
 ffff81003059fd40 ffff8101107be9e0 ffff8101107be9e0 0000000000000004
Call Trace:
 [<ffffffff810ad05d>] task_blocks_on_rt_mutex+0x153/0x1bf
 [<ffffffff81065896>] rt_mutex_slowlock+0x189/0x2a2
 [<ffffffff8106556a>] rt_mutex_lock+0x28/0x2a
 [<ffffffff810ad2d4>] __rt_down_read+0x47/0x4b
 [<ffffffff810ad2ee>] rt_down_read+0xb/0xd
 [<ffffffff810d0268>] access_process_vm+0x46/0x174
 [<ffffffff81109561>] proc_pid_cmdline+0x6e/0xfb
 [<ffffffff8110a570>] proc_info_read+0x62/0xca
 [<ffffffff8100b1fc>] vfs_read+0xcc/0x155
 [<ffffffff81011ab0>] sys_read+0x47/0x6f
 [<ffffffff8105f29e>] tracesys+0xdc/0xe1
 [<000000326bebfa10>]
CR2: 0000000005a00000
usb 1-2: USB disconnect, address 3
usb 1-2.1: USB disconnect, address 6
usb 1-2.3: USB disconnect, address 7
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
 [<ffffffff81154bb0>] plist_add+0x5b/0xa6
PGD 14d99c067 PUD 16e65d067 PMD 0
Oops: 0000 [161] PREEMPT SMP
CPU 1
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc
nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state iptable_filter
ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6
dm_mirror dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp
parport joydev sr_mod cdrom sg i2c_amd756 pcspkr i2c_core amd_rng shpchp k8temp
hwmon tg3 rtc_cmos rtc_core rtc_lib serio_raw usb_storage mptspi mptscsih
mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 3136, comm: ps Not tainted 2.6.21-14ibm #1
RIP: 0010:[<ffffffff81154bb0>]  [<ffffffff81154bb0>] plist_add+0x5b/0xa6
RSP: 0018:ffff81003059fc78  EFLAGS: 00010093
RAX: ffff81003059fd58 RBX: ffff81003059fd40 RCX: ffff81003059fd58
RDX: ffff81003059fd58 RSI: fffffffffffffff8 RDI: ffff81003059fd40
RBP: ffff81003059fc88 R08: ffff81003059fd48 R09: 000000000001c20b
R10: 0000000000000000 R11: 0000000000000002 R12: ffff810211d88ed8
R13: ffff81003059fd18 R14: ffff810211d88800 R15: ffff8101107be9e0
FS:  00002b5f855f9df0(0000) GS:ffff810111c85b40(0000) knlGS:00000000f7fb8b90
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000014d9a2000 CR4: 00000000000006e0
Process ps (pid: 3136, threadinfo ffff81003059e000, task ffff810021228800)
Stack:  ffff810211d88ed8 ffff810211d88ed0 ffff81003059fce8 ffffffff810ad05d
 0000000100000101 0000000000000292 0000000000000006 ffff81003059fd18
 ffff81003059fd40 ffff8101107be9e0 ffff8101107be9e0 0000000000000004
Call Trace:
 [<ffffffff810ad05d>] task_blocks_on_rt_mutex+0x153/0x1bf
 [<ffffffff81065896>] rt_mutex_slowlock+0x189/0x2a2
 [<ffffffff8106556a>] rt_mutex_lock+0x28/0x2a
 [<ffffffff810ad2d4>] __rt_down_read+0x47/0x4b
 [<ffffffff810ad2ee>] rt_down_read+0xb/0xd
 [<ffffffff810d0268>] access_process_vm+0x46/0x174
 [<ffffffff81109561>] proc_pid_cmdline+0x6e/0xfb
 [<ffffffff8110a570>] proc_info_read+0x62/0xca
 [<ffffffff8100b1fc>] vfs_read+0xcc/0x155
 [<ffffffff81011ab0>] sys_read+0x47/0x6f
 [<ffffffff8105f29e>] tracesys+0xdc/0xe1
 [<000000326bebfa10>]
=======================



Additional information:

The trace in this bug looks very similar to the one in ltc bug 35202 (RH bug
242865), originating in the vfs read
path->task_blocks_on_rt_mutex->plist_add->dump_trace

What glibc patches were being used, and what was being run on the box when this
was triggered?

(In reply to comment #2)
> What glibc patches were being used, and what was being run on the box when this
> was triggered?

glibc was default RHEL5 one : glibc-2.5-12
Sudhanshu was running release-testing.sh on this machine. At the time of
failure, I think it was running kernbench.

The USB noise is interesting. Was this triggered while switching KVM consoles on
the bladecenter?

usb 1-2: USB disconnect, address 3
usb 1-2.1: USB disconnect, address 6
usb 1-2.3: USB disconnect, address 7

Comment 1 IBM Bug Proxy 2007-06-06 15:20:33 UTC
----- Additional Comments From dvhltc.com  2007-06-06 11:16 EDT -------
I believe the USB messages are BC KVM related. 

Comment 2 IBM Bug Proxy 2007-06-07 22:30:40 UTC
----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-06-07 18:26 EDT -------
At the very top of the  original oops I noticed:
RIP  [<ffffffff81081787>] show_mem+0x8f/0x144 

That looks alot like the sysrq-m bug (LTC bug #35225). I'm curious if this is
fallout from that triggered by an OOM.

Has this been reproduced yet? 

Comment 3 IBM Bug Proxy 2007-06-08 06:10:37 UTC
------- Additional Comments From sripathi.com (prefers email at sripathik.com)  2007-06-08 02:08 EDT -------
(In reply to comment #8)
> At the very top of the  original oops I noticed:
> RIP  [<ffffffff81081787>] show_mem+0x8f/0x144 
> 
> That looks alot like the sysrq-m bug (LTC bug #35225). I'm curious if this is
> fallout from that triggered by an OOM.

There were plenty of OOM messages in dmesg, so I think show_mem is related to that. 

Comment 4 IBM Bug Proxy 2007-06-08 08:45:33 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-08 04:39 EDT -------
> Has this been reproduced yet?

John,
I have not been able to hit this issue. Been running our tests on some boxes. 

Comment 5 Tim Burke 2007-06-14 19:26:55 UTC
Several possibly related fixes on the pile o' stuff Clark is pulling together
now. ie, -rt10 + stable5 + misc fixes. Don't know the exact kernel version this
will be labeled (its still building).  How about you guys wait for that kernel
and then see if this problem still occurs?


Comment 6 IBM Bug Proxy 2007-06-15 08:50:39 UTC
----- Additional Comments From cijurajan.com  2007-06-15 04:45 EDT -------
Just completed the release testing once again using 2.6.21-14ibm2 kernel. The
problem didn't get reproduce. 

Comment 7 IBM Bug Proxy 2007-06-15 12:20:38 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |NEEDINFO




------- Additional Comments From cijurajan.com  2007-06-15 08:16 EDT -------
Moving the state to NEEDINFO 

Comment 8 IBM Bug Proxy 2007-06-26 00:01:09 UTC
----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-06-25 19:56 EDT -------
Just as a sample point, I run multiple overnight runs of kernbench and
recalibrate  on an LS20 w/ -23ibm3 and -31 kernels and have seen no such problem. 

Comment 9 IBM Bug Proxy 2007-07-02 18:41:14 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |OPEN




------- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-07-02 14:35 EDT -------
This has not been reproduced for awhile. And its likely the showmem and softirq
fixs that landed in -23ibm3 resolved this. Closing. 

Comment 10 IBM Bug Proxy 2007-07-03 17:15:14 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|CLOSED                      |REOPENED
         Resolution|UNREPRODUCIBLE              |




------- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-07-03 13:10 EDT -------
Please do not move bugs from resolved states to closed.
See https://ltc.linux.ibm.com/wiki/rt-linux/Bugs for the realtime bug lifecycle.