245338 – [RHEL5 RT] Oops - Unable to handle kernel paging request

Bug 245338 - [RHEL5 RT] Oops - Unable to handle kernel paging request

Summary: [RHEL5 RT] Oops - Unable to handle kernel paging request

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	1.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Clark Williams
QA Contact:
Docs Contact:
URL:	http://rhts.lab.boston.redhat.com/cgi...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-06-22 14:48 UTC by Jeff Burke
Modified:	2008-02-27 19:58 UTC (History)
CC List:	3 users (show)
Fixed In Version:	-43
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-22 21:26:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Traceback from panic on 2.6.21-33.el5rt (10.60 KB, text/plain) 2007-07-10 22:19 UTC, Clark Williams	no flags	Details
boot log and traceback from second panic (51.86 KB, text/plain) 2007-07-12 14:14 UTC, Clark Williams	no flags	Details
panic with logdev prints (1.04 MB, text/plain) 2007-07-19 17:59 UTC, Clark Williams	no flags	Details
back port changes from RT Linux mainline (30.01 KB, patch) 2007-09-25 20:29 UTC, Steven Rostedt	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	36455	0	None	None	None	Never

Description Jeff Burke 2007-06-22 14:48:08 UTC

Description of problem:
 While running the raceer test on hp-dl585-02 x86_64. The system Oops.

Unable to handle kernel paging request at ffff8115ccd0e800 RIP: 
 [<ffff8115ccd0e800>]
PGD 8063 PUD 60063 PMD 80000015ccc001e3 PTE 0
Oops: 0011 [1] PREEMPT SMP 
CPU 4 
Modules linked in: 
Pid: 64, comm: softirq-rcu/4 Not tainted 2.6.21-32.el5rt #1

Version-Release number of selected component (if applicable):
 2.6.21-31.el5rt

How reproducible:
 Unknown
  
Actual results:

Unable to handle kernel paging request at ffff8115ccd0e800 RIP: 
 [<ffff8115ccd0e800>]
PGD 8063 PUD 60063 PMD 80000015ccc001e3 PTE 0
Oops: 0011 [1] PREEMPT SMP 
CPU 4 
Modules linked in: nfs nfsd exportfs lockd nfs_acl autofs4 hidp rfcomm l2cap
bluetooth sunrpc ipv6 video sbs i2c_ec dock button battery asus_acpi ac
parport_pc lp parport floppy pcspkr pata_amd ata_generic libata i2c_amd756
i2c_core amd_rng k8temp hwmon tg3 shpchp serio_raw ide_cd cdrom dm_snapshot
dm_zero dm_mirror dm_mod cciss sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 64, comm: softirq-rcu/4 Not tainted 2.6.21-32.el5rt #1
RIP: 0010:[<ffff8115ccd0e800>]  [<ffff8115ccd0e800>]
RSP: 0000:ffff810123eede38  EFLAGS: 00010286
RAX: ffff8115fffbd040 RBX: ffff8112c32f3500 RCX: 000000000000000b
RDX: ffff8115fffbd040 RSI: ffff810c000c45f8 RDI: ffff8115ff568300
RBP: ffff810123eede50 R08: ffff810123eec000 R09: 0000000000000038
R10: ffffffff808e1700 R11: ffff81062419be98 R12: ffff8115ff568300
R13: 00000000fffffeff R14: 0000000000000004 R15: 0000000000000100
FS:  00002b9edfcc3f20(0000) GS:ffff810123e21a40(0000) knlGS:00000000f7fce6c0
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff8115ccd0e800 CR3: 000000016acfc000 CR4: 00000000000006e0
Process softirq-rcu/4 (pid: 64, threadinfo ffff810123eec000, task ffff8115fffbd040)
Stack:  ffffffff802c53aa ffffffff809df968 ffff810001054600 ffff810123eedeb0
 ffffffff80295a34 0000000600000001 00000004000a7bf0 ffff810c00000032
 ffff810001054600 ffff810c000a7bf0 ffff810001054600 ffff810c000a7bf0
Call Trace:


Code: 00 83 56 ff 15 81 ff ff 40 13 cf f9 03 81 ff ff 01 00 00 00 
RIP  [<ffff8115ccd0e800>]
 RSP <ffff810123eede38>
CR2: ffff8115ccd0e800

Additional info:
 This happened while smoke testing the 2.6.21-32.el5rt kernel.

Comment 7 Clark Williams 2007-07-09 14:14:24 UTC

No, that was a different bug that happened during bootup and acme was looking at
that one. I grabbed the system, loaded -32 and then loaded LTP. I ran the
racer.sh test until my reservation expired and was kicked off, didn't see any
panics.

I'll try again later today. Jeff, any advice on setup other than loading the
latest -rt kernel and then loading LTP and running racer.sh?

Comment 8 Jeff Burke 2007-07-09 14:23:59 UTC

Clark,
  Not sure what log you looked at but it was loaded with panics. This was your
job correct? http://rhts.lab.boston.redhat.com/cgi-bin/rhts/jobs.cgi?id=2271

  Here is a link to the log:
http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=268893

  Also you may want to read the email from Steve re: [PATCH] Don't BUG_ON in exit.c

Jeff

Comment 9 Clark Williams 2007-07-09 15:30:14 UTC

Ah, I didn't look at my log, just a quick look at the test output on a local
machine; saw the output from racer and the killed message and assumed that it
had run until the reservation expired.

Scrolling back I see that problems started around 2200 Tuesday night (about 7
hours into the run). I see an Oops being reported. Unfortunately not much useful
there since it doesn't match up with the panic in the logs. 

I'm going to add Steven's BUG_ON patch and try again with -33.

Comment 10 Clark Williams 2007-07-10 22:19:16 UTC

Created attachment 158902 [details]
Traceback from panic on 2.6.21-33.el5rt

This is a traceback from running the racer.sh script from LTP for approximately
8 hours on hp-dl587-02.rhts.boston.redhat.com, using the 2.6.21-33.el5rt kernel
(contains the BUG_ON removal patch from srostedt)

Comment 11 Clark Williams 2007-07-12 14:14:34 UTC

Created attachment 159057 [details]
boot log and traceback from second panic

This is the boot log and traceback from a panic on the system
hp-dl585g2-01.rhts.boston.redhat.com.

Comment 12 Clark Williams 2007-07-12 14:28:33 UTC

In the traceback attached to Comment #11, the problem occured at some point
after 6 hours of runtime. General similarity to the first panic in that the
problem occured in task_rq_lock() called by softirq-rcu. I'm beginning to think
that it's not a problem in softirqs, but rather something whacking a task
structure before calling a softirq, since it looks like task_rq_lock faults
while trying to use something out of the task_struct that was passed to it.

Comment 13 Clark Williams 2007-07-16 22:09:25 UTC

While trying to generate a kernel dump with kernel-rt-2.6.21-34.el5rt, I saw the
following message on the console of a DL585:

hdb: cdrom_pc_intr: The drive appears confused (ireason = 0x01). Trying to
recover by ending request. 

As far as I know there is no disc in the CD tray of the test system, therefore
there should be no request to cancel. This is borne out by the fact the system
was non-responsive after this console message and required a reset.

Comment 14 Clark Williams 2007-07-19 17:59:29 UTC

Created attachment 159599 [details]
panic with logdev prints

This is serial console output from a panic running a 2.6.21-34 kernel with
rostedts logdev patch and a print in the call_rcu function and the
process_rcu_callbacks function

Comment 15 IBM Bug Proxy 2007-07-23 20:50:59 UTC

----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-07-23 16:48 EDT -------
Here is a very similar Oops we triggered in our labs. Unfortunately while doing
nvidia binary driver testing. :(

Jul 15 04:55:55 elm3b160 smartd[4711]: Device: /dev/sda, Temperature changed -2
Celsius to 34 Celsius since last report
Jul 15 08:49:19 elm3b160 kernel: Unable to handle kernel paging request at
ffff8102becdcd30 RIP:
Jul 15 08:49:19 elm3b160 kernel:  [<ffff8102becdcd30>]
Jul 15 08:49:19 elm3b160 kernel: PGD 8063 PUD 13063 PMD 80000002bec001e3 PTE 0
Jul 15 08:49:19 elm3b160 kernel: Oops: 0011 [1] PREEMPT SMP
Jul 15 08:49:19 elm3b160 kernel: CPU 1
Jul 15 08:49:19 elm3b160 kernel: Modules linked in: nvidia(P) autofs4 hidp
rfcomm l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT
nf_conntrack_ipv4 xt_state iptable_filter ip_tables ip6t_REJECT xt_tcpudp
ip6table_filter ip6_tables x_tables ipv6 dm_mirror dm_mod video sbs i2c_ec dock
button battery asus_acpi ac parport_pc lp parport floppy sg snd_hda_intel pcspkr
snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore
snd_page_alloc i2c_i801 i2c_core ata_generic shpchp tg3 rtc_cmos rtc_core
rtc_lib serio_raw ide_cd cdrom ata_piix libata mptsas mptscsih mptbase
scsi_transport_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Jul 15 08:49:19 elm3b160 kernel: Pid: 25, comm: softirq-rcu/1 Tainted: P       
2.6.21-31.el5rt #1
Jul 15 08:49:19 elm3b160 kernel: RIP: 0010:[<ffff8102becdcd30>] 
[<ffff8102becdcd30>]
Jul 15 08:49:19 elm3b160 kernel: RSP: 0018:ffff81010f1cde38  EFLAGS: 00010246
Jul 15 08:49:19 elm3b160 kernel: RAX: 0000000000000000 RBX: ffff81028dd96cc0
RCX: 000000000906cd40
Jul 15 08:49:19 elm3b160 kernel: RDX: ffff81010f1cdfd8 RSI: 0000000000000282
RDI: ffff8102b2ede280
Jul 15 08:49:19 elm3b160 kernel: RBP: ffff81010f1cde50 R08: ffff81010f1cc000
R09: 000000000000003c
Jul 15 08:49:19 elm3b160 kernel: R10: ffffffff808e1700 R11: 000000010904a840
R12: ffff8102b2ede280
Jul 15 08:49:19 elm3b160 kernel: R13: 00000000fffffeff R14: 0000000000000001
R15: 0000000000000100
Jul 15 08:49:19 elm3b160 kernel: FS:  0000000000000000(0000)
GS:ffff81010f12f7c0(0000) knlGS:0000000000000000
Jul 15 08:49:19 elm3b160 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Jul 15 08:49:19 elm3b160 kernel: CR2: ffff8102becdcd30 CR3: 00000002b013f000
CR4: 00000000000006e0
Jul 15 08:49:19 elm3b160 kernel: Process softirq-rcu/1 (pid: 25, threadinfo
ffff81010f1cc000, task ffff81010f19a810)
Jul 15 08:49:19 elm3b160 kernel: Stack:  ffffffff802c536f ffffffff809df968
ffff810009054680 ffff81010f1cdeb0
Jul 15 08:49:19 elm3b160 kernel:  ffffffff80295ab0 0000000600000001
000000010f165bf0 ffff810100000032
Jul 15 08:49:19 elm3b160 kernel:  ffff810009054680 ffff81010f165bf0
ffff810009054680 ffff81010f165bf0
Jul 15 08:49:19 elm3b160 kernel: Call Trace:
Jul 15 08:49:19 elm3b160 kernel: 
Jul 15 08:49:19 elm3b160 kernel: 
Jul 15 08:49:19 elm3b160 kernel: Code: 80 2d ff b3 02 81 ff ff 80 b9 58 be 02 81
ff ff 01 00 00 00
Jul 15 08:49:19 elm3b160 kernel: RIP  [<ffff8102becdcd30>]
Jul 15 08:49:19 elm3b160 kernel:  RSP <ffff81010f1cde38>
Jul 15 08:49:19 elm3b160 kernel: CR2: ffff8102becdcd30

Comment 16 IBM Bug Proxy 2007-08-07 03:00:38 UTC

----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-08-06 22:57 EDT -------
RH: Might want to take a look at RH bug #248352 (LTC bug #36397) to see if the
restrict irqd patch helps this issue as well. Unfortunately that fix is not a
complete fix and we still need irqd to be able to migrate to avoid starvation,
but it might help narrow down this issue.

Comment 17 IBM Bug Proxy 2007-08-08 22:40:21 UTC

----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-08-08 18:34 EDT -------
I'm taking a shot at reproducing this now using the racer script on the -31
kernel. If I can reproduce it I'll try it w/ the irq binding fix from RH bug
#248352 (LTC bug #36397).

Comment 18 IBM Bug Proxy 2007-08-10 20:35:49 UTC

------- Comment From jstultz.com 2007-08-10 16:30 EDT-------
So far I've been unable to reproduce this issue on an LS20 using the ltp racer.

Comment 19 Jeff Burke 2007-08-27 15:25:04 UTC

While testing the 2.6.21-36.el5rt kernel on Snapshot#3 RHEL5.1-Server-20070817.0
I see the following Oops on ibm-taroko.rhts.boston.redhat.com. According to the
RHTS records, this system is a IBM IntelliStation M Pro -[922940Z]- Revision A.

http://rhts.lab.boston.redhat.com/cgi-bin/rhts/jobs.cgi?id=5427 recipe 17814
http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=586470
________________________________________________________________________________
Unable to handle kernel paging request at ffff81003188e380 RIP:
 [<ffff81003188e380>]
PGD 8063 PUD 9063 PMD 80000000318001e3 PTE 0
Oops: 0011 [1] PREEMPT SMP
CPU 1
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6
cpufreq_ondemand dm_multipath video sbs i2c_ec dock button battery
asus_acpi ac lp floppy snd_hda_intel snd_hda_codec snd_seq_dummy
snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss
snd_mixer_oss sg snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr
iTCO_wdt shpchp iTCO_vendor_support i2c_i801 tg3 i2c_core ata_generic
rtc_cmos parport_pc rtc_core parport rtc_lib ide_cd cdrom serio_raw
dm_snapshot dm_zero dm_mirror dm_mod ahci ata_piix libata mptsas
mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd ehci_hcd
ohci_hcd uhci_hcd
Pid: 25, comm: softirq-rcu/1 Not tainted 2.6.21-36.el5rt #1
RIP: 0010:[<ffff81003188e380>]  [<ffff81003188e380>]
RSP: 0000:ffff81003fe55e38  EFLAGS: 00010282
RAX: ffff81003fe21810 RBX: ffff81002ae1d340 RCX: ffff810005731680
RDX: ffff81003fe21810 RSI: ffff810001000000 RDI: ffff810016cc0980
RBP: ffff81003fe55e50 R08: ffff81003fe54000 R09: ffff81003fe41de0
R10: ffffffff808e1700 R11: 7fffffffffffffff R12: ffff810016cc0980
R13: 00000000fffffeff R14: 0000000000000001 R15: 0000000000000100
FS:  0000000000000000(0000) GS:ffff810037cb79c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff81003188e380 CR3: 000000000ad14000 CR4: 00000000000006e0
Process softirq-rcu/1 (pid: 25, threadinfo ffff81003fe54000, task
ffff81003fe21810)
Stack:  ffffffff802c5056 ffffffff809df968 ffff81000256a600 ffff81003fe55eb0
 ffffffff8029578a 0000000600000001 000000010276dbf0 ffff810000000032
 ffff81000256a600 ffff81000276dbf0 ffff81000256a600 ffff81000276dbf0
Call Trace:

Code: 40 c8 20 3c 00 81 ff ff 00 cd 99 06 00 81 ff ff 01 00 00 00
RIP  [<ffff81003188e380>]
 RSP <ffff81003fe55e38>
CR2: ffff81003188e380
________________________________________________________________________________

Comment 20 Steven Rostedt 2007-09-12 21:29:34 UTC

After lots and lots of hunting for this, my logdev showed the problem:

The file struct has a f_u union of a list head, a rcuhead and what Peter
Zijlstra added, a llist (locked list head).  This is a union so these share the
same memory.

(from logdev)

[ 3447.112218] cpu:0 (true:4330) file_kill:403 reinit fu_llistf=ffff81001e762b00
next=ffff81000222f6c9 func=ffff81002688f440
[ 3447.112223] cpu:0 (true:4330) call_rcu:323 head=ffff81001e762b00
func=ffffffff80292c79 (ffffffff80292c4d)
[ 3447.112377] cpu:1 (ls:4260) file_kill:411 del fu_llistf=ffff81002688f440
next=ffff81001e762b00 func=ffff81002634c640
[ 3447.113118] cpu:0 (softirq-rcu/0:13) rcu_process_callbacks:311
list=ffff81001e762b00 next=0000000000000000 func=ffff81002634c640

The first line shows that the file->f_u.fu_llist (ffff81001e762b00) was deleted,
but it only had it's list initialized. It didn't "delete" it from the list it
was in. I always used the rcuhead next and func as the output, but this is where
it's also being used as llist.

The next line calls the call_rcu to add to the rcu list to run function at the
address ffffffff80292c79.

The third line shows object ffff81002688f440 being deleted from the list. But
you can also see that its next pointer was the (already deleted)
ffff81001e762b00 and the prev pointer is ffff81002634c640.

The forth line shows the rcu callback, calling ffff81002634c640 (prev pointer
from previous line) instead of the function ffffffff80292c79.

The thing is that the del_list does a item->next->prev = item->prev;
item->prev->next = item->next.  So you can see how the corruption happended.

Luckily, Peter Zijlstra has modified this code already for upstream -RT and we
are currently testing a backpork he did to see if it fixes the issue.

Comment 21 Steven Rostedt 2007-09-25 20:29:03 UTC

Created attachment 206091 [details]
back port changes from RT Linux mainline

This is a patch (combined set) from Peter Zijlstra that fixes the issue.

Comment 22 IBM Bug Proxy 2007-09-26 22:26:00 UTC

------- Comment From dvhltc.com 2007-09-26 18:20 EDT-------
I'd like to see a review of the code from Peter and an evaluation of how
confident we are if this is fixed by the end of the week.  Marking required data
accordingly.  if there is more to do come the end of the week, we can document
that here in the bug and update the required date.

------- Comment From dvhltc.com 2007-09-26 18:20 EDT-------
Reassigning to PC.

Comment 23 Clark Williams 2007-10-19 15:06:47 UTC

We've been running these patches since kernel-rt-2.6.21-38 and have not seen the
bug (current release is -43).

I'm inclined to call this one fix.

Note You need to log in before you can comment on or make changes to this bug.