Description of problem: While running the raceer test on hp-dl585-02 x86_64. The system Oops. Unable to handle kernel paging request at ffff8115ccd0e800 RIP: [<ffff8115ccd0e800>] PGD 8063 PUD 60063 PMD 80000015ccc001e3 PTE 0 Oops: 0011 [1] PREEMPT SMP CPU 4 Modules linked in: Pid: 64, comm: softirq-rcu/4 Not tainted 2.6.21-32.el5rt #1 Version-Release number of selected component (if applicable): 2.6.21-31.el5rt How reproducible: Unknown Actual results: Unable to handle kernel paging request at ffff8115ccd0e800 RIP: [<ffff8115ccd0e800>] PGD 8063 PUD 60063 PMD 80000015ccc001e3 PTE 0 Oops: 0011 [1] PREEMPT SMP CPU 4 Modules linked in: nfs nfsd exportfs lockd nfs_acl autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport floppy pcspkr pata_amd ata_generic libata i2c_amd756 i2c_core amd_rng k8temp hwmon tg3 shpchp serio_raw ide_cd cdrom dm_snapshot dm_zero dm_mirror dm_mod cciss sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 64, comm: softirq-rcu/4 Not tainted 2.6.21-32.el5rt #1 RIP: 0010:[<ffff8115ccd0e800>] [<ffff8115ccd0e800>] RSP: 0000:ffff810123eede38 EFLAGS: 00010286 RAX: ffff8115fffbd040 RBX: ffff8112c32f3500 RCX: 000000000000000b RDX: ffff8115fffbd040 RSI: ffff810c000c45f8 RDI: ffff8115ff568300 RBP: ffff810123eede50 R08: ffff810123eec000 R09: 0000000000000038 R10: ffffffff808e1700 R11: ffff81062419be98 R12: ffff8115ff568300 R13: 00000000fffffeff R14: 0000000000000004 R15: 0000000000000100 FS: 00002b9edfcc3f20(0000) GS:ffff810123e21a40(0000) knlGS:00000000f7fce6c0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: ffff8115ccd0e800 CR3: 000000016acfc000 CR4: 00000000000006e0 Process softirq-rcu/4 (pid: 64, threadinfo ffff810123eec000, task ffff8115fffbd040) Stack: ffffffff802c53aa ffffffff809df968 ffff810001054600 ffff810123eedeb0 ffffffff80295a34 0000000600000001 00000004000a7bf0 ffff810c00000032 ffff810001054600 ffff810c000a7bf0 ffff810001054600 ffff810c000a7bf0 Call Trace: Code: 00 83 56 ff 15 81 ff ff 40 13 cf f9 03 81 ff ff 01 00 00 00 RIP [<ffff8115ccd0e800>] RSP <ffff810123eede38> CR2: ffff8115ccd0e800 Additional info: This happened while smoke testing the 2.6.21-32.el5rt kernel.
No, that was a different bug that happened during bootup and acme was looking at that one. I grabbed the system, loaded -32 and then loaded LTP. I ran the racer.sh test until my reservation expired and was kicked off, didn't see any panics. I'll try again later today. Jeff, any advice on setup other than loading the latest -rt kernel and then loading LTP and running racer.sh?
Clark, Not sure what log you looked at but it was loaded with panics. This was your job correct? http://rhts.lab.boston.redhat.com/cgi-bin/rhts/jobs.cgi?id=2271 Here is a link to the log: http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=268893 Also you may want to read the email from Steve re: [PATCH] Don't BUG_ON in exit.c Jeff
Ah, I didn't look at my log, just a quick look at the test output on a local machine; saw the output from racer and the killed message and assumed that it had run until the reservation expired. Scrolling back I see that problems started around 2200 Tuesday night (about 7 hours into the run). I see an Oops being reported. Unfortunately not much useful there since it doesn't match up with the panic in the logs. I'm going to add Steven's BUG_ON patch and try again with -33.
Created attachment 158902 [details] Traceback from panic on 2.6.21-33.el5rt This is a traceback from running the racer.sh script from LTP for approximately 8 hours on hp-dl587-02.rhts.boston.redhat.com, using the 2.6.21-33.el5rt kernel (contains the BUG_ON removal patch from srostedt)
Created attachment 159057 [details] boot log and traceback from second panic This is the boot log and traceback from a panic on the system hp-dl585g2-01.rhts.boston.redhat.com.
In the traceback attached to Comment #11, the problem occured at some point after 6 hours of runtime. General similarity to the first panic in that the problem occured in task_rq_lock() called by softirq-rcu. I'm beginning to think that it's not a problem in softirqs, but rather something whacking a task structure before calling a softirq, since it looks like task_rq_lock faults while trying to use something out of the task_struct that was passed to it.
While trying to generate a kernel dump with kernel-rt-2.6.21-34.el5rt, I saw the following message on the console of a DL585: hdb: cdrom_pc_intr: The drive appears confused (ireason = 0x01). Trying to recover by ending request. As far as I know there is no disc in the CD tray of the test system, therefore there should be no request to cancel. This is borne out by the fact the system was non-responsive after this console message and required a reset.
Created attachment 159599 [details] panic with logdev prints This is serial console output from a panic running a 2.6.21-34 kernel with rostedts logdev patch and a print in the call_rcu function and the process_rcu_callbacks function
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-23 16:48 EDT ------- Here is a very similar Oops we triggered in our labs. Unfortunately while doing nvidia binary driver testing. :( Jul 15 04:55:55 elm3b160 smartd[4711]: Device: /dev/sda, Temperature changed -2 Celsius to 34 Celsius since last report Jul 15 08:49:19 elm3b160 kernel: Unable to handle kernel paging request at ffff8102becdcd30 RIP: Jul 15 08:49:19 elm3b160 kernel: [<ffff8102becdcd30>] Jul 15 08:49:19 elm3b160 kernel: PGD 8063 PUD 13063 PMD 80000002bec001e3 PTE 0 Jul 15 08:49:19 elm3b160 kernel: Oops: 0011 [1] PREEMPT SMP Jul 15 08:49:19 elm3b160 kernel: CPU 1 Jul 15 08:49:19 elm3b160 kernel: Modules linked in: nvidia(P) autofs4 hidp rfcomm l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_mirror dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport floppy sg snd_hda_intel pcspkr snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_generic shpchp tg3 rtc_cmos rtc_core rtc_lib serio_raw ide_cd cdrom ata_piix libata mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Jul 15 08:49:19 elm3b160 kernel: Pid: 25, comm: softirq-rcu/1 Tainted: P 2.6.21-31.el5rt #1 Jul 15 08:49:19 elm3b160 kernel: RIP: 0010:[<ffff8102becdcd30>] [<ffff8102becdcd30>] Jul 15 08:49:19 elm3b160 kernel: RSP: 0018:ffff81010f1cde38 EFLAGS: 00010246 Jul 15 08:49:19 elm3b160 kernel: RAX: 0000000000000000 RBX: ffff81028dd96cc0 RCX: 000000000906cd40 Jul 15 08:49:19 elm3b160 kernel: RDX: ffff81010f1cdfd8 RSI: 0000000000000282 RDI: ffff8102b2ede280 Jul 15 08:49:19 elm3b160 kernel: RBP: ffff81010f1cde50 R08: ffff81010f1cc000 R09: 000000000000003c Jul 15 08:49:19 elm3b160 kernel: R10: ffffffff808e1700 R11: 000000010904a840 R12: ffff8102b2ede280 Jul 15 08:49:19 elm3b160 kernel: R13: 00000000fffffeff R14: 0000000000000001 R15: 0000000000000100 Jul 15 08:49:19 elm3b160 kernel: FS: 0000000000000000(0000) GS:ffff81010f12f7c0(0000) knlGS:0000000000000000 Jul 15 08:49:19 elm3b160 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jul 15 08:49:19 elm3b160 kernel: CR2: ffff8102becdcd30 CR3: 00000002b013f000 CR4: 00000000000006e0 Jul 15 08:49:19 elm3b160 kernel: Process softirq-rcu/1 (pid: 25, threadinfo ffff81010f1cc000, task ffff81010f19a810) Jul 15 08:49:19 elm3b160 kernel: Stack: ffffffff802c536f ffffffff809df968 ffff810009054680 ffff81010f1cdeb0 Jul 15 08:49:19 elm3b160 kernel: ffffffff80295ab0 0000000600000001 000000010f165bf0 ffff810100000032 Jul 15 08:49:19 elm3b160 kernel: ffff810009054680 ffff81010f165bf0 ffff810009054680 ffff81010f165bf0 Jul 15 08:49:19 elm3b160 kernel: Call Trace: Jul 15 08:49:19 elm3b160 kernel: Jul 15 08:49:19 elm3b160 kernel: Jul 15 08:49:19 elm3b160 kernel: Code: 80 2d ff b3 02 81 ff ff 80 b9 58 be 02 81 ff ff 01 00 00 00 Jul 15 08:49:19 elm3b160 kernel: RIP [<ffff8102becdcd30>] Jul 15 08:49:19 elm3b160 kernel: RSP <ffff81010f1cde38> Jul 15 08:49:19 elm3b160 kernel: CR2: ffff8102becdcd30
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-08-06 22:57 EDT ------- RH: Might want to take a look at RH bug #248352 (LTC bug #36397) to see if the restrict irqd patch helps this issue as well. Unfortunately that fix is not a complete fix and we still need irqd to be able to migrate to avoid starvation, but it might help narrow down this issue.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-08-08 18:34 EDT ------- I'm taking a shot at reproducing this now using the racer script on the -31 kernel. If I can reproduce it I'll try it w/ the irq binding fix from RH bug #248352 (LTC bug #36397).
------- Comment From jstultz.com 2007-08-10 16:30 EDT------- So far I've been unable to reproduce this issue on an LS20 using the ltp racer.
While testing the 2.6.21-36.el5rt kernel on Snapshot#3 RHEL5.1-Server-20070817.0 I see the following Oops on ibm-taroko.rhts.boston.redhat.com. According to the RHTS records, this system is a IBM IntelliStation M Pro -[922940Z]- Revision A. http://rhts.lab.boston.redhat.com/cgi-bin/rhts/jobs.cgi?id=5427 recipe 17814 http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=586470 ________________________________________________________________________________ Unable to handle kernel paging request at ffff81003188e380 RIP: [<ffff81003188e380>] PGD 8063 PUD 9063 PMD 80000000318001e3 PTE 0 Oops: 0011 [1] PREEMPT SMP CPU 1 Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 cpufreq_ondemand dm_multipath video sbs i2c_ec dock button battery asus_acpi ac lp floppy snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss sg snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr iTCO_wdt shpchp iTCO_vendor_support i2c_i801 tg3 i2c_core ata_generic rtc_cmos parport_pc rtc_core parport rtc_lib ide_cd cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod ahci ata_piix libata mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 25, comm: softirq-rcu/1 Not tainted 2.6.21-36.el5rt #1 RIP: 0010:[<ffff81003188e380>] [<ffff81003188e380>] RSP: 0000:ffff81003fe55e38 EFLAGS: 00010282 RAX: ffff81003fe21810 RBX: ffff81002ae1d340 RCX: ffff810005731680 RDX: ffff81003fe21810 RSI: ffff810001000000 RDI: ffff810016cc0980 RBP: ffff81003fe55e50 R08: ffff81003fe54000 R09: ffff81003fe41de0 R10: ffffffff808e1700 R11: 7fffffffffffffff R12: ffff810016cc0980 R13: 00000000fffffeff R14: 0000000000000001 R15: 0000000000000100 FS: 0000000000000000(0000) GS:ffff810037cb79c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: ffff81003188e380 CR3: 000000000ad14000 CR4: 00000000000006e0 Process softirq-rcu/1 (pid: 25, threadinfo ffff81003fe54000, task ffff81003fe21810) Stack: ffffffff802c5056 ffffffff809df968 ffff81000256a600 ffff81003fe55eb0 ffffffff8029578a 0000000600000001 000000010276dbf0 ffff810000000032 ffff81000256a600 ffff81000276dbf0 ffff81000256a600 ffff81000276dbf0 Call Trace: Code: 40 c8 20 3c 00 81 ff ff 00 cd 99 06 00 81 ff ff 01 00 00 00 RIP [<ffff81003188e380>] RSP <ffff81003fe55e38> CR2: ffff81003188e380 ________________________________________________________________________________
After lots and lots of hunting for this, my logdev showed the problem: The file struct has a f_u union of a list head, a rcuhead and what Peter Zijlstra added, a llist (locked list head). This is a union so these share the same memory. (from logdev) [ 3447.112218] cpu:0 (true:4330) file_kill:403 reinit fu_llistf=ffff81001e762b00 next=ffff81000222f6c9 func=ffff81002688f440 [ 3447.112223] cpu:0 (true:4330) call_rcu:323 head=ffff81001e762b00 func=ffffffff80292c79 (ffffffff80292c4d) [ 3447.112377] cpu:1 (ls:4260) file_kill:411 del fu_llistf=ffff81002688f440 next=ffff81001e762b00 func=ffff81002634c640 [ 3447.113118] cpu:0 (softirq-rcu/0:13) rcu_process_callbacks:311 list=ffff81001e762b00 next=0000000000000000 func=ffff81002634c640 The first line shows that the file->f_u.fu_llist (ffff81001e762b00) was deleted, but it only had it's list initialized. It didn't "delete" it from the list it was in. I always used the rcuhead next and func as the output, but this is where it's also being used as llist. The next line calls the call_rcu to add to the rcu list to run function at the address ffffffff80292c79. The third line shows object ffff81002688f440 being deleted from the list. But you can also see that its next pointer was the (already deleted) ffff81001e762b00 and the prev pointer is ffff81002634c640. The forth line shows the rcu callback, calling ffff81002634c640 (prev pointer from previous line) instead of the function ffffffff80292c79. The thing is that the del_list does a item->next->prev = item->prev; item->prev->next = item->next. So you can see how the corruption happended. Luckily, Peter Zijlstra has modified this code already for upstream -RT and we are currently testing a backpork he did to see if it fixes the issue.
Created attachment 206091 [details] back port changes from RT Linux mainline This is a patch (combined set) from Peter Zijlstra that fixes the issue.
------- Comment From dvhltc.com 2007-09-26 18:20 EDT------- I'd like to see a review of the code from Peter and an evaluation of how confident we are if this is fixed by the end of the week. Marking required data accordingly. if there is more to do come the end of the week, we can document that here in the bug and update the required date. ------- Comment From dvhltc.com 2007-09-26 18:20 EDT------- Reassigning to PC.
We've been running these patches since kernel-rt-2.6.21-38 and have not seen the bug (current release is -43). I'm inclined to call this one fix.