237481 – Oops: 0011 [1] PREEMPT SMP

Bug 237481 - Oops: 0011 [1] PREEMPT SMP

Summary: Oops: 0011 [1] PREEMPT SMP

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	1.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Arnaldo Carvalho de Melo
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-04-23 14:10 UTC by Rod Nayfield
Modified:	2008-02-27 19:56 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-07-19 19:28:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Rod Nayfield 2007-04-23 14:10:56 UTC

Description of problem:

oops

Version-Release number of selected component (if applicable):

[root@betabox log]# uname -r
2.6.20-0119.rt8

Steps to Reproduce:
1. leave box alone, come back and see error
  
Actual results:

oops

Expected results:

no oops

Comment 1 Rod Nayfield 2007-04-23 14:11:12 UTC

Apr 23 07:27:01 betabox kernel: stopped custom tracer.
Apr 23 07:27:01 betabox kernel: Unable to handle kernel paging request at
ffff810037e45530 RIP: 
Apr 23 07:27:01 betabox kernel:  [<ffff810037e45530>]
Apr 23 07:27:01 betabox kernel: PGD 8063 PUD 9063 PMD 8000000037e001e3 PTE
ffff810037fff800
Apr 23 07:27:01 betabox kernel: Oops: 0011 [1] PREEMPT SMP 
Apr 23 07:27:01 betabox kernel: CPU 1 
Apr 23 07:27:01 betabox kernel: Modules linked in: i915 drm hidp rfcomm l2cap
bluetooth sunrpc ipv6 video sbs i2c_ec dock button battery asus_acpi backlight
ac lp snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event
sr_mod snd_seq cdrom snd_seq_device sg snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd pcspkr soundcore parport_pc snd_page_alloc i2c_i801 shpchp
pata_marvell i2c_core iTCO_wdt iTCO_vendor_support parport e1000 ohci1394
ieee1394 serio_raw dm_snapshot dm_zero dm_mirror dm_mod ata_piix ata_generic
libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Apr 23 07:27:01 betabox kernel: Pid: 22, comm: softirq-tasklet Not tainted
2.6.20-0119.rt8 #1
Apr 23 07:27:01 betabox kernel: RIP: 0010:[<ffff810037e45530>]  [<ffff810037e45530>]
Apr 23 07:27:01 betabox kernel: RSP: 0018:ffff81007d645e28  EFLAGS: 00010282
Apr 23 07:27:01 betabox kernel: RAX: ffff81007d643560 RBX: ffff81006f7b9440 RCX:
ffff8100688ddd00
Apr 23 07:27:01 betabox kernel: RDX: ffff81007d643560 RSI: ffff810001000068 RDI:
ffff81007bd92d40
Apr 23 07:27:01 betabox kernel: RBP: ffff81007d645e40 R08: ffff81007d644000 R09:
0000000000000012
Apr 23 07:27:01 betabox kernel: R10: ffffffff808e6ac0 R11: ffff81007e56d090 R12:
ffff81007bd92d40
Apr 23 07:27:01 betabox kernel: R13: 00000000000f4240 R14: 0000000000000000 R15:
0000000000000001
Apr 23 07:27:01 betabox kernel: FS:  0000000000000000(0000)
GS:ffff81007e53d9c0(0000) knlGS:0000000000000000
Apr 23 07:27:01 betabox kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Apr 23 07:27:01 betabox kernel: CR2: ffff810037e45530 CR3: 0000000072de2000 CR4:
00000000000006e0
Apr 23 07:27:01 betabox kernel: Process softirq-tasklet (pid: 22, threadinfo
ffff81007d644000, task ffff81007d643560)
Apr 23 07:27:01 betabox kernel: Stack:  ffffffff802c99fd ffffffff80a7e7d0
00000000000f4240 ffff81007d645e70
Apr 23 07:27:01 betabox kernel:  ffffffff80294f6e ffff81000407cb88
0000000000000020 ffff81000407cb88
Apr 23 07:27:01 betabox kernel:  ffffffff8029521b ffff81007d645e80
ffffffff802950cd ffff81007d645eb0
Apr 23 07:27:01 betabox kernel: Call Trace:
Apr 23 07:27:01 betabox kernel: Inexact backtrace:
Apr 23 07:27:01 betabox kernel:  [<ffffffff802c99fd>]
rcu_process_callbacks+0x5e/0x6b
Apr 23 07:27:01 betabox kernel:  [<ffffffff80294f6e>] __tasklet_action+0xea/0x13d
Apr 23 07:27:01 betabox kernel:  [<ffffffff8029521b>] ksoftirqd+0x0/0x1e3
Apr 23 07:27:01 betabox kernel:  [<ffffffff802950cd>] tasklet_action+0x27/0x29
Apr 23 07:27:01 betabox kernel:  [<ffffffff80295321>] ksoftirqd+0x106/0x1e3
Apr 23 07:27:01 betabox kernel:  [<ffffffff8023405e>] kthread+0xee/0x11a
Apr 23 07:27:01 betabox kernel:  [<ffffffff802600f8>] child_rip+0xa/0x12
Apr 23 07:27:01 betabox kernel:  [<ffffffff80233f70>] kthread+0x0/0x11a
Apr 23 07:27:01 betabox kernel:  [<ffffffff802600ee>] child_rip+0x0/0x12
Apr 23 07:27:01 betabox kernel: 
Apr 23 07:27:01 betabox kernel: ---------------------------
Apr 23 07:27:01 betabox kernel: | preempt count: 00000001 ]
Apr 23 07:27:01 betabox kernel: | 1-level deep critical section nesting:
Apr 23 07:27:01 betabox kernel: ----------------------------------------
Apr 23 07:27:01 betabox kernel: .. [<ffffffff8026657b>] ....
__spin_trylock+0x13/0x4f
Apr 23 07:27:01 betabox kernel: .....[<ffffffff80267111>] ..   ( <=
oops_begin+0x23/0x6f)
Apr 23 07:27:01 betabox kernel: 
Apr 23 07:27:01 betabox kernel: 
Apr 23 07:27:01 betabox kernel: Code: 40 2d d9 7b 00 81 ff ff 00 c3 de 37 00 81
ff ff 01 00 00 00 
Apr 23 07:27:01 betabox kernel: RIP  [<ffff810037e45530>]
Apr 23 07:27:02 betabox kernel:  RSP <ffff81007d645e28>
Apr 23 07:27:02 betabox kernel: CR2: ffff810037e45530

Comment 2 Rod Nayfield 2007-04-23 14:11:35 UTC

messages:Apr 23 07:27:01 betabox kernel: Oops: 0011 [1] PREEMPT SMP 
messages.1:Apr 19 05:23:26 betabox kernel: Oops: 0011 [1] PREEMPT SMP 
messages.4:Mar 31 22:25:29 betabox kernel: Oops: 0011 [1] PREEMPT SMP

Comment 3 Rod Nayfield 2007-04-23 14:38:20 UTC

FYI it is the same rcu spot in each oops

# grep rcu_process /var/log/messages*
/var/log/messages:Apr 23 07:27:01 betabox kernel:  [<ffffffff802c99fd>]
rcu_process_callbacks+0x5e/0x6b
/var/log/messages.1:Apr 19 05:23:26 betabox kernel:  [<ffffffff802c99fd>]
rcu_process_callbacks+0x5e/0x6b
/var/log/messages.4:Mar 31 22:25:29 betabox kernel:  [<ffffffff802c99fd>]
rcu_process_callbacks+0x5e/0x6b

Comment 4 keith mannth 2007-04-23 17:18:20 UTC

Are all of the Oops procedded by 

"Apr 23 07:27:01 betabox kernel: stopped custom tracer." or is is just juck that
this happend right before this oops? 

What is the "custom tracer" and does stoping this service cause this error?

Comment 5 Rod Nayfield 2007-04-23 19:44:47 UTC

The message is there before all 3 Oops.

I am not sure where that message comes from, perhaps it is the latency tracer?  

There was no user intervention when the system Oopsed

Comment 6 IBM Bug Proxy 2007-04-24 02:15:34 UTC

----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-04-23 22:10 EDT -------
Paul McKenney: Could you take a look at this? StephenR thinks its an RCU issue.

Comment 7 keith mannth 2007-04-25 00:11:25 UTC

So stoping the latency tracer is causing this error?  

Can someone provide details about the "stopped custom tracer"?

Comment 8 IBM Bug Proxy 2007-05-13 00:35:50 UTC

----- Additional Comments From Paul.McKenney.com (prefers email at paulmck.com)  2007-05-12 20:31 EDT -------
(1) Was utrace in use? (there have been some cases where call_rcu() is called
twice on the same data element -- bad idea)

(2) Was oprofile in use?  (Working on getting NMI-safe RCU patch against current
-rt on the off-chance that this is the issue.)

Comment 9 Rod Nayfield 2007-05-15 00:33:43 UTC

No, this was a desktop development machine sitting mainly idle save for some
desktop things (browser, rhythmbox).

Comment 10 Arnaldo Carvalho de Melo 2007-05-18 12:45:36 UTC

The "stopped custom tracer" message comes from:

./lib/debug_locks.c

/*
 * Generic 'turn off all lock debugging' function:
 */
int debug_locks_off(void)
{
        if (xchg(&debug_locks, 0)) {
#ifdef CONFIG_DEBUG_RT_MUTEXES
                if (spin_is_locked(&current->pi_lock))
                        spin_unlock(&current->pi_lock);
#endif
                if (!debug_locks_silent) {
                        stop_trace();
                        user_trace_stop();
                        printk("stopped custom tracer.\n");
                        console_verbose();
                        return 1;
                }
        }
        return 0;
}

And that is called in many places. In some it would be helpful to use the kernel
command line option "debug_locks_verbose=1" so that we get some WARN_ON messages
from DEBUG_LOCKS_WARN_ON() cases where debug_locks_off() is called.

Rod, It may well be just a coincidence that this message appears, but can you
please use it to see if we get some more info?

I'm continuing the investigation meanwhile.

Comment 11 Rod Nayfield 2007-05-18 12:57:24 UTC

rebooting the system now with that cmdline.

Will let you know if it oops-es again

Comment 12 Arnaldo Carvalho de Melo 2007-05-18 13:11:11 UTC

Nah, forget about it, debug_locks_verbose already is 1 in your setup, if not we
would't see the "stopped custom tracer" at all. So the "stopped custom tracer"
doesn't buy us nothing, it merely says that from that point on the latency
tracer is not in use anymore because a locking bug happened and (in
./lib/debug_locks.c):

/*
 * We want to turn all lock-debugging facilities on/off at once,
 * via a global flag. The reason is that once a single bug has been
 * detected and reported, there might be cascade of followup bugs
 * that would just muddy the log. So we report the first one and
 * shut up after that.
 */
int debug_locks = 1;

I.e. the xchg() in debug_locks_off().

So the only conclusion so far is that this is a locking bug detected by the
lock-debugging facilities.

Comment 13 Arnaldo Carvalho de Melo 2007-05-18 13:27:17 UTC

And that conclusion is flawed, as:

void oops_enter(void)
{
        debug_locks_off(); /* can't trust the integrity of the kernel anymore */
        do_oops_enter_exit();
}

Is what probably is happening, i.e. something unrelated to lock-debugging
facilities causes an oops and the oops handler first calls debug_locks_off(), we
get that "stopped custom tracer" message and the oops is printed.

So this message is not of help.

Comment 14 Arnaldo Carvalho de Melo 2007-05-25 13:47:50 UTC

Rod, have you seen this again? I guess we can expire this one as it doesn't seem
to be happening anymore.

Comment 15 Arnaldo Carvalho de Melo 2007-05-25 13:49:50 UTC

Further information: Rod kept using the same kernel for over a month on the same
machine, tried to reproduce it several times and the messge last happened over a
month ago. Rod, can you confirm or complement/correct this information?

Comment 16 Arnaldo Carvalho de Melo 2007-05-30 18:30:25 UTC

Just got one of those, first time ever on this Dell Poweredge1950 dual dual core
Xeon machine:

hub 5-7:1.0: 4 ports detected
SELinux:  Disabled at runtime.
audit(1180548100.564:2): selinux=0 auid=4294967295
stopped custom tracer.
Unable to handle kernel paging request at ffff81012f885530 RIP:
 [<ffff81012f885530>]
PGD 8063 PUD d063 PMD 800000012f8001e3 PTE 0
Oops: 0011 [1] PREEMPT SMP
CPU 1
Modules linked in: oprofile autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6
dm_multipath video sbs id
Pid: 22, comm: softirq-tasklet Not tainted 2.6.21-15.el5rt #1
RIP: 0010:[<ffff81012f885530>]  [<ffff81012f885530>]
RSP: 0018:ffff81012fd13df8  EFLAGS: 00010282
RAX: ffff81012fce1040 RBX: ffff81012e346600 RCX: ffff810121c5d940
RDX: ffff81012fce1040 RSI: ffff810001947120 RDI: ffff81011e815480
RBP: ffff81012fd13e10 R08: ffff81012fd12000 R09: ffff81012fd07de0
R10: ffffffff816e2700 R11: 000001122868dcc0 R12: ffff81011e815480
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000020
FS:  0000000000000000(0000) GS:ffff81012fc9b7c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff81012f885530 CR3: 00000001220d8000 CR4: 00000000000006e0
Process softirq-tasklet (pid: 22, threadinfo ffff81012fd12000, task
ffff81012fce1040)
Stack:  ffffffff810c68e6 ffffffff81858910 00000000000f4240 ffff81012fd13e40
 ffffffff81094a9e ffff8100079f9548 ffffffff817e2968 ffff8100079f9548
 00000000ffffffdf ffff81012fd13e50 ffffffff81094f76 ffff81012fd13eb0
Call Trace:


Code: 80 54 81 1e 01 81 ff ff 00 98 69 2e 01 81 ff ff 01 00 00 00
RIP  [<ffff81012f885530>]
 RSP <ffff81012fd13df8>
CR2: ffff81012f885530

This with:
[root@mica ~]# uname -a
Linux mica.ghostprotocols.net 2.6.21-15.el5rt #1 SMP PREEMPT RT Thu May 24
11:37:09 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
[root@mica ~]#

This was while trying to get oprofile working after 2.6.21, will see if I can
reproduce this reliably.

Comment 17 Arnaldo Carvalho de Melo 2007-05-30 18:42:42 UTC

Before rebooting I tried to stop oprofile, and got oprofiled stuck in D state.
sysrq-T shows this:

oprofiled     D [ffff81011e239040] 000000010019aa5c     0  3843      1 (L-TLB)
 ffff81011a105cd8 0000000000000046 0000000000000000 ffffffff81019e19
 0000000000000000 ffff81011e239290 ffff81011e239040 ffff81012fd62040
 000001e32c4f80e7 0000000000033e5e 0000000200000001 ffffffff8100a5f7
Call Trace:
 [<ffffffff810646ec>] schedule+0xe2/0x102
 [<ffffffff8106478e>] wait_for_completion+0x82/0xc1
 [<ffffffff810c66cf>] synchronize_rcu+0x35/0x3c
 [<ffffffff8109c858>] atomic_notifier_chain_unregister+0x2e/0x36
 [<ffffffff810930f1>] task_handoff_unregister+0x13/0x15
 [<ffffffff88415d7f>] :oprofile:sync_stop+0x3b/0x42
 [<ffffffff88415062>] :oprofile:oprofile_shutdown+0x15/0x46
 [<ffffffff88415e32>] :oprofile:event_buffer_release+0x16/0x46
 [<ffffffff810128b3>] __fput+0xac/0x179
 [<ffffffff8102e284>] fput+0x14/0x16
 [<ffffffff810250fc>] filp_close+0x66/0x71
 [<ffffffff81039e0a>] put_files_struct+0x6d/0xc1
 [<ffffffff81015753>] do_exit+0x27c/0x894
 [<ffffffff8104a86a>] cpuset_exit+0x0/0x6e
 [<ffffffff8104e407>] sys_exit_group+0x12/0x14
 [<ffffffff8105f11e>] system_call+0x7e/0x83
 [<000000326f094a5f>]

Comment 18 Arnaldo Carvalho de Melo 2007-05-30 19:57:25 UTC

Couldn't reproduce it anymore :-(

Comment 19 IBM Bug Proxy 2007-06-01 03:45:49 UTC

----- Additional Comments From nivedita.com (prefers email at niv.com)  2007-05-31 23:40 EDT -------
Actually, I did see this on my x3455 while running stress tests,
but it's not clear when/what provoked the tests. Connectivity
to the box was lost.

That was on the 2.6.21-4.el5rt kernel, however.  I have updated
to 2.6.21-14.el5rt today, and am running with nohz=off. So far,
have not seen it yet. About to kick off stress tests, will update
the bug by tmrw if this is reproduced.

From /var/log/messages:
=========================
May 29 12:06:43 rt-ash kernel: stopped custom tracer.
May 29 12:06:43 rt-ash kernel: Unable to handle kernel paging request at
ffff8101afe4b530 RIP:
May 29 12:06:43 rt-ash kernel:  [<ffff8101afe4b530>]
May 29 12:06:43 rt-ash kernel: PGD 8063 PUD f063 PMD 80000001afe001e3 PTE 0
May 29 12:06:43 rt-ash kernel: Oops: 0011 [1] PREEMPT SMP
May 29 12:06:43 rt-ash kernel: CPU 1
May 29 12:06:43 rt-ash kernel: Modules linked in: autofs4 hidp rfcomm l2cap
bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state
iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables
x_tables ipv6 dm_mirror dm_mod video sbs i2c_ec dock button battery asus_acpi ac
parport_pc lp parport sg ata_generic pcspkr pata_serverworks i2c_core shpchp
k8temp hwmon tg3 rtc_cmos rtc_core rtc_lib ide_cd serio_raw cdrom sata_svw
libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
May 29 12:06:43 rt-ash kernel: Pid: 22, comm: softirq-tasklet Not tainted
2.6.21-4.el5rt #1
May 29 12:06:43 rt-ash kernel: RIP: 0010:[<ffff8101afe4b530>]  [<ffff8101afe4b530>]
May 29 12:06:43 rt-ash kernel: RSP: 0018:ffff8101b311bdf8  EFLAGS: 00010282
May 29 12:06:43 rt-ash kernel: RAX: ffff8101afcc0040 RBX: ffff8101afd5a940 RCX:
ffff8101afb76640
May 29 12:06:43 rt-ash kernel: RDX: ffff8101afcc0040 RSI: ffff81000194b1a8 RDI:
ffff8101a909ab40
May 29 12:06:43 rt-ash kernel: RBP: ffff8101b311be10 R08: ffff8101b311a000 R09:
ffff8101b3111de0
May 29 12:06:43 rt-ash kernel: R10: ffffffff816e3700 R11: ffff8101afcbd090 R12:
ffff8101a909ab40
May 29 12:06:43 rt-ash kernel: R13: 0000000000000000 R14: 0000000000000001 R15:
0000000000000020
May 29 12:06:43 rt-ash kernel: FS:  00002b5df4d082a0(0000)
GS:ffff8101afc85b40(0000) knlGS:00000000f7ef36c0
May 29 12:06:43 rt-ash kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
May 29 12:06:43 rt-ash kernel: CR2: ffff8101afe4b530 CR3: 000000022b37c000 CR4:
00000000000006e0
May 29 12:06:43 rt-ash kernel: Process softirq-tasklet (pid: 22, threadinfo
ffff8101b311a000, task ffff8101afcc0040)
May 29 12:06:43 rt-ash kernel: Stack:  ffffffff810c6a00 ffffffff8185c910
00000000000f4240 ffff8101b311be40
May 29 12:06:43 rt-ash kernel:  ffffffff81094ce4 ffff81000aa3c8c8
ffffffff817e62e8 ffff81000aa3c8c8
May 29 12:06:43 rt-ash kernel:  00000000ffffffdf ffff8101b311be50
ffffffff810951bc ffff8101b311beb0
May 29 12:06:43 rt-ash kernel: Call Trace:
May 29 12:06:43 rt-ash kernel:
May 29 12:06:43 rt-ash kernel:
May 29 12:06:43 rt-ash kernel: Code: 40 ab 09 a9 01 81 ff ff 40 f8 d4 2f 02 81
ff ff 01 00 00 00
May 29 12:06:43 rt-ash kernel: RIP  [<ffff8101afe4b530>]
May 29 12:06:43 rt-ash kernel:  RSP <ffff8101b311bdf8>
May 29 12:06:43 rt-ash kernel: CR2: ffff8101afe4b530
May 29 13:18:10 rt-ash dhclient: DHCPREQUEST on eth0 to 9.3.233.235 port 67

ROOT:rt-ash:log]> cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2218
stepping        : 2
cpu MHz         : 2593.498
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm
3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips        : 5188.80
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2218
stepping        : 2
cpu MHz         : 2593.498
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm
3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips        : 5185.95
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

processor       : 2
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2218
stepping        : 2
cpu MHz         : 2593.498
cache size      : 1024 KB
physical id     : 1
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm
3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips        : 5185.96
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

processor       : 3
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2218
stepping        : 2
cpu MHz         : 2593.498
cache size      : 1024 KB
physical id     : 1
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm
3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips        : 5185.96
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

ROOT:rt-ash:log]> cat /proc/meminfo
MemTotal:      8102204 kB
MemFree:       7757580 kB
Buffers:         38284 kB
Cached:         208100 kB
SwapCached:          0 kB
Active:         120136 kB
Inactive:       164032 kB
SwapTotal:     2040244 kB
SwapFree:      2040244 kB
Dirty:              72 kB
Writeback:           0 kB
AnonPages:       37800 kB
Mapped:          10352 kB
Slab:            35768 kB
SReclaimable:    23368 kB
SUnreclaim:      12400 kB
PageTables:       4516 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   6091344 kB
Committed_AS:    87588 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      1284 kB
VmallocChunk: 34359736911 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

Comment 20 Arnaldo Carvalho de Melo 2007-06-01 12:37:07 UTC

Nivedita, if you get another such OOPS please try a sysrq+T and look if there is
any task stuck in D state, posting the backtrace.

Comment 21 IBM Bug Proxy 2007-06-04 15:05:20 UTC

----- Additional Comments From nivedita.com (prefers email at niv.com)  2007-06-04 11:00 EDT -------
Arnaldo, 

OK, so I have gotten this oops twice again. 

Unfortunately, sysrq-t isn't responding.

I'll try and grab more useful output somehow. This is only happening on our
x3455 at the moment, which is also suffering other issues.  The good news is
that we can make it happen fairly regularly, although no single test etc
seems to be provoking it.

Comment 22 IBM Bug Proxy 2007-06-05 12:15:49 UTC

----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-05 08:12 EDT -------
So on this system, to test is sysrq-m was working fine, I found in the following
in the output of Show memory:

SysRq : Show Memory
Mem-info:
Node 0 DMA per-cpu:
CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    1: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    2: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    3: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: Hot: hi:  186, btch:  31 usd: 178   Cold: hi:   62, btch:  15 usd:   0
CPU    1: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU    2: Hot: hi:  186, btch:  31 usd:  27   Cold: hi:   62, btch:  15 usd:   0
CPU    3: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
Node 0 Normal per-cpu:
CPU    0: Hot: hi:  186, btch:  31 usd: 138   Cold: hi:   62, btch:  15 usd:  12
CPU    1: Hot: hi:  186, btch:  31 usd:  26   Cold: hi:   62, btch:  15 usd:   0
CPU    2: Hot: hi:  186, btch:  31 usd:  55   Cold: hi:   62, btch:  15 usd:  10
CPU    3: Hot: hi:  186, btch:  31 usd:  24   Cold: hi:   62, btch:  15 usd:   0
Node 1 Normal per-cpu:
CPU    0: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU    1: Hot: hi:  186, btch:  31 usd:  30   Cold: hi:   62, btch:  15 usd:   9
CPU    2: Hot: hi:  186, btch:  31 usd:  11   Cold: hi:   62, btch:  15 usd:   0
CPU    3: Hot: hi:  186, btch:  31 usd: 140   Cold: hi:   62, btch:  15 usd:  12
Active:85375 inactive:79740 dirty:48 writeback:0 unstable:0
 free:1796150 slab:41690 mapped:2263 pagetables:651 bounce:0
Node 0 DMA free:15912kB min:20kB low:24kB high:28kB active:0kB inactive:0kB
present:15532kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 3216 3966
Node 0 DMA32 free:3138700kB min:4604kB low:5752kB high:6904kB active:0kB
inactive:0kB present:3293204kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 750
Node 0 Normal free:323600kB min:1072kB low:1340kB high:1608kB active:224140kB
inactive:182644kB present:768000kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
Node 1 Normal free:3706388kB min:5728kB low:7160kB high:8592kB active:117360kB
inactive:136316kB present:4096000kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
Node 0 DMA: 2*4kB 2*8kB 1*16kB 2*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB
1*2048kB 3*4096kB = 15912kB
Node 0 DMA32: 1*4kB 1*8kB 2*16kB 1*32kB 3*64kB 5*128kB 1*256kB 2*512kB 3*1024kB
2*2048kB 764*4096kB = 3138700kB
Node 0 Normal: 1220*4kB 572*8kB 252*16kB 191*32kB 132*64kB 95*128kB 29*256kB
25*512kB 7*1024kB 3*2048kB 61*4096kB = 323600kB
Node 1 Normal: 79*4kB 5*8kB 5*16kB 5*32kB 31*64kB 40*128kB 12*256kB 2*512kB
6*1024kB 3*2048kB 899*4096kB = 3706388kB
Swap cache: add 0, delete 0, find 0/0, race 0+0
Free swap  = 8393952kB
Total swap = 8393952kB
Free swap:       8393952kB
stopped custom tracer.
Unable to handle kernel paging request at 0000000004e00000 RIP: 
 [<ffffffff81081787>] show_mem+0x8f/0x144
PGD 125852067 PUD 116a2b067 PMD 0 
Oops: 0000 [1] PREEMPT SMP 
CPU 2 
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_mirror
dm_mod video sbs i2c_ec i2c_core dock button battery asus_acpi ac parport_pc lp
parport sg pcspkr shpchp k8temp hwmon bnx2 rtc_cmos rtc_core rtc_lib serio_raw
usb_storage mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd
ehci_hcd ohci_hcd uhci_hcd
Pid: 14896, comm: bash Not tainted 2.6.21-14.el5rt #1
RIP: 0010:[<ffffffff81081787>]  [<ffffffff81081787>] show_mem+0x8f/0x144
RSP: 0018:ffff81021b201e78  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000000d0000 RCX: 000000000000001a
RDX: 0000000004e00000 RSI: ffff8100019470d0 RDI: 00000000000d0000
RBP: ffff81021b201e98 R08: ffff810000012000 R09: ffff81012f623440
R10: 00007fff3dde0b93 R11: 0000000000000246 R12: 000000000000f67c
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000006
FS:  00002add69e68db0(0000) GS:ffff81012fc85b40(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000004e00000 CR3: 0000000125f57000 CR4: 00000000000006e0
Process bash (pid: 14896, threadinfo ffff81021b200000, task ffff81022fd76040)
Stack:  ffffffff8145ab40 0000000000000000 000000000000006d 0000000000000000
 ffff81021b201ea8 ffffffff811ad97c ffff81021b201ee8 ffffffff811ad878
 ffff81011b53e000 0000000000000002 00002add6d36f000 ffff81021b201f48
Call Trace:
 [<ffffffff811ad97c>] sysrq_handle_showmem+0x9/0xb
 [<ffffffff811ad878>] __handle_sysrq+0x9a/0x128
 [<ffffffff8110cbea>] write_sysrq_trigger+0x30/0x3b
 [<ffffffff81016aaa>] vfs_write+0xcf/0x158
 [<ffffffff81017434>] sys_write+0x47/0x70
 [<ffffffff8105f11e>] system_call+0x7e/0x83
 [<0000003d188bfa90>]


Code: 8b 02 f6 c4 04 74 05 49 ff c4 eb 35 8b 02 66 85 c0 79 05 49 
RIP  [<ffffffff81081787>] show_mem+0x8f/0x144
 RSP <ffff81021b201e78>
CR2: 0000000004e00000


This seems to be what has been seen here...

Comment 23 IBM Bug Proxy 2007-06-05 13:25:47 UTC

----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-05 09:19 EDT -------
Pretty much on all the systems, sysrq-m gives the same backtrace.

Comment 24 Arnaldo Carvalho de Melo 2007-06-05 17:10:27 UTC

This brings yet another bug, namely sysrq-m oopsing on x86_64. 2.6.22-rc3-git
has the same problem, 2.6.20 doesn't, trying to fix this one now.

But what was asked was not sysrq+m, but sysrq+t, so that we could see where
tasks in D state are or if we could get more clues from it.

Comment 25 IBM Bug Proxy 2007-06-05 18:00:49 UTC

----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-05 13:54 EDT -------
It was while working on bug #35225 (RH #242630), I came across the Oops. As
reported in that bug, sysrq-t is not working.

Comment 26 IBM Bug Proxy 2007-06-05 18:21:13 UTC

----- Additional Comments From nivedita.com (prefers email at niv.com)  2007-06-05 14:18 EDT -------
Hey Arnaldo, 

I'll echo Ankita, I have been unable to get sysrq-T to work.

Comment 27 IBM Bug Proxy 2007-06-06 23:10:48 UTC

----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-06-06 19:07 EDT -------
If we could, please keep the sysrq-m issue in LTC bug #35225 or open a new bug.

Regarding the original OOPS, looking at the trace, it looks possibly related to
the recently resolved __tasklet_action bug (LTC bug #34969  RH bug #242422). In
that case rcu_process_callbacks and __tasklet_action were involved.

Comment 28 Arnaldo Carvalho de Melo 2007-06-07 15:52:07 UTC

Ok, moved the sysrq+M discussion to bug #242630, where a fix is waiting for
further testing on the LTC machines where the OOPSes are happening, which
probably are machines with memory holes as shown in the e820 bootup messages :-)

Comment 29 IBM Bug Proxy 2007-06-11 08:50:51 UTC

----- Additional Comments From Dinakar.G.com (prefers email at dino.com)  2007-06-11 04:46 EDT -------
So I saw this bug while running release_testing.sh over the weekend.

Jun 10 06:53:43 llm48 kernel: stopped custom tracer.
Jun 10 06:53:43 llm48 kernel: Unable to handle kernel paging request at
ffff810111f26130 RIP:
Jun 10 06:53:43 llm48 kernel:  [<ffff810111f26130>]
Jun 10 06:53:43 llm48 kernel: PGD 8063 PUD d063 PMD 8000000111e001e3 PTE
4101834b118821ea
Jun 10 06:53:43 llm48 kernel: Oops: 0011 [1] PREEMPT SMP
Jun 10 06:53:43 llm48 kernel: CPU 1
Jun 10 06:53:43 llm48 kernel: Modules linked in: autofs4 hidp rfcomm l2cap
bluetooth sunrpc ipv6 dm_mirror dm_mod video sbs i2c_ec dock button battery
asus_acpi ac parport_pc lp parport joydev sr_mod cdrom sg i2c_amd756 i2c_core
pcspkr amd_rng shpchp k8temp hwmon tg3 rtc_cmos rtc_core rtc_lib serio_raw
usb_storage mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd
ehci_hcd ohci_hcd uhci_hcd
Jun 10 06:53:43 llm48 kernel: Pid: 22, comm: softirq-tasklet Not tainted
2.6.21-23ibm1 #1
Jun 10 06:53:43 llm48 kernel: RIP: 0010:[<ffff810111f26130>]  [<ffff810111f26130>]
Jun 10 06:53:43 llm48 kernel: RSP: 0018:ffff810111ce7df8  EFLAGS: 00010282
Jun 10 06:53:43 llm48 kernel: RAX: ffff8101180eb800 RBX: ffff8102066f4e00 RCX:
000000000000000b
Jun 10 06:53:43 llm48 kernel: RDX: ffff8101180eb800 RSI: ffff810001947210 RDI:
ffff8101c09c5d40
Jun 10 06:53:43 llm48 kernel: RBP: ffff810111ce7e10 R08: ffff810111ce6000 R09:
ffff810111cddde0
Jun 10 06:53:43 llm48 kernel: R10: ffffffff816e2700 R11: ffffffff8106d0e2 R12:
ffff8101c09c5d40
Jun 10 06:53:43 llm48 kernel: R13: 0000000000000000 R14: 0000000000000001 R15:
0000000000000020
Jun 10 06:53:43 llm48 kernel: FS:  00002b7da57086f0(0000)
GS:ffff810111c85b40(0000) knlGS:00000000f7ee76c0
Jun 10 06:53:43 llm48 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Jun 10 06:53:43 llm48 kernel: CR2: ffff810111f26130 CR3: 0000000205c20000 CR4:
00000000000006e0
Jun 10 06:53:43 llm48 kernel: Process softirq-tasklet (pid: 22, threadinfo
ffff810111ce6000, task ffff8101180eb800)
Jun 10 06:53:43 llm48 kernel: Stack:  ffffffff810c6941 ffffffff81858910
00000000000f4240 ffff810111ce7e40
Jun 10 06:53:43 llm48 kernel:  ffffffff81094aed ffff810001f9b548
ffffffff817e2968 ffff810001f9b548
Jun 10 06:53:43 llm48 kernel:  00000000ffffffdf ffff810111ce7e50
ffffffff81094fd7 ffff810111ce7eb0
Jun 10 06:53:43 llm48 kernel: Call Trace:
Jun 10 06:53:43 llm48 kernel:
Jun 10 06:53:43 llm48 kernel:
Jun 10 06:53:43 llm48 kernel: Code: 40 5d 9c c0 01 81 ff ff 80 3b 39 11 02 81 ff
ff 01 00 00 00
Jun 10 06:53:43 llm48 kernel: RIP  [<ffff810111f26130>]
Jun 10 06:53:43 llm48 kernel:  RSP <ffff810111ce7df8>
Jun 10 06:53:43 llm48 kernel: CR2: ffff810111f26130

uname -a
Linux llm48.in.ibm.com 2.6.21-23ibm1 #1 SMP PREEMPT RT Thu Jun 7 12:49:03 CDT
2007 x86_64 x86_64 x86_64 GNU/Linux

Not sure which test triggered this though

Comment 30 IBM Bug Proxy 2007-06-12 23:41:05 UTC

----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-06-12 19:38 EDT -------
I think I triggered a similar issue w/ the -23ibm2 kernel booting w/
"noaliencache" when running kernbench.

Comment 31 IBM Bug Proxy 2007-06-14 21:37:05 UTC

----- Additional Comments From mauery.com (prefers email at vernux.com)  2007-06-14 17:29 EDT -------
This was just seen on toro-rtj1:

stopped custom tracer.
Unable to handle kernel paging request at ffff810211d3d130 RIP: 
 [<ffff810211d3d130>]
PGD 8063 PUD 11063 PMD 8000000111a9e163 PTE 8000000211d3d163
Oops: 0011 [1] PREEMPT SMP 
CPU 2 
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_mirror
dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport
joydev sr_mod cdrom sg pcspkr i2c_amd756 i2c_core amd_rng shpchp k8temp hwmon
tg3 rtc_cmos rtc_core rtc_lib serio_raw usb_storage mptspi mptscsih mptbase
scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 35, comm: softirq-tasklet Not tainted 2.6.21-23ibm2 #1
RIP: 0010:[<ffff810211d3d130>]  [<ffff810211d3d130>]
RSP: 0018:ffff810111d1ddf8  EFLAGS: 00010286
RAX: ffff81011816f800 RBX: ffff81005d747240 RCX: 0000000000000001
RDX: ffff81011816f800 RSI: ffff810001947088 RDI: ffff8100cf6112c0
RBP: ffff810111d1de10 R08: ffff810111d1c000 R09: 000000000000003c
R10: ffffffff816e2700 R11: ffffffff8106d0e2 R12: ffff8100cf6112c0
R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000020
FS:  00002addf48ca120(0000) GS:ffff8101181549c0(0000) knlGS:00000000f7f07b90
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff810211d3d130 CR3: 0000000207a2e000 CR4: 00000000000006e0
Process softirq-tasklet (pid: 35, threadinfo ffff810111d1c000, task
ffff81011816f800)
Stack:  ffffffff810c6941 ffffffff81858910 00000000000f4240 ffff810111d1de40
 ffffffff81094aed ffff8101180796c8 ffffffff817e2968 ffff8101180796c8
 00000000ffffffdf ffff810111d1de50 ffffffff81094fd7 ffff810111d1deb0
Call Trace:


Code: c0 12 61 cf 00 81 ff ff c0 ad 38 11 01 81 ff ff 01 00 00 00 
RIP  [<ffff810211d3d130>]
 RSP <ffff810111d1ddf8>
CR2: ffff810211d3d130


Here is the message from the person that saw it:
"I've been running some Java SVT tests on the toro-rtj1.austin.ibm.com box, 
which has RHEL5-RT installed.

I was at a command prompt, and started to type 'ls', but i only got as far as 
the 'l' before the OS put out the following message:

[rtjaxxon@toro-rtj1 ~]$
Message from syslogd@toro-rtj1 at Thu Jun 14 05:27:56 2007 ...
toro-rtj1 kernel: Oops: 0011 [1] PREEMPT SMP

Message from syslogd@toro-rtj1 at Thu Jun 14 05:27:57 2007 ...
toro-rtj1 kernel: CR2: ffff810211d3d130



- This then didn't show a command prompt, but typing 'ls' again, followed by 
enter showed the directory as normal, then i was back at the prompt."

Comment 32 IBM Bug Proxy 2007-06-26 00:00:19 UTC

----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-06-25 19:54 EDT -------
I've been trying to reproduce w/ both -23ibm3 and -31 and so far have had no luck. 

I think if it doesn't crop up for another week we can close this.

Comment 33 IBM Bug Proxy 2007-07-02 18:40:42 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|OPEN                        |REJECTED
         Resolution|                            |UNREPRODUCIBLE




------- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-07-02 14:34 EDT -------
This has not been seen since -23ibm3. If it recurs, please reopen.

Comment 34 Tim Burke 2007-07-19 19:28:00 UTC

Closing per prior comment.

Note You need to log in before you can comment on or make changes to this bug.