Description of problem: oops Version-Release number of selected component (if applicable): [root@betabox log]# uname -r 2.6.20-0119.rt8 Steps to Reproduce: 1. leave box alone, come back and see error Actual results: oops Expected results: no oops
Apr 23 07:27:01 betabox kernel: stopped custom tracer. Apr 23 07:27:01 betabox kernel: Unable to handle kernel paging request at ffff810037e45530 RIP: Apr 23 07:27:01 betabox kernel: [<ffff810037e45530>] Apr 23 07:27:01 betabox kernel: PGD 8063 PUD 9063 PMD 8000000037e001e3 PTE ffff810037fff800 Apr 23 07:27:01 betabox kernel: Oops: 0011 [1] PREEMPT SMP Apr 23 07:27:01 betabox kernel: CPU 1 Apr 23 07:27:01 betabox kernel: Modules linked in: i915 drm hidp rfcomm l2cap bluetooth sunrpc ipv6 video sbs i2c_ec dock button battery asus_acpi backlight ac lp snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event sr_mod snd_seq cdrom snd_seq_device sg snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd pcspkr soundcore parport_pc snd_page_alloc i2c_i801 shpchp pata_marvell i2c_core iTCO_wdt iTCO_vendor_support parport e1000 ohci1394 ieee1394 serio_raw dm_snapshot dm_zero dm_mirror dm_mod ata_piix ata_generic libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Apr 23 07:27:01 betabox kernel: Pid: 22, comm: softirq-tasklet Not tainted 2.6.20-0119.rt8 #1 Apr 23 07:27:01 betabox kernel: RIP: 0010:[<ffff810037e45530>] [<ffff810037e45530>] Apr 23 07:27:01 betabox kernel: RSP: 0018:ffff81007d645e28 EFLAGS: 00010282 Apr 23 07:27:01 betabox kernel: RAX: ffff81007d643560 RBX: ffff81006f7b9440 RCX: ffff8100688ddd00 Apr 23 07:27:01 betabox kernel: RDX: ffff81007d643560 RSI: ffff810001000068 RDI: ffff81007bd92d40 Apr 23 07:27:01 betabox kernel: RBP: ffff81007d645e40 R08: ffff81007d644000 R09: 0000000000000012 Apr 23 07:27:01 betabox kernel: R10: ffffffff808e6ac0 R11: ffff81007e56d090 R12: ffff81007bd92d40 Apr 23 07:27:01 betabox kernel: R13: 00000000000f4240 R14: 0000000000000000 R15: 0000000000000001 Apr 23 07:27:01 betabox kernel: FS: 0000000000000000(0000) GS:ffff81007e53d9c0(0000) knlGS:0000000000000000 Apr 23 07:27:01 betabox kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Apr 23 07:27:01 betabox kernel: CR2: ffff810037e45530 CR3: 0000000072de2000 CR4: 00000000000006e0 Apr 23 07:27:01 betabox kernel: Process softirq-tasklet (pid: 22, threadinfo ffff81007d644000, task ffff81007d643560) Apr 23 07:27:01 betabox kernel: Stack: ffffffff802c99fd ffffffff80a7e7d0 00000000000f4240 ffff81007d645e70 Apr 23 07:27:01 betabox kernel: ffffffff80294f6e ffff81000407cb88 0000000000000020 ffff81000407cb88 Apr 23 07:27:01 betabox kernel: ffffffff8029521b ffff81007d645e80 ffffffff802950cd ffff81007d645eb0 Apr 23 07:27:01 betabox kernel: Call Trace: Apr 23 07:27:01 betabox kernel: Inexact backtrace: Apr 23 07:27:01 betabox kernel: [<ffffffff802c99fd>] rcu_process_callbacks+0x5e/0x6b Apr 23 07:27:01 betabox kernel: [<ffffffff80294f6e>] __tasklet_action+0xea/0x13d Apr 23 07:27:01 betabox kernel: [<ffffffff8029521b>] ksoftirqd+0x0/0x1e3 Apr 23 07:27:01 betabox kernel: [<ffffffff802950cd>] tasklet_action+0x27/0x29 Apr 23 07:27:01 betabox kernel: [<ffffffff80295321>] ksoftirqd+0x106/0x1e3 Apr 23 07:27:01 betabox kernel: [<ffffffff8023405e>] kthread+0xee/0x11a Apr 23 07:27:01 betabox kernel: [<ffffffff802600f8>] child_rip+0xa/0x12 Apr 23 07:27:01 betabox kernel: [<ffffffff80233f70>] kthread+0x0/0x11a Apr 23 07:27:01 betabox kernel: [<ffffffff802600ee>] child_rip+0x0/0x12 Apr 23 07:27:01 betabox kernel: Apr 23 07:27:01 betabox kernel: --------------------------- Apr 23 07:27:01 betabox kernel: | preempt count: 00000001 ] Apr 23 07:27:01 betabox kernel: | 1-level deep critical section nesting: Apr 23 07:27:01 betabox kernel: ---------------------------------------- Apr 23 07:27:01 betabox kernel: .. [<ffffffff8026657b>] .... __spin_trylock+0x13/0x4f Apr 23 07:27:01 betabox kernel: .....[<ffffffff80267111>] .. ( <= oops_begin+0x23/0x6f) Apr 23 07:27:01 betabox kernel: Apr 23 07:27:01 betabox kernel: Apr 23 07:27:01 betabox kernel: Code: 40 2d d9 7b 00 81 ff ff 00 c3 de 37 00 81 ff ff 01 00 00 00 Apr 23 07:27:01 betabox kernel: RIP [<ffff810037e45530>] Apr 23 07:27:02 betabox kernel: RSP <ffff81007d645e28> Apr 23 07:27:02 betabox kernel: CR2: ffff810037e45530
messages:Apr 23 07:27:01 betabox kernel: Oops: 0011 [1] PREEMPT SMP messages.1:Apr 19 05:23:26 betabox kernel: Oops: 0011 [1] PREEMPT SMP messages.4:Mar 31 22:25:29 betabox kernel: Oops: 0011 [1] PREEMPT SMP
FYI it is the same rcu spot in each oops # grep rcu_process /var/log/messages* /var/log/messages:Apr 23 07:27:01 betabox kernel: [<ffffffff802c99fd>] rcu_process_callbacks+0x5e/0x6b /var/log/messages.1:Apr 19 05:23:26 betabox kernel: [<ffffffff802c99fd>] rcu_process_callbacks+0x5e/0x6b /var/log/messages.4:Mar 31 22:25:29 betabox kernel: [<ffffffff802c99fd>] rcu_process_callbacks+0x5e/0x6b
Are all of the Oops procedded by "Apr 23 07:27:01 betabox kernel: stopped custom tracer." or is is just juck that this happend right before this oops? What is the "custom tracer" and does stoping this service cause this error?
The message is there before all 3 Oops. I am not sure where that message comes from, perhaps it is the latency tracer? There was no user intervention when the system Oopsed
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-04-23 22:10 EDT ------- Paul McKenney: Could you take a look at this? StephenR thinks its an RCU issue.
So stoping the latency tracer is causing this error? Can someone provide details about the "stopped custom tracer"?
----- Additional Comments From Paul.McKenney.com (prefers email at paulmck.com) 2007-05-12 20:31 EDT ------- (1) Was utrace in use? (there have been some cases where call_rcu() is called twice on the same data element -- bad idea) (2) Was oprofile in use? (Working on getting NMI-safe RCU patch against current -rt on the off-chance that this is the issue.)
No, this was a desktop development machine sitting mainly idle save for some desktop things (browser, rhythmbox).
The "stopped custom tracer" message comes from: ./lib/debug_locks.c /* * Generic 'turn off all lock debugging' function: */ int debug_locks_off(void) { if (xchg(&debug_locks, 0)) { #ifdef CONFIG_DEBUG_RT_MUTEXES if (spin_is_locked(¤t->pi_lock)) spin_unlock(¤t->pi_lock); #endif if (!debug_locks_silent) { stop_trace(); user_trace_stop(); printk("stopped custom tracer.\n"); console_verbose(); return 1; } } return 0; } And that is called in many places. In some it would be helpful to use the kernel command line option "debug_locks_verbose=1" so that we get some WARN_ON messages from DEBUG_LOCKS_WARN_ON() cases where debug_locks_off() is called. Rod, It may well be just a coincidence that this message appears, but can you please use it to see if we get some more info? I'm continuing the investigation meanwhile.
rebooting the system now with that cmdline. Will let you know if it oops-es again
Nah, forget about it, debug_locks_verbose already is 1 in your setup, if not we would't see the "stopped custom tracer" at all. So the "stopped custom tracer" doesn't buy us nothing, it merely says that from that point on the latency tracer is not in use anymore because a locking bug happened and (in ./lib/debug_locks.c): /* * We want to turn all lock-debugging facilities on/off at once, * via a global flag. The reason is that once a single bug has been * detected and reported, there might be cascade of followup bugs * that would just muddy the log. So we report the first one and * shut up after that. */ int debug_locks = 1; I.e. the xchg() in debug_locks_off(). So the only conclusion so far is that this is a locking bug detected by the lock-debugging facilities.
And that conclusion is flawed, as: void oops_enter(void) { debug_locks_off(); /* can't trust the integrity of the kernel anymore */ do_oops_enter_exit(); } Is what probably is happening, i.e. something unrelated to lock-debugging facilities causes an oops and the oops handler first calls debug_locks_off(), we get that "stopped custom tracer" message and the oops is printed. So this message is not of help.
Rod, have you seen this again? I guess we can expire this one as it doesn't seem to be happening anymore.
Further information: Rod kept using the same kernel for over a month on the same machine, tried to reproduce it several times and the messge last happened over a month ago. Rod, can you confirm or complement/correct this information?
Just got one of those, first time ever on this Dell Poweredge1950 dual dual core Xeon machine: hub 5-7:1.0: 4 ports detected SELinux: Disabled at runtime. audit(1180548100.564:2): selinux=0 auid=4294967295 stopped custom tracer. Unable to handle kernel paging request at ffff81012f885530 RIP: [<ffff81012f885530>] PGD 8063 PUD d063 PMD 800000012f8001e3 PTE 0 Oops: 0011 [1] PREEMPT SMP CPU 1 Modules linked in: oprofile autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_multipath video sbs id Pid: 22, comm: softirq-tasklet Not tainted 2.6.21-15.el5rt #1 RIP: 0010:[<ffff81012f885530>] [<ffff81012f885530>] RSP: 0018:ffff81012fd13df8 EFLAGS: 00010282 RAX: ffff81012fce1040 RBX: ffff81012e346600 RCX: ffff810121c5d940 RDX: ffff81012fce1040 RSI: ffff810001947120 RDI: ffff81011e815480 RBP: ffff81012fd13e10 R08: ffff81012fd12000 R09: ffff81012fd07de0 R10: ffffffff816e2700 R11: 000001122868dcc0 R12: ffff81011e815480 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000020 FS: 0000000000000000(0000) GS:ffff81012fc9b7c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: ffff81012f885530 CR3: 00000001220d8000 CR4: 00000000000006e0 Process softirq-tasklet (pid: 22, threadinfo ffff81012fd12000, task ffff81012fce1040) Stack: ffffffff810c68e6 ffffffff81858910 00000000000f4240 ffff81012fd13e40 ffffffff81094a9e ffff8100079f9548 ffffffff817e2968 ffff8100079f9548 00000000ffffffdf ffff81012fd13e50 ffffffff81094f76 ffff81012fd13eb0 Call Trace: Code: 80 54 81 1e 01 81 ff ff 00 98 69 2e 01 81 ff ff 01 00 00 00 RIP [<ffff81012f885530>] RSP <ffff81012fd13df8> CR2: ffff81012f885530 This with: [root@mica ~]# uname -a Linux mica.ghostprotocols.net 2.6.21-15.el5rt #1 SMP PREEMPT RT Thu May 24 11:37:09 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux [root@mica ~]# This was while trying to get oprofile working after 2.6.21, will see if I can reproduce this reliably.
Before rebooting I tried to stop oprofile, and got oprofiled stuck in D state. sysrq-T shows this: oprofiled D [ffff81011e239040] 000000010019aa5c 0 3843 1 (L-TLB) ffff81011a105cd8 0000000000000046 0000000000000000 ffffffff81019e19 0000000000000000 ffff81011e239290 ffff81011e239040 ffff81012fd62040 000001e32c4f80e7 0000000000033e5e 0000000200000001 ffffffff8100a5f7 Call Trace: [<ffffffff810646ec>] schedule+0xe2/0x102 [<ffffffff8106478e>] wait_for_completion+0x82/0xc1 [<ffffffff810c66cf>] synchronize_rcu+0x35/0x3c [<ffffffff8109c858>] atomic_notifier_chain_unregister+0x2e/0x36 [<ffffffff810930f1>] task_handoff_unregister+0x13/0x15 [<ffffffff88415d7f>] :oprofile:sync_stop+0x3b/0x42 [<ffffffff88415062>] :oprofile:oprofile_shutdown+0x15/0x46 [<ffffffff88415e32>] :oprofile:event_buffer_release+0x16/0x46 [<ffffffff810128b3>] __fput+0xac/0x179 [<ffffffff8102e284>] fput+0x14/0x16 [<ffffffff810250fc>] filp_close+0x66/0x71 [<ffffffff81039e0a>] put_files_struct+0x6d/0xc1 [<ffffffff81015753>] do_exit+0x27c/0x894 [<ffffffff8104a86a>] cpuset_exit+0x0/0x6e [<ffffffff8104e407>] sys_exit_group+0x12/0x14 [<ffffffff8105f11e>] system_call+0x7e/0x83 [<000000326f094a5f>]
Couldn't reproduce it anymore :-(
----- Additional Comments From nivedita.com (prefers email at niv.com) 2007-05-31 23:40 EDT ------- Actually, I did see this on my x3455 while running stress tests, but it's not clear when/what provoked the tests. Connectivity to the box was lost. That was on the 2.6.21-4.el5rt kernel, however. I have updated to 2.6.21-14.el5rt today, and am running with nohz=off. So far, have not seen it yet. About to kick off stress tests, will update the bug by tmrw if this is reproduced. From /var/log/messages: ========================= May 29 12:06:43 rt-ash kernel: stopped custom tracer. May 29 12:06:43 rt-ash kernel: Unable to handle kernel paging request at ffff8101afe4b530 RIP: May 29 12:06:43 rt-ash kernel: [<ffff8101afe4b530>] May 29 12:06:43 rt-ash kernel: PGD 8063 PUD f063 PMD 80000001afe001e3 PTE 0 May 29 12:06:43 rt-ash kernel: Oops: 0011 [1] PREEMPT SMP May 29 12:06:43 rt-ash kernel: CPU 1 May 29 12:06:43 rt-ash kernel: Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_mirror dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport sg ata_generic pcspkr pata_serverworks i2c_core shpchp k8temp hwmon tg3 rtc_cmos rtc_core rtc_lib ide_cd serio_raw cdrom sata_svw libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd May 29 12:06:43 rt-ash kernel: Pid: 22, comm: softirq-tasklet Not tainted 2.6.21-4.el5rt #1 May 29 12:06:43 rt-ash kernel: RIP: 0010:[<ffff8101afe4b530>] [<ffff8101afe4b530>] May 29 12:06:43 rt-ash kernel: RSP: 0018:ffff8101b311bdf8 EFLAGS: 00010282 May 29 12:06:43 rt-ash kernel: RAX: ffff8101afcc0040 RBX: ffff8101afd5a940 RCX: ffff8101afb76640 May 29 12:06:43 rt-ash kernel: RDX: ffff8101afcc0040 RSI: ffff81000194b1a8 RDI: ffff8101a909ab40 May 29 12:06:43 rt-ash kernel: RBP: ffff8101b311be10 R08: ffff8101b311a000 R09: ffff8101b3111de0 May 29 12:06:43 rt-ash kernel: R10: ffffffff816e3700 R11: ffff8101afcbd090 R12: ffff8101a909ab40 May 29 12:06:43 rt-ash kernel: R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000020 May 29 12:06:43 rt-ash kernel: FS: 00002b5df4d082a0(0000) GS:ffff8101afc85b40(0000) knlGS:00000000f7ef36c0 May 29 12:06:43 rt-ash kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b May 29 12:06:43 rt-ash kernel: CR2: ffff8101afe4b530 CR3: 000000022b37c000 CR4: 00000000000006e0 May 29 12:06:43 rt-ash kernel: Process softirq-tasklet (pid: 22, threadinfo ffff8101b311a000, task ffff8101afcc0040) May 29 12:06:43 rt-ash kernel: Stack: ffffffff810c6a00 ffffffff8185c910 00000000000f4240 ffff8101b311be40 May 29 12:06:43 rt-ash kernel: ffffffff81094ce4 ffff81000aa3c8c8 ffffffff817e62e8 ffff81000aa3c8c8 May 29 12:06:43 rt-ash kernel: 00000000ffffffdf ffff8101b311be50 ffffffff810951bc ffff8101b311beb0 May 29 12:06:43 rt-ash kernel: Call Trace: May 29 12:06:43 rt-ash kernel: May 29 12:06:43 rt-ash kernel: May 29 12:06:43 rt-ash kernel: Code: 40 ab 09 a9 01 81 ff ff 40 f8 d4 2f 02 81 ff ff 01 00 00 00 May 29 12:06:43 rt-ash kernel: RIP [<ffff8101afe4b530>] May 29 12:06:43 rt-ash kernel: RSP <ffff8101b311bdf8> May 29 12:06:43 rt-ash kernel: CR2: ffff8101afe4b530 May 29 13:18:10 rt-ash dhclient: DHCPREQUEST on eth0 to 9.3.233.235 port 67 ROOT:rt-ash:log]> cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 2218 stepping : 2 cpu MHz : 2593.498 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy bogomips : 5188.80 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 2218 stepping : 2 cpu MHz : 2593.498 cache size : 1024 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy bogomips : 5185.95 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc processor : 2 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 2218 stepping : 2 cpu MHz : 2593.498 cache size : 1024 KB physical id : 1 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy bogomips : 5185.96 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc processor : 3 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 2218 stepping : 2 cpu MHz : 2593.498 cache size : 1024 KB physical id : 1 siblings : 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy bogomips : 5185.96 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc ROOT:rt-ash:log]> cat /proc/meminfo MemTotal: 8102204 kB MemFree: 7757580 kB Buffers: 38284 kB Cached: 208100 kB SwapCached: 0 kB Active: 120136 kB Inactive: 164032 kB SwapTotal: 2040244 kB SwapFree: 2040244 kB Dirty: 72 kB Writeback: 0 kB AnonPages: 37800 kB Mapped: 10352 kB Slab: 35768 kB SReclaimable: 23368 kB SUnreclaim: 12400 kB PageTables: 4516 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 6091344 kB Committed_AS: 87588 kB VmallocTotal: 34359738367 kB VmallocUsed: 1284 kB VmallocChunk: 34359736911 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB
Nivedita, if you get another such OOPS please try a sysrq+T and look if there is any task stuck in D state, posting the backtrace.
----- Additional Comments From nivedita.com (prefers email at niv.com) 2007-06-04 11:00 EDT ------- Arnaldo, OK, so I have gotten this oops twice again. Unfortunately, sysrq-t isn't responding. I'll try and grab more useful output somehow. This is only happening on our x3455 at the moment, which is also suffering other issues. The good news is that we can make it happen fairly regularly, although no single test etc seems to be provoking it.
----- Additional Comments From ankigarg.com (prefers email at ankita.com) 2007-06-05 08:12 EDT ------- So on this system, to test is sysrq-m was working fine, I found in the following in the output of Show memory: SysRq : Show Memory Mem-info: Node 0 DMA per-cpu: CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: Hot: hi: 186, btch: 31 usd: 178 Cold: hi: 62, btch: 15 usd: 0 CPU 1: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0 CPU 2: Hot: hi: 186, btch: 31 usd: 27 Cold: hi: 62, btch: 15 usd: 0 CPU 3: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0 Node 0 Normal per-cpu: CPU 0: Hot: hi: 186, btch: 31 usd: 138 Cold: hi: 62, btch: 15 usd: 12 CPU 1: Hot: hi: 186, btch: 31 usd: 26 Cold: hi: 62, btch: 15 usd: 0 CPU 2: Hot: hi: 186, btch: 31 usd: 55 Cold: hi: 62, btch: 15 usd: 10 CPU 3: Hot: hi: 186, btch: 31 usd: 24 Cold: hi: 62, btch: 15 usd: 0 Node 1 Normal per-cpu: CPU 0: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0 CPU 1: Hot: hi: 186, btch: 31 usd: 30 Cold: hi: 62, btch: 15 usd: 9 CPU 2: Hot: hi: 186, btch: 31 usd: 11 Cold: hi: 62, btch: 15 usd: 0 CPU 3: Hot: hi: 186, btch: 31 usd: 140 Cold: hi: 62, btch: 15 usd: 12 Active:85375 inactive:79740 dirty:48 writeback:0 unstable:0 free:1796150 slab:41690 mapped:2263 pagetables:651 bounce:0 Node 0 DMA free:15912kB min:20kB low:24kB high:28kB active:0kB inactive:0kB present:15532kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 3216 3966 Node 0 DMA32 free:3138700kB min:4604kB low:5752kB high:6904kB active:0kB inactive:0kB present:3293204kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 750 Node 0 Normal free:323600kB min:1072kB low:1340kB high:1608kB active:224140kB inactive:182644kB present:768000kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 Node 1 Normal free:3706388kB min:5728kB low:7160kB high:8592kB active:117360kB inactive:136316kB present:4096000kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 Node 0 DMA: 2*4kB 2*8kB 1*16kB 2*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB Node 0 DMA32: 1*4kB 1*8kB 2*16kB 1*32kB 3*64kB 5*128kB 1*256kB 2*512kB 3*1024kB 2*2048kB 764*4096kB = 3138700kB Node 0 Normal: 1220*4kB 572*8kB 252*16kB 191*32kB 132*64kB 95*128kB 29*256kB 25*512kB 7*1024kB 3*2048kB 61*4096kB = 323600kB Node 1 Normal: 79*4kB 5*8kB 5*16kB 5*32kB 31*64kB 40*128kB 12*256kB 2*512kB 6*1024kB 3*2048kB 899*4096kB = 3706388kB Swap cache: add 0, delete 0, find 0/0, race 0+0 Free swap = 8393952kB Total swap = 8393952kB Free swap: 8393952kB stopped custom tracer. Unable to handle kernel paging request at 0000000004e00000 RIP: [<ffffffff81081787>] show_mem+0x8f/0x144 PGD 125852067 PUD 116a2b067 PMD 0 Oops: 0000 [1] PREEMPT SMP CPU 2 Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_mirror dm_mod video sbs i2c_ec i2c_core dock button battery asus_acpi ac parport_pc lp parport sg pcspkr shpchp k8temp hwmon bnx2 rtc_cmos rtc_core rtc_lib serio_raw usb_storage mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 14896, comm: bash Not tainted 2.6.21-14.el5rt #1 RIP: 0010:[<ffffffff81081787>] [<ffffffff81081787>] show_mem+0x8f/0x144 RSP: 0018:ffff81021b201e78 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000000d0000 RCX: 000000000000001a RDX: 0000000004e00000 RSI: ffff8100019470d0 RDI: 00000000000d0000 RBP: ffff81021b201e98 R08: ffff810000012000 R09: ffff81012f623440 R10: 00007fff3dde0b93 R11: 0000000000000246 R12: 000000000000f67c R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000006 FS: 00002add69e68db0(0000) GS:ffff81012fc85b40(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000004e00000 CR3: 0000000125f57000 CR4: 00000000000006e0 Process bash (pid: 14896, threadinfo ffff81021b200000, task ffff81022fd76040) Stack: ffffffff8145ab40 0000000000000000 000000000000006d 0000000000000000 ffff81021b201ea8 ffffffff811ad97c ffff81021b201ee8 ffffffff811ad878 ffff81011b53e000 0000000000000002 00002add6d36f000 ffff81021b201f48 Call Trace: [<ffffffff811ad97c>] sysrq_handle_showmem+0x9/0xb [<ffffffff811ad878>] __handle_sysrq+0x9a/0x128 [<ffffffff8110cbea>] write_sysrq_trigger+0x30/0x3b [<ffffffff81016aaa>] vfs_write+0xcf/0x158 [<ffffffff81017434>] sys_write+0x47/0x70 [<ffffffff8105f11e>] system_call+0x7e/0x83 [<0000003d188bfa90>] Code: 8b 02 f6 c4 04 74 05 49 ff c4 eb 35 8b 02 66 85 c0 79 05 49 RIP [<ffffffff81081787>] show_mem+0x8f/0x144 RSP <ffff81021b201e78> CR2: 0000000004e00000 This seems to be what has been seen here...
----- Additional Comments From ankigarg.com (prefers email at ankita.com) 2007-06-05 09:19 EDT ------- Pretty much on all the systems, sysrq-m gives the same backtrace.
This brings yet another bug, namely sysrq-m oopsing on x86_64. 2.6.22-rc3-git has the same problem, 2.6.20 doesn't, trying to fix this one now. But what was asked was not sysrq+m, but sysrq+t, so that we could see where tasks in D state are or if we could get more clues from it.
----- Additional Comments From ankigarg.com (prefers email at ankita.com) 2007-06-05 13:54 EDT ------- It was while working on bug #35225 (RH #242630), I came across the Oops. As reported in that bug, sysrq-t is not working.
----- Additional Comments From nivedita.com (prefers email at niv.com) 2007-06-05 14:18 EDT ------- Hey Arnaldo, I'll echo Ankita, I have been unable to get sysrq-T to work.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-06 19:07 EDT ------- If we could, please keep the sysrq-m issue in LTC bug #35225 or open a new bug. Regarding the original OOPS, looking at the trace, it looks possibly related to the recently resolved __tasklet_action bug (LTC bug #34969 RH bug #242422). In that case rcu_process_callbacks and __tasklet_action were involved.
Ok, moved the sysrq+M discussion to bug #242630, where a fix is waiting for further testing on the LTC machines where the OOPSes are happening, which probably are machines with memory holes as shown in the e820 bootup messages :-)
----- Additional Comments From Dinakar.G.com (prefers email at dino.com) 2007-06-11 04:46 EDT ------- So I saw this bug while running release_testing.sh over the weekend. Jun 10 06:53:43 llm48 kernel: stopped custom tracer. Jun 10 06:53:43 llm48 kernel: Unable to handle kernel paging request at ffff810111f26130 RIP: Jun 10 06:53:43 llm48 kernel: [<ffff810111f26130>] Jun 10 06:53:43 llm48 kernel: PGD 8063 PUD d063 PMD 8000000111e001e3 PTE 4101834b118821ea Jun 10 06:53:43 llm48 kernel: Oops: 0011 [1] PREEMPT SMP Jun 10 06:53:43 llm48 kernel: CPU 1 Jun 10 06:53:43 llm48 kernel: Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_mirror dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport joydev sr_mod cdrom sg i2c_amd756 i2c_core pcspkr amd_rng shpchp k8temp hwmon tg3 rtc_cmos rtc_core rtc_lib serio_raw usb_storage mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Jun 10 06:53:43 llm48 kernel: Pid: 22, comm: softirq-tasklet Not tainted 2.6.21-23ibm1 #1 Jun 10 06:53:43 llm48 kernel: RIP: 0010:[<ffff810111f26130>] [<ffff810111f26130>] Jun 10 06:53:43 llm48 kernel: RSP: 0018:ffff810111ce7df8 EFLAGS: 00010282 Jun 10 06:53:43 llm48 kernel: RAX: ffff8101180eb800 RBX: ffff8102066f4e00 RCX: 000000000000000b Jun 10 06:53:43 llm48 kernel: RDX: ffff8101180eb800 RSI: ffff810001947210 RDI: ffff8101c09c5d40 Jun 10 06:53:43 llm48 kernel: RBP: ffff810111ce7e10 R08: ffff810111ce6000 R09: ffff810111cddde0 Jun 10 06:53:43 llm48 kernel: R10: ffffffff816e2700 R11: ffffffff8106d0e2 R12: ffff8101c09c5d40 Jun 10 06:53:43 llm48 kernel: R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000020 Jun 10 06:53:43 llm48 kernel: FS: 00002b7da57086f0(0000) GS:ffff810111c85b40(0000) knlGS:00000000f7ee76c0 Jun 10 06:53:43 llm48 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jun 10 06:53:43 llm48 kernel: CR2: ffff810111f26130 CR3: 0000000205c20000 CR4: 00000000000006e0 Jun 10 06:53:43 llm48 kernel: Process softirq-tasklet (pid: 22, threadinfo ffff810111ce6000, task ffff8101180eb800) Jun 10 06:53:43 llm48 kernel: Stack: ffffffff810c6941 ffffffff81858910 00000000000f4240 ffff810111ce7e40 Jun 10 06:53:43 llm48 kernel: ffffffff81094aed ffff810001f9b548 ffffffff817e2968 ffff810001f9b548 Jun 10 06:53:43 llm48 kernel: 00000000ffffffdf ffff810111ce7e50 ffffffff81094fd7 ffff810111ce7eb0 Jun 10 06:53:43 llm48 kernel: Call Trace: Jun 10 06:53:43 llm48 kernel: Jun 10 06:53:43 llm48 kernel: Jun 10 06:53:43 llm48 kernel: Code: 40 5d 9c c0 01 81 ff ff 80 3b 39 11 02 81 ff ff 01 00 00 00 Jun 10 06:53:43 llm48 kernel: RIP [<ffff810111f26130>] Jun 10 06:53:43 llm48 kernel: RSP <ffff810111ce7df8> Jun 10 06:53:43 llm48 kernel: CR2: ffff810111f26130 uname -a Linux llm48.in.ibm.com 2.6.21-23ibm1 #1 SMP PREEMPT RT Thu Jun 7 12:49:03 CDT 2007 x86_64 x86_64 x86_64 GNU/Linux Not sure which test triggered this though
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-12 19:38 EDT ------- I think I triggered a similar issue w/ the -23ibm2 kernel booting w/ "noaliencache" when running kernbench.
----- Additional Comments From mauery.com (prefers email at vernux.com) 2007-06-14 17:29 EDT ------- This was just seen on toro-rtj1: stopped custom tracer. Unable to handle kernel paging request at ffff810211d3d130 RIP: [<ffff810211d3d130>] PGD 8063 PUD 11063 PMD 8000000111a9e163 PTE 8000000211d3d163 Oops: 0011 [1] PREEMPT SMP CPU 2 Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_mirror dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport joydev sr_mod cdrom sg pcspkr i2c_amd756 i2c_core amd_rng shpchp k8temp hwmon tg3 rtc_cmos rtc_core rtc_lib serio_raw usb_storage mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 35, comm: softirq-tasklet Not tainted 2.6.21-23ibm2 #1 RIP: 0010:[<ffff810211d3d130>] [<ffff810211d3d130>] RSP: 0018:ffff810111d1ddf8 EFLAGS: 00010286 RAX: ffff81011816f800 RBX: ffff81005d747240 RCX: 0000000000000001 RDX: ffff81011816f800 RSI: ffff810001947088 RDI: ffff8100cf6112c0 RBP: ffff810111d1de10 R08: ffff810111d1c000 R09: 000000000000003c R10: ffffffff816e2700 R11: ffffffff8106d0e2 R12: ffff8100cf6112c0 R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000020 FS: 00002addf48ca120(0000) GS:ffff8101181549c0(0000) knlGS:00000000f7f07b90 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: ffff810211d3d130 CR3: 0000000207a2e000 CR4: 00000000000006e0 Process softirq-tasklet (pid: 35, threadinfo ffff810111d1c000, task ffff81011816f800) Stack: ffffffff810c6941 ffffffff81858910 00000000000f4240 ffff810111d1de40 ffffffff81094aed ffff8101180796c8 ffffffff817e2968 ffff8101180796c8 00000000ffffffdf ffff810111d1de50 ffffffff81094fd7 ffff810111d1deb0 Call Trace: Code: c0 12 61 cf 00 81 ff ff c0 ad 38 11 01 81 ff ff 01 00 00 00 RIP [<ffff810211d3d130>] RSP <ffff810111d1ddf8> CR2: ffff810211d3d130 Here is the message from the person that saw it: "I've been running some Java SVT tests on the toro-rtj1.austin.ibm.com box, which has RHEL5-RT installed. I was at a command prompt, and started to type 'ls', but i only got as far as the 'l' before the OS put out the following message: [rtjaxxon@toro-rtj1 ~]$ Message from syslogd@toro-rtj1 at Thu Jun 14 05:27:56 2007 ... toro-rtj1 kernel: Oops: 0011 [1] PREEMPT SMP Message from syslogd@toro-rtj1 at Thu Jun 14 05:27:57 2007 ... toro-rtj1 kernel: CR2: ffff810211d3d130 - This then didn't show a command prompt, but typing 'ls' again, followed by enter showed the directory as normal, then i was back at the prompt."
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-25 19:54 EDT ------- I've been trying to reproduce w/ both -23ibm3 and -31 and so far have had no luck. I think if it doesn't crop up for another week we can close this.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|OPEN |REJECTED Resolution| |UNREPRODUCIBLE ------- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-02 14:34 EDT ------- This has not been seen since -23ibm3. If it recurs, please reopen.
Closing per prior comment.