Bug 242630 - sysrq-M causes oops in 2.6.21-23.el5rt
Summary: sysrq-M causes oops in 2.6.21-23.el5rt
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel
Version: 1.0
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ---
: ---
Assignee: Arnaldo Carvalho de Melo
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-06-05 07:09 UTC by IBM Bug Proxy
Modified: 2008-02-27 19:56 UTC (History)
2 users (show)

Fixed In Version: 2.6.21-31.el5rt
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-06-20 16:23:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fixes sysrq+m OOPS (2.26 KB, patch)
2007-06-07 15:43 UTC, Arnaldo Carvalho de Melo
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 35225 0 None None None Never

Description IBM Bug Proxy 2007-06-05 07:09:55 UTC
LTC Owner is: jstultz.com
LTC Originator is: ankigarg.com


Problem description:

When trying to trigger dump using 'echo "c" > /proc/sysrq-trigger', the system
does not hang. 

dmesg:
SysRq : Trigger a crashdump

Whether kdump is enabled or not, echo "c" > /proc/sysrq-trigger should cause the
system to hang.
But, if kdump is _not_ enabled, sysrq-c would not hang the system.
Thanks Sripathi, for confirming this.

Also, I think Darren had tried sysrq-m, which had also not worked.
From IRC,
<dvhart> with both echo "m" > /proc/sysrq... and via send brk t
<dvhart> neither worked

 - Ankita

Comment 1 IBM Bug Proxy 2007-06-05 12:50:34 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-05 08:47 EDT -------
Now when I tried sysrq-c on rt-beech, it worked. On another machine I was even
able to get the output from sysrq-m. sysrq-t still does not seem to work. 

Comment 2 IBM Bug Proxy 2007-06-06 11:50:48 UTC
----- Additional Comments From sripathi.com (prefers email at sripathik.com)  2007-06-06 07:47 EDT -------
When we echo "m" to /proc/sysrq-trigger, the following oops is seen:
SysRq : Show Memory
Mem-info:
Node 0 DMA per-cpu:
CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    1: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    2: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    3: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: Hot: hi:  186, btch:  31 usd:  27   Cold: hi:   62, btch:  15 usd:   0
CPU    1: Hot: hi:  186, btch:  31 usd:  27   Cold: hi:   62, btch:  15 usd:   0
CPU    2: Hot: hi:  186, btch:  31 usd: 182   Cold: hi:   62, btch:  15 usd:   0
CPU    3: Hot: hi:  186, btch:  31 usd:  91   Cold: hi:   62, btch:  15 usd:   0
Node 0 Normal per-cpu:
CPU    0: Hot: hi:  186, btch:  31 usd:  14   Cold: hi:   62, btch:  15 usd:   6
CPU    1: Hot: hi:  186, btch:  31 usd:  72   Cold: hi:   62, btch:  15 usd:   1
CPU    2: Hot: hi:  186, btch:  31 usd: 174   Cold: hi:   62, btch:  15 usd:   3
CPU    3: Hot: hi:  186, btch:  31 usd: 173   Cold: hi:   62, btch:  15 usd:  14
Active:13193 inactive:59413 dirty:2 writeback:0 unstable:0
 free:1924957 slab:5818 mapped:2332 pagetables:823 bounce:0
Node 0 DMA free:15912kB min:20kB low:24kB high:28kB active:0kB inactive:0kB
present:15532kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 3216 7966
Node 0 DMA32 free:3039500kB min:4604kB low:5752kB high:6904kB active:0kB
inactive:0kB present:3293204kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 4750
Node 0 Normal free:4644416kB min:6800kB low:8500kB high:10200kB active:52772kB
inactive:237652kB present:4864000kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
Node 0 DMA: 2*4kB 2*8kB 1*16kB 2*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB
1*2048kB 3*4096kB = 15912kB
Node 0 DMA32: 1*4kB 1*8kB 0*16kB 0*32kB 0*64kB 4*128kB 1*256kB 1*512kB 3*1024kB
2*2048kB 740*4096kB = 3039500kB
Node 0 Normal: 510*4kB 423*8kB 421*16kB 404*32kB 369*64kB 46*128kB 31*256kB
19*512kB 13*1024kB 0*2048kB 1113*4096kB = 4644416kB
Swap cache: add 0, delete 0, find 0/0, race 0+0
Free swap  = 2040244kB
Total swap = 2040244kB
Free swap:       2040244kB
stopped custom tracer.
Unable to handle kernel paging request at 0000000004e00000 RIP:
 [<ffffffff81081787>] show_mem+0x8f/0x144
PGD 223633067 PUD 223634067 PMD 0
Oops: 0000 [1] PREEMPT SMP
CPU 2
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_mirror
dm_mod video sbs i2c_ec i2c_core dock button battery asus_acpi ac parport_pc lp
parport sg pcspkr shpchp k8temp hwmon bnx2 rtc_cmos rtc_core rtc_lib serio_raw
usb_storage mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd
ehci_hcd ohci_hcd uhci_hcd
Pid: 3062, comm: bash Not tainted 2.6.21-23.el5rt #1
RIP: 0010:[<ffffffff81081787>]  [<ffffffff81081787>] show_mem+0x8f/0x144
RSP: 0018:ffff810221059e78  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000000d0000 RCX: 000000000000001a
RDX: 0000000004e00000 RSI: ffff8100019470d0 RDI: 00000000000d0000
RBP: ffff810221059e98 R08: ffff810000012000 R09: 0000000000000020
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000156e4
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000006
FS:  00002b27578f3db0(0000) GS:ffff81022fc9b640(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000004e00000 CR3: 0000000223d67000 CR4: 00000000000006e0
Process bash (pid: 3062, threadinfo ffff810221058000, task ffff81022f1b9800)
Stack:  ffffffff8145ab40 0000000000000000 000000000000006d 0000000000000000
 ffff810221059ea8 ffffffff811ad976 ffff810221059ee8 ffffffff811ad872
 ffff810221145000 0000000000000002 00002b275adfd000 ffff810221059f48
Call Trace:
 [<ffffffff811ad976>] sysrq_handle_showmem+0x9/0xb
 [<ffffffff811ad872>] __handle_sysrq+0x9a/0x128
 [<ffffffff8110cbe4>] write_sysrq_trigger+0x30/0x3b
 [<ffffffff81016aaa>] vfs_write+0xcf/0x158
 [<ffffffff81017434>] sys_write+0x47/0x70
 [<ffffffff8105f29e>] tracesys+0xdc/0xe1
 [<00000034ecebfa90>]


Code: 8b 02 f6 c4 04 74 05 49 ff c4 eb 35 8b 02 66 85 c0 79 05 49
RIP  [<ffffffff81081787>] show_mem+0x8f/0x144
 RSP <ffff810221059e78>
CR2: 0000000004e00000 

Comment 3 IBM Bug Proxy 2007-06-06 11:55:40 UTC
----- Additional Comments From sripathi.com (prefers email at sripathik.com)  2007-06-06 07:50 EDT -------
I have obtained a kdump for this problem. It is on rt-cypress.austin.ibm.com. Do
this to take a look at it:
/home/sripathi/crash-4.0-4.2/crash
/usr/lib/debug/lib/modules/2.6.21-23.el5rt/vmlinux
/var/crash/2007-06-05-02:28/vmcore

Note: The crash shipped with RHEL5 doesn't work (Bug 35280 - RH242869) 

Comment 4 IBM Bug Proxy 2007-06-06 12:00:33 UTC
----- Additional Comments From sripathi.com (prefers email at sripathik.com)  2007-06-06 07:55 EDT -------
By brute force approach, I narrowed this problem to something that went in
between 2.6.20.12 and 2.6.21. 

Comment 5 Guy Streeter 2007-06-06 15:30:39 UTC
I am not able to reproduce this sysrq-M oops on a 2-processor Opteron and the
2.6.21-23.el5rt kernel.
I have no trouble with any of the sysrq commands except C.

Is there anything you can identify that is specififc to the system where it
oopsed? Is the oops reproducible?

Comment 6 IBM Bug Proxy 2007-06-06 16:40:44 UTC
------- Additional Comments From sripathi.com (prefers email at sripathik.com)  2007-06-06 12:36 EDT -------
(In reply to comment #14)
> Is there anything you can identify that is specififc to the system where it
> oopsed? Is the oops reproducible?

I can recreate it 100% on multiple machines. The machine details where I am
doing this is:
Hardware: LS21 blade
memory: 8GB
Swap: 2GB
Kernel: 2.6.21-23.el5rt

I don't know if there is anything specific about this hardware. Is there
anything specific that you suspect? 

Comment 7 IBM Bug Proxy 2007-06-06 22:00:47 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|RH242630- sysrq not         |RH242630- sysrq-m oopses on
                   |functional in 2.6.21-       |2.6.21-23.el5rt
                   |23.el5rt                    |




------- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-06-06 17:57 EDT -------
I also was able to reproduce the issue on an LS20 w/ 8Gigs and the
2.6.21-23.el5rt kernel. I'm updating the summary to better clarify the current
issue. 

Comment 8 IBM Bug Proxy 2007-06-06 22:11:05 UTC
----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-06-06 18:07 EDT -------
Also just to confirm, sysrq-t worked fine on the box I was using. 

Comment 9 IBM Bug Proxy 2007-06-06 23:01:16 UTC
----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-06-06 18:58 EDT -------
sysrq-m oops recreated w/ 2.6.21-rt8 

Comment 10 IBM Bug Proxy 2007-06-07 00:30:50 UTC
----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-06-06 20:27 EDT -------
Recreated w/ current -git, so I've sent mail to lkml. 

Comment 11 IBM Bug Proxy 2007-06-07 02:15:37 UTC
----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-06-06 22:11 EDT -------
This looks connected to CONFIG_NUMA 

Comment 12 IBM Bug Proxy 2007-06-07 09:40:25 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-07 05:38 EDT -------
Found that sysrq-m does not oops on RHEL5 kernel 

Comment 13 IBM Bug Proxy 2007-06-07 11:55:45 UTC
------- Additional Comments From sripathi.com (prefers email at sripathik.com)  2007-06-07 07:51 EDT -------
(In reply to comment #22)
> Found that sysrq-m does not oops on RHEL5 kernel
Please refer to comment #13. The problem was caused by something that went into
mainline in 2.6.21. Everything till 2.6.20.12 works fine. 

Comment 14 Arnaldo Carvalho de Melo 2007-06-07 14:21:34 UTC
I bisected this yesterday and got it down to this changeset:

commit f0a5a58aa812b31fd9f197c4ba48245942364eae
Author: Bob Picco <bob.picco>
Date:   Tue Feb 13 13:26:25 2007 +0100

    [PATCH] x86-64: clean up sparsemem memory_present call

    Eliminate arch specific memory_present call x86_64 NUMA by utilizing
    sparse_memory_present_with_active_regions.

    Acked-by: Mel Gorman <mel.ie>
    Signed-off-by: Bob Picco <bob.picco>
    Signed-off-by: Andi Kleen <ak>
    Cc: Andi Kleen <ak>
    Signed-off-by: Andrew Morton <akpm>

I posted a dmesg with loglevel=9 for Bob to take a look at
http://oops.ghostprotocols.net:81/acme/dmesg.txt.

This poweredge machine has 2 dual core Xeons. I recently fixed a problem in
oprofile where it was using for_each_online_cpu for allocating msrs and then
using for_each_possible_cpu to initialize the data structures, but this machine
reports 8 possible CPUs and 4 online CPUs causing NULL dereferences, perhaps
this is the case with this bug, still checking tho. This could explain why some
x86_64 machines doesn't OOPSes with sysrq+M, it would be just a matter of having
the cpu possible mask equal to the cpu online mask.

Comment 15 Guy Streeter 2007-06-07 14:30:05 UTC
I changed the summary of this issue so it reflects the sysrq-M oops problem.
Any other sysrq problems need a separate bugzilla.


Comment 16 Arnaldo Carvalho de Melo 2007-06-07 15:43:34 UTC
Created attachment 156474 [details]
fixes sysrq+m OOPS

Confirmed that the aforementioned patch is the one introducing the problem.
Talked with the patch author, Bob Picco, that provided a patch fixing the
problem, attaching it here. Tests on the machines where the problem was
appearing will be greatly appreciated. I tested by just booting 2.6.22-rc4-git
with the attached patch, OOPS fixed.

Comment 17 Arnaldo Carvalho de Melo 2007-06-07 19:48:56 UTC
Just to get a better understanding of the problem, can you please provide the
e820 map shown at bootup in the machines where the oops happens? Here is the one
in my test machine, that was oopsing too:

BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)
 BIOS-e820: 0000000000100000 - 00000000cffa8000 (usable)
 BIOS-e820: 00000000cffa8000 - 00000000cffb7c00 (ACPI data)
 BIOS-e820: 00000000cffb7c00 - 00000000d0000000 (reserved)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000fe000000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000130000000 (usable)

Comment 18 IBM Bug Proxy 2007-06-07 19:56:19 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |FIXEDAWAITINGTEST
         Resolution|                            |FIX_BY_DISTRO




------- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-06-07 15:52 EDT -------
The patch x86_64-for-show_mem-when-SPARSEMEM-is-configured.patch doesn't cleanly
apply to -rt but I merged it and verified it resolves the issue. 

TimB mentioned the fix would be in an upcoming release, so marking this as fixed
for now. 

Comment 19 IBM Bug Proxy 2007-06-07 20:00:41 UTC
----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-06-07 15:58 EDT -------
Here's the e820 map for the box we've seen this on:
 BIOS-e820: 0000000000000000 - 000000000009d400 (usable)
 BIOS-e820: 000000000009d400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000edfcddc0 (usable)
 BIOS-e820: 00000000edfcddc0 - 00000000edfd0000 (ACPI data)
 BIOS-e820: 00000000edfd0000 - 00000000ee000000 (reserved)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000212000000 (usable) 

Comment 21 Clark Williams 2007-06-15 18:01:20 UTC
patch from Bob Picco and Arnaldo (slightly munged by me) is in
kernel-rt-2.6.21.5-rt10 and should fix this problem

Comment 22 IBM Bug Proxy 2007-06-20 00:25:39 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|FIXEDAWAITINGTEST           |ACCEPTED




------- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-06-19 20:19 EDT -------
Verified fixed in 2.6.21-31.el5rt 


Note You need to log in before you can comment on or make changes to this bug.