Bug 971900

Summary:	BUG: unable to handle kernel paging request at 000000067b556720 on v3.8-rt
Product:	Red Hat Enterprise MRG	Reporter:	John Kacur <jkacur>
Component:	realtime-kernel	Assignee:	John Kacur <jkacur>
Status:	CLOSED WONTFIX	QA Contact:	MRG Quality Engineering <mrgqe-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	Development	CC:	bhu, lgoncalv, srostedt, williams
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-04-10 19:16:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description John Kacur 2013-06-07 14:14:42 UTC

Aborting core
Additional info:
BUG: unable to handle kernel paging request at 000000067b556720 
[12017.143876] IP: [<ffffffff81089bb3>] try_to_wake_up+0xb3/0x2d0 
[12017.143878] PGD 4954ed067 PUD 0  
[12017.143881] Oops: 0000 [#1] PREEMPT SMP  
[12017.143917] Modules linked in: ts_kmp nf_conntrack_ipv4 nf_defrag_ipv4 nls_koi8_u nls_cp932 arc4 ecb md4 nls_utf8 cifs nfsv4 auth_rpcgss nfsv3 nfs_acl nfsv2 nfs lockd bluetooth rfkill sunrpc cpufreq_ondemand ipv6 gpio_ich iTCO_wdt iTCO_vendor_support e1000 joydev sg coretemp acpi_cpufreq freq_table mperf crc32c_intel microcode serio_raw pcspkr i2c_i801 lpc_ich ioatdma i7core_edac edac_core shpchp igb dca ptp pps_core ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif ahci libahci ata_generic pata_acpi pata_jmicron radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: rmd128] 
[12017.143920] CPU 6  
[12017.143920] Pid: 22035, comm: mtest01 Not tainted 3.8.13-rt9.3.el6rt.x86_64.debug #1 Supermicro X8DTN/X8DTN 
[12017.143924] RIP: 0010:[<ffffffff81089bb3>]  [<ffffffff81089bb3>] try_to_wake_up+0xb3/0x2d0 
[12017.143925] RSP: 0000:ffff880493bdbc38  EFLAGS: 00010046 
[12017.143926] RAX: 00000000df34c058 RBX: ffff880493f08000 RCX: 00000000a93ba93a 
[12017.143927] RDX: 000000000020fe40 RSI: ffffffff81089b3c RDI: ffff880493f08748 
[12017.143928] RBP: ffff880493bdbc78 R08: 0000000000000000 R09: 0000000000000000 
[12017.143931] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000008 
[12017.143932] R13: ffff880493f08730 R14: 0000000000000282 R15: 00000000df34c058 
[12017.143934] FS:  00007fea80e2d700(0000) GS:ffff8804b4e00000(0000) knlGS:0000000000000000 
[12017.143935] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b 
[12017.143936] CR2: 000000067b556720 CR3: 000000049335f000 CR4: 00000000000007e0 
[12017.143938] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
[12017.143939] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
[12017.143940] Process mtest01 (pid: 22035, threadinfo ffff880493bda000, task ffff880491854840) 
[12017.143941] Stack: 
[12017.143948]  000000000020fe40 0000000000000282 ffff880493bdbc68 ffff88049237b268 
[12017.143953]  ffff8802df34c958 ffff88049237b2b0 ffff880491854840 0000000000000282 
[12017.143958]  ffff880493bdbc88 ffffffff81089e28 ffff880493bdbc98 ffffffff810b6e29 
[12017.143959] Call Trace: 
[12017.143964]  [<ffffffff81089e28>] wake_up_lock_sleeper+0x18/0x20 
[12017.143970]  [<ffffffff810b6e29>] rt_mutex_wake_waiter+0x29/0x30 
[12017.143973]  [<ffffffff810b7041>] wakeup_next_waiter+0x91/0xd0 
[12017.143979]  [<ffffffff815a7b29>] rt_spin_lock_slowunlock+0x49/0x90 
[12017.143982]  [<ffffffff815a854e>] rt_spin_unlock+0x2e/0x40 
[12017.143986]  [<ffffffff81160874>] do_anonymous_page+0xc4/0x290 
[12017.143990]  [<ffffffff81165244>] handle_pte_fault+0x1a4/0x1c0 
[12017.143993]  [<ffffffff81165395>] handle_mm_fault+0x135/0x1d0 
[12017.143997]  [<ffffffff810b88e9>] ? rt_down_read_trylock+0x69/0xa0 
[12017.144000]  [<ffffffff815acdd4>] ? __do_page_fault+0x1f4/0x540 
[12017.144003]  [<ffffffff815ace21>] __do_page_fault+0x241/0x540 
[12017.144007]  [<ffffffff815a93fb>] ? _raw_spin_unlock_irq+0x3b/0x70 
[12017.144010]  [<ffffffff815a9df3>] ? error_sti+0x5/0x6 
[12017.144015]  [<ffffffff810aada9>] ? trace_hardirqs_off_caller+0x29/0xc0 
[12017.144020]  [<ffffffff812ae09d>] ? trace_hardirqs_off_thunk+0x3a/0x3c 
[12017.144023]  [<ffffffff815ad12e>] do_page_fault+0xe/0x10 
[12017.144026]  [<ffffffff815a9bf2>] page_fault+0x22/0x30 
[12017.144053] Code: d2 44 8b 78 18 0f 84 7f 00 00 00 44 8b 0d 66 66 a0 00 44 89 f8 45 85 c9 0f 85 9e 01 00 00 48 c7 c2 40 fe 20 00 89 c0 48 89 55 c0 <48> 8b 04 c5 60 64 af 81 48 8d 04 02 48 89 c7 48 89 45 c8 e8 65  
[12017.144055] RIP  [<ffffffff81089bb3>] try_to_wake_up+0xb3/0x2d0 
[12017.144055]  RSP <ffff880493bdbc38> 
[12017.144056] CR2: 000000067b556720 
[12017.496313] ---[ end trace 0000000000000002 ]--- 
[12017.496317] BUG: unable to handle kernel paging request at 000000067b556720 
[12017.496319] note: mtest01[22035] exited with preempt_count 2 
[12017.496324] IP: [<ffffffff8108483d>] resched_task+0x1d/0x80 
[12017.496326] PGD 0  
[12017.496328] Oops: 0000 [#2] PREEMPT SMP  
[12017.496354] Modules linked in: ts_kmp nf_conntrack_ipv4 nf_defrag_ipv4 nls_koi8_u nls_cp932 arc4 ecb md4 nls_utf8 cifs nfsv4 auth_rpcgss nfsv3 nfs_acl nfsv2 nfs lockd bluetooth rfkill sunrpc cpufreq_ondemand ipv6 gpio_ich iTCO_wdt iTCO_vendor_support e1000 joydev sg coretemp acpi_cpufreq freq_table mperf crc32c_intel microcode serio_raw pcspkr i2c_i801 lpc_ich ioatdma i7core_edac edac_core shpchp igb dca ptp pps_core ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif ahci libahci ata_generic pata_acpi pata_jmicron radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: rmd128] 
[12017.496357] CPU 0  
[12017.496357] Pid: 0, comm: swapper/0 Tainted: G      D      3.8.13-rt9.3.el6rt.x86_64.debug #1 Supermicro X8DTN/X8DTN 
[12017.496359] RIP: 0010:[<ffffffff8108483d>]  [<ffffffff8108483d>] resched_task+0x1d/0x80 
[12017.496360] RSP: 0000:ffff880274e03cf8  EFLAGS: 00010046 
[12017.496361] RAX: 000000000020fe40 RBX: ffff880493f08000 RCX: 00000000df34c058 
[12017.496361] RDX: ffff8802df34c000 RSI: ffff880251d64840 RDI: ffff880493f08000 
[12017.496362] RBP: ffff880274e03d08 R08: 0000000000000403 R09: 00000000000e7ef0 
[12017.496363] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880251d64840 
[12017.496363] R13: ffff8804b540fe40 R14: 0000000000000000 R15: ffff8804b5200000 
[12017.496365] FS:  0000000000000000(0000) GS:ffff880274e00000(0000) knlGS:0000000000000000 
[12017.496365] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b 
[12017.496366] CR2: 000000067b556720 CR3: 0000000001a15000 CR4: 00000000000007f0 
[12017.496367] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
[12017.496368] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
[12017.496369] Process swapper/0 (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a1d420) 
[12017.496369] Stack: 
[12017.496373]  ffff8804b540fe40 ffff880251d64840 ffff880274e03d28 ffffffff81084a85 
[12017.496376]  ffff8804b540fe40 ffff8804b540fe40 ffff880274e03d58 ffffffff81088e85 
[12017.496379]  ffff8804b540fe40 ffff880251d64840 0000000000000007 0000000000000007 
[12017.496380] Call Trace: 
[12017.496382]  <IRQ>  
[12017.496383]  [<ffffffff81084a85>] check_preempt_curr+0x85/0xa0 
[12017.496386]  [<ffffffff81088e85>] ttwu_do_wakeup+0x45/0x140 
[12017.496388]  [<ffffffff8108915a>] ttwu_do_activate.clone.3+0x4a/0x60 
[12017.496389]  [<ffffffff8108923e>] ttwu_queue+0xce/0xf0 
[12017.496392]  [<ffffffff81091ff0>] ? select_task_rq_rt+0x50/0x100 
[12017.496394]  [<ffffffff81089cd0>] try_to_wake_up+0x1d0/0x2d0 
[12017.496396]  [<ffffffff81089e57>] wake_up_process+0x27/0x50 
[12017.496399]  [<ffffffff810ed1f4>] handle_irq_event_percpu+0x114/0x290 
[12017.496404]  [<ffffffff8147abf5>] ? cpuidle_wrap_enter+0x55/0xa0 
[12017.496406]  [<ffffffff810ed3b8>] handle_irq_event+0x48/0x70 
[12017.496409]  [<ffffffff810f036a>] handle_fasteoi_irq+0x5a/0xf0 
[12017.496412]  [<ffffffff81004701>] handle_irq+0x71/0x190 
[12017.496415]  [<ffffffff815ad1b0>] ? notifier_call_chain+0x80/0x80 
[12017.496417]  [<ffffffff810aae4d>] ? trace_hardirqs_off+0xd/0x10 
[12017.496420]  [<ffffffff815b2e5d>] do_IRQ+0x5d/0xe0 
[12017.496423]  [<ffffffff815a98ef>] common_interrupt+0x6f/0x6f 
[12017.496424]  <EOI>  
[12017.496427]  [<ffffffff810a631c>] ? clockevents_notify+0x4c/0x1b0 
[12017.496429]  [<ffffffff8147abf1>] ? cpuidle_wrap_enter+0x51/0xa0 
[12017.496431]  [<ffffffff8147abf5>] ? cpuidle_wrap_enter+0x55/0xa0 
[12017.496433]  [<ffffffff8147abf1>] ? cpuidle_wrap_enter+0x51/0xa0 
[12017.496435]  [<ffffffff8147ac50>] cpuidle_enter_tk+0x10/0x20 
[12017.496437]  [<ffffffff8147a5e7>] cpuidle_enter_state+0x17/0x50 
[12017.496439]  [<ffffffff8147af7c>] cpuidle_idle_call+0xdc/0x1f0 
[12017.496443]  [<ffffffff8100b9e5>] cpu_idle+0xf5/0x160 
[12017.496446]  [<ffffffff81598d45>] rest_init+0xc5/0xd0 
[12017.496448]  [<ffffffff81598c80>] ? csum_partial_copy_generic+0x170/0x170 
[12017.496452]  [<ffffffff81d12da0>] start_kernel+0x3c3/0x3d0 
[12017.496454]  [<ffffffff81d127fb>] ? repair_env_string+0x5b/0x5b 
[12017.496456]  [<ffffffff81d1232d>] x86_64_start_reservations+0x131/0x136 
[12017.496458]  [<ffffffff81d1241f>] x86_64_start_kernel+0xed/0xf4 
[12017.496478] Code: b2 74 db 00 e9 ee fe ff ff 0f 1f 44 00 00 55 48 89 e5 41 54 53 66 66 66 66 90 48 8b 57 10 48 c7 c0 40 fe 20 00 48 89 fb 8b 4a 18 <48> 03 04 cd 60 64 af 81 0f b7 08 0f b7 40 02 66 39 c1 74 45 48  
[12017.496480] RIP  [<ffffffff8108483d>] resched_task+0x1d/0x80 
[12017.496481]  RSP <ffff880274e03cf8> 
[12017.496481] CR2: 000000067b556720 
[12017.920911] ---[ end trace 0000000000000003 ]---

Comment 1 John Kacur 2013-06-07 14:20:43 UTC

Stephen Rostedt suggests that this may be due to stack corruption and

This is fixed in mainline, and needs to be backported to 3.8:

commit 091d0d55b286c9340201b4ed4470be87fc568228
Author: Li Zefan <lizefan>
Date:   Thu May 9 15:08:15 2013 +0800

    shm: fix null pointer deref when userspace specifies invalid
hugepage size

Luis Claudio has been doing work to reproduce this, but it is apparently more easily caught with our automated testing system.

Comment 2 Steven Rostedt 2013-06-07 16:02:56 UTC

No, this one is not fixed in mainline. I just said it might be the fix, but there's nothing in that output that suggests that the commit I posted fixes this issue.

Beth posted several crashes, this was just the first one, which looks to be a stack corruption on the wake up. I'll explain what happened in this crash:

[12017.143924] RIP: 0010:[<ffffffff81089bb3>]  [<ffffffff81089bb3>] try_to_wake_up+0xb3/0x2d0 

Looking at the disassembled code:

 0xffffffff81089b82 <+130>: mov    0x10(%rbx),%rax
 0xffffffff81089b86 <+134>: test   %r10d,%r10d
 0xffffffff81089b89 <+137>: mov    0x18(%rax),%r15d
 0xffffffff81089b9a <+154>: mov    %r15d,%eax
 0xffffffff81089bb3 <+179>: mov    -0x7e509ba0(,%rax,8),%rax  <-- crash here

The location in code was here:

 rq = task_rq(p);

But we need to look at task_rq(p):

#define task_rq(p)		cpu_rq(task_cpu(p))

static inline unsigned int task_cpu(const struct task_struct *p)
{
	return task_thread_info(p)->cpu;
}

Where:
 #define task_thread_info(task) ((struct thread_info *)(task)->stack)

and:

#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))

I wont go into the ugliness of the per_cpu() but that's the:

  mov    -0x7e509ba0(,%rax,8),%rax


When starting into this, we have the task structure (saved in %rbx)

RBX: ffff880493f08000

Which looks like a legit pointer to a task_struct.

The 0x10(%rbx), %rax is:

struct task_struct {
	volatile long state;
	volatile long saved_state;
	void *stack;

As this is 64bit, each long there is 8 bytes, and that leaves us with the offset of 0x10 being "stack".

Thus we moved the stack pointer into %rax, and then took the offset of that:

 mov    0x18(%rax),%r15d

struct thread_info {
	struct task_struct	*task;
	struct exec_domain	*exec_domain;
	__u32			flags;
	__u32			status;
	__u32			cpu;

The task and exec_domain pointers are 8 bytes each, the flags and status are 4 bytes each: 8 + 8 + 4 + 4 = 24 = 0x18

Thus we moved the cpu # of the task into %r15

R15: 00000000df34c058

???

Unless we have more than 3,744,776,280 CPUs something totally went bad here.

mov    %r15d,%eax

Just chops it down to 32 bits.

mov    -0x7e509ba0(,%rax,8),%rax 

Loads the per_cpu value based off of the cpu (which is a really huge number). As I highly doubt we have a box with over 3 billion CPUs, the offset generated points into some unknown memory and we take a kernel page fault and crash.

As the task cpu (task_cpu) is stored at the bottom of the tasks stack, this can be corrupted if the stack gets too big. Which could have been what happened. Usually when this happens the task with the big stack will crash and we might get a clue to what happened. But unfortunately, here it was another task reading the corrupted tasks stack that died, and we have no idea why that task had a corrupted stack.

But that was the above reported bug, and has really nothing to do with the fix.

For the suggested fix, 091d0d55b "shm: fix null pointer deref when userspace specifies invalid hugepage size", this fixes the other crashes the Beth reported in that same email:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 
IP: [<ffffffff8113ec85>] sys_mmap_pgoff+0x145/0x260 
PGD 424340067 PUD 424341067 PMD 0  
Oops: 0000 [#11] PREEMPT SMP  
Modules linked in: sunrpc ipv6 acpi_power_meter gpio_ich iTCO_wdt iTCO_vendor_support joydev coretemp hwmon crc32c_intel ghash_clmulni_intel microcode sg serio_raw pcspkr hpwdt hpilo lpc_ich i7core_edac edac_core be2iscsi iscsi_boot_sysfs libiscsi scsi_transport_iscsi be2net ext4 jbd2 mbcache sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul hpsa mgag200 ttm drm_kms_helper drm i2c_algo_bit sysimgblt sysfillrect i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod [last unloaded: hwlat_detector] 
CPU 14  
Pid: 27314, comm: scrashme Tainted: G      D      3.8.13-rt9.3.el6rt.x86_64 #1 HP ProLiant BL460c G7 
RIP: 0010:[<ffffffff8113ec85>]  [<ffffffff8113ec85>] sys_mmap_pgoff+0x145/0x260 
RSP: 0018:ffff88049616bf08  EFLAGS: 00010246 
RAX: 0000000000000000 RBX: 1fed4a6cd07d13ec RCX: 0000000000000009 
RDX: 0000000000100000 RSI: 0000000000001000 RDI: 0000000000100000 
RBP: ffff88049616bf68 R08: 0000000000200000 R09: 0000000000000034 
R10: 272d3d7983b6f408 R11: 0000000000000206 R12: 0000000000000009 
R13: 0000000000000000 R14: 1654e0c99499bc30 R15: 0000000000000000 
FS:  00007fb6640d5700(0000) GS:ffff880c0bce0000(0000) knlGS:0000000000000000 
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
CR2: 0000000000000008 CR3: 00000003c68d3000 CR4: 00000000000007e0 
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
Process scrashme (pid: 27314, threadinfo ffff88049616a000, task ffff8805d5552fc0) 
Stack: 
 ffff8805d5552fc0 0000000000100000 0000000000000034 272d3d7983b6f408 
 00007fffdc63e30e 0000000000000000 ffff88049616bf48 0000000000000360 
 0000000000000009 0000000000000007 00007fffdc63e30e 0000000000402eab 
Call Trace: 
 [<ffffffff810070d9>] sys_mmap+0x29/0x30 
 [<ffffffff81553482>] system_call_fastpath+0x16/0x1b 
Code: 4c 8b 7d f8 c9 c3 44 89 c9 ba 01 00 00 00 44 89 4d b0 d3 e2 4c 89 55 b8 48 63 d2 48 89 d7 48 89 55 a8 e8 cf 18 01 00 48 8b 55 a8 <8b> 48 08 be 00 10 00 00 48 89 f0 48 89 75 a8 48 d3 e0 48 89 d7  
RIP  [<ffffffff8113ec85>] sys_mmap_pgoff+0x145/0x260 
 RSP <ffff88049616bf08> 


Where the crash happened here:

sys_mmap_pgoff (mm/mmap.c:1304)

1301		} else if (flags & MAP_HUGETLB) {
1302			struct user_struct *user = NULL;
1303	
1304			len = ALIGN(len, huge_page_size(hstate_sizelog(
1305				(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK)));

Disassembling this:

   0xffffffff8113ec7c <+316>:	callq  0xffffffff81150550 <size_to_hstate>
   0xffffffff8113ec81 <+321>:	mov    -0x58(%rbp),%rdx
   0xffffffff8113ec85 <+325>:	mov    0x8(%rax),%ecx

rax is the return value of size_to_hstate, which is hidden in the hstate_sizelog():

static inline struct hstate *hstate_sizelog(int page_size_log)
{
	if (!page_size_log)
		return &default_hstate;
	return size_to_hstate(1 << page_size_log);
}

As you can see, it returns the size_to_hstate(). If we look at what was returned:

RAX: 0000000000000000

It is NULL. Which is something that size_to_hstate() can return. The current code doesn't handle a NULL pointer being returned and crashes. The commit I showed has this:

@@ -1367,9 +1367,13 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
                        len = ALIGN(len, huge_page_size(hstate_file(file)));
        } else if (flags & MAP_HUGETLB) {
                struct user_struct *user = NULL;
+               struct hstate *hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) &
+                                                  SHM_HUGE_MASK);
 
-               len = ALIGN(len, huge_page_size(hstate_sizelog(
-                       (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK)));
+               if (!hs)
+                       return -EINVAL;
+
+               len = ALIGN(len, huge_page_size(hs));
                /*
                 * VM_NORESERVE is used because the reservations will be
                 * taken when vm_ops->mmap() is called

As you can see above, it checks the return value of hstate_sizelog(), and returns with a -EINVAL if it returns NULL and does not crash the system.

Now did this cause the original bug in this BZ? I don't know. But I never said that the commit would fix it.

Comment 3 Beth Uptagrafft 2015-04-10 19:16:27 UTC

This issue has not been updated in a while and is using an older, unsupported kernel. This BZ is being closed WONTFIX.