Bug 991188

Summary:	Crash on beaker, possible rcu boost problem
Product:	Red Hat Enterprise MRG	Reporter:	Steven Rostedt <srostedt>
Component:	realtime-kernel	Assignee:	Red Hat Real Time Maintenance <rt-maint>
Status:	CLOSED WONTFIX	QA Contact:	MRG Quality Engineering <mrgqe-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.0	CC:	bhu, jkacur, lgoncalv, williams
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-04-10 19:09:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Steven Rostedt 2013-08-01 19:33:06 UTC

Description of problem:

Beaker hit the following crash:

BUG: soft lockup - CPU#0 stuck for 23s! [hackbench:28505] 
Modules linked in: sunrpc acpi_cpufreq ipv6 gpio_ich iTCO_wdt iTCO_vendor_support joydev coretemp crc32c_intel microcode serio_raw pcspkr sg i2c_i801 lpc_ich shpchp igb ptp pps_core ioatdma dca i7core_edac edac_core ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif ahci libahci pata_acpi ata_generic pata_jmicron radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cpufreq_ondemand] 
CPU 0  
Pid: 28505, comm: hackbench Not tainted 3.6.11.5-rt37.53.el6rt.x86_64 #1 Supermicro X8DTN/X8DTN 
RIP: 0010:[<ffffffff815326c2>]  [<ffffffff815326c2>] _raw_spin_lock+0x32/0x40 
RSP: 0018:ffff880247169af8  EFLAGS: 00000a83 
RAX: 0000000000009e00 RBX: ffff880247168010 RCX: 00000000c0000100 
RDX: 0000000000004854 RSI: ffff880249075f40 RDI: ffff880248549a00 
RBP: ffff880247169af8 R08: 0000000000000000 R09: 0000000000000000 
R10: 0000000000000000 R11: 000000000000000c R12: ffff880247168010 
R13: ffff880248549a00 R14: ffffffff8153095b R15: ffff880247169ae8 
FS:  00007f326df50700(0000) GS:ffff880276c00000(0000) knlGS:0000000000000000 
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b 
CR2: 0000003ee8272750 CR3: 0000000244b6e000 CR4: 00000000000007f0 
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
Process hackbench (pid: 28505, threadinfo ffff880247168000, task ffff880249075f40) 
Stack: 
 ffff880247169bc8 ffffffff81531b15 ffff880247169b58 ffff8802441317d0 
 ffff880249075f40 ffff880249075f40 ffff880200000078 ffff880247169b30 
 ffff880247169b30 ffff880248549a08 ffff880248549a08 ffffffff00000078 
Call Trace: 
 [<ffffffff81531b15>] rt_spin_lock_slowlock+0x115/0x310 
 [<ffffffff81532306>] rt_spin_lock+0x26/0x30 
 [<ffffffff8107e9de>] __wake_up_sync_key+0x3e/0x80 
 [<ffffffff814452a3>] sock_def_readable+0x43/0x90 
 [<ffffffff814fca0c>] unix_stream_sendmsg+0x21c/0x4b0 
 [<ffffffff8143f957>] sock_aio_write+0x187/0x1a0 
 [<ffffffff8153095b>] ? __schedule+0x41b/0x900 
 [<ffffffff8116a43a>] do_sync_write+0xaa/0xf0 
 [<ffffffff8116aa33>] vfs_write+0x173/0x190 
 [<ffffffff8116ab51>] sys_write+0x51/0x90 
 [<ffffffff815397d2>] system_call_fastpath+0x16/0x1b 
Code: 66 90 65 48 8b 04 25 08 99 00 00 83 80 44 e0 ff ff 01 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0e 0f 1f 40 00 f3 90 <0f> b7 07 66 39 c2 75 f6 c9 c3 0f 1f 40 00 55 48 89 e5 66 66 66  
------------[ cut here ]------------ 
kernel BUG at kernel/rtmutex_common.h:75! 
invalid opcode: 0000 [#1] PREEMPT SMP  
Modules linked in: sunrpc acpi_cpufreq ipv6 gpio_ich iTCO_wdt iTCO_vendor_support joydev coretemp crc32c_intel microcode serio_raw pcspkr sg i2c_i801 lpc_ich shpchp igb ptp pps_core ioatdma dca i7core_edac edac_core ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif ahci libahci pata_acpi ata_generic pata_jmicron radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cpufreq_ondemand] 
CPU 9  
Pid: 14, comm: rcub/1 Not tainted 3.6.11.5-rt37.53.el6rt.x86_64 #1 Supermicro X8DTN/X8DTN 
RIP: 0010:[<ffffffff810a21fc>]  [<ffffffff810a21fc>] rt_mutex_adjust_prio_chain+0x46c/0x480 
RSP: 0000:ffff88026bcbbc40  EFLAGS: 00010086 
RAX: ffff8801ffffffe8 RBX: ffff880249075f40 RCX: 0000000000000246 
RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff880248549a00 
RBP: ffff88026bcbbcd0 R08: 0000000000000000 R09: 736f702f78756e69 
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 
R13: ffff880248549a00 R14: ffff880247169b28 R15: ffff8802490765e8 
FS:  0000000000000000(0000) GS:ffff880276d40000(0000) knlGS:0000000000000000 
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b 
CR2: 0000000001bb2028 CR3: 00000001bb71b000 CR4: 00000000000007e0 
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
Process rcub/1 (pid: 14, threadinfo ffff88026bcba000, task ffff88026bcb4770) 
Stack: 
 ffff880247169b28 0000000000000246 ffff8801ffffffe8 ffff88026bcb4770 
 ffff88026bcba010 ffff88026bcbbd60 000000006bcbbcc0 ffff88026bcba000 
 000000016bcbbcb0 ffff88026bcbbe80 ffff880247169b28 0000000000000000 
Call Trace: 
 [<ffffffff810a26e5>] task_blocks_on_rt_mutex+0x175/0x280 
 [<ffffffff815317da>] rt_mutex_slowlock+0xaa/0x210 
 [<ffffffff815319ef>] rt_mutex_lock+0x2f/0x40 
 [<ffffffff810d2c6b>] rcu_boost_kthread+0x1bb/0x4a0 
 [<ffffffff810d2ab0>] ? rcu_cpu_kthread+0x640/0x640 
 [<ffffffff8106b506>] kthread+0x96/0xa0 
 [<ffffffff8153a8a4>] kernel_thread_helper+0x4/0x10 
 [<ffffffff8106b470>] ? kthreadd+0x1e0/0x1e0 
 [<ffffffff8153a8a0>] ? gs_change+0xb/0xb 
Code: e8 ca d3 48 00 e9 ba fe ff ff 0f 0b eb fe 0f 0b eb fe 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 eb f3 e8 09 ed 48 00 e9 27 fe ff ff <0f> 0b 66 90 eb fc 0f 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55  
RIP  [<ffffffff810a21fc>] rt_mutex_adjust_prio_chain+0x46c/0x480 
 RSP <ffff88026bcbbc40> 
---[ end trace 0000000000000002 ]--- 
note: rcub/1[14] exited with preempt_count 2 
------------[ cut here ]------------ 


I'm concerned about the warning on the rcu_boost_kthread in rt_mutex_lock(). The code that crashed there is:

	t = container_of(tb, struct task_struct, rcu_node_entry);
	rt_mutex_init_proxy_locked(&mtx, t);
	t->rcu_boost_mutex = &mtx;
	raw_spin_unlock_irqrestore(&rnp->lock, flags);
	rt_mutex_lock(&mtx);  /* Side effect: boosts task t's priority. */
	rt_mutex_unlock(&mtx);  /* Keep lockdep happy. */

What it does is to detect that a process is holding up RCU, it finds that process, creates an rt_mutex on its stack and initializes it as a proxy lock, having owner be the task that's holding up RCU.  It then blocks on the mutex to use the pi algorithm to boost the task to run again.

The bug triggered within the grabbing of the rt_mutex_lock():

rt_mutex_top_waiter(struct rt_mutex *lock)
{
	struct rt_mutex_waiter *w;

	w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter,
			       list_entry);
	BUG_ON(w->lock != lock);  <<-- bug here

	return w;
}

Now no one should be waiting on the lock except for the rcu boost thread that created the lock.

Comment 3 Beth Uptagrafft 2015-04-10 19:09:47 UTC

This issue has not been updated in a while and is using an older, unsupported kernel. This BZ is being closed WONTFIX.