Bug 1434616

Summary:	CPU hotplug causes lglock to be taken from atomic context
Product:	Red Hat Enterprise Linux 7	Reporter:	Crystal Wood <crwood>
Component:	kernel-rt	Assignee:	Crystal Wood <crwood>
kernel-rt sub component:	Other	QA Contact:	Jiri Kastner <jkastner>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	bhu, crwood, lgoncalv, williams
Version:	7.4
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-01 19:02:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1353018, 1410158

Description Crystal Wood 2017-03-21 23:08:07 UTC

Doing the following on system with at least 2 CPUs:
# echo 0 > /sys/devices/system/cpu/cpu1/online 
# echo 1 > /sys/devices/system/cpu/cpu1/online 

...in a debug RT kernel on x86_64 results in the following output.  It could also cause a deadlock in a non-debug kernel if there is a stop_two_cpus() (which does not take stop_cpus_mutex) running at the same time.

The fix is to apply these patches from rt-4.8.15-rt10:
stomp-machine-create-lg_global_trylock_relax-primiti.patch
stomp-machine-use-lg_global_trylock_relax-to-dead-wi.patch


[  336.523845] SMP alternatives: lockdep: fixing up alternatives
[  336.592592] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  336.662955] numa_add_cpu cpu 1 node 0: mask now 0-7
[  336.662978] BUG: sleeping function called from invalid context at kernel/rtmutex.c:818
[  336.662978] in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
[  336.662979] 2 locks held by swapper/1/0:
[  336.662985]  #0:  (stop_cpus_mutex){+.+...}, at: [<ffffffffa713d2de>] stop_machine_from_inactive_cpu+0x8e/0x150
[  336.662988]  #1:  (stop_cpus_lock){+.+...}, at: [<ffffffffa713c9a6>] queue_stop_cpus_work.isra.6+0x36/0xd0
[  336.662988] irq event stamp: 6160834
[  336.662991] hardirqs last  enabled at (6160833): [<ffffffffa710890d>] tick_nohz_idle_enter+0x5d/0xb0
[  336.662994] hardirqs last disabled at (6160834): [<ffffffffa7047485>] play_dead_common+0x55/0x60
[  336.662996] softirqs last  enabled at (0): [<ffffffffa7081816>] copy_process+0x836/0x1e60
[  336.662997] softirqs last disabled at (0): [<          (null)>]           (null)
[  336.662998] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E  ------------   3.10.0-rt56+ #1
[  336.662999] Hardware name: HP ProLiant m710 Server Cartridge/, BIOS H03 10/26/2015
[  336.663001]  ffff880282270000 5c27c3285c27b9da ffff880282277d98 ffffffffa775aea4
[  336.663002]  ffff880282277dc0 ffffffffa70cd86d ffff88083e3d2da0 0000000000000000
[  336.663003]  ffffffffa7c9e760 ffff880282277dd8 ffffffffa77633a0 0000000000000000
[  336.663004] Call Trace:
[  336.663008]  [<ffffffffa775aea4>] dump_stack+0x19/0x1b
[  336.663010]  [<ffffffffa70cd86d>] __might_sleep+0x12d/0x1f0
[  336.663013]  [<ffffffffa77633a0>] __rt_spin_lock+0x20/0x30
[  336.663016]  [<ffffffffa70c5bd0>] lg_global_lock+0x80/0xd0
[  336.663017]  [<ffffffffa713c9a6>] ? queue_stop_cpus_work.isra.6+0x36/0xd0
[  336.663019]  [<ffffffffa713c9a6>] queue_stop_cpus_work.isra.6+0x36/0xd0
[  336.663020]  [<ffffffffa713d334>] stop_machine_from_inactive_cpu+0xe4/0x150
[  336.663023]  [<ffffffffa703d780>] ? mtrr_restore+0xb0/0xb0
[  336.663025]  [<ffffffffa703e0e3>] mtrr_ap_init+0x83/0x90
[  336.663026]  [<ffffffffa7031ffd>] identify_secondary_cpu+0x1d/0x80
[  336.663028]  [<ffffffffa704595e>] smp_store_cpu_info+0x3e/0x40
[  336.663029]  [<ffffffffa704607d>] start_secondary+0xad/0x230
[  336.663030] ---------------------------
[  336.663031] | preempt count: 00000001 ]
[  336.663031] | 1-level deep critical section nesting:
[  336.663031] ----------------------------------------
[  336.663033] .. [<ffffffffa7045ffd>] .... start_secondary+0x2d/0x230
[  336.663051] .....[<00000000>] ..   ( <= 0x0)

Comment 2 Crystal Wood 2017-03-21 23:11:11 UTC

The comment about a non-debug kernel should say, "It could also cause an inactive CPU to try to schedule" rather than "could also cause a deadlock".

Comment 3 Crystal Wood 2017-04-07 21:44:10 UTC

A better fix than the two RT patches originally mentioned is to cherry pick e6253970413d99f416f7de8bd516e5f1834d8216 ("stop_machine: Remove stop_cpus_lock and lg_double_lock/unlock()").  Besides being simpler and closer to current upstream, it avoids potential latency problems associated with lglock (in particular, the global lock/unlock sequence in cpu_stopper_thread()).

Comment 8 errata-xmlrpc 2017-08-01 19:02:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2077

Comment 9 errata-xmlrpc 2017-08-02 00:25:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2077