Bug 602414

Summary:	Machine locks up when running SysRq-w
Product:	Red Hat Enterprise Linux 4	Reporter:	Jon Thomas <jthomas>
Component:	kernel	Assignee:	Red Hat Kernel Manager <kernel-mgr>
Status:	CLOSED WONTFIX	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.8	CC:	jwest, prarit, tao
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-06-14 19:54:04 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jon Thomas 2010-06-09 19:48:23 UTC

Analysis notes from Ulrich Obergfell 
----------------------------------------------------------------------------
the system hang of cluster member  was caused by an AB <-> BA deadlock
condition which involves the smp_call_function() mechanism. These are the
stack traces of the two deadlocking threads (in vmcore.snosea.system-hang).


crash> set -c 6
  PID: 16798
COMMAND: "java"
 TASK: 10511d587f0  [THREAD_INFO: 10438d24000]
  CPU: 6
STATE: TASK_RUNNING (ACTIVE)

crash> bt -I 0xffffffff8011d0cb -S 0x10438d25ac8
PID: 16798  TASK: 10511d587f0       CPU: 6   COMMAND: "java"
#0 [10438d25ac8] __smp_call_function at ffffffff8011d0cb
#1 [10438d25b40] smp_call_function at ffffffff8011d118
#2 [10438d25b70] flush_tlb_all at ffffffff8011d149
#3 [10438d25b80] remove_vm_area at ffffffff80172dd6
#4 [10438d25ba0] __vunmap at ffffffff80172e2a
#5 [10438d25bc0] destroy_context at ffffffff8011579c
#6 [10438d25bd0] __mmdrop at ffffffff80136710
#7 [10438d25be0] flush_old_exec at ffffffff8018696d
#8 [10438d25c70] load_elf_binary at ffffffff801a652c
#9 [10438d25db0] search_binary_handler at ffffffff80186fa2
#10 [10438d25df0] load_script at ffffffff801a5c21
#11 [10438d25ea0] search_binary_handler at ffffffff80186fa2
#12 [10438d25ee0] compat_do_execve at ffffffff801a3d05
#13 [10438d25f20] sys32_execve at ffffffff801289c7
#14 [10438d25f50] ia32_ptregs_common at ffffffff80126c3d
  RIP: 00000000ffffe410  RSP: 00000000f3ef5cc8  RFLAGS: 00000216
  RAX: 000000000000000b  RBX: 00000000f3ef6ed0  RCX: 00000000088263e0
  RDX: 0000000008051378  RSI: 0000000000000400  RDI: 000000000096cff4
  RBP: 00000000f3ef5cc8   R8: 0000000000000000   R9: 0000000000000000
  R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
  R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
  ORIG_RAX: 000000000000000b  CS: 0023  SS: 002b


The 'java' process on CPU 6 has acquired the 'call_lock' and has sent a
'TLB flush' request to the other 7 CPUs via a inter-processor interrupt.
The 'started' and 'finished' counters in the 'call_data' structure indicate
that six of seven CPUs were able to process the interrupt. A response was
still outstanding from CPU 4.


crash> px call_lock
call_lock = $2 = {
lock = 0xffffffff = -1,  i.e. locked and one waiter
magic = 0xdead4ead
}

crash> px *call_data
$3 = {
func = 0xffffffff8011ced7 <do_flush_tlb_all>,
info = 0x0,
started = {
  counter = 0x6
},
finished = {
  counter = 0x6
},
wait = 0x1
}


CPU 4 was executing the 'sysrq-w' issued by the 'show_CPUs.sh' script which
was running to monitor high CPU load situations.


crash> set -c 4
  PID: 17133
COMMAND: "show_CPUs.sh"
 TASK: 1050c0a37f0  [THREAD_INFO: 1039c31a000]
  CPU: 4
STATE: TASK_RUNNING (ACTIVE)

crash> bt -I 0xffffffff8011d108 -S 0x1039c31be80
PID: 17133  TASK: 1050c0a37f0       CPU: 4   COMMAND: "show_CPUs.sh"
#0 [1039c31be80] smp_call_function at ffffffff8011d108
#1 [1039c31beb0] __handle_sysrq at ffffffff8023f89b
#2 [1039c31bef0] write_sysrq_trigger at ffffffff801b4709
#3 [1039c31bf10] vfs_write at ffffffff8017c432
#4 [1039c31bf40] sys_write at ffffffff8017c51a
#5 [1039c31bf80] system_call at ffffffff801102de
  RIP: 00000039c05bc8d2  RSP: 0000007fbfffee00  RFLAGS: 00010203
  RAX: 0000000000000001  RBX: ffffffff801102de  RCX: 00000000006c0077
  RDX: 0000000000000002  RSI: 0000002a95557000  RDI: 0000000000000001
  RBP: 0000000000000002   R8: 0000000000000001   R9: 0000002a955653e0
  R10: 0000000000000053  R11: 0000000000000246  R12: 00000039c07328c0
  R13: 0000002a95557000  R14: 0000000000000002  R15: 0000000000000000
  ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b


There appears to be a bug in sysrq_handle_showcpus() which uses the
smp_call_function() mechanism even though __handle_sysrq() has blocked
all interrupts.


void __handle_sysrq(int key, struct pt_regs *pt_regs, struct tty_struct *tty)
{
      ...
      //
      // A response to the inter-processor interrupt was not possible
      // because interrupts have been blocked by spin_lock_irqsave().
      //
      spin_lock_irqsave(&sysrq_key_table_lock, flags);
      ...
      if (op_p) {
              ...
              // call sysrq_handle_showcpus()
              op_p->handler(key, pt_regs, tty);


static void sysrq_handle_showcpus(int key, struct pt_regs *pt_regs,
                                struct tty_struct *tty) {
      showacpu(NULL);
      //
      // smp_call_function() attempted to acquire the 'call_lock' too.
      // However, this was already being held/owned by the 'java' process
      // on CPU 6. Hence, the 'sysrq-w' handler was unable to make progress.
      // On the other hand, the 'java' process could not make progress either
      // because CPU 4 did not respond to the inter-processor interrupt.
      //
      smp_call_function(showacpu, NULL, 0, 0);
}



It appears that cluster member smotie had a dependency to cluster member snosea
via 'syslogd' and GFS. Stack trace of 'syslogd' (in vmcore.smotie.system-hang).


PID: 3265   TASK: 105a6eec030       CPU: 1   COMMAND: "syslogd"
#0 [105a867da98] schedule at ffffffff80314a94
#1 [105a867db70] wait_for_completion at ffffffff80314cd8
#2 [105a867dbf0] glock_wait_internal at ffffffffa019e88c
#3 [105a867dc30] gfs_glock_nq at ffffffffa019f0ca
#4 [105a867dc70] do_write_buf at ffffffffa01b36cf
#5 [105a867dd30] walk_vm at ffffffffa01b2853
#6 [105a867de20] __gfs_write at ffffffffa01b3d17
#7 [105a867de60] do_readv_writev at ffffffff8017c829
#8 [105a867df40] sys_writev at ffffffff8017c9d1
#9 [105a867df80] system_call at ffffffff801102de
  RIP: 0000002a9572ea37  RSP: 0000007fbffff778  RFLAGS: 00000246
  RAX: 0000000000000014  RBX: ffffffff801102de  RCX: ffffffffffffffff
  RDX: 0000000000000006  RSI: 0000007fbfffedd0  RDI: 0000000000000009
  RBP: 0000007fbfffedd0   R8: fefefefefefefeff   R9: 0000000000000031
  R10: 0000000000000035  R11: 0000000000000246  R12: 0000000000000009
  R13: 0000007fbfffedd0  R14: 0000000000000006  R15: 00000000bfffeee0
  ORIG_RAX: 0000000000000014  CS: 0033  SS: 002b