Bug 675294

Summary:	[RHEL6.1] s/390x hang while running LTP test
Product:	Red Hat Enterprise Linux 6	Reporter:	Jeff Burke <jburke>
Component:	kernel	Assignee:	Larry Woodman <lwoodman>
Status:	CLOSED ERRATA	QA Contact:	Mike Gahagan <mgahagan>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.1	CC:	arozansk, jstancek, mzywusko, pbunyan
Target Milestone:	rc
Target Release:	---
Hardware:	s390x
OS:	Linux
Whiteboard:
Fixed In Version:	kernel-2.6.32-117.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-05-23 20:39:14 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jeff Burke 2011-02-04 19:50:06 UTC

Description of problem:
 While running the kernel testing. the LTP test causes the system to hang.

Version-Release number of selected component (if applicable):
2.6.32-96.el6 

How reproducible:
99% of the time

Steps to Reproduce:
1. Install RHEL6 GA s/390x
2. Install Kernel 2.6.32-96.el6 or greater
3. Run the upstream LTP testsuite
  
Actual results:

 [-- MARK -- Wed Feb 2 17:45:00 2011]
 logger: 2011-02-02 17:46:36 /usr/bin/rhts-test-runner.sh 5992 1620
 hearbeat...
 logger: 2011-02-02 17:47:36 /usr/bin/rhts-test-runner.sh 5992 1680
 hearbeat...
 logger: 2011-02-02 17:48:36 /usr/bin/rhts-test-runner.sh 5992 1740
 hearbeat...
 00: HCPGSP2627I The virtual machine is placed in CP mode due to a
 SIGP initial C
 PU reset from CPU 01.
 cpu: Processor 1 started, address 0, identification 32C5C2
 logger: 2011-02-02 17:49:35 /usr/bin/rhts-test-runner.sh 5992 1800
 hearbeat...
 [-- MARK -- Wed Feb 2 17:50:00 2011]
 logger: 2011-02-02 17:50:35 /usr/bin/rhts-test-runner.sh 5992 1860
 hearbeat...
   <000003c000985d3c> ext4_dirty_inode+0x38/0x74 ext4
   <000000000027a40e> __mark_inode_dirty+0x46/0x198
   <0000000000269ad0> touch_atime+0x138/0x170
   <00000000001f022c> generic_file_aio_read+0x418/0x7ac
   <000000000024f354> do_sync_read+0xf0/0x154
   <0000000000250348> vfs_read+0xa0/0x1a0
   <000000000025054a> SyS_read+0x5a/0xac
   <000000000011860c> sysc_tracego+0xe/0x14
   <000002000012fe90> 0x2000012fe90
 [-- MARK -- Wed Feb 2 17:55:00 2011]
 INFO: task plymouthd:175 blocked for more than 120 seconds.
 "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this
 message.
 plymouthd D 000003c0008636b4 0 175 1 0x00000000
 00000000000005ff 0000000000000600 00000000000000a8 0000000000000000
         0000000000ff4e00 0000000000fe4e00 0000000000000600
         00000000007b3cf8
         0000000000000000 0000000000000000 000000000236e140
         000000000070ee98
         00000000007a5e00 000000000236e5d8 000000001f992040
         0000000000ff4e00
         00000000004c4c78 00000000004bb1be 00000000023bf818
         00000000023bf9d0
 Call Trace:
 (<00000000004bb1be> schedule+0x5aa/0xf84)
   <000003c0008636b4> start_this_handle+0x308/0x5e0 jbd2
   <000003c000863ba4> jbd2_journal_start+0xd8/0x118 jbd2
   <000003c000985d3c> ext4_dirty_inode+0x38/0x74 ext4
   <000000000027a40e> __mark_inode_dirty+0x46/0x198
   <000000000026993c> file_update_time+0x110/0x16c
   <00000000001ef812> __generic_file_aio_write+0x256/0x448
   <00000000001efa72> generic_file_aio_write+0x6e/0xf4
   <000003c000980286> ext4_file_write+0x7e/0x21c ext4
   <000000000024f200> do_sync_write+0xf0/0x154
   <0000000000250054> vfs_write+0xa0/0x1a0
   <0000000000250256> SyS_write+0x5a/0xac
   <00000000001184d4> sysc_noemu+0x10/0x16
   <0000020000206f20> 0x20000206f20
 INFO: task jbd2/dm-0-8:469 blocked for more than 120 seconds.
 "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this
 message.
 jbd2/dm-0-8 D 000003c000864010 0 469 2 0x00000000
 00000000000005ff 0000000000000600 00000000000000a8 0000000000000000
         0000000000ff4e00 0000000000fe4e00 0000000000000600
         00000000007b3cf8
         0000000000000000 000000001cbdeb90 0000000000000000
         000000000070ee98
         00000000007a5e00 000000001cbdf028 000000001f992040
         0000000000ff4e00
         00000000004c4c78 00000000004bb1be 000000000240fab0
         000000000240fc68
 Call Trace:
 (<00000000004bb1be> schedule+0x5aa/0xf84)
   <000003c000864010> jbd2_journal_commit_transaction+0x1c8/0x1a94
   jbd2
   <000003c00086c47e> kjournald2+0xde/0x2c0 jbd2
   <000000000016cbac> kthread+0xa4/0xac
   <0000000000109dea> kernel_thread_starter+0x6/0xc
   <0000000000109de4> kernel_thread_starter+0x0/0xc

Expected results:


Additional info:

Comment 1 Larry Woodman 2011-02-07 03:07:55 UTC

I have narrowed the culprit down to this upstream backport commit.  I have reproduce the hang with 2 separate upstream kernels as well as RHEL6.1 32-96 and beyond.

------------------------------------------------------------------------ 
Subject: sched: Change nohz idle load balancing logic to push model
From: Larry Woodman <lwoodman>
Author: Venkatesh Pallipadi <venki>
Date: Fri May 21 17:09:41 2010 -0700

    sched: Change nohz idle load balancing logic to push model

    mainline commit 83cd4fe27ad8446619b2e030b171b858501de87d

    In the new push model, all idle CPUs indeed go into nohz mode. There is
    still the concept of idle load balancer (performing the load balancing
    on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz
    balancer when any of the nohz CPUs need idle load balancing.
    The kickee CPU does the idle load balancing on behalf of all idle CPUs
    instead of the normal idle balance.

    This addresses the below two problems with the current nohz ilb logic:
    * the idle load balancer continued to have periodic ticks during idle and
      wokeup frequently, even though it did not have any rebalancing to do on
      behalf of any of the idle CPUs.
    * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
      periodic wakeup can result in a periodic additional interrupt on a CPU
      doing the timer broadcast.

    Also currently we are migrating the unpinned timers from an idle to the cpu
    doing idle load balancing (when all the cpus in the system are idle,
    there is no idle load balancing cpu and timers get added to the same idle cpu
    where the request was made. So the existing optimization works only on semi idle
    system).

    And In semi idle system, we no longer have periodic ticks on the idle load
    balancer CPU. Using that cpu will add more delays to the timers than intended
    (as that cpu's timer base may not be uptodate wrt jiffies etc). This was
    causing mysterious slowdowns during boot etc.
---------------------------------------------------------------------

The problem for some reason only on the on the s390x nohz_balancer_kick() calls __smp_call_function_single() which calls csd_lock() which calls csd_lock_wait().
For some reason(unknown yet) the system is spinning in this loop yes every dump shows the data->flags is zero.


/*
 * csd_lock/csd_unlock used to serialize access to per-cpu csd resources
 *
 * For non-synchronous ipi calls the csd can still be in use by the
 * previous function call. For multi-cpu calls its even more interesting
 * as we'll have to ensure no other cpu is observing our csd.
 */
static void csd_lock_wait(struct call_single_data *data)
{
        while (data->flags & CSD_FLAG_LOCK)
                cpu_relax();
}



Larry

Comment 2 Larry Woodman 2011-02-07 19:20:09 UTC

This upstream patch was missing from RHEL6.1:

commit 27c379f7f89a4d558c685b5d89b5ba2fe79ae701
Author: Heiko Carstens <heiko.carstens.com>
Date:   Fri Sep 10 13:47:29 2010 +0200

    generic-ipi: Fix deadlock in __smp_call_function_single

    Just got my 6 way machine to a state where cpu 0 is in an
    endless loop within __smp_call_function_single.
    All other cpus are idle.

    The call trace on cpu 0 looks like this:

     __smp_call_function_single
     scheduler_tick
     update_process_times
     tick_sched_timer
     __run_hrtimer
     hrtimer_interrupt
     clock_comparator_work
     do_extint
     ext_int_handler
     ----> timer irq
     cpu_idle

    __smp_call_function_single() got called from nohz_balancer_kick()
    (inlined) with the remote cpu being 1, wait being 0 and the per
    cpu variable remote_sched_softirq_cb (call_single_data) of the
    current cpu (0).

    Then it loops forever when it tries to grab the lock of the
    call_single_data, since it is already locked and enqueued on cpu 0.

    My theory how this could have happened: for some reason the
    scheduler decided to call __smp_call_function_single() on it's own
    cpu, and sends an IPI to itself. The interrupt stays pending
    since IRQs are disabled. If then the hypervisor schedules the
    cpu away it might happen that upon rescheduling both the IPI and
    the timer IRQ are pending. If then interrupts are enabled again
    it depends which one gets scheduled first.
    If the timer interrupt gets delivered first we end up with the
    local deadlock as seen in the calltrace above.

    Let's make __smp_call_function_single() check if the target cpu is
    the current cpu and execute the function immediately just like
    smp_call_function_single does. That should prevent at least the
    scenario described here.

    It might also be that the scheduler is not supposed to call
    __smp_call_function_single with the remote cpu being the current
    cpu, but that is a different issue.

    Signed-off-by: Heiko Carstens <heiko.carstens.com>
    Acked-by: Peter Zijlstra <a.p.zijlstra>
    Acked-by: Jens Axboe <jaxboe>
    Cc: Venkatesh Pallipadi <venki>
    Cc: Suresh Siddha <suresh.b.siddha>
    LKML-Reference: <20100910114729.GB2827.de.ibm.com>
    Signed-off-by: Ingo Molnar <mingo>
----------------------------------------------------------------------------------------------------------------

Fixes BZ675294




rhel6-ipi_deadlock.patch

diff --git a/kernel/smp.c b/kernel/smp.c
index 75c970c..ed6aacf 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -365,9 +365,10 @@ call:
 EXPORT_SYMBOL_GPL(smp_call_function_any);
 
 /**
- * __smp_call_function_single(): Run a function on another CPU
+ * __smp_call_function_single(): Run a function on a specific CPU
  * @cpu: The CPU to run on.
  * @data: Pre-allocated and setup data structure
+ * @wait: If true, wait until function has completed on specified CPU.
  *
  * Like smp_call_function_single(), but allow caller to pass in a
  * pre-allocated data structure. Useful for embedding @data inside
@@ -376,8 +377,10 @@ EXPORT_SYMBOL_GPL(smp_call_function_any);
 void __smp_call_function_single(int cpu, struct call_single_data *data,
 				int wait)
 {
-	csd_lock(data);
+	unsigned int this_cpu;
+	unsigned long flags;
 
+	this_cpu = get_cpu();
 	/*
 	 * Can deadlock when called with interrupts disabled.
 	 * We allow cpu's that are not yet online though, as no one else can
@@ -387,7 +390,15 @@ void __smp_call_function_single(int cpu, struct call_single_data *data,
 	WARN_ON_ONCE(cpu_online(smp_processor_id()) && wait && irqs_disabled()
 		     && !oops_in_progress);
 
-	generic_exec_single(cpu, data, wait);
+	if (cpu == this_cpu) {
+		local_irq_save(flags);
+		data->func(data->info);
+		local_irq_restore(flags);
+	} else {
+		csd_lock(data);
+		generic_exec_single(cpu, data, wait);
+	}
+	put_cpu();
 }
 
 /**

Comment 3 RHEL Program Management 2011-02-07 19:30:30 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 4 Aristeu Rozanski 2011-02-18 22:16:39 UTC

Patch(es) available on kernel-2.6.32-117.el6

Comment 8 Mike Gahagan 2011-03-17 18:54:22 UTC

Confirmed ltp has run to completion with both -118 and -122 kernels so this one can be verified.

Comment 9 errata-xmlrpc 2011-05-23 20:39:14 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html