Bug 855840

Summary: kernel may soft-lockup while stopping some of the CPUs
Product: Red Hat Enterprise Linux 6 Reporter: Roman Kagan <rvkagan>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.2CC: imammedo
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-10 12:41:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
serial console log none

Description Roman Kagan 2012-09-10 11:23:24 UTC
Created attachment 611402 [details]
serial console log

Description of problem:

During system shutdown, the kernel gets stuck after printing

ACPI: Preparing to enter system sleep state S5
Disabling non-boot CPUs ...

and then endlessly reports soft-lockup in one of the migration (aka cpu_stopper) threads in stop_machine_cpu_stop every minute or so.


Version-Release number of selected component (if applicable):
detected on 2.6.32-220.23.1.el6.x86_64; seems relevant to all RHEL6 series.


How reproducible:
under one percent


Steps to Reproduce:
1. reboot or shut down the system
  
Actual results:
system is stuck

Expected results:
system proceeds to reboot/halt


Additional info:
The problem was detected while rebooting in a loop several RHEL6.2 virtual machines in a test version of Parallels Cloud Server.

The issue has been tracked down to the situation where on one of the CPUs the realtime runqueue ran out of its quota while no tasks remained in the regular runqueue.  As a result, the cpu_stopper thread never got scheduled on the CPU because it was on the rt runqueue, and no regular task was available to run and unthrottle the rt runqueue.

The issue was addressed by the mainline linux commit

commit 34f971f6f7988be4d014eec3e3526bee6d007ffa
Author: Peter Zijlstra <a.p.zijlstra>
Date:   Wed Sep 22 13:53:15 2010 +0200

    sched: Create special class for stop/migrate work
    
    In order to separate the stop/migrate work thread from the SCHED_FIFO
    implementation, create a special class for it that is of higher priority than
    SCHED_FIFO itself.
    
    This currently solves a problem where cpu-hotplug consumes so much cpu-time
    that the SCHED_FIFO class gets throttled, but has the bandwidth replenishment
    timer pending on the now dead cpu.
    
    It is also required for when we add the planned deadline scheduling class above
    SCHED_FIFO, as the stop/migrate thread still needs to transcent those tasks.
    
    Tested-by: Heiko Carstens <heiko.carstens.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra>
    LKML-Reference: <1285165776.2275.1022.camel@laptop>
    Signed-off-by: Ingo Molnar <mingo>

which appeared in v2.6.37-rc1.

Comment 2 Igor Mammedov 2012-09-10 12:41:57 UTC
Fix is targeted for RHEL6.4

*** This bug has been marked as a duplicate of bug 843541 ***