Bug 753729

Summary: system cannot suspend with "stopping tasks timed out - bnx2i_thread/0 remaining"
Product: Red Hat Enterprise Linux 5 Reporter: Guangze Bai <gbai>
Component: kernelAssignee: Mike Christie <mchristi>
Status: CLOSED ERRATA QA Contact: Storage QE <storage-qe>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.8CC: bprakash, ccui, coughlan, czhang, eddie.wai, fge, mschmidt, nhorman, syeghiay, yshao
Target Milestone: betaKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.18-300.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 765724 (view as bug list) Environment:
Last Closed: 2012-02-21 04:01:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 757620, 758797, 765724    
Attachments:
Description Flags
bnx2i patch to add explicit PF_NOFREEZE setting for I/O kthreads none

Description Guangze Bai 2011-11-14 10:11:52 UTC
Description of problem:

ThinkPad T400 cannot suspend on -296.el5 kernel.

# uname -a
Linux localhost.localdomain 2.6.18-296.el5 #1 SMP Thu Nov 3 12:56:56 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

# echo mem > /sys/power/state
-> T400 cannot complete suspension and came back again

-> Here is the messages during suspension
# tailf /var/log/messages
Nov 14 20:02:40 localhost last message repeated 2 times
Nov 14 20:02:40 localhost gconfd (root-4065): Resolved address "xml:readwrite:/root/.gconf" to a writable configuration source at position 0
Nov 14 20:02:40 localhost nm-system-settings: Loaded plugin ifcfg-rh: (c) 2007 - 2008 Red Hat, Inc.  To report bugs please use the NetworkManager mailing list.
Nov 14 20:02:40 localhost nm-system-settings:    ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-lo ...
Nov 14 20:02:40 localhost nm-system-settings:    ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-eth0 ...
Nov 14 20:02:40 localhost nm-system-settings:    ifcfg-rh:     read connection 'System eth0'
Nov 14 20:02:40 localhost nm-system-settings:    ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-wlan0 ...
Nov 14 20:02:40 localhost nm-system-settings:    ifcfg-rh:     error: Missing SSID
Nov 14 20:02:43 localhost pcscd: winscard.c:304:SCardConnect() Reader E-Gate 0 0 Not Found
Nov 14 20:05:39 localhost kernel: Machine check events logged
Nov 14 20:06:58 localhost kernel: Disabling non-boot CPUs ...
Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 9
Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 12
Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 169
Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 193
Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 201
Nov 14 20:06:58 localhost kernel: CPU 1 is now offline
Nov 14 20:06:58 localhost kernel: SMP alternatives: switching to UP code
Nov 14 20:06:58 localhost kernel: CPU 1 offline: Remove Rx thread
Nov 14 20:09:00 localhost restorecond: Read error (Interrupted system call)
Nov 14 20:09:00 localhost kernel: CPU1 is down
Nov 14 20:09:00 localhost kernel: Stopping tasks: ========================================================================================================================================================
Nov 14 20:09:00 localhost kernel:  stopping tasks timed out after 120 seconds (1 tasks remaining):
Nov 14 20:09:00 localhost kernel:   bnx2i_thread/0
Nov 14 20:09:00 localhost kernel: Restarting tasks...<6> Strange, bnx2i_thread/0 not stopped
Nov 14 20:09:00 localhost kernel:  done
Nov 14 20:09:00 localhost kernel: Enabling non-boot CPUs ...
Nov 14 20:09:00 localhost kernel: SMP alternatives: switching to SMP code
Nov 14 20:09:00 localhost kernel: Booting processor 1/2 APIC 0x1
Nov 14 20:09:00 localhost kernel: Initializing CPU#1
Nov 14 20:09:00 localhost kernel: Calibrating delay using timer specific routine.. 4521.94 BogoMIPS (lpj=2260974)
Nov 14 20:09:00 localhost kernel: CPU: L1 I cache: 32K, L1 D cache: 32K
Nov 14 20:09:00 localhost kernel: CPU: L2 cache: 3072K
Nov 14 20:09:00 localhost kernel: CPU: Physical Processor ID: 0
Nov 14 20:09:00 localhost kernel: CPU: Processor Core ID: 1
Nov 14 20:09:00 localhost kernel: Intel(R) Core(TM)2 Duo CPU     P8400  @ 2.26GHz stepping 0a
Nov 14 20:09:00 localhost kernel: CPU 1: Syncing TSC to CPU 0.
Nov 14 20:09:00 localhost kernel: CPU 1: synchronized TSC with CPU 0 (last diff -2048 cycles, maxerr 314 cycles)
Nov 14 20:09:00 localhost kernel: bnx2i: CPU 1 online: Create Rx thread
Nov 14 20:09:00 localhost kernel: CPU1 is up


Indeed, bnx2i live in system:
# ps aux | grep bnx
root      2446  0.0  0.0      0     0 ?        S<   20:55   0:00 [bnx2i_thread/0]
root      2447  0.0  0.0      0     0 ?        S<   20:55   0:00 [bnx2i_thread/1]

Version-Release number of selected component (if applicable):
kernel-2.6.18-296.el5

How reproducible:
always

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:
system can suspend successfully

Additional info:
I also tested on -274.el5, system can suspend without above issue.

Comment 1 Guangze Bai 2011-11-14 10:17:05 UTC
Tested on -288.el5 and T400 can successfully suspend. Also, no bnx2i_thread lived in system. So marking "Regression". I'll bisect and provide more info later.

Comment 5 Michal Schmidt 2011-11-15 08:11:56 UTC
The kernel thread's main loop is in bnx2i_percpu_io_thread(). The thread neither calls try_to_freeze(), nor marks itself unfreezable (PF_NOFREEZE). It needs to do one of these as described in Documentation/power/kernel_threads.txt.

Comment 6 Mike Christie 2011-11-15 18:34:00 UTC
Adding bnx2i maintainer Eddie from broadcom.

It looks like this could be a problem in fcoe.ko and bnx2fc.ko in rhel 6 too.

Comment 7 Mike Christie 2011-11-15 20:02:02 UTC
(In reply to comment #6)
> Adding bnx2i maintainer Eddie from broadcom.
> 
> It looks like this could be a problem in fcoe.ko and bnx2fc.ko in rhel 6 too.

I guess this does not apply to rhel6? The kernel_threads.txt is not there anymore and I see it is removed.

But for rhel5 does qla2xxx have the problem?

Comment 8 Michal Schmidt 2011-11-16 09:29:35 UTC
(In reply to comment #7)
> I guess this does not apply to rhel6? The kernel_threads.txt is not there
> anymore and I see it is removed.

In RHEL6 there is Documentation/power/freezing_of_tasks.txt instead.
There is one significant difference. In RHEL6 kernel threads are non-freezable by default. See commit 83144186 "Freezer: make kernel threads nonfreezable by default".

> But for rhel5 does qla2xxx have the problem?

Looking at the code... yes, it does.

Comment 9 Eddie Wai 2011-11-16 19:17:54 UTC
Created attachment 534072 [details]
bnx2i patch to add explicit PF_NOFREEZE setting for I/O kthreads

It looks like the correct fix for the bnx2i I/O kthread is to add the explicit setting of the PF_NOFREEZE flag.  This will align the bnx2i I/O kthread behavior between the RHEL6/upstream and RHEL5.8.

The enclosed patch was created based off of the linux-2.6.18-295.el5 kernel source.  Please review, thanks.

Eddie

Comment 14 Jarod Wilson 2011-12-05 14:49:08 UTC
Patch(es) available in kernel-2.6.18-300.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5/
Detailed testing feedback is always welcomed.
If you require guidance regarding testing, please ask the bug assignee.

Comment 16 Gris Ge 2011-12-07 10:34:25 UTC
Tried on server platform with bnx2i iscsi session, but server only support suspend to disk.

kernel -300
Server _cannot_ boot up, the console of that server is down, so this is manually type:
====
begin fw dump (mark 0x3c67a0)
 0x80071b4
mcp intr[0.0]: 0x4:SPAD RPTY => 0x PC 0x800650c
====

Will provide the detailed output once eng-ops fix the console.

I see no customer need to suspend a server to disk, so if you guys don't want to fix it, we can close this bug.

Comment 18 errata-xmlrpc 2012-02-21 04:01:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0150.html