Bug 753729 - system cannot suspend with "stopping tasks timed out - bnx2i_thread/0 remaining"
Summary: system cannot suspend with "stopping tasks timed out - bnx2i_thread/0 remaining"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.8
Hardware: Unspecified
OS: Linux
urgent
high
Target Milestone: beta
: ---
Assignee: Mike Christie
QA Contact: Storage QE
URL:
Whiteboard:
Depends On:
Blocks: 757620 758797 765724
TreeView+ depends on / blocked
 
Reported: 2011-11-14 10:11 UTC by Guangze Bai
Modified: 2015-02-08 21:37 UTC (History)
10 users (show)

Fixed In Version: kernel-2.6.18-300.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 765724 (view as bug list)
Environment:
Last Closed: 2012-02-21 04:01:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
bnx2i patch to add explicit PF_NOFREEZE setting for I/O kthreads (1.21 KB, patch)
2011-11-16 19:17 UTC, Eddie Wai
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2012:0150 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux 5.8 kernel update 2012-02-21 07:35:24 UTC

Description Guangze Bai 2011-11-14 10:11:52 UTC
Description of problem:

ThinkPad T400 cannot suspend on -296.el5 kernel.

# uname -a
Linux localhost.localdomain 2.6.18-296.el5 #1 SMP Thu Nov 3 12:56:56 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

# echo mem > /sys/power/state
-> T400 cannot complete suspension and came back again

-> Here is the messages during suspension
# tailf /var/log/messages
Nov 14 20:02:40 localhost last message repeated 2 times
Nov 14 20:02:40 localhost gconfd (root-4065): Resolved address "xml:readwrite:/root/.gconf" to a writable configuration source at position 0
Nov 14 20:02:40 localhost nm-system-settings: Loaded plugin ifcfg-rh: (c) 2007 - 2008 Red Hat, Inc.  To report bugs please use the NetworkManager mailing list.
Nov 14 20:02:40 localhost nm-system-settings:    ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-lo ...
Nov 14 20:02:40 localhost nm-system-settings:    ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-eth0 ...
Nov 14 20:02:40 localhost nm-system-settings:    ifcfg-rh:     read connection 'System eth0'
Nov 14 20:02:40 localhost nm-system-settings:    ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-wlan0 ...
Nov 14 20:02:40 localhost nm-system-settings:    ifcfg-rh:     error: Missing SSID
Nov 14 20:02:43 localhost pcscd: winscard.c:304:SCardConnect() Reader E-Gate 0 0 Not Found
Nov 14 20:05:39 localhost kernel: Machine check events logged
Nov 14 20:06:58 localhost kernel: Disabling non-boot CPUs ...
Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 9
Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 12
Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 169
Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 193
Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 201
Nov 14 20:06:58 localhost kernel: CPU 1 is now offline
Nov 14 20:06:58 localhost kernel: SMP alternatives: switching to UP code
Nov 14 20:06:58 localhost kernel: CPU 1 offline: Remove Rx thread
Nov 14 20:09:00 localhost restorecond: Read error (Interrupted system call)
Nov 14 20:09:00 localhost kernel: CPU1 is down
Nov 14 20:09:00 localhost kernel: Stopping tasks: ========================================================================================================================================================
Nov 14 20:09:00 localhost kernel:  stopping tasks timed out after 120 seconds (1 tasks remaining):
Nov 14 20:09:00 localhost kernel:   bnx2i_thread/0
Nov 14 20:09:00 localhost kernel: Restarting tasks...<6> Strange, bnx2i_thread/0 not stopped
Nov 14 20:09:00 localhost kernel:  done
Nov 14 20:09:00 localhost kernel: Enabling non-boot CPUs ...
Nov 14 20:09:00 localhost kernel: SMP alternatives: switching to SMP code
Nov 14 20:09:00 localhost kernel: Booting processor 1/2 APIC 0x1
Nov 14 20:09:00 localhost kernel: Initializing CPU#1
Nov 14 20:09:00 localhost kernel: Calibrating delay using timer specific routine.. 4521.94 BogoMIPS (lpj=2260974)
Nov 14 20:09:00 localhost kernel: CPU: L1 I cache: 32K, L1 D cache: 32K
Nov 14 20:09:00 localhost kernel: CPU: L2 cache: 3072K
Nov 14 20:09:00 localhost kernel: CPU: Physical Processor ID: 0
Nov 14 20:09:00 localhost kernel: CPU: Processor Core ID: 1
Nov 14 20:09:00 localhost kernel: Intel(R) Core(TM)2 Duo CPU     P8400  @ 2.26GHz stepping 0a
Nov 14 20:09:00 localhost kernel: CPU 1: Syncing TSC to CPU 0.
Nov 14 20:09:00 localhost kernel: CPU 1: synchronized TSC with CPU 0 (last diff -2048 cycles, maxerr 314 cycles)
Nov 14 20:09:00 localhost kernel: bnx2i: CPU 1 online: Create Rx thread
Nov 14 20:09:00 localhost kernel: CPU1 is up


Indeed, bnx2i live in system:
# ps aux | grep bnx
root      2446  0.0  0.0      0     0 ?        S<   20:55   0:00 [bnx2i_thread/0]
root      2447  0.0  0.0      0     0 ?        S<   20:55   0:00 [bnx2i_thread/1]

Version-Release number of selected component (if applicable):
kernel-2.6.18-296.el5

How reproducible:
always

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:
system can suspend successfully

Additional info:
I also tested on -274.el5, system can suspend without above issue.

Comment 1 Guangze Bai 2011-11-14 10:17:05 UTC
Tested on -288.el5 and T400 can successfully suspend. Also, no bnx2i_thread lived in system. So marking "Regression". I'll bisect and provide more info later.

Comment 5 Michal Schmidt 2011-11-15 08:11:56 UTC
The kernel thread's main loop is in bnx2i_percpu_io_thread(). The thread neither calls try_to_freeze(), nor marks itself unfreezable (PF_NOFREEZE). It needs to do one of these as described in Documentation/power/kernel_threads.txt.

Comment 6 Mike Christie 2011-11-15 18:34:00 UTC
Adding bnx2i maintainer Eddie from broadcom.

It looks like this could be a problem in fcoe.ko and bnx2fc.ko in rhel 6 too.

Comment 7 Mike Christie 2011-11-15 20:02:02 UTC
(In reply to comment #6)
> Adding bnx2i maintainer Eddie from broadcom.
> 
> It looks like this could be a problem in fcoe.ko and bnx2fc.ko in rhel 6 too.

I guess this does not apply to rhel6? The kernel_threads.txt is not there anymore and I see it is removed.

But for rhel5 does qla2xxx have the problem?

Comment 8 Michal Schmidt 2011-11-16 09:29:35 UTC
(In reply to comment #7)
> I guess this does not apply to rhel6? The kernel_threads.txt is not there
> anymore and I see it is removed.

In RHEL6 there is Documentation/power/freezing_of_tasks.txt instead.
There is one significant difference. In RHEL6 kernel threads are non-freezable by default. See commit 83144186 "Freezer: make kernel threads nonfreezable by default".

> But for rhel5 does qla2xxx have the problem?

Looking at the code... yes, it does.

Comment 9 Eddie Wai 2011-11-16 19:17:54 UTC
Created attachment 534072 [details]
bnx2i patch to add explicit PF_NOFREEZE setting for I/O kthreads

It looks like the correct fix for the bnx2i I/O kthread is to add the explicit setting of the PF_NOFREEZE flag.  This will align the bnx2i I/O kthread behavior between the RHEL6/upstream and RHEL5.8.

The enclosed patch was created based off of the linux-2.6.18-295.el5 kernel source.  Please review, thanks.

Eddie

Comment 14 Jarod Wilson 2011-12-05 14:49:08 UTC
Patch(es) available in kernel-2.6.18-300.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5/
Detailed testing feedback is always welcomed.
If you require guidance regarding testing, please ask the bug assignee.

Comment 16 Gris Ge 2011-12-07 10:34:25 UTC
Tried on server platform with bnx2i iscsi session, but server only support suspend to disk.

kernel -300
Server _cannot_ boot up, the console of that server is down, so this is manually type:
====
begin fw dump (mark 0x3c67a0)
 0x80071b4
mcp intr[0.0]: 0x4:SPAD RPTY => 0x PC 0x800650c
====

Will provide the detailed output once eng-ops fix the console.

I see no customer need to suspend a server to disk, so if you guys don't want to fix it, we can close this bug.

Comment 18 errata-xmlrpc 2012-02-21 04:01:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0150.html


Note You need to log in before you can comment on or make changes to this bug.