Description of problem: ThinkPad T400 cannot suspend on -296.el5 kernel. # uname -a Linux localhost.localdomain 2.6.18-296.el5 #1 SMP Thu Nov 3 12:56:56 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux # echo mem > /sys/power/state -> T400 cannot complete suspension and came back again -> Here is the messages during suspension # tailf /var/log/messages Nov 14 20:02:40 localhost last message repeated 2 times Nov 14 20:02:40 localhost gconfd (root-4065): Resolved address "xml:readwrite:/root/.gconf" to a writable configuration source at position 0 Nov 14 20:02:40 localhost nm-system-settings: Loaded plugin ifcfg-rh: (c) 2007 - 2008 Red Hat, Inc. To report bugs please use the NetworkManager mailing list. Nov 14 20:02:40 localhost nm-system-settings: ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-lo ... Nov 14 20:02:40 localhost nm-system-settings: ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-eth0 ... Nov 14 20:02:40 localhost nm-system-settings: ifcfg-rh: read connection 'System eth0' Nov 14 20:02:40 localhost nm-system-settings: ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-wlan0 ... Nov 14 20:02:40 localhost nm-system-settings: ifcfg-rh: error: Missing SSID Nov 14 20:02:43 localhost pcscd: winscard.c:304:SCardConnect() Reader E-Gate 0 0 Not Found Nov 14 20:05:39 localhost kernel: Machine check events logged Nov 14 20:06:58 localhost kernel: Disabling non-boot CPUs ... Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 9 Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 12 Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 169 Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 193 Nov 14 20:06:58 localhost kernel: Breaking affinity for irq 201 Nov 14 20:06:58 localhost kernel: CPU 1 is now offline Nov 14 20:06:58 localhost kernel: SMP alternatives: switching to UP code Nov 14 20:06:58 localhost kernel: CPU 1 offline: Remove Rx thread Nov 14 20:09:00 localhost restorecond: Read error (Interrupted system call) Nov 14 20:09:00 localhost kernel: CPU1 is down Nov 14 20:09:00 localhost kernel: Stopping tasks: ======================================================================================================================================================== Nov 14 20:09:00 localhost kernel: stopping tasks timed out after 120 seconds (1 tasks remaining): Nov 14 20:09:00 localhost kernel: bnx2i_thread/0 Nov 14 20:09:00 localhost kernel: Restarting tasks...<6> Strange, bnx2i_thread/0 not stopped Nov 14 20:09:00 localhost kernel: done Nov 14 20:09:00 localhost kernel: Enabling non-boot CPUs ... Nov 14 20:09:00 localhost kernel: SMP alternatives: switching to SMP code Nov 14 20:09:00 localhost kernel: Booting processor 1/2 APIC 0x1 Nov 14 20:09:00 localhost kernel: Initializing CPU#1 Nov 14 20:09:00 localhost kernel: Calibrating delay using timer specific routine.. 4521.94 BogoMIPS (lpj=2260974) Nov 14 20:09:00 localhost kernel: CPU: L1 I cache: 32K, L1 D cache: 32K Nov 14 20:09:00 localhost kernel: CPU: L2 cache: 3072K Nov 14 20:09:00 localhost kernel: CPU: Physical Processor ID: 0 Nov 14 20:09:00 localhost kernel: CPU: Processor Core ID: 1 Nov 14 20:09:00 localhost kernel: Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz stepping 0a Nov 14 20:09:00 localhost kernel: CPU 1: Syncing TSC to CPU 0. Nov 14 20:09:00 localhost kernel: CPU 1: synchronized TSC with CPU 0 (last diff -2048 cycles, maxerr 314 cycles) Nov 14 20:09:00 localhost kernel: bnx2i: CPU 1 online: Create Rx thread Nov 14 20:09:00 localhost kernel: CPU1 is up Indeed, bnx2i live in system: # ps aux | grep bnx root 2446 0.0 0.0 0 0 ? S< 20:55 0:00 [bnx2i_thread/0] root 2447 0.0 0.0 0 0 ? S< 20:55 0:00 [bnx2i_thread/1] Version-Release number of selected component (if applicable): kernel-2.6.18-296.el5 How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: system can suspend successfully Additional info: I also tested on -274.el5, system can suspend without above issue.
Tested on -288.el5 and T400 can successfully suspend. Also, no bnx2i_thread lived in system. So marking "Regression". I'll bisect and provide more info later.
The kernel thread's main loop is in bnx2i_percpu_io_thread(). The thread neither calls try_to_freeze(), nor marks itself unfreezable (PF_NOFREEZE). It needs to do one of these as described in Documentation/power/kernel_threads.txt.
Adding bnx2i maintainer Eddie from broadcom. It looks like this could be a problem in fcoe.ko and bnx2fc.ko in rhel 6 too.
(In reply to comment #6) > Adding bnx2i maintainer Eddie from broadcom. > > It looks like this could be a problem in fcoe.ko and bnx2fc.ko in rhel 6 too. I guess this does not apply to rhel6? The kernel_threads.txt is not there anymore and I see it is removed. But for rhel5 does qla2xxx have the problem?
(In reply to comment #7) > I guess this does not apply to rhel6? The kernel_threads.txt is not there > anymore and I see it is removed. In RHEL6 there is Documentation/power/freezing_of_tasks.txt instead. There is one significant difference. In RHEL6 kernel threads are non-freezable by default. See commit 83144186 "Freezer: make kernel threads nonfreezable by default". > But for rhel5 does qla2xxx have the problem? Looking at the code... yes, it does.
Created attachment 534072 [details] bnx2i patch to add explicit PF_NOFREEZE setting for I/O kthreads It looks like the correct fix for the bnx2i I/O kthread is to add the explicit setting of the PF_NOFREEZE flag. This will align the bnx2i I/O kthread behavior between the RHEL6/upstream and RHEL5.8. The enclosed patch was created based off of the linux-2.6.18-295.el5 kernel source. Please review, thanks. Eddie
Patch(es) available in kernel-2.6.18-300.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5/ Detailed testing feedback is always welcomed. If you require guidance regarding testing, please ask the bug assignee.
Tried on server platform with bnx2i iscsi session, but server only support suspend to disk. kernel -300 Server _cannot_ boot up, the console of that server is down, so this is manually type: ==== begin fw dump (mark 0x3c67a0) 0x80071b4 mcp intr[0.0]: 0x4:SPAD RPTY => 0x PC 0x800650c ==== Will provide the detailed output once eng-ops fix the console. I see no customer need to suspend a server to disk, so if you guys don't want to fix it, we can close this bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html