Description of problem: pm-suspend tries to stop all tasks, syslogd refuses to stop, thereby preventing suspend/hibernate Version-Release number of selected component (if applicable): rahide 2006-01-14 How reproducible: 100% Steps to Reproduce: 1. pm-suspend Actual results: # pm-suspend Freezing cpus ... Breaking affinity for irq 0 Breaking affinity for irq 4 Breaking affinity for irq 14 Breaking affinity for irq 74 Breaking affinity for irq 193 CPU 1 is now offline migration_cost=6 CPU1 is down Stopping tasks: ================================================================ stopping tasks failed (1 tasks remaining) Restarting tasks...<6> Strange, syslogd not stopped done Thawing cpus ... Expected results: syslog stops when other tasks do Additional info: system is in runlevel 3, runnign as root on a serial console just to eliminate any possible Xorg or fb issues ... If I manually stop syslog, then pm-suspend is able to proceed further (though not far enough but that's a separate issue) # service syslog stop Shutting down kernel logger: [ OK ] Shutting down system logger: [ OK ] # pm-suspend Freezing cpus ... Breaking affinity for irq 0 Breaking affinity for irq 1 CPU 1 is now offline migration_cost=5 CPU1 is down Stopping tasks: =============================================================|
I'm not able to reproduce this problem with kernel-2.6.15-1.1859_FC5 - I've tried @10 pm_suspend / resume cycles, with X and syslogd running, and it has never failed for me yet. It is actually the kernel that "stops" the tasks - perhaps the method used has changed somewhat with the latest kernel. Can you still reproduce this problem with kernel-2.6.15-1.1859_FC5 ? If so, please do: # strace -p `pidof syslogd` >/tmp/syslogd.strace.log 2>&1 & # pm-suspend (problem reproduced; resume) # kill `pidof strace` and send me the /tmp/syslogd.strace.log file - thanks!
OK installed x86_64 kernel-2.6.15-1.1859_FC5 from davej's repo Can confirm issue still happens, if I pm-suspend from gnome or vgacon I can't see what happens as monitor goes to sleep, machine hangs and won't come back, so having to use serial console the pm-suspend (without strace running) gives this # pm-suspend Freezing cpus ... Breaking affinity for irq 4 Breaking affinity for irq 14 Breaking affinity for irq 66 CPU 1 is now offline migration_cost=15 CPU1 is down Stopping tasks: ============================================================== stopping tasks failed (1 tasks remaining) Restarting tasks...<6> Strange, syslogd not stopped done Thawing cpus ... Booting processor 1/2 Since machine doesn't suspend (and therefore can't resume) it often causes resyncs on my raid arrays when I reboot it, each cycle takes a while, so I did it slightly differently, hopefully still valid. I started the strace -p on the serial console and captured output on another machine, then I logged onto the vgaconsole of the test machine in runlevel 3 and did the pm-suspend from there, so asn not to mix up the pm-suspend out and the strace output. The strace output is as follows, you can hopefully ignore the usb attatch/detach that is just my USB KVM in operation, any more info required please let me know. Process 2340 attached - interrupt to quit select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<6>Jan 17 22:26:05 kernel: usb 2"..., 1022, 0, NULL, NULL) = 93 writev(1, [{"Jan 17 22:26:05", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: usb 2-2.4: new low speed"..., 73}, {"\n", 1}], 6) = 95 fsync(1) = 0 select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<6>Jan 17 22:26:05 kernel: usb 2"..., 1022, 0, NULL, NULL) = 76 writev(1, [{"Jan 17 22:26:05", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: usb 2-2.4: configuration"..., 56}, {"\n", 1}], 6) = 78 fsync(1) = 0 select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<6>Jan 17 22:26:05 kernel: input"..., 1022, 0, NULL, NULL) = 79 writev(1, [{"Jan 17 22:26:05", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: input: Logitech USB Rece"..., 59}, {"\n", 1}], 6) = 81 fsync(1) = 0 select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<6>Jan 17 22:26:05 kernel: input"..., 1022, 0, NULL, NULL) = 102 writev(1, [{"Jan 17 22:26:05", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: input: USB HID v1.10 Mou"..., 82}, {"\n", 1}], 6) = 104 fsync(1) = 0 select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<29>Jan 17 22:26:10 ntpd[2687]: "..., 1022, 0, NULL, NULL) = 57 writev(1, [{"Jan 17 22:26:10", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"ntpd[2687]: ntpd exiting on sign"..., 37}, {"\n", 1}], 6) = 59 fsync(1) = 0 select(1, [0], NULL, NULL, NULL) = ? ERESTARTNOHAND (To be restarted) --- SIGALRM (Alarm clock) @ 0 (0) --- rt_sigaction(SIGALRM, {0x55555555767e, [ALRM], SA_RESTORER, 0x2aaaaabf5270}, {0x55555555767e, [ALRM], SA_RESTORER, 0x2aaaaabf5270}, 8) = 0 alarm(30) = 0 rt_sigaction(SIGALRM, NULL, {0x55555555767e, [ALRM], SA_RESTORER, 0x2aaaaabf5270}, 8) = 0 rt_sigaction(SIGALRM, {0x55555555767e, [ALRM], SA_RESTORER, 0x2aaaaabf5270}, NULL, 8) = 0 rt_sigreturn(0xe) = -1 EINTR (Interrupted system call) select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<7>Jan 17 22:26:13 kernel: PM: P"..., 1022, 0, NULL, NULL) = 62 select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<4>Jan 17 22:26:13 kernel: Freez"..., 1022, 0, NULL, NULL) = 45 writev(1, [{"Jan 17 22:26:13", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: Freezing cpus ...", 25}, {"\n", 1}], 6) = 47 fsync(1) = 0 select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<4>Jan 17 22:26:13 kernel: Break"..., 1022, 0, NULL, NULL) = 55 writev(1, [{"Jan 17 22:26:13", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: Breaking affinity for ir"..., 35}, {"\n", 1}], 6) = 57 fsync(1) = 0 select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<4>Jan 17 22:26:13 kernel: Break"..., 1022, 0, NULL, NULL) = 55 writev(1, [{"Jan 17 22:26:13", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: Breaking affinity for ir"..., 35}, {"\n", 1}], 6) = 57 fsync(1) = 0 select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<4>Jan 17 22:26:13 kernel: CPU 1"..., 1022, 0, NULL, NULL) = 48 writev(1, [{"Jan 17 22:26:13", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: CPU 1 is now offline", 28}, {"\n", 1}], 6) = 50 fsync(1) = 0 select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<4>Jan 17 22:26:13 kernel: migra"..., 1022, 0, NULL, NULL) = 44 writev(1, [{"Jan 17 22:26:13", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: migration_cost=6", 24}, {"\n", 1}], 6) = 46 fsync(1) = 0 select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<4>Jan 17 22:26:19 kernel: CPU1 "..., 1022, 0, NULL, NULL) = 40 writev(1, [{"Jan 17 22:26:19", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: CPU1 is down", 20}, {"\n", 1}], 6) = 42 fsync(1) = 0 select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<4>Jan 17 22:26:19 kernel: Stopp"..., 1022, 0, NULL, NULL) = 108 writev(1, [{"Jan 17 22:26:19", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: Stopping tasks: ========"..., 88}, {"\n", 1}], 6) = 110 fsync(1) = 0 select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<3>Jan 17 22:26:19 kernel: stop"..., 1022, 0, NULL, NULL) = 70 writev(1, [{"Jan 17 22:26:19", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: stopping tasks failed ("..., 50}, {"\n", 1}], 6) = 72 fsync(1) = 0 select(1, [0], NULL, NULL, NULL) = 1 (in [0]) recvfrom(0, "<4>Jan 17 22:26:19 kernel: Resta"..., 1022, 0, NULL, NULL) = 79 writev(1, [{"Jan 17 22:26:19", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: Restarting tasks...<6> S"..., 59}, {"\n", 1}], 6) = 81 fsync(1
Well, one would expect that syslogd should get some kind of signal telling it to shut down. The strace log shows that no such signal is ever sent to syslogd, and the suspend process just gives up. I'll investigate exactly how the process of stopping tasks is working, and why it always works OK on my machine ...
OK, I've finally managed to duplicate the problem, after many attempts at suspend/resume. When stracing syslogd, I got the same error: Stopping tasks: ================================================================ stopping tasks failed (1 tasks remaining) Restarting tasks...<6> Strange, syslogd not stopped After resuming, syslogd was then 100% non-operational, and un-interruptable - it could not be killed - I also had to kill the strace process with -KILL manually. The syslogd /proc/$pid/status then shows syslogd to be in 'State: D (disk sleep)' and syslogd cannot be killed - 'kill -KILL $pid' by root has no effect . Strange that it went into 'disk sleep' when /usr/sbin/pm-suspend does a suspend-to-RAM ? Perhaps this is because syslogd was 'frozen' when it should not have been, and then gets into an uninterruptable state in the kernel. syslogd spends 99.9% of its time either in the select() call, waiting for input, or in the fsync() call, waiting for its output to be flushed to disk. fsync() is non-interruptable - I think the freeze() attempt on syslogd may have occurred when it was doing its fsync() call . Perhaps another problem is that every action taken by the kernel/power code is also logged via klogd / syslogd. Anyway, there is nothing syslogd can do about sometimes being unfreezeable by the kernel. The kernel needs to correct the way it attempts to freeze syslogd, and should not put syslogd into the 'disk sleep' state where it cannot be killed and a new instance of it cannot be run until the machine is rebooted. This problem appears to be a kernel bug - moving to the kernel.
Thanks for the detailed investigation, just a FYI that this is a SATA drive with software raid partitions (raid1 and raid0) with LVM on top of the arrays, in case these multiple block i/o layers complicate things making the freeze more likely to hit problemas on my machine ...
Since I notice you've assigned this bug, best to say that I think the obstacle with syslog has been cleared, the machine *does* now suspend, it doesn't resume properly, but that's another story ... remember the bug elsewhere about serial splurge on resume? I can double check this before marking this NOTABUG if you'd like
Perhaps we should not be too quick to close this bug unless the problem originally causing it has definitely been identified and is known to be fixed - I think this is an intermittent problem, and not easy to reproduce. It took me many attempts to get the pm_suspend to fail because of syslogd - it happened once out of around 20-30 tries. But when it happened, the system was hosed - all processes blocked trying to send to the syslogd process, which could not be killed, and the only way to recover the system was to do a hard reset. I think we need to be sure that if a process is in an un-interruptable system call like fsync(), the suspend waits for this call to complete or fails the suspend.
OK, on the basis that "things which go away by themselves" tend to "come back by themselves" you might be right, I didn't know if this had been fixed elsewhere or not. When this used to go wrong for me it was very reliable at going wrong.
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you.
Lots of work on the freezer has been done since 2.6.15. Any bug that appeared now would almost certainly be due to an entirely different cause. Closing as fixed in current release.