177869 – syslog refuses to stop thereby preventing suspend/hibernate

Bug 177869 - syslog refuses to stop thereby preventing suspend/hibernate

Summary: syslog refuses to stop thereby preventing suspend/hibernate

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:	NeedsRetesting
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-01-15 22:21 UTC by Andy Burns
Modified:	2015-01-04 22:24 UTC (History)
CC List:	4 users (show)
Fixed In Version:	3218.fc8
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-06-12 06:06:21 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Andy Burns 2006-01-15 22:21:16 UTC

Description of problem:

pm-suspend tries to stop all tasks, syslogd refuses to stop, thereby preventing
suspend/hibernate

Version-Release number of selected component (if applicable):

rahide 2006-01-14

How reproducible:

100%

Steps to Reproduce:
1. pm-suspend
  
Actual results:

# pm-suspend
Freezing cpus ...
Breaking affinity for irq 0
Breaking affinity for irq 4
Breaking affinity for irq 14
Breaking affinity for irq 74
Breaking affinity for irq 193
CPU 1 is now offline
migration_cost=6
CPU1 is down
Stopping tasks: ================================================================
 stopping tasks failed (1 tasks remaining)
Restarting tasks...<6> Strange, syslogd not stopped
 done
Thawing cpus ...

Expected results:

syslog stops when other tasks do

Additional info:

system is in runlevel 3, runnign as root on a serial console just to eliminate
any possible Xorg or fb issues ...

If I manually stop syslog, then pm-suspend is able to proceed further (though
not far enough but that's a separate issue)

# service syslog stop
Shutting down kernel logger: [  OK  ]
Shutting down system logger: [  OK  ]

# pm-suspend
Freezing cpus ...
Breaking affinity for irq 0
Breaking affinity for irq 1
CPU 1 is now offline
migration_cost=5
CPU1 is down
Stopping tasks: =============================================================|

Comment 1 Jason Vas Dias 2006-01-17 19:26:36 UTC

I'm not able to reproduce this problem with kernel-2.6.15-1.1859_FC5 -
I've tried @10 pm_suspend / resume cycles, with X and syslogd running, and it 
has never failed for me yet.
It is actually the kernel that "stops" the tasks - perhaps the method used has
changed somewhat with the latest kernel.

Can you still reproduce this problem with kernel-2.6.15-1.1859_FC5 ?
If so, please do:
  # strace -p `pidof syslogd` >/tmp/syslogd.strace.log 2>&1 &
  # pm-suspend
(problem reproduced; resume)
  # kill `pidof strace`
and send me the /tmp/syslogd.strace.log file - thanks!

Comment 2 Andy Burns 2006-01-17 22:36:42 UTC

OK installed x86_64 kernel-2.6.15-1.1859_FC5 from davej's repo

Can confirm issue still happens, if I pm-suspend from gnome or vgacon I can't
see what happens as monitor goes to sleep, machine hangs and won't come back, so
having to use serial console

the pm-suspend (without strace running) gives this

# pm-suspend
Freezing cpus ...
Breaking affinity for irq 4
Breaking affinity for irq 14
Breaking affinity for irq 66
CPU 1 is now offline
migration_cost=15
CPU1 is down
Stopping tasks: ==============================================================
 stopping tasks failed (1 tasks remaining)
Restarting tasks...<6> Strange, syslogd not stopped
 done
Thawing cpus ...
Booting processor 1/2 

Since machine doesn't suspend (and therefore can't resume) it often causes
resyncs on my raid arrays when I reboot it, each cycle takes a while, so I did
it slightly differently, hopefully still valid.

I started the strace -p on the serial console and captured output on another
machine, then I logged onto the vgaconsole of the test machine in runlevel 3 and
did the pm-suspend from there, so asn not to mix up the pm-suspend out and the
strace output.

The strace output is as follows, you can hopefully ignore the usb attatch/detach
that is just my USB KVM in operation, any more info required please let me know.

Process 2340 attached - interrupt to quit
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<6>Jan 17 22:26:05 kernel: usb 2"..., 1022, 0, NULL, NULL) = 93
writev(1, [{"Jan 17 22:26:05", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel:
usb 2-2.4: new low speed"..., 73}, {"\n", 1}], 6) = 95
fsync(1)                                = 0
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<6>Jan 17 22:26:05 kernel: usb 2"..., 1022, 0, NULL, NULL) = 76
writev(1, [{"Jan 17 22:26:05", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel:
usb 2-2.4: configuration"..., 56}, {"\n", 1}], 6) = 78
fsync(1)                                = 0
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<6>Jan 17 22:26:05 kernel: input"..., 1022, 0, NULL, NULL) = 79
writev(1, [{"Jan 17 22:26:05", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel:
input: Logitech USB Rece"..., 59}, {"\n", 1}], 6) = 81
fsync(1)                                = 0
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<6>Jan 17 22:26:05 kernel: input"..., 1022, 0, NULL, NULL) = 102
writev(1, [{"Jan 17 22:26:05", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel:
input: USB HID v1.10 Mou"..., 82}, {"\n", 1}], 6) = 104
fsync(1)                                = 0
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<29>Jan 17 22:26:10 ntpd[2687]: "..., 1022, 0, NULL, NULL) = 57
writev(1, [{"Jan 17 22:26:10", 15}, {" ", 1}, {"htpc", 4}, {" ", 1},
{"ntpd[2687]: ntpd exiting on sign"..., 37}, {"\n", 1}], 6) = 59
fsync(1)                                = 0
select(1, [0], NULL, NULL, NULL)        = ? ERESTARTNOHAND (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigaction(SIGALRM, {0x55555555767e, [ALRM], SA_RESTORER, 0x2aaaaabf5270},
{0x55555555767e, [ALRM], SA_RESTORER, 0x2aaaaabf5270}, 8) = 0
alarm(30)                               = 0
rt_sigaction(SIGALRM, NULL, {0x55555555767e, [ALRM], SA_RESTORER,
0x2aaaaabf5270}, 8) = 0
rt_sigaction(SIGALRM, {0x55555555767e, [ALRM], SA_RESTORER, 0x2aaaaabf5270},
NULL, 8) = 0
rt_sigreturn(0xe)                       = -1 EINTR (Interrupted system call)
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<7>Jan 17 22:26:13 kernel: PM: P"..., 1022, 0, NULL, NULL) = 62
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<4>Jan 17 22:26:13 kernel: Freez"..., 1022, 0, NULL, NULL) = 45
writev(1, [{"Jan 17 22:26:13", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel:
Freezing cpus ...", 25}, {"\n", 1}], 6) = 47
fsync(1)                                = 0
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<4>Jan 17 22:26:13 kernel: Break"..., 1022, 0, NULL, NULL) = 55
writev(1, [{"Jan 17 22:26:13", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel:
Breaking affinity for ir"..., 35}, {"\n", 1}], 6) = 57
fsync(1)                                = 0
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<4>Jan 17 22:26:13 kernel: Break"..., 1022, 0, NULL, NULL) = 55
writev(1, [{"Jan 17 22:26:13", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel:
Breaking affinity for ir"..., 35}, {"\n", 1}], 6) = 57
fsync(1)                                = 0
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<4>Jan 17 22:26:13 kernel: CPU 1"..., 1022, 0, NULL, NULL) = 48
writev(1, [{"Jan 17 22:26:13", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel:
CPU 1 is now offline", 28}, {"\n", 1}], 6) = 50
fsync(1)                                = 0
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<4>Jan 17 22:26:13 kernel: migra"..., 1022, 0, NULL, NULL) = 44
writev(1, [{"Jan 17 22:26:13", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel:
migration_cost=6", 24}, {"\n", 1}], 6) = 46
fsync(1)                                = 0
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<4>Jan 17 22:26:19 kernel: CPU1 "..., 1022, 0, NULL, NULL) = 40
writev(1, [{"Jan 17 22:26:19", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel:
CPU1 is down", 20}, {"\n", 1}], 6) = 42
fsync(1)                                = 0
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<4>Jan 17 22:26:19 kernel: Stopp"..., 1022, 0, NULL, NULL) = 108
writev(1, [{"Jan 17 22:26:19", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel:
Stopping tasks: ========"..., 88}, {"\n", 1}], 6) = 110
fsync(1)                                = 0
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<3>Jan 17 22:26:19 kernel:  stop"..., 1022, 0, NULL, NULL) = 70
writev(1, [{"Jan 17 22:26:19", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel: 
stopping tasks failed ("..., 50}, {"\n", 1}], 6) = 72
fsync(1)                                = 0
select(1, [0], NULL, NULL, NULL)        = 1 (in [0])
recvfrom(0, "<4>Jan 17 22:26:19 kernel: Resta"..., 1022, 0, NULL, NULL) = 79
writev(1, [{"Jan 17 22:26:19", 15}, {" ", 1}, {"htpc", 4}, {" ", 1}, {"kernel:
Restarting tasks...<6> S"..., 59}, {"\n", 1}], 6) = 81
fsync(1

Comment 3 Jason Vas Dias 2006-01-17 22:46:07 UTC

Well, one would expect that syslogd should get some kind of signal telling it
to shut down.
The strace log shows that no such signal is ever sent to syslogd, and the 
suspend process just gives up.
I'll investigate exactly how the process of stopping tasks is working, and why
it always works OK on my machine ...

Comment 4 Jason Vas Dias 2006-01-18 00:08:52 UTC

OK, I've finally managed to duplicate the problem, after many attempts at
suspend/resume.

When stracing syslogd, I got the same error:

Stopping tasks: ================================================================
stopping tasks failed (1 tasks remaining)
Restarting tasks...<6> Strange, syslogd not stopped

After resuming, syslogd was then 100% non-operational, and un-interruptable - it
could not be killed - I also had to kill the strace process with -KILL manually.

The syslogd /proc/$pid/status then shows syslogd to be in 
  'State:  D (disk sleep)'
and syslogd cannot be killed - 'kill -KILL $pid' by root has no effect .

Strange that it went into 'disk sleep' when /usr/sbin/pm-suspend does a 
suspend-to-RAM ?

Perhaps this is because syslogd was 'frozen' when it should not have been,
and then gets into an uninterruptable state in the kernel.

syslogd spends 99.9% of its time either in the select() call, waiting for 
input, or in the fsync() call, waiting for its output to be flushed to disk.

fsync() is non-interruptable - I think the freeze() attempt on syslogd may
have occurred when it was doing its fsync() call .

Perhaps another problem is that every action taken by the kernel/power code
is also logged via klogd / syslogd.

Anyway, there is nothing syslogd can do about sometimes being unfreezeable
by the kernel.

The kernel needs to correct the way it attempts to freeze syslogd, and should
not put syslogd into the 'disk sleep' state where it cannot be killed and a new
instance of it cannot be run until the machine is rebooted.

This problem appears to be a kernel bug - moving to the kernel.

Comment 5 Andy Burns 2006-01-18 00:19:21 UTC

Thanks for the detailed investigation, just a FYI that this is a SATA drive with
software raid partitions (raid1 and raid0) with LVM on top of the arrays, in
case these multiple block i/o layers complicate things making the freeze more
likely to hit problemas on my machine ...

Comment 6 Andy Burns 2006-02-04 08:51:42 UTC

Since I notice you've assigned this bug, best to say that I think the obstacle
with syslog has been cleared, the machine *does* now suspend, it doesn't resume
properly, but that's another story ... remember the bug elsewhere about serial
splurge on resume?

I can double check this before marking this NOTABUG if you'd like

Comment 7 Jason Vas Dias 2006-02-04 17:06:47 UTC

Perhaps we should not be too quick to close this bug unless the problem 
originally causing it has definitely been identified and is known to be fixed -
I think this is an intermittent problem, and not easy to reproduce.

It took me many attempts to get the pm_suspend to fail because of syslogd -
it happened once out of around 20-30 tries.

But when it happened, the system was hosed - all processes blocked trying to
send to the syslogd process, which could not be killed, and the only way to
recover the system was to do a hard reset. 

I think we need to be sure that if a process is in an un-interruptable system
call like fsync(), the suspend waits for this call to complete or fails the 
suspend.

Comment 8 Andy Burns 2006-02-04 17:29:30 UTC

OK, on the basis that "things which go away by themselves" tend to "come back by
themselves" you might be right, I didn't know if this had been fixed elsewhere
or not. When this used to go wrong for me it was very reliable at going wrong.

Comment 9 Dave Jones 2006-10-16 18:21:24 UTC

A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 10 Nigel Cunningham 2007-06-12 06:06:21 UTC

Lots of work on the freezer has been done since 2.6.15. Any bug that appeared
now would almost certainly be due to an entirely different cause. Closing as
fixed in current release.

Note You need to log in before you can comment on or make changes to this bug.