Bug 845553

Summary: systemd stops services on suspend
Product: [Fedora] Fedora Reporter: Pierre Ossman <pierre-bugzilla>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16CC: bfields, jlayton, johannbg, lnykryn, lpoetter, metherid, mschmidt, msekleta, notting, plautrba, steved, systemd-maint, vpavlin
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-13 17:28:21 EST Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
/var/log/messages
none
dmesg (failed mount)
none
/var/log/messages (failed mount)
none
dmesg (dead NM and rsyslog)
none
/var/log/messages (dead NM and rsyslog) none

Description Pierre Ossman 2012-08-03 08:21:46 EDT
Since some time back, I've had problems with my HTPC machine not working properly after resume. Some key services get terminated, which generally makes the machine unusable. Since they are terminated cleanly (SIGTERM), and being such a wide variety, I'm suspecting a systemd bug.

The services I've noticed are:

 - NetworkManager
 - lircd
 - rsyslog
 - NFS mount points

Looking in /var/log/messages, there are more though:

Aug  3 14:10:13 hugin NetworkManager[933]: <info> caught signal 15, shutting down normally.
Aug  3 14:10:13 hugin avahi-daemon[934]: Got SIGTERM, quitting.
Aug  3 14:10:14 hugin rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="941" x-info="http://www.rsyslog.com"] exiting on signal 15.

I also have this odd thing:

Aug  3 14:10:13 hugin dbus-daemon[965]: /images (deleted):  rcercrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcercercrcrcrcrcrcrcrcrercrcrcercercercercercercercercercercerercercercrcercercercercercercercercercercrcercercercercercercercercercercrcrcrcercercerercercercererercercercercrcrcrcrcrcrcercercercercercercrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcrcerce


This is very major bug as it makes suspend completely unusable on the machine.

Bug 788911 might also be related.
Comment 1 Pierre Ossman 2012-08-03 08:22:40 EDT
I'm currently running systemd-37-25.fc16.x86_64.
Comment 2 Pierre Ossman 2012-08-03 08:25:38 EDT
Created attachment 602121 [details]
/var/log/messages

Here's the entire log, covering both suspend and resume.

Noteworthy is the fact that this line is just before services got killed:

Aug  3 14:10:13 hugin systemd[1]: Reexecuting.
Comment 3 Michal Schmidt 2012-08-03 08:46:14 EDT
Please boot with the following kernel command line parameters:
  systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M

After reproducing the bug, please store the kernel log buffer:
  dmesg > dmesg.txt

and attach the resulting file here. Thanks.
Comment 4 Pierre Ossman 2012-08-04 09:19:17 EDT
Created attachment 602247 [details]
dmesg (failed mount)
Comment 5 Pierre Ossman 2012-08-04 09:20:53 EDT
Created attachment 602248 [details]
/var/log/messages (failed mount)

dmesg and messages from when it broke the NFS mount points. No other killed services as far as I can tell. Will try to reproduce the fully broken behaviour.
Comment 6 Pierre Ossman 2012-08-07 14:14:10 EDT
Created attachment 602824 [details]
dmesg (dead NM and rsyslog)
Comment 7 Pierre Ossman 2012-08-07 14:15:10 EDT
Created attachment 602825 [details]
/var/log/messages (dead NM and rsyslog)

Got a proper failure today. Attached is dmesg and messages.
Comment 8 Jóhann B. Guðmundsson 2012-08-08 03:18:57 EDT
Those lirc errors indicate there is something wrong with your configuration file and nfs and suspend issue is nothing new and unless it has been fixed nfs shares need to be unmounted cleanly before suspended. 

Moving this report against nfs-utils Steve will just bounce it back if this is an systemd issue...
Comment 9 Pierre Ossman 2012-08-08 06:33:06 EDT
Which logs are we talking about now? The first or second set?

In the second set, both NetworkManager and rsyslog have been killed. That's the more serious issue, and is not something that nfs-utils has any bearing on.
Comment 10 Michal Schmidt 2012-08-08 10:21:48 EDT
There's an interesting point after the wakeup:

[30784.854221] systemd[1]: Received SIGTERM from PID 29523 (fuser).

Looks like some script calls "fuser -k ..." and this results in various processes being killed.
Comment 11 Jóhann B. Guðmundsson 2012-08-08 12:22:05 EDT
Hmm So to try to determine what's calling fuser -k Pierre needs to run ps -AwfH > before-suspend.txt then match the pid to that fuser kill? 

xbmc is also segfaulting

[30787.388735] xbmc.bin[1346]: segfault at 74 ip 000000340be178cd sp 00007fff7d2f63e0 error 4 in libfreetype.so.6.7.1[340be00000+9a000]
Comment 12 Pierre Ossman 2012-08-08 12:47:06 EDT
(In reply to comment #11)
> Hmm So to try to determine what's calling fuser -k Pierre needs to run ps
> -AwfH > before-suspend.txt then match the pid to that fuser kill? 
> 

Isn't fuser a freshly started process? So catching its brief existence with ps seems tricky.

> xbmc is also segfaulting
> 
> [30787.388735] xbmc.bin[1346]: segfault at 74 ip 000000340be178cd sp
> 00007fff7d2f63e0 error 4 in libfreetype.so.6.7.1[340be00000+9a000]

xbmc is a temperamental piece of software. It probably crashes when some service it is using disappears.
Comment 13 Lennart Poettering 2012-08-08 21:42:17 EDT
Seems indeed that somebody calls fuser -k. Have you tried just grepping through your whole file system tree for this?

I am pretty sure this is nothing to do with systemd. We certainly do not invoke fuser -k ever.
Comment 14 Pierre Ossman 2012-08-09 06:03:21 EDT
The only non-binary I can find that uses fuser is /etc/init.d/functions, which has it in the unmounting code. That code in turn is only used by the netfs service (i.e. unmounting NFS stuff).

I'm not sure how netfs is supposed to be invoked these days, as systemd has both netfs.service, remote-fs.service, as well as individual *.mount entries for everything. And NetworkManager still has its dispatch scripts that mount and unmount things.

To get further, I've replaced fuser with a shell script that logs when it is called, the arguments, and the parent process. Hopefully that can get us closer.
Comment 15 Pierre Ossman 2012-08-10 07:44:52 EDT
Ok, got the issue again, so I have some more clues:

This is what I got in my new log:

Fri Aug 10 13:04:46 CEST 2012 Invoked by 25485 (netfs): fuser -k -m -15 /images (deleted)

And if I run that command without the "-k" bit, I get a list of a lot of processes, including 1 (systemd). So this seems to be where things go very wrong.

Now two questions pop up:

 1. Where does the "(deleted)" come from? Doing umount -l doesn't produce this. And you can't remove the mount point as long as the mount is still active.

 2. Should netfs even be used? It isn't on a fresh boot. But once NM's dispatch scripts trigger, it will be activated.
Comment 16 Jeff Layton 2012-08-10 08:27:23 EDT
(In reply to comment #15)

> Now two questions pop up:
> 
>  1. Where does the "(deleted)" come from? Doing umount -l doesn't produce
> this. And you can't remove the mount point as long as the mount is still
> active.
> 

(deleted) comes from path_with_deleted() in the kernel. Basically, the kernel is building a pathname from a dentry (a path component object). If that dentry has been unhashed (made unavailable for lookups) then it will get " (deleted)" appended when it builds the pathname from it.

>  2. Should netfs even be used? It isn't on a fresh boot. But once NM's
> dispatch scripts trigger, it will be activated.

I'd say no. Tearing down mounts just because the network interface is going down seems wrong to me. It's quite possible (likely even) that these mounts are busy when someone goes to suspend the machine. Having your processes catch SIGKILLs just because they have an open file on a network filesystem is not what most people expect when they suspend their machines.
Comment 17 Fedora End Of Life 2013-01-16 12:38:06 EST
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 18 Fedora End Of Life 2013-02-13 17:28:25 EST
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.