Bug 629040

Summary:	sysv services: cannot detect service status if main process is unknown
Product:	[Fedora] Fedora	Reporter:	Bill Nottingham <notting>
Component:	systemd	Assignee:	Lennart Poettering <lpoetter>
Status:	CLOSED UPSTREAM	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	low
Version:	rawhide	CC:	lpoetter, metherid, mschmidt, notting, plautrba, rvokal, sdake
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-07-05 18:31:37 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Bill Nottingham 2010-08-31 18:45:22 UTC

Description of problem:


[root@nostromo ~]# service crond start
Starting crond (via systemctl):                            [  OK  ]
[root@nostromo ~]# systemctl status crond.service
crond.service - LSB: run cron daemon
	  Loaded: loaded (/etc/rc.d/init.d/crond)
	  Active: active (running) since [Tue, 31 Aug 2010 14:42:47 -0400; 2s ago]
	 Process: 14763 (/etc/rc.d/init.d/crond stop, code=exited, status=0/SUCCESS)
	 Process: 14797 (/etc/rc.d/init.d/crond start, code=exited, status=0/SUCCESS)
	  CGroup: name=systemd:/systemd-1/crond.service
		  └ 14802 crond
[root@nostromo ~]# kill -11 14802
[root@nostromo ~]# systemctl status crond.service
crond.service - LSB: run cron daemon
	  Loaded: loaded (/etc/rc.d/init.d/crond)
	  Active: active (exited) since [Tue, 31 Aug 2010 14:42:47 -0400; 16s ago]
	 Process: 14763 (/etc/rc.d/init.d/crond stop, code=exited, status=0/SUCCESS)
	 Process: 14797 (/etc/rc.d/init.d/crond start, code=exited, status=0/SUCCESS)
	  CGroup: name=systemd:/systemd-1/crond.service

Shouldn't it have noticed that the daemon fell over? (OK, it was pushed.)

Version-Release number of selected component (if applicable):

systemd-8-3.fc14.x86_64

How reproducible:

100%

Steps to Reproduce:
1. Whack some service by hand
  
Actual results:

systemd thinks it's still running

Comment 1 Lennart Poettering 2010-08-31 21:36:40 UTC

Oh, it did. See the "active (running)" vs. "active (exited)"?

The problem is that SysV service scripts may but don't have to spawn a background process. Or they might do so only temporarily. Example: The iptables script is something that runs once on booth and then exits, and then is run again on shutdown. While it is "active" it won't have any process running. On the other hand there is stuff like cron, which spawns a process, and terminates it on shutdown. Since there is no nice way to figure out which kind of script a SysV script is, i.e. one that spawns a process or one that doesn't we have to deal with both cases and hence imply RemainAfterExit=true, which allows us to cover both setups. Of course that implies that for SysV scripts we still consider a service "active" even if it actually exited. Only the substate (i.e. running vs. exited) will tell you what actually happened.

The SysV system is simply too simply to cover this properly.

A simple fix is to write a native unit file for crond, which would then set RemainAfterExit=false (which is the default anyway for native units).

(Note that the traditional chkconfig header we use supports an option to declare the PID file for a service. We could add some code to systemd that sets RemainAfterExit=false if that option is set, under the assumption that services with a PID file certainly do spawn some background service. However, unfortunately the cron service doesn't use this chkconfig header option. And instead of fixing that there it probably makes more sense to simply write a native service file -- or just take the one I already wrote.)

Comment 2 Bill Nottingham 2010-09-01 04:30:16 UTC

Hm, this happens for other services as well, though:

[root@localhost system]# systemctl status cups.service
cups.service - LSB: The CUPS scheduler
	  Loaded: loaded (/etc/rc.d/init.d/cups)
	  Active: active (running) since [Tue, 31 Aug 2010 15:06:32 -0400; 9h ago]
	 Process: 1156 (/etc/rc.d/init.d/cups start, code=exited, status=0/SUCCESS)
	  CGroup: name=systemd:/systemd-1/cups.service
		  ├ 1181 cupsd -C /etc/cups/cupsd.conf
		  └ 1195 cups-polld cups.rdu.redhat.com 631 30 631
[root@localhost system]# kill -11 1181
[root@localhost system]# systemctl status cups.service
cups.service - LSB: The CUPS scheduler
	  Loaded: loaded (/etc/rc.d/init.d/cups)
	  Active: active (running) since [Tue, 31 Aug 2010 15:06:32 -0400; 9h ago]
	 Process: 1156 (/etc/rc.d/init.d/cups start, code=exited, status=0/SUCCESS)
	  CGroup: name=systemd:/systemd-1/cups.service
		  └ 1195 cups-polld cups.rdu.redhat.com 631 30 631

Note that in this case, it still shows as *running*.

Comment 3 Lennart Poettering 2010-09-02 19:52:59 UTC

YUpp, because it is still running, i.e. process 1195 is still around?

Comment 4 Lennart Poettering 2010-09-02 19:53:41 UTC

Or in other words, if you kill 1195, too, then the state should change to active (exited).

Comment 5 Bill Nottingham 2010-09-02 19:54:58 UTC

Something is still running in the cgroup, yes. But it's not the service.

There needs to be a good way to track this, otherwise using systemctl to check the status of SysV services isn't useful, as it won't be correct.

Comment 6 Lennart Poettering 2010-09-02 20:28:15 UTC

Well, but what do *is* the service?

systemd has no way to figure that out, since we didn't fork the service when sysv scripts are used. All we know is that there are multiple processes after we ran the startup script. Which one of those is the "main process" is not clear.

We try to make the best of it, and wait for all processes to go away before we consider the service "exited". I think that is the only reasonable thing to do.

Note that if the PID file is declared in the chkconfig header we actually read it after running the start-up script and then consider that the main pid of service. If CUPS would use that chkconfig header field, then systemd would be able to figure out the main service. Maybe we should file bugs and ask people to include the header in their scripts?

Comment 7 Lennart Poettering 2010-09-02 20:28:28 UTC

s/do//

Comment 8 Bill Nottingham 2010-09-02 20:34:03 UTC

(In reply to comment #6)
> Well, but what do *is* the service?

Generally... the thing with the same name as the script. It's sort of a dirty heuristic, but I think it does work for compatibility.

> Note that if the PID file is declared in the chkconfig header we actually read
> it after running the start-up script and then consider that the main pid of
> service. If CUPS would use that chkconfig header field, then systemd would be
> able to figure out the main service. Maybe we should file bugs and ask people
> to include the header in their scripts?

We could, but if we're offering SysV compat, it should work as best as possible without requiring changes to the scripts.

Comment 9 Lennart Poettering 2010-09-02 20:45:35 UTC

(In reply to comment #8)
> (In reply to comment #6)
> > Well, but what do *is* the service?
> 
> Generally... the thing with the same name as the script. It's sort of a dirty
> heuristic, but I think it does work for compatibility.

Often that doesn't work. i.e. in avahi we tend to have two avahi processes, going by the same "comm" name as they are actually forked from the same parent.

Also, some distros run multiple daemons from the same sysv script (i.e. smbd, nmbd; or various nfs services). In those cases all processes might be considered the "main" process, and we'd really have a hard time to say which one has more right to be declared "main".

(That said, I'd guess a different heuristic might work better: we could pick the process that is around after invoking the start script that has getppid() == 1 and the lowest starttime field, i.e. is the first one that was created. But even that might go wrong...)

SysV is SysV. I don't think we should try to be too smart with it. If it doesn't support things nicely it is so because it is old and wasn't designed with monitoring in mind. If people want to monitor things nicely, they should write native unit files. 

Note that nothing is really broken in the current SysV logic in systemd. All that is suboptimal is that we sometimes stay a bit longer in "running" state instead of "exited" than the user might expect. But I think I can live with that. It's probably something we should document in the FAQ, but other than a non-expected state in the "systemctl status" output this has little technical effect.

Comment 10 Bill Nottingham 2010-09-02 21:25:32 UTC

That's a huge user effect, if it's claiming something is running when it's not. Why would someone use systemd's tools if they're going to give them a different/wrong answer?

Comment 11 Lennart Poettering 2010-09-02 22:00:37 UTC

Well, why do you claim that systemd is giving a wrong answer?

You said "something is running". And that's exactly what is the case for CUPS: some CUPS left-over processes are still running, even if the main process isn't. And as long as "something is running" systemd says "running". And as soon as "something is running" isn't the case anymore it will say "exited".

I see really nothing wrong with the current behaviour of systemd here.

Comment 12 Bill Nottingham 2010-09-03 15:38:42 UTC

The problem is that it is a regression for sysv services, that will give the incorrect answer to admins checking on their service status. Let me see if I can code up a propsective fix.

Comment 13 Matthias Clasen 2010-10-08 22:46:30 UTC

Moving systemd bugs to f15, since the systemd feature got delayed.

Comment 14 Lennart Poettering 2010-11-18 23:23:37 UTC

BTW, systemd is now trying a bit harder to find the main pid. If none is configured but only one process is in the cgroup after startup then systemd assumes that this is the main process of the daemon. This should fix things in many but not all cases.

Comment 15 Lennart Poettering 2011-02-16 23:20:34 UTC

systemd is now even smarter: it looks for processes that are actually child of PID 1, i.e. proper daemons. If there is only one of those it must be the main daemon.

I think this is as good as it can get. Because if there are multiple processes in the cgroup but all of them children of PID1 then there are actually more than one daemons running in the cgroup in which case declaring one "main" makes little sense and cannot be done automatically.

I think we can consider this bug fixed for now. Let's sit back and watch if we there are more cases with multiple processes which we need to cover. And then, when those show up let's reinvestigate the situation.

Closing this for now.

Comment 16 Steven Dake 2011-07-04 18:37:56 UTC

Reopening.  rgmanager and pacemaker are nonfunctional on F15 because of invalid init script status reporting.

Comment 17 Steven Dake 2011-07-05 18:31:37 UTC

satisified with pidfile solution.  Apologies for the noise.