848172 – systemd upon upgrade to F17 breaks working sendmail

Bug 848172 - systemd upon upgrade to F17 breaks working sendmail

Summary: systemd upon upgrade to F17 breaks working sendmail

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	dracut
Sub Component:
Version:	17
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	dracut-maint
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-08-14 19:55 UTC by Michal Jaegermann
Modified:	2013-07-31 23:59 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-07-31 23:59:09 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
results of 'strace systemctl start sm-client.service' (11.90 KB, text/plain) 2012-08-14 19:57 UTC, Michal Jaegermann	no flags	Details
View All

Description Michal Jaegermann 2012-08-14 19:55:53 UTC

Description of problem:

After a switch of a Fedora installation from F16 to F17 previously working sendmail ceased to work. Status checks with systemctl invariably show "Active: failed (Result: timeout) ....".  It turns out that sendmail.service appears to be waiting for sm-client.service but for that I am getting:

# systemctl status sm-client.service
sm-client.service - Sendmail Mail Transport Client
	  Loaded: loaded (/usr/lib/systemd/system/sm-client.service; enabled)
	  Active: inactive (dead)
	  CGroup: name=systemd:/system/sm-client.service

Attempts on command line to do "systemctl start sm-client.service" just sit there doing nothing beyond blocking a shell prompt.

OTOH - running directly from shell

/usr/sbin/sendmail -L sm-msp-queue -Ac -q1h
/usr/sbin/sendmail -bd -q1h

immediately provides functioning sendmail service which works as it used to
(while systemctl knows nothing about it).

Running 'strace systemctl start sm-client.service' (results attached) show
systemd sitting in 'poll' and waiting for who knows what.

'systemctl --failed' reports only this job:

sendmail.service loaded failed failed     Sendmail Mail Transport Agent


Version-Release number of selected component (if applicable):
systemd-44-17.fc17

How reproducible:
on every boot

Expected results:
systemd not breaking active services

Additional info:
That particular installation is a small headless server with no desktop files.

Comment 1 Michal Jaegermann 2012-08-14 19:57:23 UTC

Created attachment 604430 [details]
results of 'strace systemctl start sm-client.service'

Comment 2 Michal Schmidt 2012-08-15 11:53:10 UTC

(In reply to comment #0)
> Attempts on command line to do "systemctl start sm-client.service" just sit
> there doing nothing beyond blocking a shell prompt.

While it's stuck here, run "systemctl status sm-client.service" from another shell and paste the output here.

> Running 'strace systemctl start sm-client.service' (results attached) show
> systemd sitting in 'poll' and waiting for who knows what.

I'm pretty sure it waits for systemd (PID 1) to reply with the message about the job completion. stracing systemctl is rarely useful, because the work is done by systemd.

Comment 3 Michal Schmidt 2012-08-15 11:59:00 UTC

[ Reassigning to sendmail tentatively. Jaroslav, were there any similar reports of sendmail breaking after upgrade from F16 to F17? ]

Comment 4 Jaroslav Škarvada 2012-08-15 12:12:47 UTC

(In reply to comment #3)
> [ Reassigning to sendmail tentatively. Jaroslav, were there any similar
> reports of sendmail breaking after upgrade from F16 to F17? ]

No one similar as far as i know. I upgraded several boxes without problem.

Comment 5 Michal Schmidt 2012-08-15 12:31:43 UTC

Michal J.,
which method did you use for the upgrade from F16?
Could this be a SELinux related problem (do you run in enforcing mode)?

Comment 6 Michal Jaegermann 2012-08-15 16:10:20 UTC

(In reply to comment #2)
> (In reply to comment #0)
> > Attempts on command line to do "systemctl start sm-client.service" just sit
> > there doing nothing beyond blocking a shell prompt.
> 
> While it's stuck here, run "systemctl status sm-client.service" from another
> shell and paste the output here.

It says:

sm-client.service - Sendmail Mail Transport Client
	  Loaded: loaded (/usr/lib/systemd/system/sm-client.service; enabled)
	  Active: inactive (dead)
	  CGroup: name=systemd:/system/sm-client.service

If you will wait long enough then for "start" then it eventually times out
and it prints: "A dependency job failed. See system journal for details."
I could not find any other details beyond this "dependency".

> stracing systemctl is rarely useful, because the work is
> done by systemd.

So how one can debug such problems?  There do not seem to be any debug facilities described for systemctl and booting with systemd.log_level=debug gets me the following:

[    4.964407] systemd[1]: Installed new job sm-client.service/start as 78
[    4.964415] systemd[1]: Installed new job sendmail.service/start as 79
[   14.843971] systemd[1]: sendmail.service changed dead -> start-pre
[   14.930356] systemd[1]: Child 669 belongs to sendmail.service
[   14.930384] systemd[1]: sendmail.service: control process exited, code=exited status=0
[   14.930599] systemd[1]: sendmail.service running next control command for state start-pre
[   15.058630] systemd[1]: Child 682 belongs to sendmail.service
[   15.058658] systemd[1]: sendmail.service: control process exited, code=exited status=0
[   15.058856] systemd[1]: sendmail.service got final SIGCHLD for state start-pre
[   15.058952] systemd[1]: About to execute: /usr/sbin/sendmail -bd $SENDMAIL_OPTS $SENDMAIL_OPTARG
[   15.059834] systemd[1]: Forked /usr/sbin/sendmail as 702
[   15.059943] systemd[1]: sendmail.service changed start-pre -> start
[   15.267173] systemd[1]: Received SIGCHLD from PID 702 (sendmail).
[   15.268757] systemd[1]: Got SIGCHLD for process 702 (sendmail)
[   15.268855] systemd[1]: Child 702 belongs to sendmail.service
[   15.268869] systemd[1]: sendmail.service: control process exited, code=exited status=0
[   15.268940] systemd[1]: sendmail.service got final SIGCHLD for state start
[   15.268951] systemd[1]: PID file /run/sendmail.pid not readable (yet?) after start.
[   15.269229] systemd[1]: Setting watch for sendmail.service's PID file /run/sendmail.pid
[  105.058966] systemd[1]: sendmail.service operation timed out. Terminating.
[  105.059852] systemd[1]: Stopping watch for sendmail.service's PID file /run/sendmail.pid
[  105.059911] systemd[1]: sendmail.service changed start -> final-sigterm
[  105.064376] systemd[1]: Received SIGCHLD from PID 709 (sendmail).
[  105.072906] systemd[1]: Got SIGCHLD for process 709 (sendmail)
[  105.080733] systemd[1]: sendmail.service: cgroup is empty
[  105.080763] systemd[1]: sendmail.service changed final-sigterm -> failed
[  105.080861] systemd[1]: Job sendmail.service/start finished, result=failed
[  105.081061] systemd[1]: Job sm-client.service/start finished, result=dependency
[  105.081132] systemd[1]: Job sm-client.service/start failed with result 'dependency'.
[  105.081182] systemd[1]: Unit sendmail.service entered failed state.

leaving me not much wiser.

> .... I upgraded several boxes without problem.

I have another installation of F17 upgraded from F15 and on this one sendmail starts like it was supposed to be. I very much dislike "now you see it, now you don't" problems. That particular upgrade was in fact very messy as anaconda screwed up badly and bailed out in the middle leaving me with a non-obvious recovery job which required quite a bit of guessing. 

As a matter of fact sendmail fails for me to start in an exactly the same way as here on my rawhide installation (systemd-187-3.fc18, sendmail-8.14.5-15.fc18).  I did not report that so far as a still outstanding bug 840242 makes to fail who knows what there and I was hoping that eventually all of this will get straighten out.

> which method did you use for the upgrade from F16?

That particular one was upgraded with "yum distrosync..." and went through without any problems.

> Could this be a SELinux related problem.

I do not think so.  Normally the box in question is indeed in an enforcing mode but switching it to permissive does not change anything.  Moreover on a rawhide installation SELinux is just turned off and sm-client.service still end up "dead".

Comment 7 Michal Schmidt 2012-08-15 16:26:41 UTC

(In reply to comment #6)
> [   15.268951] systemd[1]: PID file /run/sendmail.pid not readable (yet?)
> after start.
> [   15.269229] systemd[1]: Setting watch for sendmail.service's PID file
> /run/sendmail.pid
> [  105.058966] systemd[1]: sendmail.service operation timed out. Terminating.

This actually shows that the problem is with sendmail.service itself, not with sm-client.service. sm-client.service was the one waiting for the other service to complete, not the other way around.

So we should be looking into "systemctl status sendmail.service" while it's stuck. Also check if /run/sendmail.pid exists at that point and contains the expected PID number.

Comment 8 Michal Jaegermann 2012-08-15 17:21:03 UTC

> So we should be looking into "systemctl status sendmail.service" while it's
> stuck.

Until it will timeout and enter into a failed state it says: "Active: activating (start) ..." and "1412 sendmail: accepting connections" and "sendmail[1412]: starting daemon (8.14.5): ...".

Actually it does NOT accept connections, regardless what it says, because
sm-client part is not running.

After it timed out - see above.

> Also check if /run/sendmail.pid exists at that point and contains the
> expected PID number.

Neither /run/sm-client.pid nor /run/sendmail.pid exists.  In case you ask - files 
/usr/lib/systemd/system/sendmail.service and
/usr/lib/systemd/system/sm-client.service are not modified in any way.

Once again.  If I will run as root

/usr/sbin/sendmail -L sm-msp-queue -Ac -q1h
/usr/sbin/sendmail -bd -q1h

then I have a working sendmail even if systemctl will report, as expected, that these services failed.  OTOH /run/sm-client.pid and /run/sendmail.pid are still not there.  /run has "drwxr-xr-x. root root system_u:object_r:var_run_t:s0" permissions.

Comment 9 Michal Schmidt 2012-08-15 18:47:51 UTC

By default sendmail writes its pid to /var/run/sendmail.pid.
Please check that /var/run is a symlink to /run.

Comment 10 Michal Jaegermann 2012-08-15 21:40:15 UTC

(In reply to comment #9)
> By default sendmail writes its pid to /var/run/sendmail.pid.
> Please check that /var/run is a symlink to /run.

Ah, that could be it.  Only the current corresponding package is
filesystem-3-2.fc17.x86_64 and 

# rpm -qlv filesystem | egrep 'run|lock'
drwxr-xr-x    2 root    root                        0 Feb  3  2012 /run
drwxr-xr-x    2 root    root                        0 Feb  3  2012 /run/lock
drwxr-xr-x    2 root    root                        0 Feb  3  2012 /var/lock
drwxr-xr-x    2 root    root                        0 Feb  3  2012 /var/lock/subsys
drwxr-xr-x    2 root    root                        0 Feb  3  2012 /var/run

OTOH 'rpm -q --scripts filesystem' gives:
....
postinstall scriptlet (using <lua>):
posix.symlink("../run", "/var/run")
posix.symlink("../run/lock", "/var/lock")

Nothing in my logs gives any indication that this scriptlet failed but it looks like that it did as /var/lock and /var/run were not symlinks.  Now I changed them into symlinks, taking care of a contents in the process, and sendmail.service indeed starts.

Just for an extra attraction I looked at one of F16 installations around.  This one was definitely upgraded, from F14 to F16, with a help of anaconda.  In its filesystem-2.4.44-1.fc16 package scripts I can also find:

postinstall scriptlet (using <lua>):
posix.symlink("../run", "/var/run")
posix.symlink("../run/lock", "/var/lock")

The catch is that /var/run and /var/lock are surely not symlinks, despite of the above, and so far everything which was supposed to run seems to be running.  That includes sendmail.service.  What is really going on?  Was I just "lucky" so far?  And what is really messing up here?

Those postinstall scriptlets seem to be overly simplicistic on file systems where /var/run and /var/lock already exist but I am not sure what lua can/should do.  Naively I would expect these scripletes to fail.  I am not sure what and when created symlinks in question on that F17 installation where they are indeed present.

Comment 11 Kay Sievers 2012-08-15 22:57:36 UTC

(In reply to comment #10)
> Those postinstall scriptlets seem to be overly simplicistic on file systems
> where /var/run and /var/lock already exist but I am not sure what lua
> can/should do.  Naively I would expect these scripletes to fail.

It's the expected behaviour, they should fail. filesystem.rpm is not supposed
to mangle the filesystem, but only to create the symlinks if they are not
already there like in new installations.

The conversion from /var/run,lock directories to symlinks happens by a
dracut script/module at the same time the "usermove" is done, where /bin,
/sbin, /lib, /lib64 are replaced by symlinks to /usr.

Comment 12 Michal Jaegermann 2012-08-15 23:55:05 UTC

(In reply to comment #11)

> 
> The conversion from /var/run,lock directories to symlinks happens by a
> dracut script/module at the same time the "usermove" is done, where /bin,
> /sbin, /lib, /lib64 are replaced by symlinks to /usr.

Well, that script run, both in the case described here and on my rawhide installation and /bin, /sbin, /lib and /lib64 are symlinks to /usr.  Still
/var/run and /var/lock were left intact.  Issues with conversions in dracut?

> ... they should fail. filesystem.rpm is not supposed to mangle the filesystem

So what is the point of this postinstall scriptlet?  First you are creating a layout by installing filesystem.rpm and after that you run some actions which are supposed to fail.  Right?  OK.

Apparently all of this does not hit in F16 because both /run and /var/run are just mount points for the same tmpfs.

Comment 13 Kay Sievers 2012-08-16 00:14:00 UTC

(In reply to comment #12)

> Well, that script run, both in the case described here and on my rawhide
> installation and /bin, /sbin, /lib and /lib64 are symlinks to /usr.  Still
> /var/run and /var/lock were left intact.  Issues with conversions in dracut?

Maybe, but it worked for most people as it seems. It's the first report
about it. Maybe you have /var on a separate volume and it was not mounted
at the time anaconda ran the script.

> So what is the point of this postinstall scriptlet?  First you are creating
> a layout by installing filesystem.rpm and after that you run some actions
> which are supposed to fail.  Right?  OK.

RPM can not change the fs layout, meaning converting links to directories or
the other way around need to be done outside of RPM.

With the lua code, new installations get the new layout right away, old ones
were still (before we did the usrmove) be able to install the same package.

filesystem.rpm should be fine the way it is. When we drop the upgrade path
from non-usermove systems (after F18) we can remove all that.

Comment 14 Michal Jaegermann 2012-08-16 01:03:45 UTC

(In reply to comment #13)
> Maybe you have /var on a separate volume and it was not mounted
> at the time anaconda ran the script.

That particular small headless server has only /, /home and swap partitions.

I am afraid that now I will be not able to reproduce steps which caused observed results.  My general impression is that with every "improvement" the whole system gets more and more tangled leading to unpredictable side-effects.
 
> With the lua code, new installations get the new layout right away,

I am afraid that I do not see how.  In a step one you unpack, while installing, a content of filesystem.rpm and you create, among other things, /run and /var/run directories (possibly relative to some other mountpoint like /mnt/sysroot/ or whatever). In a step two postinstall scriptlet runs and is mangling just created layout;  but if it is running two days later, say, it is prevented from doing just that.  I possibly miss something but this sounds weird to me.

Comment 15 Michal Schmidt 2012-08-16 07:07:52 UTC

Can you find the exact version of dracut that you used to make the convertfs step of the upgrade? "yum history info dracut" should help.

(In reply to comment #14)
> In a step one you unpack, while installing, a content of filesystem.rpm and
> you create, among other things, /run and /var/run directories

No. /var/run is marked %ghost in filesystem.rpm, so it does not get unpacked. For new installations the postinstall script is therefore able to create the symlink.

Comment 16 Jaroslav Škarvada 2012-08-20 09:49:18 UTC

It doesn't seem to be sendmail issue, reassigning to dracut (keeping myself in the CC).

Comment 17 Michal Jaegermann 2012-08-21 19:43:44 UTC

Yesterday I had a chance to run F16->F17 upgrade and I tried to watch pretty carefully what happens when dracut runs "convertfs".  A creation of /var/run
symlink was reported but nothing like that happened with /var indeed on a separate file system (the system owner managed in the past to overflow / when some program went haywire and started unlimited writes to /var/tmp; a malicious user can do that too but that should leave at least "a reserve").  If anything mentioned /var/lock link I missed it.

The most obvious fix seems to be to "convertfs" to check if /var should not be mounted and do that before making links in question.

That leaves systems which were already converted but maybe not entirely.  In such case it seems not too difficult to run in a boot sequence a service which checks if /var/run is mounted tmpfs; if not if /var/run is a link and if not then does the required conversion (possibly followed by "... | cpio -pmd /var/run" to copy what needed) and the same for /var/lock.  Such service would do anything just once.

In that particular case I performed missing steps "manually" and simply rebooted.

Comment 18 yourrealname 2012-10-24 13:15:02 UTC

Had the same issue. Changed the /lib/systemd/system/sendmail.service to use:

PIDFile=/var/run/sendmail.pid

instead of /run/sendmail.pid

Sendmail starts happily.

Comment 19 Jaroslav Škarvada 2012-10-24 13:54:44 UTC

(In reply to comment #18)
> Had the same issue. Changed the /lib/systemd/system/sendmail.service to use:
> 
> PIDFile=/var/run/sendmail.pid
> 
> instead of /run/sendmail.pid
> 
> Sendmail starts happily.

This indicate that your filesystem layout is wrong. The /var/run must be symlink to /run, otherwise the upgrade went wrong.

Comment 20 Michal Jaegermann 2012-10-24 18:56:42 UTC

(In reply to comment #19)

> 
> This indicate that your filesystem layout is wrong. The /var/run must be
> symlink to /run, otherwise the upgrade went wrong.

Actually it is expected that /var/run is a symlink to /run AND /var/lock
is a symlink to /run/lock.  With /var on a separate file system upgrade does not handle that.  It is easy to correct that manually "post-upgrade", if a need will arise, but one needs to be aware of this gotcha.

Comment 21 yourrealname 2012-10-24 19:11:30 UTC

Imo, these symlinks are hacky and make things harder to track and understand. Which one is it going to be? /var/run or /run; /var/lock or /run/lock? The /var seems preferable.

Comment 22 Jóhann B. Guðmundsson 2012-10-24 19:15:01 UTC

No what the symlink points to is the preferable one since it is the *actual* location...

Comment 23 yourrealname 2012-10-24 19:17:49 UTC

I'm game for either; as long as it is one and only one place.

Comment 24 Sergio Basto 2012-11-01 18:21:02 UTC

(In reply to comment #18)
> Had the same issue. Changed the /lib/systemd/system/sendmail.service to use:
> 
> PIDFile=/var/run/sendmail.pid
> 
> instead of /run/sendmail.pid
> 
> Sendmail starts happily.

This doesn't fix nothing. On restart I got same error, 

systemctl restart sendmail.service
Nov  1 18:16:54 segulix systemd[1]: PID file /var/run/sendmail.pid not readable (yet?) after start.


and with /run/sendmail.pid also start correctly 

ll /var/run
rwxrwxrwx. 1 root root 6 Jun  2 20:20 /var/run -> ../run

Comment 25 Fedora End Of Life 2013-07-03 22:15:05 UTC

This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 26 Fedora End Of Life 2013-07-31 23:59:16 UTC

Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.