Description of problem: After a switch of a Fedora installation from F16 to F17 previously working sendmail ceased to work. Status checks with systemctl invariably show "Active: failed (Result: timeout) ....". It turns out that sendmail.service appears to be waiting for sm-client.service but for that I am getting: # systemctl status sm-client.service sm-client.service - Sendmail Mail Transport Client Loaded: loaded (/usr/lib/systemd/system/sm-client.service; enabled) Active: inactive (dead) CGroup: name=systemd:/system/sm-client.service Attempts on command line to do "systemctl start sm-client.service" just sit there doing nothing beyond blocking a shell prompt. OTOH - running directly from shell /usr/sbin/sendmail -L sm-msp-queue -Ac -q1h /usr/sbin/sendmail -bd -q1h immediately provides functioning sendmail service which works as it used to (while systemctl knows nothing about it). Running 'strace systemctl start sm-client.service' (results attached) show systemd sitting in 'poll' and waiting for who knows what. 'systemctl --failed' reports only this job: sendmail.service loaded failed failed Sendmail Mail Transport Agent Version-Release number of selected component (if applicable): systemd-44-17.fc17 How reproducible: on every boot Expected results: systemd not breaking active services Additional info: That particular installation is a small headless server with no desktop files.
Created attachment 604430 [details] results of 'strace systemctl start sm-client.service'
(In reply to comment #0) > Attempts on command line to do "systemctl start sm-client.service" just sit > there doing nothing beyond blocking a shell prompt. While it's stuck here, run "systemctl status sm-client.service" from another shell and paste the output here. > Running 'strace systemctl start sm-client.service' (results attached) show > systemd sitting in 'poll' and waiting for who knows what. I'm pretty sure it waits for systemd (PID 1) to reply with the message about the job completion. stracing systemctl is rarely useful, because the work is done by systemd.
[ Reassigning to sendmail tentatively. Jaroslav, were there any similar reports of sendmail breaking after upgrade from F16 to F17? ]
(In reply to comment #3) > [ Reassigning to sendmail tentatively. Jaroslav, were there any similar > reports of sendmail breaking after upgrade from F16 to F17? ] No one similar as far as i know. I upgraded several boxes without problem.
Michal J., which method did you use for the upgrade from F16? Could this be a SELinux related problem (do you run in enforcing mode)?
(In reply to comment #2) > (In reply to comment #0) > > Attempts on command line to do "systemctl start sm-client.service" just sit > > there doing nothing beyond blocking a shell prompt. > > While it's stuck here, run "systemctl status sm-client.service" from another > shell and paste the output here. It says: sm-client.service - Sendmail Mail Transport Client Loaded: loaded (/usr/lib/systemd/system/sm-client.service; enabled) Active: inactive (dead) CGroup: name=systemd:/system/sm-client.service If you will wait long enough then for "start" then it eventually times out and it prints: "A dependency job failed. See system journal for details." I could not find any other details beyond this "dependency". > stracing systemctl is rarely useful, because the work is > done by systemd. So how one can debug such problems? There do not seem to be any debug facilities described for systemctl and booting with systemd.log_level=debug gets me the following: [ 4.964407] systemd[1]: Installed new job sm-client.service/start as 78 [ 4.964415] systemd[1]: Installed new job sendmail.service/start as 79 [ 14.843971] systemd[1]: sendmail.service changed dead -> start-pre [ 14.930356] systemd[1]: Child 669 belongs to sendmail.service [ 14.930384] systemd[1]: sendmail.service: control process exited, code=exited status=0 [ 14.930599] systemd[1]: sendmail.service running next control command for state start-pre [ 15.058630] systemd[1]: Child 682 belongs to sendmail.service [ 15.058658] systemd[1]: sendmail.service: control process exited, code=exited status=0 [ 15.058856] systemd[1]: sendmail.service got final SIGCHLD for state start-pre [ 15.058952] systemd[1]: About to execute: /usr/sbin/sendmail -bd $SENDMAIL_OPTS $SENDMAIL_OPTARG [ 15.059834] systemd[1]: Forked /usr/sbin/sendmail as 702 [ 15.059943] systemd[1]: sendmail.service changed start-pre -> start [ 15.267173] systemd[1]: Received SIGCHLD from PID 702 (sendmail). [ 15.268757] systemd[1]: Got SIGCHLD for process 702 (sendmail) [ 15.268855] systemd[1]: Child 702 belongs to sendmail.service [ 15.268869] systemd[1]: sendmail.service: control process exited, code=exited status=0 [ 15.268940] systemd[1]: sendmail.service got final SIGCHLD for state start [ 15.268951] systemd[1]: PID file /run/sendmail.pid not readable (yet?) after start. [ 15.269229] systemd[1]: Setting watch for sendmail.service's PID file /run/sendmail.pid [ 105.058966] systemd[1]: sendmail.service operation timed out. Terminating. [ 105.059852] systemd[1]: Stopping watch for sendmail.service's PID file /run/sendmail.pid [ 105.059911] systemd[1]: sendmail.service changed start -> final-sigterm [ 105.064376] systemd[1]: Received SIGCHLD from PID 709 (sendmail). [ 105.072906] systemd[1]: Got SIGCHLD for process 709 (sendmail) [ 105.080733] systemd[1]: sendmail.service: cgroup is empty [ 105.080763] systemd[1]: sendmail.service changed final-sigterm -> failed [ 105.080861] systemd[1]: Job sendmail.service/start finished, result=failed [ 105.081061] systemd[1]: Job sm-client.service/start finished, result=dependency [ 105.081132] systemd[1]: Job sm-client.service/start failed with result 'dependency'. [ 105.081182] systemd[1]: Unit sendmail.service entered failed state. leaving me not much wiser. > .... I upgraded several boxes without problem. I have another installation of F17 upgraded from F15 and on this one sendmail starts like it was supposed to be. I very much dislike "now you see it, now you don't" problems. That particular upgrade was in fact very messy as anaconda screwed up badly and bailed out in the middle leaving me with a non-obvious recovery job which required quite a bit of guessing. As a matter of fact sendmail fails for me to start in an exactly the same way as here on my rawhide installation (systemd-187-3.fc18, sendmail-8.14.5-15.fc18). I did not report that so far as a still outstanding bug 840242 makes to fail who knows what there and I was hoping that eventually all of this will get straighten out. > which method did you use for the upgrade from F16? That particular one was upgraded with "yum distrosync..." and went through without any problems. > Could this be a SELinux related problem. I do not think so. Normally the box in question is indeed in an enforcing mode but switching it to permissive does not change anything. Moreover on a rawhide installation SELinux is just turned off and sm-client.service still end up "dead".
(In reply to comment #6) > [ 15.268951] systemd[1]: PID file /run/sendmail.pid not readable (yet?) > after start. > [ 15.269229] systemd[1]: Setting watch for sendmail.service's PID file > /run/sendmail.pid > [ 105.058966] systemd[1]: sendmail.service operation timed out. Terminating. This actually shows that the problem is with sendmail.service itself, not with sm-client.service. sm-client.service was the one waiting for the other service to complete, not the other way around. So we should be looking into "systemctl status sendmail.service" while it's stuck. Also check if /run/sendmail.pid exists at that point and contains the expected PID number.
> So we should be looking into "systemctl status sendmail.service" while it's > stuck. Until it will timeout and enter into a failed state it says: "Active: activating (start) ..." and "1412 sendmail: accepting connections" and "sendmail[1412]: starting daemon (8.14.5): ...". Actually it does NOT accept connections, regardless what it says, because sm-client part is not running. After it timed out - see above. > Also check if /run/sendmail.pid exists at that point and contains the > expected PID number. Neither /run/sm-client.pid nor /run/sendmail.pid exists. In case you ask - files /usr/lib/systemd/system/sendmail.service and /usr/lib/systemd/system/sm-client.service are not modified in any way. Once again. If I will run as root /usr/sbin/sendmail -L sm-msp-queue -Ac -q1h /usr/sbin/sendmail -bd -q1h then I have a working sendmail even if systemctl will report, as expected, that these services failed. OTOH /run/sm-client.pid and /run/sendmail.pid are still not there. /run has "drwxr-xr-x. root root system_u:object_r:var_run_t:s0" permissions.
By default sendmail writes its pid to /var/run/sendmail.pid. Please check that /var/run is a symlink to /run.
(In reply to comment #9) > By default sendmail writes its pid to /var/run/sendmail.pid. > Please check that /var/run is a symlink to /run. Ah, that could be it. Only the current corresponding package is filesystem-3-2.fc17.x86_64 and # rpm -qlv filesystem | egrep 'run|lock' drwxr-xr-x 2 root root 0 Feb 3 2012 /run drwxr-xr-x 2 root root 0 Feb 3 2012 /run/lock drwxr-xr-x 2 root root 0 Feb 3 2012 /var/lock drwxr-xr-x 2 root root 0 Feb 3 2012 /var/lock/subsys drwxr-xr-x 2 root root 0 Feb 3 2012 /var/run OTOH 'rpm -q --scripts filesystem' gives: .... postinstall scriptlet (using <lua>): posix.symlink("../run", "/var/run") posix.symlink("../run/lock", "/var/lock") Nothing in my logs gives any indication that this scriptlet failed but it looks like that it did as /var/lock and /var/run were not symlinks. Now I changed them into symlinks, taking care of a contents in the process, and sendmail.service indeed starts. Just for an extra attraction I looked at one of F16 installations around. This one was definitely upgraded, from F14 to F16, with a help of anaconda. In its filesystem-2.4.44-1.fc16 package scripts I can also find: postinstall scriptlet (using <lua>): posix.symlink("../run", "/var/run") posix.symlink("../run/lock", "/var/lock") The catch is that /var/run and /var/lock are surely not symlinks, despite of the above, and so far everything which was supposed to run seems to be running. That includes sendmail.service. What is really going on? Was I just "lucky" so far? And what is really messing up here? Those postinstall scriptlets seem to be overly simplicistic on file systems where /var/run and /var/lock already exist but I am not sure what lua can/should do. Naively I would expect these scripletes to fail. I am not sure what and when created symlinks in question on that F17 installation where they are indeed present.
(In reply to comment #10) > Those postinstall scriptlets seem to be overly simplicistic on file systems > where /var/run and /var/lock already exist but I am not sure what lua > can/should do. Naively I would expect these scripletes to fail. It's the expected behaviour, they should fail. filesystem.rpm is not supposed to mangle the filesystem, but only to create the symlinks if they are not already there like in new installations. The conversion from /var/run,lock directories to symlinks happens by a dracut script/module at the same time the "usermove" is done, where /bin, /sbin, /lib, /lib64 are replaced by symlinks to /usr.
(In reply to comment #11) > > The conversion from /var/run,lock directories to symlinks happens by a > dracut script/module at the same time the "usermove" is done, where /bin, > /sbin, /lib, /lib64 are replaced by symlinks to /usr. Well, that script run, both in the case described here and on my rawhide installation and /bin, /sbin, /lib and /lib64 are symlinks to /usr. Still /var/run and /var/lock were left intact. Issues with conversions in dracut? > ... they should fail. filesystem.rpm is not supposed to mangle the filesystem So what is the point of this postinstall scriptlet? First you are creating a layout by installing filesystem.rpm and after that you run some actions which are supposed to fail. Right? OK. Apparently all of this does not hit in F16 because both /run and /var/run are just mount points for the same tmpfs.
(In reply to comment #12) > Well, that script run, both in the case described here and on my rawhide > installation and /bin, /sbin, /lib and /lib64 are symlinks to /usr. Still > /var/run and /var/lock were left intact. Issues with conversions in dracut? Maybe, but it worked for most people as it seems. It's the first report about it. Maybe you have /var on a separate volume and it was not mounted at the time anaconda ran the script. > So what is the point of this postinstall scriptlet? First you are creating > a layout by installing filesystem.rpm and after that you run some actions > which are supposed to fail. Right? OK. RPM can not change the fs layout, meaning converting links to directories or the other way around need to be done outside of RPM. With the lua code, new installations get the new layout right away, old ones were still (before we did the usrmove) be able to install the same package. filesystem.rpm should be fine the way it is. When we drop the upgrade path from non-usermove systems (after F18) we can remove all that.
(In reply to comment #13) > Maybe you have /var on a separate volume and it was not mounted > at the time anaconda ran the script. That particular small headless server has only /, /home and swap partitions. I am afraid that now I will be not able to reproduce steps which caused observed results. My general impression is that with every "improvement" the whole system gets more and more tangled leading to unpredictable side-effects. > With the lua code, new installations get the new layout right away, I am afraid that I do not see how. In a step one you unpack, while installing, a content of filesystem.rpm and you create, among other things, /run and /var/run directories (possibly relative to some other mountpoint like /mnt/sysroot/ or whatever). In a step two postinstall scriptlet runs and is mangling just created layout; but if it is running two days later, say, it is prevented from doing just that. I possibly miss something but this sounds weird to me.
Can you find the exact version of dracut that you used to make the convertfs step of the upgrade? "yum history info dracut" should help. (In reply to comment #14) > In a step one you unpack, while installing, a content of filesystem.rpm and > you create, among other things, /run and /var/run directories No. /var/run is marked %ghost in filesystem.rpm, so it does not get unpacked. For new installations the postinstall script is therefore able to create the symlink.
It doesn't seem to be sendmail issue, reassigning to dracut (keeping myself in the CC).
Yesterday I had a chance to run F16->F17 upgrade and I tried to watch pretty carefully what happens when dracut runs "convertfs". A creation of /var/run symlink was reported but nothing like that happened with /var indeed on a separate file system (the system owner managed in the past to overflow / when some program went haywire and started unlimited writes to /var/tmp; a malicious user can do that too but that should leave at least "a reserve"). If anything mentioned /var/lock link I missed it. The most obvious fix seems to be to "convertfs" to check if /var should not be mounted and do that before making links in question. That leaves systems which were already converted but maybe not entirely. In such case it seems not too difficult to run in a boot sequence a service which checks if /var/run is mounted tmpfs; if not if /var/run is a link and if not then does the required conversion (possibly followed by "... | cpio -pmd /var/run" to copy what needed) and the same for /var/lock. Such service would do anything just once. In that particular case I performed missing steps "manually" and simply rebooted.
Had the same issue. Changed the /lib/systemd/system/sendmail.service to use: PIDFile=/var/run/sendmail.pid instead of /run/sendmail.pid Sendmail starts happily.
(In reply to comment #18) > Had the same issue. Changed the /lib/systemd/system/sendmail.service to use: > > PIDFile=/var/run/sendmail.pid > > instead of /run/sendmail.pid > > Sendmail starts happily. This indicate that your filesystem layout is wrong. The /var/run must be symlink to /run, otherwise the upgrade went wrong.
(In reply to comment #19) > > This indicate that your filesystem layout is wrong. The /var/run must be > symlink to /run, otherwise the upgrade went wrong. Actually it is expected that /var/run is a symlink to /run AND /var/lock is a symlink to /run/lock. With /var on a separate file system upgrade does not handle that. It is easy to correct that manually "post-upgrade", if a need will arise, but one needs to be aware of this gotcha.
Imo, these symlinks are hacky and make things harder to track and understand. Which one is it going to be? /var/run or /run; /var/lock or /run/lock? The /var seems preferable.
No what the symlink points to is the preferable one since it is the *actual* location...
I'm game for either; as long as it is one and only one place.
(In reply to comment #18) > Had the same issue. Changed the /lib/systemd/system/sendmail.service to use: > > PIDFile=/var/run/sendmail.pid > > instead of /run/sendmail.pid > > Sendmail starts happily. This doesn't fix nothing. On restart I got same error, systemctl restart sendmail.service Nov 1 18:16:54 segulix systemd[1]: PID file /var/run/sendmail.pid not readable (yet?) after start. and with /run/sendmail.pid also start correctly ll /var/run rwxrwxrwx. 1 root root 6 Jun 2 20:20 /var/run -> ../run
This message is a reminder that Fedora 17 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 17. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '17'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 17's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 17 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior to Fedora 17's end of life. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.