Description of problem: The gridengine sge_qmaster does not create a PID file on startup (e.g. /run/sgemaster.pid). The current systemd configuration runs it as a SysV-style forking daemon (Type=forking in sgemaster.service). Without a PID file, systemd tries to guess the main PID of a forking service, but this process is not reliable (search GuessMainPID in http://www.freedesktop.org/software/systemd/man/systemd.service.html). On one of our servers, this reliably segfaulted (others work ok). After significant effort tracing the error, I found it to be that the incorrect main PID was identified, systemd then thought the daemon was finished and sent a SIGTERM to the other processes to tidy up. Unfortunately this signal arrived during initialisation and the daemon blew up with a seg fault due to insufficiently defensive coding. This is a combination of a misidentification of the main PID and a race condition on receiving a SIGTERM during initialisation. I'm unsure of the systemd algorithm for guessing the main PID, but it probably just waits for a couple of forks or a short time period, then picks the current top process, so this may be racey too. The guessing process is explicitly described as unreliable in the systemd documentation. Version-Release number of selected component (if applicable): gridengine-qmaster-2011.11p1-15.fc19.x86_64 How reproducible: 100% on one server, 0% on another! Steps to Reproduce: 1. create a default Grid Engine master: dnf install gridengine-qmaster cd /usr/share/gridengine/ ./install_qmaster # pretty much just take defaults 2. systemctl start sgemaster.service Actual results: Depends on luck - can hit a race condition (=segfault) or a controlled shutdown due to misidentification of the main process (=master is stopped cleanly), or be lucky and have it start up. Expected results: Daemon starts up. Additional info: Aside from being luckier, I found two fixes to the problem. One can patch the daemons to write out a PID file and add a PIDFile option to the systemd service unit file. I can provide an example patch for sge_qmaster if that's helpful. Less invasively, one can add the following lines to /etc/sysconfig/gridengine, which prevent the daemons from forking: ------ # prevent SGE from daemonising qmaster, shadowd, execd # required for systemd to control this as a "Type=simple" service # see bugzilla #???? SGE_ND=true ----- and change the unit file type to "Type=simple", where system expects the process to continue in the foreground rather than daemonising. This does result in minor spam to /var/log/messages (about 20 lines every 3 mins on my system). One has to change the unit files for sgemaster.service, sge_shadowd.service and sge_execd.service. Lennart Poettering recommends the foreground approach in another discussion here - http://lists.freedesktop.org/archives/systemd-devel/2011-June/002677.html
Does: PIDFile=/var/spool/gridengine/${SGE_CELL}/qmaster/qmaster.pid work? Not sure if you can use environment variables in PIDFile. If not I may just put the default there and a note to change it if needed in /etc/sysconfig/gridengine. Modifying /etc/sysconfig/gridengine doesn't seem viable as it is %config(noreplace), and I don't like there being stuff sent regularly to the log. It's a bummer that they are assuming the SGE_ND is only for debugging.
Hmm, no, looks like PIDFile can't parse variables.
The log spam is a bit annoying. I took a peek at the code and it seems it's not possible to tune it down much with $SGE_DEBUG_LEVEL, as the regular spams seem to rely purely on $SGE_ND. Some patching could solve this, of course. It might be possible to relatively easily patch in an extra environment variable to separate out the forking from the debug. The critical function (for the qmaster) seems to be sge_daemonize_qmaster() around line 180 in SOURCES/GE2011.11p1/source/daemons/qmaster/sge_qmaster_threads.c and would simply need to check for the existence of something like $SGE_DONT_FORK and return. I'm not sure what the other consequences of this might be ;) It's also a little bit nasty since arguably that's what SGE_ND (no daemonize) is supposed to mean! This would still require a change to the sysconfig file though.
gridengine-2011.11p1-22.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/gridengine-2011.11p1-22.fc20
gridengine-2011.11p1-22.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/gridengine-2011.11p1-22.fc19
Package gridengine-2011.11p1-22.fc20: * should fix your issue, * was pushed to the Fedora 20 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing gridengine-2011.11p1-22.fc20' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2014-10362/gridengine-2011.11p1-22.fc20 then log in and leave karma (feedback).
gridengine-2011.11p1-22.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report.
gridengine-2011.11p1-22.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report.