Created attachment 739107 [details] systemd unitfile for the shadow master service Description of problem: The Grid Engine provides the ability to run a shadow master daemon, which will step in if the primary master node dies. Since the move to systemd, there is no means to start this daemon as a service. Previously this was started in the SysV script. Version-Release number of selected component (if applicable): gridengine-2011.11p1-2.fc17.x86_64 gridengine-qmaster-2011.11p1-2.fc17.x86_64 From browsing http://koji.fedoraproject.org/koji/buildinfo?buildID=410861 I believe this request is still relevant in F19 and up. How reproducible: 100% Steps to Reproduce: 1. configure a machine as a shadow master (add to $SGE_ROOT/$SGE_CELL/common/shadow_masters) 2a. systemctl start sge_shadowmaster.service or 2b. systemctl start sgemaster.service Actual results: 2a. No such unit file. 2b. service fails, because the node is not the primary master Expected results: 2a. /usr/bin/sge_shadowd is started 2b. system notices this node is not the specified master, but is in $SGE_ROOT/$SGE_CELL/common/shadow_masters, and so starts a shadow daemon instead Additional info: In the SysV init script, there were a number of functions to check whether the host is a shadow one, etc. One approach would be to port these over. An easy but less complete alternative is to add a new unit file for the shadow master. This co-exists nicely with the sgemaster (though that will fail if the node isn't a master). An example unit file is attached. For those attempting to run automatic failover, note that: - the spool should be on a shared filesystem (e.g. NFS) - when the master is shut down cleanly, e.g. by a systemctl stop or a reboot, it leaves a lock file in $SGE_ROOT/$SGE_CELL/qmaster/lock. This should be deleted in order for shadow masters to take over (add an ExecPost to the systemd sgemaster unit file if you don't want this to happen). - sge_shadowd apparently execs the qmaster from $SGE_ROOT/$SGE_CELL/../bin so be cautious if symlinking $SGE_CELL as it traces back up the wrong filesystem Packaging note (minor): The files associated with the shadow daemon are currently in the "gridengine" package and would probably better fit in the "gridengine-qmaster" package. These are: /usr/bin/sge_shadowd /usr/share/gridengine/bin/linux-x64/sge_shadowd /usr/share/gridengine/util/sgeSMF/.svn/text-base/shadowd_template.xml.svn-base (probably shouldn't be packaged at all) /usr/share/gridengine/util/sgeSMF/shadowd_template.xml /usr/share/man/man8/sge_shadowd.8.gz
Yeah, it seems I've really neglected shadowd functionality (since I don't make use of it myself). I've filed a bug with upstream to see if we can coordinate, although their responsiveness is hit and miss.
I'm going to go with the separate sge_shadowd.service approach.
gridengine-2011.11p1-13.fc17 has been submitted as an update for Fedora 17. https://admin.fedoraproject.org/updates/gridengine-2011.11p1-13.fc17
Package gridengine-2011.11p1-13.fc17: * should fix your issue, * was pushed to the Fedora 17 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing gridengine-2011.11p1-13.fc17' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2013-8929/gridengine-2011.11p1-13.fc17 then log in and leave karma (feedback).
Thanks :) Due to holidays and scheduling a downtime window, we should be able to test it around the first week of June.
This message is a reminder that Fedora 17 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 17. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '17'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 17's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 17 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior to Fedora 17's end of life. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Belatedly adding that I've installed this package and it fired up the shadow daemon correctly. As a sidenote for future google hits, those keeping their spool on an automounted NFS mount may wish to add autofs to the unit file dependencies, to ensure that the spool directory will be mountable before starting the daemon, i.e. After=network.target autofs.service
Created attachment 768745 [details] Optional shadow master configuration variables for addition to /etc/sysconfig/gridengine Here's some configuration variables relevant to the shadow daemon and how it works. I forget exactly where I got them, but I think it was the manual; they're not so easy to come across otherwise. They're possibly a bit spammy for general users, but might be useful to include in commented out form.
Thanks, I've checked this in and it will make it out with the next update, whenever that is.
gridengine-2011.11p1-13.fc17 has been pushed to the Fedora 17 stable repository. If problems still persist, please make note of it in this bug report.