Bug 955768 - Grid Engine shadow master functionality not present in systemd-enabled systems
Summary: Grid Engine shadow master functionality not present in systemd-enabled systems
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: gridengine
Version: 17
Hardware: All
OS: Linux
unspecified
low
Target Milestone: ---
Assignee: Orion Poplawski
QA Contact: Fedora Extras Quality Assurance
URL: https://sourceforge.net/tracker/?grou...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-04-23 18:38 UTC by Mike Grant
Modified: 2013-07-30 17:33 UTC (History)
1 user (show)

Fixed In Version: gridengine-2011.11p1-13.fc17
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-07-30 17:33:41 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
systemd unitfile for the shadow master service (208 bytes, text/plain)
2013-04-23 18:38 UTC, Mike Grant
no flags Details
Optional shadow master configuration variables for addition to /etc/sysconfig/gridengine (2.32 KB, text/plain)
2013-07-04 12:18 UTC, Mike Grant
no flags Details

Description Mike Grant 2013-04-23 18:38:26 UTC
Created attachment 739107 [details]
systemd unitfile for the shadow master service

Description of problem:
The Grid Engine provides the ability to run a shadow master daemon, which will step in if the primary master node dies.  Since the move to systemd, there is no means to start this daemon as a service.  Previously this was started in the SysV script.

Version-Release number of selected component (if applicable):
gridengine-2011.11p1-2.fc17.x86_64
gridengine-qmaster-2011.11p1-2.fc17.x86_64

From browsing http://koji.fedoraproject.org/koji/buildinfo?buildID=410861 I believe this request is still relevant in F19 and up.

How reproducible:
100%

Steps to Reproduce:
1. configure a machine as a shadow master (add to $SGE_ROOT/$SGE_CELL/common/shadow_masters)
2a. systemctl start sge_shadowmaster.service
 or
2b. systemctl start sgemaster.service
  
Actual results:
2a. No such unit file.
2b. service fails, because the node is not the primary master

Expected results:
2a. /usr/bin/sge_shadowd is started
2b. system notices this node is not the specified master, but is in $SGE_ROOT/$SGE_CELL/common/shadow_masters, and so starts a shadow daemon instead

Additional info:

In the SysV init script, there were a number of functions to check whether the host is a shadow one, etc.  One approach would be to port these over.  An easy but less complete alternative is to add a new unit file for the shadow master.  This co-exists nicely with the sgemaster (though that will fail if the node isn't a master).  An example unit file is attached.

For those attempting to run automatic failover, note that:
 - the spool should be on a shared filesystem (e.g. NFS)
 - when the master is shut down cleanly, e.g. by a systemctl stop or a reboot, it leaves a lock file in $SGE_ROOT/$SGE_CELL/qmaster/lock.  This should be deleted in order for shadow masters to take over (add an ExecPost to the systemd sgemaster unit file if you don't want this to happen).
 - sge_shadowd apparently execs the qmaster from $SGE_ROOT/$SGE_CELL/../bin so be cautious if symlinking $SGE_CELL as it traces back up the wrong filesystem

Packaging note (minor):
The files associated with the shadow daemon are currently in the "gridengine" package and would probably better fit in the "gridengine-qmaster" package.  These are:
 /usr/bin/sge_shadowd
 /usr/share/gridengine/bin/linux-x64/sge_shadowd
 /usr/share/gridengine/util/sgeSMF/.svn/text-base/shadowd_template.xml.svn-base (probably shouldn't be packaged at all)
 /usr/share/gridengine/util/sgeSMF/shadowd_template.xml
 /usr/share/man/man8/sge_shadowd.8.gz

Comment 1 Orion Poplawski 2013-04-25 03:07:54 UTC
Yeah, it seems I've really neglected shadowd functionality (since I don't make use of it myself).  I've filed a bug with upstream to see if we can coordinate, although their responsiveness is hit and miss.

Comment 2 Orion Poplawski 2013-05-21 19:49:03 UTC
I'm going to go with the separate sge_shadowd.service approach.

Comment 3 Fedora Update System 2013-05-21 22:29:11 UTC
gridengine-2011.11p1-13.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/gridengine-2011.11p1-13.fc17

Comment 4 Fedora Update System 2013-05-23 12:30:21 UTC
Package gridengine-2011.11p1-13.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing gridengine-2011.11p1-13.fc17'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-8929/gridengine-2011.11p1-13.fc17
then log in and leave karma (feedback).

Comment 5 Mike Grant 2013-05-24 13:34:18 UTC
Thanks :)  Due to holidays and scheduling a downtime window, we should be able to test it around the first week of June.

Comment 6 Fedora End Of Life 2013-07-04 07:46:11 UTC
This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 7 Mike Grant 2013-07-04 10:21:31 UTC
Belatedly adding that I've installed this package and it fired up the shadow daemon correctly.

As a sidenote for future google hits, those keeping their spool on an automounted NFS mount may wish to add autofs to the unit file dependencies, to ensure that the spool directory will be mountable before starting the daemon, i.e.
 After=network.target autofs.service

Comment 8 Mike Grant 2013-07-04 12:18:31 UTC
Created attachment 768745 [details]
Optional shadow master configuration variables for addition to /etc/sysconfig/gridengine

Here's some configuration variables relevant to the shadow daemon and how it works.  I forget exactly where I got them, but I think it was the manual; they're not so easy to come across otherwise.  They're possibly a bit spammy for general users, but might be useful to include in commented out form.

Comment 9 Orion Poplawski 2013-07-29 05:28:26 UTC
Thanks, I've checked this in and it will make it out with the next update, whenever that is.

Comment 10 Fedora Update System 2013-07-30 17:33:41 UTC
gridengine-2011.11p1-13.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.