| Summary: | ERROR: Failed to send command "condor_restart" to subsystem "master" during master restart | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Jon Thomas <jthomas> | ||||
| Component: | condor-wallaby-client | Assignee: | grid-maint-list <grid-maint-list> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | MRG Quality Engineering <mrgqe-bugs> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 1.3 | CC: | matt, rrati | ||||
| Target Milestone: | 2.0 | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-01-20 19:31:04 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
Jon Thomas
2011-01-13 15:04:08 UTC
This is likely an error in the condor_restart tool. The configd is looking for a non-zero return code, so if the restart goes through but condor_restart returns a non-zero then you would get the results in this BZ.
I think this has something to do with condorutils/osutil.py and run_cmd.
run_cmd is responsible for the -1 retval.
01/11 14:19:38 ERROR: Failed to send command "condor_restart" to subsystem
"master" (retval: -1, stdout: "None", stderr: "None")
It seems to be the exception for this block
try:
cmd_list = shlex.split(cmd)
obj = Popen(cmd_list, stdout=PIPE, stderr=PIPE, env=env)
(std_out, std_err) = obj.communicate()
retcode = obj.returncode
except:
return (-1, None, None)
01/18 16:06:42 DEBUG: Sending command "condor_restart" to subsystem "master"
01/18 16:06:42 DEBUG: Shutting down
01/18 16:06:42 DEBUG: Closing QMF connections
01/18 16:06:42 DEBUG: Lost connecton to the configuration store
01/18 16:06:43 DEBUG: Closed QMF connections
01/18 16:06:43 DEBUG: Setting stop flag
01/18 16:06:43 ERROR: Failed to send command "condor_restart" to subsystem "master" (retval: -1, stdout: "None", stderr: "None")
re: may not be completely cosmetic...
Initial startup path throws away the return value for check_config_ver, so that seems like the issue is cosmetic.
However, the process_version_q uses the return value as a flag to requeue the version. I'm not really sure of the consequences of this yet.
if version != -1:
# There was a new version in the queue
if self.check_config_ver(version) == False:
if version_queue.empty() == True:
# There hasn't been another version to look at, so put
# this one back as there was an error installing it
version_queue.put(version, False)
size += 1
return size
Looking again at the log, this looks like 1) condor_configd sends a condor_restart to master 2) master tells condor_configd to shutdown while it's in the popen code block 3) shutdown signal fires condor_configd's exit_signal_handler 4) the signal or signal handler causes popen code block (perhaps obj.communicate) to have an exception resulting in a -1 return value 01/18 16:06:42 DEBUG: Sending command "condor_restart" to subsystem "master" 01/18 16:06:42 DEBUG: Shutting down 01/18 16:06:42 DEBUG: Closing QMF connections 01/18 16:06:42 DEBUG: Lost connecton to the configuration store 01/18 16:06:43 DEBUG: Closed QMF connections 01/18 16:06:43 DEBUG: Setting stop flag 01/18 16:06:43 ERROR: Failed to send command "condor_restart" to subsystem "master" (retval: -1, stdout: "None", stderr: "None") One thought on fixing this is to convert or overload run_cmd to have a NOWAIT flag. condor_restart return code doesn't really tell us if the restart was successful. doCommand is void. maybe just something like this for the NOWAIT case. or return true of obj is valid. obj = Popen(cmd_list, stdout=PIPE, stderr=PIPE, env=env) retcode = obj.pid This is potentially http://bugs.python.org/issue1068268 The subprocess module is not safe for use with signals, because it doesn't retry the system calls upon EINTR. I'm going to backport the patch in http://bugs.python.org/issue1068268 to see if it fixes the problem. If so, then we can dup this to https://bugzilla.redhat.com/show_bug.cgi?id=524836 Created attachment 474518 [details] python subprocess.py patch This seems to be fixed with the backport of these two patches for python: http://svn.python.org/view/python/trunk/Lib/subprocess.py?r1=64756&r2=64755&pathrev=64756 and http://bugs.python.org/issue1068268 01/20 14:23:13 DEBUG: Sending command "condor_restart" to subsystem "master" 01/20 14:23:13 DEBUG: Shutting down 01/20 14:23:13 DEBUG: Closing QMF connections 01/20 14:23:13 DEBUG: Lost connecton to the configuration store 01/20 14:23:13 DEBUG: Closed QMF connections 01/20 14:23:13 DEBUG: Setting stop flag 01/20 14:23:13 DEBUG: Sent command "condor_restart" to subsystem "master" 01/20/11 14:23:13 Sent SIGTERM to QMF_CONFIGD (pid 4316) 01/20/11 14:23:13 Sent SIGTERM to STARTD (pid 4315) 01/20/11 14:23:14 The QMF_CONFIGD (pid 4316) exited with status 0 01/20/11 14:23:14 The STARTD (pid 4315) exited with status 0 01/20/11 14:23:14 All daemons are gone. Restarting. 01/20/11 14:23:14 Restarting master right away. I tested with 20 runs and was not able to reproduce with the patch. I hit the problem immediately when I reverted to the unpatched original subprocess.py. So, I think this is safe to dup this to the other python bz. *** This bug has been marked as a duplicate of bug 524836 *** |