Hide Forgot
It appears that condor_configd sends a condor_restart and the restart occurs, but we see the error message. This seems somewhat cosmetic, perhaps a return code, ack, or timeout type of problem. # cat /var/log/condor/ConfigLog 01/11 14:19:14 INFO: Starting Up 01/11 14:19:14 INFO: Hostname is "had1.mypool.com" 01/11 14:19:32 INFO: Retrieving configuration version "1294773420019095" from the store 01/11 14:19:33 INFO: Retrieved configuration from the store 01/11 14:19:38 ERROR: Failed to send command "condor_restart" to subsystem "master" (retval: -1, stdout: "None", stderr: "None") 01/11 14:19:38 INFO: Exiting 01/11/11 14:19:37 Sent SIGTERM to QMF_CONFIGD (pid 2780) 01/11/11 14:19:37 Sent SIGTERM to SCHEDD (pid 2778) 01/11/11 14:19:38 Sent SIGTERM to STARTD (pid 2779) 01/11/11 14:19:38 The QMF_CONFIGD (pid 2780) exited with status 0 01/11/11 14:19:59 The SCHEDD (pid 2778) exited with status 0 01/11/11 14:22:59 The STARTD (pid 2779) exited with status 0 01/11/11 14:22:59 All daemons are gone. Restarting. 01/11/11 14:22:59 Restarting master right away. Repro: A customer hit this in a couple different installs involving host machine and a virtual machine. I saw this in an install between two virtual machines. So, there may be some relation to load and occurrence of the error message.
This is likely an error in the condor_restart tool. The configd is looking for a non-zero return code, so if the restart goes through but condor_restart returns a non-zero then you would get the results in this BZ.
I think this has something to do with condorutils/osutil.py and run_cmd. run_cmd is responsible for the -1 retval. 01/11 14:19:38 ERROR: Failed to send command "condor_restart" to subsystem "master" (retval: -1, stdout: "None", stderr: "None") It seems to be the exception for this block try: cmd_list = shlex.split(cmd) obj = Popen(cmd_list, stdout=PIPE, stderr=PIPE, env=env) (std_out, std_err) = obj.communicate() retcode = obj.returncode except: return (-1, None, None) 01/18 16:06:42 DEBUG: Sending command "condor_restart" to subsystem "master" 01/18 16:06:42 DEBUG: Shutting down 01/18 16:06:42 DEBUG: Closing QMF connections 01/18 16:06:42 DEBUG: Lost connecton to the configuration store 01/18 16:06:43 DEBUG: Closed QMF connections 01/18 16:06:43 DEBUG: Setting stop flag 01/18 16:06:43 ERROR: Failed to send command "condor_restart" to subsystem "master" (retval: -1, stdout: "None", stderr: "None")
re: may not be completely cosmetic... Initial startup path throws away the return value for check_config_ver, so that seems like the issue is cosmetic. However, the process_version_q uses the return value as a flag to requeue the version. I'm not really sure of the consequences of this yet. if version != -1: # There was a new version in the queue if self.check_config_ver(version) == False: if version_queue.empty() == True: # There hasn't been another version to look at, so put # this one back as there was an error installing it version_queue.put(version, False) size += 1 return size
Looking again at the log, this looks like 1) condor_configd sends a condor_restart to master 2) master tells condor_configd to shutdown while it's in the popen code block 3) shutdown signal fires condor_configd's exit_signal_handler 4) the signal or signal handler causes popen code block (perhaps obj.communicate) to have an exception resulting in a -1 return value 01/18 16:06:42 DEBUG: Sending command "condor_restart" to subsystem "master" 01/18 16:06:42 DEBUG: Shutting down 01/18 16:06:42 DEBUG: Closing QMF connections 01/18 16:06:42 DEBUG: Lost connecton to the configuration store 01/18 16:06:43 DEBUG: Closed QMF connections 01/18 16:06:43 DEBUG: Setting stop flag 01/18 16:06:43 ERROR: Failed to send command "condor_restart" to subsystem "master" (retval: -1, stdout: "None", stderr: "None")
One thought on fixing this is to convert or overload run_cmd to have a NOWAIT flag. condor_restart return code doesn't really tell us if the restart was successful. doCommand is void. maybe just something like this for the NOWAIT case. or return true of obj is valid. obj = Popen(cmd_list, stdout=PIPE, stderr=PIPE, env=env) retcode = obj.pid
This is potentially http://bugs.python.org/issue1068268 The subprocess module is not safe for use with signals, because it doesn't retry the system calls upon EINTR.
I'm going to backport the patch in http://bugs.python.org/issue1068268 to see if it fixes the problem. If so, then we can dup this to https://bugzilla.redhat.com/show_bug.cgi?id=524836
Created attachment 474518 [details] python subprocess.py patch This seems to be fixed with the backport of these two patches for python: http://svn.python.org/view/python/trunk/Lib/subprocess.py?r1=64756&r2=64755&pathrev=64756 and http://bugs.python.org/issue1068268 01/20 14:23:13 DEBUG: Sending command "condor_restart" to subsystem "master" 01/20 14:23:13 DEBUG: Shutting down 01/20 14:23:13 DEBUG: Closing QMF connections 01/20 14:23:13 DEBUG: Lost connecton to the configuration store 01/20 14:23:13 DEBUG: Closed QMF connections 01/20 14:23:13 DEBUG: Setting stop flag 01/20 14:23:13 DEBUG: Sent command "condor_restart" to subsystem "master" 01/20/11 14:23:13 Sent SIGTERM to QMF_CONFIGD (pid 4316) 01/20/11 14:23:13 Sent SIGTERM to STARTD (pid 4315) 01/20/11 14:23:14 The QMF_CONFIGD (pid 4316) exited with status 0 01/20/11 14:23:14 The STARTD (pid 4315) exited with status 0 01/20/11 14:23:14 All daemons are gone. Restarting. 01/20/11 14:23:14 Restarting master right away. I tested with 20 runs and was not able to reproduce with the patch. I hit the problem immediately when I reverted to the unpatched original subprocess.py. So, I think this is safe to dup this to the other python bz.
*** This bug has been marked as a duplicate of bug 524836 ***