Bug 669396

Summary: ERROR: Failed to send command "condor_restart" to subsystem "master" during master restart
Product: Red Hat Enterprise MRG Reporter: Jon Thomas <jthomas>
Component: condor-wallaby-clientAssignee: grid-maint-list <grid-maint-list>
Status: CLOSED DUPLICATE QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.3CC: matt, rrati
Target Milestone: 2.0   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-20 19:31:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
python subprocess.py patch none

Description Jon Thomas 2011-01-13 15:04:08 UTC
It appears that condor_configd sends a condor_restart and the restart occurs, but we see the error message. This seems somewhat cosmetic, perhaps a return code, ack, or timeout type of problem.

# cat /var/log/condor/ConfigLog
01/11 14:19:14 INFO: Starting Up
01/11 14:19:14 INFO: Hostname is "had1.mypool.com"
01/11 14:19:32 INFO: Retrieving configuration version "1294773420019095" from the store
01/11 14:19:33 INFO: Retrieved configuration from the store
01/11 14:19:38 ERROR: Failed to send command "condor_restart" to subsystem "master" (retval: -1, stdout: "None", stderr: "None")
01/11 14:19:38 INFO: Exiting


01/11/11 14:19:37 Sent SIGTERM to QMF_CONFIGD (pid 2780)
01/11/11 14:19:37 Sent SIGTERM to SCHEDD (pid 2778)
01/11/11 14:19:38 Sent SIGTERM to STARTD (pid 2779)
01/11/11 14:19:38 The QMF_CONFIGD (pid 2780) exited with status 0
01/11/11 14:19:59 The SCHEDD (pid 2778) exited with status 0
01/11/11 14:22:59 The STARTD (pid 2779) exited with status 0
01/11/11 14:22:59 All daemons are gone.  Restarting.
01/11/11 14:22:59 Restarting master right away.

Repro:

A customer hit this in a couple different installs involving host machine and a virtual machine. I saw this in an install between two virtual machines. So, there may be some relation to load and occurrence of the error message.

Comment 1 Robert Rati 2011-01-13 16:26:01 UTC
This is likely an error in the condor_restart tool.  The configd is looking for a non-zero return code, so if the restart goes through but condor_restart returns a non-zero then you would get the results in this BZ.

Comment 3 Jon Thomas 2011-01-18 21:14:16 UTC
I think this has something to do with condorutils/osutil.py and run_cmd. 

run_cmd is responsible for the -1 retval.

01/11 14:19:38 ERROR: Failed to send command "condor_restart" to subsystem
"master" (retval: -1, stdout: "None", stderr: "None")

It seems to be the exception for this block

try:
         cmd_list = shlex.split(cmd)
         obj = Popen(cmd_list, stdout=PIPE, stderr=PIPE, env=env)
         (std_out, std_err) = obj.communicate()
         retcode = obj.returncode       
      except:
         return (-1, None, None)

01/18 16:06:42 DEBUG: Sending command "condor_restart" to subsystem "master"
01/18 16:06:42 DEBUG: Shutting down
01/18 16:06:42 DEBUG: Closing QMF connections
01/18 16:06:42 DEBUG: Lost connecton to the configuration store
01/18 16:06:43 DEBUG: Closed QMF connections
01/18 16:06:43 DEBUG: Setting stop flag
01/18 16:06:43 ERROR: Failed to send command "condor_restart" to subsystem "master" (retval: -1, stdout: "None", stderr: "None")

Comment 4 Jon Thomas 2011-01-18 21:24:49 UTC
re: may not be completely cosmetic...

Initial startup path throws away the return value for check_config_ver, so that seems like the issue is cosmetic. 

However, the process_version_q uses the return value as a flag to requeue the version. I'm not really sure of the consequences of this yet.

      if version != -1:
         # There was a new version in the queue
         if self.check_config_ver(version) == False:
            if version_queue.empty() == True:
               # There hasn't been another version to look at, so put
               # this one back as there was an error installing it
               version_queue.put(version, False)
               size += 1
      return size

Comment 5 Jon Thomas 2011-01-19 17:10:38 UTC
Looking again at the log, this looks like

1) condor_configd sends a condor_restart to master
2) master tells condor_configd to shutdown while it's in the popen code block
3) shutdown signal fires condor_configd's exit_signal_handler
4) the signal or signal handler causes popen code block (perhaps obj.communicate) to have an exception resulting in a -1 return value

01/18 16:06:42 DEBUG: Sending command "condor_restart" to subsystem "master"
01/18 16:06:42 DEBUG: Shutting down
01/18 16:06:42 DEBUG: Closing QMF connections
01/18 16:06:42 DEBUG: Lost connecton to the configuration store
01/18 16:06:43 DEBUG: Closed QMF connections
01/18 16:06:43 DEBUG: Setting stop flag
01/18 16:06:43 ERROR: Failed to send command "condor_restart" to subsystem
"master" (retval: -1, stdout: "None", stderr: "None")

Comment 6 Jon Thomas 2011-01-19 19:09:48 UTC
One thought on fixing this is to convert or overload run_cmd to have a NOWAIT flag. condor_restart return code doesn't really tell us if the restart was successful. doCommand is void.          


maybe just something like this for the NOWAIT case. or return true of obj is valid.

obj = Popen(cmd_list, stdout=PIPE, stderr=PIPE, env=env)
retcode = obj.pid

Comment 7 Jon Thomas 2011-01-19 19:34:24 UTC

This is potentially http://bugs.python.org/issue1068268

The subprocess module is not safe for use with signals,
because it doesn't retry the system calls upon EINTR.

Comment 8 Jon Thomas 2011-01-19 19:42:33 UTC
I'm going to backport the patch in http://bugs.python.org/issue1068268 to see if it fixes the problem. If so, then we can dup this to


https://bugzilla.redhat.com/show_bug.cgi?id=524836

Comment 9 Jon Thomas 2011-01-20 19:29:27 UTC
Created attachment 474518 [details]
python subprocess.py patch

This seems to be fixed with the backport of these two patches for python:

http://svn.python.org/view/python/trunk/Lib/subprocess.py?r1=64756&r2=64755&pathrev=64756
and
http://bugs.python.org/issue1068268


01/20 14:23:13 DEBUG: Sending command "condor_restart" to subsystem "master"
01/20 14:23:13 DEBUG: Shutting down
01/20 14:23:13 DEBUG: Closing QMF connections
01/20 14:23:13 DEBUG: Lost connecton to the configuration store
01/20 14:23:13 DEBUG: Closed QMF connections
01/20 14:23:13 DEBUG: Setting stop flag
01/20 14:23:13 DEBUG: Sent command "condor_restart" to subsystem "master"

01/20/11 14:23:13 Sent SIGTERM to QMF_CONFIGD (pid 4316)
01/20/11 14:23:13 Sent SIGTERM to STARTD (pid 4315)
01/20/11 14:23:14 The QMF_CONFIGD (pid 4316) exited with status 0
01/20/11 14:23:14 The STARTD (pid 4315) exited with status 0
01/20/11 14:23:14 All daemons are gone.  Restarting.
01/20/11 14:23:14 Restarting master right away.

I tested with 20 runs and was not able to reproduce with the patch. I hit the problem immediately when I reverted to the unpatched original subprocess.py.

So, I think this is safe to dup this to the other python bz.

Comment 10 Jon Thomas 2011-01-20 19:31:04 UTC

*** This bug has been marked as a duplicate of bug 524836 ***