Bug 669396 - ERROR: Failed to send command "condor_restart" to subsystem "master" during master restart
Summary: ERROR: Failed to send command "condor_restart" to subsystem "master" during...
Keywords:
Status: CLOSED DUPLICATE of bug 524836
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-wallaby-client
Version: 1.3
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: 2.0
: ---
Assignee: grid-maint-list
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-01-13 15:04 UTC by Jon Thomas
Modified: 2018-11-14 15:45 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-20 19:31:04 UTC
Target Upstream Version:


Attachments (Terms of Use)
python subprocess.py patch (6.08 KB, patch)
2011-01-20 19:29 UTC, Jon Thomas
no flags Details | Diff

Description Jon Thomas 2011-01-13 15:04:08 UTC
It appears that condor_configd sends a condor_restart and the restart occurs, but we see the error message. This seems somewhat cosmetic, perhaps a return code, ack, or timeout type of problem.

# cat /var/log/condor/ConfigLog
01/11 14:19:14 INFO: Starting Up
01/11 14:19:14 INFO: Hostname is "had1.mypool.com"
01/11 14:19:32 INFO: Retrieving configuration version "1294773420019095" from the store
01/11 14:19:33 INFO: Retrieved configuration from the store
01/11 14:19:38 ERROR: Failed to send command "condor_restart" to subsystem "master" (retval: -1, stdout: "None", stderr: "None")
01/11 14:19:38 INFO: Exiting


01/11/11 14:19:37 Sent SIGTERM to QMF_CONFIGD (pid 2780)
01/11/11 14:19:37 Sent SIGTERM to SCHEDD (pid 2778)
01/11/11 14:19:38 Sent SIGTERM to STARTD (pid 2779)
01/11/11 14:19:38 The QMF_CONFIGD (pid 2780) exited with status 0
01/11/11 14:19:59 The SCHEDD (pid 2778) exited with status 0
01/11/11 14:22:59 The STARTD (pid 2779) exited with status 0
01/11/11 14:22:59 All daemons are gone.  Restarting.
01/11/11 14:22:59 Restarting master right away.

Repro:

A customer hit this in a couple different installs involving host machine and a virtual machine. I saw this in an install between two virtual machines. So, there may be some relation to load and occurrence of the error message.

Comment 1 Robert Rati 2011-01-13 16:26:01 UTC
This is likely an error in the condor_restart tool.  The configd is looking for a non-zero return code, so if the restart goes through but condor_restart returns a non-zero then you would get the results in this BZ.

Comment 3 Jon Thomas 2011-01-18 21:14:16 UTC
I think this has something to do with condorutils/osutil.py and run_cmd. 

run_cmd is responsible for the -1 retval.

01/11 14:19:38 ERROR: Failed to send command "condor_restart" to subsystem
"master" (retval: -1, stdout: "None", stderr: "None")

It seems to be the exception for this block

try:
         cmd_list = shlex.split(cmd)
         obj = Popen(cmd_list, stdout=PIPE, stderr=PIPE, env=env)
         (std_out, std_err) = obj.communicate()
         retcode = obj.returncode       
      except:
         return (-1, None, None)

01/18 16:06:42 DEBUG: Sending command "condor_restart" to subsystem "master"
01/18 16:06:42 DEBUG: Shutting down
01/18 16:06:42 DEBUG: Closing QMF connections
01/18 16:06:42 DEBUG: Lost connecton to the configuration store
01/18 16:06:43 DEBUG: Closed QMF connections
01/18 16:06:43 DEBUG: Setting stop flag
01/18 16:06:43 ERROR: Failed to send command "condor_restart" to subsystem "master" (retval: -1, stdout: "None", stderr: "None")

Comment 4 Jon Thomas 2011-01-18 21:24:49 UTC
re: may not be completely cosmetic...

Initial startup path throws away the return value for check_config_ver, so that seems like the issue is cosmetic. 

However, the process_version_q uses the return value as a flag to requeue the version. I'm not really sure of the consequences of this yet.

      if version != -1:
         # There was a new version in the queue
         if self.check_config_ver(version) == False:
            if version_queue.empty() == True:
               # There hasn't been another version to look at, so put
               # this one back as there was an error installing it
               version_queue.put(version, False)
               size += 1
      return size

Comment 5 Jon Thomas 2011-01-19 17:10:38 UTC
Looking again at the log, this looks like

1) condor_configd sends a condor_restart to master
2) master tells condor_configd to shutdown while it's in the popen code block
3) shutdown signal fires condor_configd's exit_signal_handler
4) the signal or signal handler causes popen code block (perhaps obj.communicate) to have an exception resulting in a -1 return value

01/18 16:06:42 DEBUG: Sending command "condor_restart" to subsystem "master"
01/18 16:06:42 DEBUG: Shutting down
01/18 16:06:42 DEBUG: Closing QMF connections
01/18 16:06:42 DEBUG: Lost connecton to the configuration store
01/18 16:06:43 DEBUG: Closed QMF connections
01/18 16:06:43 DEBUG: Setting stop flag
01/18 16:06:43 ERROR: Failed to send command "condor_restart" to subsystem
"master" (retval: -1, stdout: "None", stderr: "None")

Comment 6 Jon Thomas 2011-01-19 19:09:48 UTC
One thought on fixing this is to convert or overload run_cmd to have a NOWAIT flag. condor_restart return code doesn't really tell us if the restart was successful. doCommand is void.          


maybe just something like this for the NOWAIT case. or return true of obj is valid.

obj = Popen(cmd_list, stdout=PIPE, stderr=PIPE, env=env)
retcode = obj.pid

Comment 7 Jon Thomas 2011-01-19 19:34:24 UTC

This is potentially http://bugs.python.org/issue1068268

The subprocess module is not safe for use with signals,
because it doesn't retry the system calls upon EINTR.

Comment 8 Jon Thomas 2011-01-19 19:42:33 UTC
I'm going to backport the patch in http://bugs.python.org/issue1068268 to see if it fixes the problem. If so, then we can dup this to


https://bugzilla.redhat.com/show_bug.cgi?id=524836

Comment 9 Jon Thomas 2011-01-20 19:29:27 UTC
Created attachment 474518 [details]
python subprocess.py patch

This seems to be fixed with the backport of these two patches for python:

http://svn.python.org/view/python/trunk/Lib/subprocess.py?r1=64756&r2=64755&pathrev=64756
and
http://bugs.python.org/issue1068268


01/20 14:23:13 DEBUG: Sending command "condor_restart" to subsystem "master"
01/20 14:23:13 DEBUG: Shutting down
01/20 14:23:13 DEBUG: Closing QMF connections
01/20 14:23:13 DEBUG: Lost connecton to the configuration store
01/20 14:23:13 DEBUG: Closed QMF connections
01/20 14:23:13 DEBUG: Setting stop flag
01/20 14:23:13 DEBUG: Sent command "condor_restart" to subsystem "master"

01/20/11 14:23:13 Sent SIGTERM to QMF_CONFIGD (pid 4316)
01/20/11 14:23:13 Sent SIGTERM to STARTD (pid 4315)
01/20/11 14:23:14 The QMF_CONFIGD (pid 4316) exited with status 0
01/20/11 14:23:14 The STARTD (pid 4315) exited with status 0
01/20/11 14:23:14 All daemons are gone.  Restarting.
01/20/11 14:23:14 Restarting master right away.

I tested with 20 runs and was not able to reproduce with the patch. I hit the problem immediately when I reverted to the unpatched original subprocess.py.

So, I think this is safe to dup this to the other python bz.

Comment 10 Jon Thomas 2011-01-20 19:31:04 UTC

*** This bug has been marked as a duplicate of bug 524836 ***


Note You need to log in before you can comment on or make changes to this bug.