Bug 620511 - condor_configd hangs on rhel4 when receiving a remote config
Summary: condor_configd hangs on rhel4 when receiving a remote config
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-wallaby-client
Version: beta
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3
: ---
Assignee: Robert Rati
QA Contact: Tomas Rusnak
URL:
Whiteboard:
Keywords:
Depends On: 612869 623220
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-08-02 18:13 UTC by Pete MacKinnon
Modified: 2010-10-21 18:44 UTC (History)
2 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2010-10-21 18:44:50 UTC


Attachments (Terms of Use)

Description Pete MacKinnon 2010-08-02 18:13:34 UTC
condor_configd invokes condor_config_val -dump which appears to hang when receiving a new remote config from wallaby

[root@mrg8 ~]# rpm -q condor-wallaby-client
condor-wallaby-client-3.2-1.el4

Comment 1 Robert Rati 2010-08-02 18:23:13 UTC
Issue is with large output from commands executed on rhel4.  The run_cmd function would end up deadlocking waiting for the command to exit so it could read the stdout/err buffers, but the program was waiting for the buffers to be read so it could put more data in and finish executing.

Fixed in:
condor-job-hooks-1.4-3

Comment 2 Tomas Rusnak 2010-08-11 09:30:52 UTC
Reproduced on:

$CondorVersion: 7.4.4 Aug  5 2010 BuildID: RH-7.4.4-0.8.el4 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL4 $

condor-job-hooks-1.4-1.el4

08/11 05:22:07 DEBUG: Retrieved node object from store
08/11 05:22:14 DEBUG: Checking version of condor configuration
08/11 05:22:15 DEBUG: The system is already running configuration version "1281517401758710"
08/11 05:22:15 DEBUG: Performing a checkin with the store
08/11 05:22:15 DEBUG: Checked in with the store
08/11 05:24:14 DEBUG: Received a NodeUpdatedNotice
08/11 05:24:14 DEBUG: The event is for this node
08/11 05:24:15 DEBUG: Checking version of condor configuration
08/11 05:24:15 INFO: Retrieving configuration version "1281518654174482" from the store
08/11 05:24:24 DEBUG: Retrieved configuration from the store
08/11 05:26:58 DEBUG: Received a NodeUpdatedNotice
08/11 05:26:58 DEBUG: The event is for this node
08/11 05:26:58 DEBUG: Checking version of condor configuration
08/11 05:28:19 DEBUG: Received a NodeUpdatedNotice
08/11 05:28:19 DEBUG: The event is for this node

Remote configuration hangs - no deamons restarted, no configuration retrieved from configure store.

Comment 3 Robert Rati 2010-08-11 12:44:29 UTC
You reported testing on condor-job-hooks-1.4-1, but the fix is stated as being in condor-job-hooks-1.4-3.  Please test with 1.4-3.

Comment 4 Tomas Rusnak 2010-08-18 08:48:36 UTC
Depends on set to BZ623220. Cannot be verified before the problems with config store are resolved.

08/18 04:16:05 DEBUG: Received a NodeUpdatedNotice
08/18 04:16:05 DEBUG: The event is for this node
08/18 04:16:06 DEBUG: Checking version of condor configuration
08/18 04:16:06 INFO: Retrieving configuration version "1282119365333910" from the store
08/18 04:16:07 DEBUG: Retrieved configuration from the store
08/18 04:16:07 ERROR: Store error: 1, ERROR: near "-": syntax error
08/18 04:16:07 ERROR: Failed to retrive differences between versions "1282116947136550" and "1282119365333910".  No update performed
08/18 04:16:07 DEBUG: Performing a checkin with the store
08/18 04:16:07 DEBUG: Checked in with the store

/var/log/wallaby/agent.log

E, [2010-08-18T04:16:07.038373 #30701] ERROR -- : Error calling whatChanged: near "-": syntax error
E, [2010-08-18T04:16:07.038496 #30701] ERROR -- :     /usr/lib/ruby/site_ruby/1.8/sqlite3/errors.rb:62:in `check'
    /usr/lib/ruby/site_ruby/1.8/sqlite3/statement.rb:39:in `initialize'
    /usr/lib/ruby/site_ruby/1.8/sqlite3/database.rb:154:in `new'
    /usr/lib/ruby/site_ruby/1.8/sqlite3/database.rb:154:in `prepare'
    /usr/lib/ruby/site_ruby/1.8/sqlite3/database.rb:181:in `execute'
    /usr/lib/ruby/site_ruby/1.8/mrg/grid/config/Configuration.rb:94:in `build'
    /usr/lib/ruby/site_ruby/1.8/mrg/grid/config/Node.rb:236:in `whatChanged'
    /usr/lib/ruby/site_ruby/1.8/spqr/app.rb:130:in `send'
    /usr/lib/ruby/site_ruby/1.8/spqr/app.rb:130:in `method_call'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:1460:in `do_agent_events'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:1493:in `do_events'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:1519:in `sess_event_recv'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:308:in `run'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:242:in `initialize'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:241:in `new'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:241:in `initialize'
    /usr/lib/ruby/site_ruby/1.8/spqr/app.rb:211:in `new'
    /usr/lib/ruby/site_ruby/1.8/spqr/app.rb:211:in `main'
    /usr/bin/wallaby-agent:235

Comment 5 Tomas Rusnak 2010-08-25 12:06:43 UTC
Tested on:

$CondorVersion: 7.4.4 Aug 23 2010 BuildID: RH-7.4.4-0.10.el4 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL4 $

$CondorVersion: 7.4.4 Aug 23 2010 BuildID: RH-7.4.4-0.10.el4 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL4 $

condor-wallaby-client-3.4-1.el4

08/25 08:01:56 DEBUG: Connected to broker "****:5672"
08/25 08:01:57 DEBUG: Retrieved node object from store
08/25 08:01:57 DEBUG: Checking version of condor configuration
08/25 08:01:57 DEBUG: The system is already running configuration version "1282737701199239"
08/25 08:01:57 DEBUG: Performing a checkin with the store
08/25 08:01:57 DEBUG: Checked in with the store
08/25 08:02:39 DEBUG: Received a NodeUpdatedNotice
08/25 08:02:39 DEBUG: The event is for this node
08/25 08:02:39 DEBUG: Checking version of condor configuration
08/25 08:02:40 INFO: Retrieving configuration version "1282737757546367" from the store
08/25 08:02:43 DEBUG: Retrieved configuration from the store
08/25 08:02:43 DEBUG: Daemons to restart: []
08/25 08:02:43 DEBUG: Daemons to reconfig: [u'schedd', u'collector', u'startd']
08/25 08:02:44 DEBUG: Not sending "condor_reconfig" to subsystem "schedd" since it is not currently running
08/25 08:02:44 DEBUG: Not sending "condor_reconfig" to subsystem "collector" since it is not currently running
08/25 08:02:44 DEBUG: Sending command "condor_reconfig" to subsystem "startd"
08/25 08:02:44 DEBUG: Sent command "condor_reconfig" to subsystem "startd"
08/25 08:02:44 DEBUG: Performing a checkin with the store
08/25 08:02:44 DEBUG: Checked in with the store

>>> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.