Bug 620511

Summary: condor_configd hangs on rhel4 when receiving a remote config
Product: Red Hat Enterprise MRG Reporter: Pete MacKinnon <pmackinn>
Component: condor-wallaby-clientAssignee: Robert Rati <rrati>
Status: CLOSED CURRENTRELEASE QA Contact: Tomas Rusnak <trusnak>
Severity: high Docs Contact:
Priority: high    
Version: betaCC: rrati, trusnak
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-21 18:44:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 612869, 623220    
Bug Blocks:    

Description Pete MacKinnon 2010-08-02 18:13:34 UTC
condor_configd invokes condor_config_val -dump which appears to hang when receiving a new remote config from wallaby

[root@mrg8 ~]# rpm -q condor-wallaby-client
condor-wallaby-client-3.2-1.el4

Comment 1 Robert Rati 2010-08-02 18:23:13 UTC
Issue is with large output from commands executed on rhel4.  The run_cmd function would end up deadlocking waiting for the command to exit so it could read the stdout/err buffers, but the program was waiting for the buffers to be read so it could put more data in and finish executing.

Fixed in:
condor-job-hooks-1.4-3

Comment 2 Tomas Rusnak 2010-08-11 09:30:52 UTC
Reproduced on:

$CondorVersion: 7.4.4 Aug  5 2010 BuildID: RH-7.4.4-0.8.el4 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL4 $

condor-job-hooks-1.4-1.el4

08/11 05:22:07 DEBUG: Retrieved node object from store
08/11 05:22:14 DEBUG: Checking version of condor configuration
08/11 05:22:15 DEBUG: The system is already running configuration version "1281517401758710"
08/11 05:22:15 DEBUG: Performing a checkin with the store
08/11 05:22:15 DEBUG: Checked in with the store
08/11 05:24:14 DEBUG: Received a NodeUpdatedNotice
08/11 05:24:14 DEBUG: The event is for this node
08/11 05:24:15 DEBUG: Checking version of condor configuration
08/11 05:24:15 INFO: Retrieving configuration version "1281518654174482" from the store
08/11 05:24:24 DEBUG: Retrieved configuration from the store
08/11 05:26:58 DEBUG: Received a NodeUpdatedNotice
08/11 05:26:58 DEBUG: The event is for this node
08/11 05:26:58 DEBUG: Checking version of condor configuration
08/11 05:28:19 DEBUG: Received a NodeUpdatedNotice
08/11 05:28:19 DEBUG: The event is for this node

Remote configuration hangs - no deamons restarted, no configuration retrieved from configure store.

Comment 3 Robert Rati 2010-08-11 12:44:29 UTC
You reported testing on condor-job-hooks-1.4-1, but the fix is stated as being in condor-job-hooks-1.4-3.  Please test with 1.4-3.

Comment 4 Tomas Rusnak 2010-08-18 08:48:36 UTC
Depends on set to BZ623220. Cannot be verified before the problems with config store are resolved.

08/18 04:16:05 DEBUG: Received a NodeUpdatedNotice
08/18 04:16:05 DEBUG: The event is for this node
08/18 04:16:06 DEBUG: Checking version of condor configuration
08/18 04:16:06 INFO: Retrieving configuration version "1282119365333910" from the store
08/18 04:16:07 DEBUG: Retrieved configuration from the store
08/18 04:16:07 ERROR: Store error: 1, ERROR: near "-": syntax error
08/18 04:16:07 ERROR: Failed to retrive differences between versions "1282116947136550" and "1282119365333910".  No update performed
08/18 04:16:07 DEBUG: Performing a checkin with the store
08/18 04:16:07 DEBUG: Checked in with the store

/var/log/wallaby/agent.log

E, [2010-08-18T04:16:07.038373 #30701] ERROR -- : Error calling whatChanged: near "-": syntax error
E, [2010-08-18T04:16:07.038496 #30701] ERROR -- :     /usr/lib/ruby/site_ruby/1.8/sqlite3/errors.rb:62:in `check'
    /usr/lib/ruby/site_ruby/1.8/sqlite3/statement.rb:39:in `initialize'
    /usr/lib/ruby/site_ruby/1.8/sqlite3/database.rb:154:in `new'
    /usr/lib/ruby/site_ruby/1.8/sqlite3/database.rb:154:in `prepare'
    /usr/lib/ruby/site_ruby/1.8/sqlite3/database.rb:181:in `execute'
    /usr/lib/ruby/site_ruby/1.8/mrg/grid/config/Configuration.rb:94:in `build'
    /usr/lib/ruby/site_ruby/1.8/mrg/grid/config/Node.rb:236:in `whatChanged'
    /usr/lib/ruby/site_ruby/1.8/spqr/app.rb:130:in `send'
    /usr/lib/ruby/site_ruby/1.8/spqr/app.rb:130:in `method_call'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:1460:in `do_agent_events'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:1493:in `do_events'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:1519:in `sess_event_recv'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:308:in `run'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:242:in `initialize'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:241:in `new'
    /usr/lib/ruby/site_ruby/1.8/qmf.rb:241:in `initialize'
    /usr/lib/ruby/site_ruby/1.8/spqr/app.rb:211:in `new'
    /usr/lib/ruby/site_ruby/1.8/spqr/app.rb:211:in `main'
    /usr/bin/wallaby-agent:235

Comment 5 Tomas Rusnak 2010-08-25 12:06:43 UTC
Tested on:

$CondorVersion: 7.4.4 Aug 23 2010 BuildID: RH-7.4.4-0.10.el4 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL4 $

$CondorVersion: 7.4.4 Aug 23 2010 BuildID: RH-7.4.4-0.10.el4 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL4 $

condor-wallaby-client-3.4-1.el4

08/25 08:01:56 DEBUG: Connected to broker "****:5672"
08/25 08:01:57 DEBUG: Retrieved node object from store
08/25 08:01:57 DEBUG: Checking version of condor configuration
08/25 08:01:57 DEBUG: The system is already running configuration version "1282737701199239"
08/25 08:01:57 DEBUG: Performing a checkin with the store
08/25 08:01:57 DEBUG: Checked in with the store
08/25 08:02:39 DEBUG: Received a NodeUpdatedNotice
08/25 08:02:39 DEBUG: The event is for this node
08/25 08:02:39 DEBUG: Checking version of condor configuration
08/25 08:02:40 INFO: Retrieving configuration version "1282737757546367" from the store
08/25 08:02:43 DEBUG: Retrieved configuration from the store
08/25 08:02:43 DEBUG: Daemons to restart: []
08/25 08:02:43 DEBUG: Daemons to reconfig: [u'schedd', u'collector', u'startd']
08/25 08:02:44 DEBUG: Not sending "condor_reconfig" to subsystem "schedd" since it is not currently running
08/25 08:02:44 DEBUG: Not sending "condor_reconfig" to subsystem "collector" since it is not currently running
08/25 08:02:44 DEBUG: Sending command "condor_reconfig" to subsystem "startd"
08/25 08:02:44 DEBUG: Sent command "condor_reconfig" to subsystem "startd"
08/25 08:02:44 DEBUG: Performing a checkin with the store
08/25 08:02:44 DEBUG: Checked in with the store

>>> VERIFIED