Bug 628034

Summary: negotiator core on quota_dynamic =0
Product: Red Hat Enterprise MRG Reporter: Jon Thomas <jthomas>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Tomas Rusnak <trusnak>
Severity: high Docs Contact:
Priority: medium    
Version: 1.2CC: fnadge, matt, trusnak
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
An incorrect configuration of the negotiator resulted in a segmentation fault. This occurred when the 'quota' variable was set to 0 for a group that had subgroups. With this update, the segmentation fault no longer occurs in this situation.
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-14 16:14:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 528800    
Attachments:
Description Flags
patch for zero quota none

Description Jon Thomas 2010-08-27 17:26:38 UTC
Created attachment 441579 [details]
patch for zero quota

Negotiator will core if the config sets a quota=0 for a  group that has sub groups.

GROUP_QUOTA_DYNAMIC_a1.a3 = 0
GROUP_QUOTA_DYNAMIC_a1.a3.a3 = .2

This is due to the a1.a3 group not being added into the array of groups.

This is somewhat a matter of config in that if the admin sets

GROUP_QUOTA_DYNAMIC_a1.a3 = 0

they should also set 

GROUP_QUOTA_DYNAMIC_a1.a3.a3 = 0

which would avoid the problem.

But an incorrect config shouldn't really core the negotiator.

Comment 1 Jon Thomas 2010-08-27 17:31:53 UTC
Added a patch.  

The patch will change behavior in that submitters to a group with a configured 0 quota will not fall into the bucket of non-group users. The flip side is that the tree isn't pruned.

Comment 2 Matthew Farrellee 2010-08-29 23:51:53 UTC
The semantic change is that if a group is explicitly set to 0, then all submitters to that group or any subgroup will get 0 quota, instead of being dropped in with the non-group submissions? That seems like a reasonable behavior to have anyway. It would be unexpected if a group with 0 quota was allowed to run jobs.

Comment 3 Jon Thomas 2010-08-30 12:59:31 UTC
The semantics with this patch are slightly different from comment #2.

a) If the group was listed in GROUP_NAMES, the quota would be set to zero and the jobs would not run.

b) If the group was not listed in GROUP_NAMES, the jobs would fall into the non-group submission bucket.


The original behavior was based upon allowing "unofficial groups" to run jobs in non-group space without having to change their sdf to unset AccountingGroup. 

The patch causes "a" to happen, but not "b". However enforcing "b" is also a trivial patch since we can just throw out any ClassAd with a "." when the ClassAds are attached to groupArray[0].

Comment 4 Matthew Farrellee 2010-09-09 14:16:27 UTC
Built in 7.4.4-0.11

Comment 6 Tomas Rusnak 2010-09-16 09:32:32 UTC
Reproduced on:.

$CondorVersion: 7.4.4 Aug  9 2010 BuildID: RH-7.4.4-0.9.el5 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL5 $

Config:

GROUP_NAMES = a1.a3.a3
GROUP_QUOTA_DYNAMIC_a1.a3 = 0
GROUP_QUOTA_DYNAMIC_a1.a3.a3 = .2


09/16/10 05:30:52 Phase 2:  Performing accounting ...
09/16/10 05:30:52 group a1.a3.a3 dynamic quota for 8 slots = 0.200
09/16/10 05:30:52 Group Table : group a1.a3.a3 quota 0.200 usage 0.000 prio 0.00
09/16/10 05:30:52 negotiationtime: slots 8 group a1.a3.a3 autoregroup false
09/16/10 05:30:52 negotiationtime:sorting
09/16/10 05:30:52 Sort : sorting group vector
09/16/10 05:30:52 Sort : stage two
09/16/10 05:30:52 midsort : grouparray group  parent -1 child -1  left -1 right -1 i 0
09/16/10 05:30:52 midsort : grouparray group a1.a3.a3 parent -1 child -1  left -1 right -1 i 1
09/16/10 05:30:52 Sorted : grouparray group  parent -1 child -1  left -1 right -1 i 0
09/16/10 05:30:52 Sorted : grouparray group a1.a3.a3 parent -1 child -1  left -1 right -1 i 1
09/16/10 05:30:52 Sort : leaving
09/16/10 05:30:52 negotiationtime: finished sort - slots 8 group  auto true quota 1.000000 maxAllowed 8.000000 numsubmits 0 parent -1 child -1  left -1 right -1 i 0
09/16/10 05:30:52 negotiationtime: finished sort - slots 8 group a1.a3.a3 auto false quota 0.200000 maxAllowed 0.000000 numsubmits 0 parent -1 child -1  left -1 right -1 i 1
09/16/10 05:30:52 negotiationtime: finished inserting submitters - slots 8 group  quota 1.000000 maxAllowed 8.000000 numsubmits 0  i 0
09/16/10 05:30:52 negotiationtime: finished inserting submitters - slots 8 group a1.a3.a3 quota 0.200000 maxAllowed 0.000000 numsubmits 0  i 1
Stack dump for process 3833 at timestamp 1284629452 (8 frames)
condor_negotiator(dprintf_dump_stack+0x44)[0x80dc924]
condor_negotiator[0x80de764]
[0x635420]
condor_negotiator(_ZN12TimerManager7TimeoutEv+0x14b)[0x80dbe8b]
condor_negotiator(_ZN10DaemonCore6DriverEv+0x244)[0x80c4824]
condor_negotiator(main+0xd80)[0x80d8280]
/lib/libc.so.6(__libc_start_main+0xdc)[0x819e9c]
condor_negotiator[0x80a31b1]

Comment 7 Tomas Rusnak 2010-09-16 09:49:50 UTC
MasterLog:

09/16/10 05:46:26 The NEGOTIATOR (pid 4147) died due to signal 11 (Segmentation fault)

# ps ax | grep condor
 4073 ?        Ss     0:00 condor_master -pidfile /var/run/condor/condor_master.pid
 4074 ?        Ss     0:00 condor_collector -f
 4077 ?        Ss     0:00 condor_schedd -f
 4078 ?        Ss     0:05 condor_startd -f
 4079 ?        S      0:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64
 4142 pts/0    S+     0:00 grep condor

Config:

# cat /etc/condor/config.d/zzz_condor_config.test 
CREATE_CORE_FILES=True
#ABORT_ON_EXCEPTION=True
MAX_HISTORY_LOG=300*1024*1024
MAX_HISTORY_ROTATIONS=10
ALL_DEBUG = D_FULLDEBUG

GROUP_NAMES = a1.a3.a3

GROUP_QUOTA_DYNAMIC_a1.a3 = 0
GROUP_QUOTA_DYNAMIC_a1.a3.a3 = .2

# tailf /var/log/condor/NegotiatorLog
09/16/10 05:45:20 negotiationtime: finished inserting submitters - slots 8 group a1.a3.a3 quota 0.200000 maxAllowed 0.000000 numsubmits 0  i 1
Stack dump for process 4130 at timestamp 1284630320 (8 frames)
condor_negotiator(dprintf_dump_stack+0x44)[0x80dd1c4]
condor_negotiator[0x80df004]
[0xe71420]
condor_negotiator(_ZN12TimerManager7TimeoutEv+0x14b)[0x80dc72b]
condor_negotiator(_ZN10DaemonCore6DriverEv+0x244)[0x80c50c4]
condor_negotiator(main+0xd80)[0x80d8b20]
/lib/libc.so.6(__libc_start_main+0xdc)[0x819e9c]
condor_negotiator[0x80a31f1]

# condor -v
$CondorVersion: 7.4.4 Sep 14 2010 BuildID: RH-7.4.4-0.13.el5 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL5 $

Issue was not fixed in current packages. Could you check if patch was really included into build?

Comment 8 Jon Thomas 2010-09-16 13:54:32 UTC
It's there.

I'll take a look at why this is failing. Perhaps more recent code changes broke it.

Comment 9 Jon Thomas 2010-09-16 14:50:34 UTC
it looks like you hit a different failure based on:

GROUP_NAMES = a1.a3.a3

Try GROUP_NAMES = a1, a1.a3, a1.a3.a3

I'm looking at a way to fix this new issue.

Comment 10 Matthew Farrellee 2010-09-16 16:05:37 UTC
For a new issue, create a new BZ. If it blocks testing of this BZ, set the dependencies.

Comment 11 Tomas Rusnak 2010-09-17 09:57:17 UTC
Negotiator crash confirmed on:

$CondorVersion: 7.4.4 Aug  9 2010 BuildID: RH-7.4.4-0.9.el4 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL4 $

Stack dump for process 29714 at timestamp 1284716851 (8 frames)
condor_negotiator(dprintf_dump_stack+0x3f)[0x80d409f]
condor_negotiator[0x80d43fa]
/lib/tls/libpthread.so.0[0x131c98]
condor_negotiator(_ZN12TimerManager7TimeoutEv+0xf6)[0x80d2126]
condor_negotiator(_ZN10DaemonCore6DriverEv+0x17e)[0x80b779e]
condor_negotiator(main+0x133e)[0x80cdf0e]
/lib/tls/libc.so.6(__libc_start_main+0xd3)[0xce1e93]
condor_negotiator(__gxx_personality_v0+0x149)[0x8097711]

Retested over current packages (condor-7.4.4-0.14.el5) on all supported platforms x86,x86_64/RHEL4, RHEL5.

NegotiatorLog:

09/17/10 04:52:24 ---------- Started Negotiation Cycle ----------
09/17/10 04:52:24 Phase 1:  Obtaining ads from collector ...
09/17/10 04:52:24   Getting all public ads ...
09/17/10 04:52:24 Trying to query collector <IP>
09/17/10 04:52:24   Sorting 12 ads ...
09/17/10 04:52:24   Getting startd private ads ...
09/17/10 04:52:24 Trying to query collector <IP>
09/17/10 04:52:24 Got ads: 12 public and 8 private
09/17/10 04:52:24 Public ads include 0 submitter, 8 startd
09/17/10 04:52:24 Phase 1: numDynGroupSlots 8  untrimmedSlotWeightTotal 8.000000
09/17/10 04:52:24 Entering compute_significant_attrs()
09/17/10 04:52:24 Leaving compute_significant_attrs() - result=JobUniverse,LastCheckpointPlatform,NumCkpts
09/17/10 04:52:24 Phase 2:  Performing accounting ...
09/17/10 04:52:24 group a1 dynamic quota for 8 slots = 0.000
09/17/10 04:52:24 Group Table : group a1 quota 0.000 usage 0.000 prio nan
09/17/10 04:52:24 negotiationtime: slots 8 group a1 autoregroup false
09/17/10 04:52:24 group a1.a3 dynamic quota for 8 slots = 0.000
09/17/10 04:52:24 Group Table : group a1.a3 quota 0.000 usage 0.000 prio nan
09/17/10 04:52:24 negotiationtime: slots 8 group a1.a3 autoregroup false
09/17/10 04:52:24 group a1.a3.a3 dynamic quota for 8 slots = 0.200
09/17/10 04:52:24 Group Table : group a1.a3.a3 quota 0.200 usage 0.000 prio 0.00  
09/17/10 04:52:24 negotiationtime: slots 8 group a1.a3.a3 autoregroup false
09/17/10 04:52:24 negotiationtime:sorting
09/17/10 04:52:24 Sort : sorting group vector
09/17/10 04:52:24 Sorting : grouparray group a1.a3 parent -1 child -1  left -1 right -1 i 0
09/17/10 04:52:24 Sorting : grouparray group a1.a3.a3 parent -1 child -1  left -1 right -1 i 1
09/17/10 04:52:24 Sorting : grouparray group a1.a3.a3 parent -1 child -1  left -1 right -1 i 0
09/17/10 04:52:24 Sort : stage two
09/17/10 04:52:24 midsort : grouparray group  parent -1 child 1  left -1 right -1 i 0
09/17/10 04:52:24 midsort : grouparray group a1 parent 0 child 2  left -1 right -1 i 1
09/17/10 04:52:24 midsort : grouparray group a1.a3 parent 1 child 3  left -1 right -1 i 2
09/17/10 04:52:24 midsort : grouparray group a1.a3.a3 parent 2 child -1  left -1 right -1 i 3
09/17/10 04:52:24 Sorted : grouparray group  parent -1 child 1  left -1 right -1 i 0
09/17/10 04:52:24 Sorted : grouparray group a1 parent 0 child 2  left -1 right -1 i 1
09/17/10 04:52:24 Sorted : grouparray group a1.a3 parent 1 child 3  left -1 right -1 i 2
09/17/10 04:52:24 Sorted : grouparray group a1.a3.a3 parent 2 child -1  left -1 right -1 i 3
09/17/10 04:52:24 Sort : leaving
....
09/17/10 04:52:24 Group  - skipping, no submitters
09/17/10 04:52:24 Group a1 - skipping, no submitters
09/17/10 04:52:24 Group a1.a3 - skipping, no submitters
09/17/10 04:52:24 Group a1.a3.a3 - skipping, no submitters
09/17/10 04:52:24 Failed to match 0.000000 slots on iteration 1.
09/17/10 04:52:24 negotiationtime: finished  - slots 8 group  auto true quota 1.000000 maxAllowed 8.000000 nodemaxAllowed 0.000000 numsubmits 0 usage 0.000000
09/17/10 04:52:24 negotiationtime: finished  - slots 8 group a1 auto false quota 0.000000 maxAllowed 0.000000 nodemaxAllowed 0.000000 numsubmits 0 usage 0.000000
09/17/10 04:52:24 negotiationtime: finished  - slots 8 group a1.a3 auto false quota 0.000000 maxAllowed 0.000000 nodemaxAllowed 0.000000 numsubmits 0 usage 0.000000
09/17/10 04:52:24 negotiationtime: finished  - slots 8 group a1.a3.a3 auto false quota 0.200000 maxAllowed 0.000000 nodemaxAllowed 0.000000 numsubmits 0 usage 0.000000
09/17/10 04:52:24 ---------- Finished Negotiation Cycle ----------

# ps ax | grep condor
 7587 ?        Ss     0:00 condor_master -pidfile /var/run/condor/condor_master.pid
 7588 ?        Ss     0:00 condor_collector -f
 7590 ?        Ss     0:00 condor_negotiator -f
 7591 ?        Ss     0:00 condor_schedd -f
 7592 ?        Ss     0:05 condor_startd -f
 7593 ?        S      0:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64

No regression found on current packages.

>>> VERIFIED

Comment 12 Matthew Farrellee 2010-09-21 19:37:30 UTC
Comment 6 and Comment 7 -> Bug 636271

Comment 13 Martin Prpič 2010-10-07 16:14:18 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
An incorrect configuration of the negotiator resulted in a segmentation fault. This occurred when the 'quota' variable was set to 0 for a group that had supgroups. With this update, the segmentation fault no longer occurs in this situation.

Comment 14 Florian Nadge 2010-10-08 10:22:18 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-An incorrect configuration of the negotiator resulted in a segmentation fault. This occurred when the 'quota' variable was set to 0 for a group that had supgroups. With this update, the segmentation fault no longer occurs in this situation.+An incorrect configuration of the negotiator resulted in a segmentation fault. This occurred when the 'quota' variable was set to 0 for a group that had subgroups. With this update, the segmentation fault no longer occurs in this situation.

Comment 16 errata-xmlrpc 2010-10-14 16:14:16 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html