Bug 617709 - fix hfs accountant stats
Summary: fix hfs accountant stats
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.2
Hardware: All
OS: Linux
medium
medium
Target Milestone: 1.3
: ---
Assignee: Erik Erlandson
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks: 528800
TreeView+ depends on / blocked
 
Reported: 2010-07-23 19:16 UTC by Jon Thomas
Modified: 2010-10-14 16:11 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, HFS group names were handled incorrectly in the accountant. Groupname was parsed from the beginning of the group string leading to some stats being assigned to the wrong group. With this update, the group, to which the customer is submitted against, is parsed from the end of the customername string.
Clone Of:
Environment:
Last Closed: 2010-10-14 16:11:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fix for groupnames (9.27 KB, patch)
2010-07-23 19:16 UTC, Jon Thomas
no flags Details | Diff
same patch ported to condor-7_4_4-0_4_el5 (8.40 KB, patch)
2010-08-03 21:05 UTC, Jon Thomas
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0773 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3 2010-10-14 15:56:44 UTC

Description Jon Thomas 2010-07-23 19:16:49 UTC
Created attachment 434042 [details]
fix for groupnames

HFS group names are handled incorrectly in the accountant. Groupname is parsed from the beginning of the group string leading to some stats being assigned to the wrong group. 

Fix is to parse from the end of the customername string to obtain the group to which the customer submitted against.

Comment 1 Jon Thomas 2010-08-03 19:34:28 UTC
The easiest way to see this bug is to start a set of jobs that have a few levels of groups..such as a1.a2.a1, a1.a2.a2 and turn on D_ACCOUNTANT. You may want to clean the accountant log.

Let the system spin up and look at the second or later iteration of negotiation (after jobs get matched and run). With the bug you will see something like:


8/03 14:42:51 group a1 dynamic quota for 41 slots = 0.400
08/03 14:42:51 Group Table : group a1 quota 0.400 usage 8.000 prio 2000.00
08/03 14:42:51 negotiationtime: slots 41 group a1 autoregroup true
08/03 14:42:51 group b1 dynamic quota for 41 slots = 0.400
08/03 14:42:51 Group Table : group b1 quota 0.400 usage 9.000 prio 2250.00
08/03 14:42:51 negotiationtime: slots 41 group b1 autoregroup true
08/03 14:42:51 group a1.a2 dynamic quota for 41 slots = 0.200
08/03 14:42:51 Group Table : group a1.a2 quota 0.200 usage 0.000 prio 0.00
08/03 14:42:51 negotiationtime: slots 41 group a1.a2 autoregroup false
08/03 14:42:51 group a1.a1 dynamic quota for 41 slots = 0.200
08/03 14:42:51 Group Table : group a1.a1 quota 0.200 usage 0.000 prio 0.00
08/03 14:42:51 negotiationtime: slots 41 group a1.a1 autoregroup false
08/03 14:42:51 group a1.a3 dynamic quota for 41 slots = 0.600
08/03 14:42:51 Group Table : group a1.a3 quota 0.600 usage 0.000 prio 0.00
08/03 14:42:51 negotiationtime: slots 41 group a1.a3 autoregroup true
08/03 14:42:51 group a1.a3.a1 dynamic quota for 41 slots = 0.400
08/03 14:42:51 Group Table : group a1.a3.a1 quota 0.400 usage 0.000 prio 0.00
08/03 14:42:51 negotiationtime: slots 41 group a1.a3.a1 autoregroup true
08/03 14:42:51 group a1.a3.a2 dynamic quota for 41 slots = 0.200
08/03 14:42:51 Group Table : group a1.a3.a2 quota 0.200 usage 0.000 prio 0.00
08/03 14:42:51 negotiationtime: slots 41 group a1.a3.a2 autoregroup false
08/03 14:42:51 group a1.a3.a3 dynamic quota for 41 slots = 0.200
08/03 14:42:51 Group Table : group a1.a3.a3 quota 0.200 usage 0.000 prio 0.00
08/03 14:42:51 negotiationtime: slots 41 group a1.a3.a3 autoregroup false
08/03 14:42:51 group b1.b2 dynamic quota for 41 slots = 0.600
08/03 14:42:51 Group Table : group b1.b2 quota 0.600 usage 0.000 prio 0.00
08/03 14:42:51 negotiationtime: slots 41 group b1.b2 autoregroup true
08/03 14:42:51 group b1.b1 dynamic quota for 41 slots = 0.400
08/03 14:42:51 Group Table : group b1.b1 quota 0.400 usage 0.000 prio 0.00


Note that only the top tier groups have usage and prio values above zero. That is because the groupname is truncated after the first ".". Hence values for a1.a2.a1 get recorded for a1. In the fixed version you will see usage and prio values for each group name that reflects their own usage/prio.

This is the easiest observable way to see the bug. It has an impact in the addmatch and removematch statistic handling and changes behavior later in negotiation.

Comment 2 Jon Thomas 2010-08-03 21:05:11 UTC
Created attachment 436380 [details]
same patch ported to condor-7_4_4-0_4_el5

Comment 3 Erik Erlandson 2010-08-07 00:02:06 UTC
I ran a config with the following:

#####################
# HFS testing related parameters
#####################
NEGOTIATOR_DEBUG = D_FULLDEBUG | D_ACCOUNTANT
SCHEDD_INTERVAL	= 15
NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE

NUM_CPUS = 100

GROUP_NAMES = a1, b1, a1.a1, a1.a2, a1.a2.a1, a1.a2.a2, b1.b2, b1.b1

GROUP_QUOTA_DYNAMIC_a1 = .4
GROUP_QUOTA_DYNAMIC_b1 = .4
GROUP_QUOTA_DYNAMIC_a1.a1 = .5
GROUP_QUOTA_DYNAMIC_a1.a2 = .5
GROUP_QUOTA_DYNAMIC_b1.b1 = .5
GROUP_QUOTA_DYNAMIC_b1.b2 = .5
GROUP_QUOTA_DYNAMIC_a1.a2.a1 = .5
GROUP_QUOTA_DYNAMIC_a1.a2.a2 = .5




And ran this submission:

universe = vanilla
executable = /bin/sleep
arguments = 10m
+AccountingGroup = "b1.b1.user"
queue 100
+AccountingGroup = "b1.b2.user"
queue 100
+AccountingGroup = "a1.a2.a1.user"
queue 100
+AccountingGroup = "a1.a2.a2.user"
queue 100
+AccountingGroup = "zz_none.user"
queue 100


I let the system spool up for a few iterations, and I'm seeing this:

08/06 16:22:33 Group Table : group a1 quota 0.400 usage 0.000 prio 0.00
08/06 16:22:33 Group Table : group b1 quota 0.400 usage 8.000 prio 2000.00
08/06 16:22:33 Group Table : group a1.a1 quota 0.500 usage 0.000 prio 0.00
08/06 16:22:33 Group Table : group a1.a2 quota 0.500 usage 0.000 prio 0.00
08/06 16:22:33 Group Table : group a1.a2.a1 quota 0.500 usage 10.000 prio 2000.00
08/06 16:22:33 Group Table : group a1.a2.a2 quota 0.500 usage 10.000 prio 2000.00
08/06 16:22:33 Group Table : group b1.b2 quota 0.500 usage 20.000 prio 4000.00
08/06 16:22:33 Group Table : group b1.b1 quota 0.500 usage 20.000 prio 4000.00


I notice that group 'b1' seems to have a nonzero usage (8.0), but 'a1' has a zero usage, which seems inconsistent, because both groups have subgroups with submissions running (and neither has any submissions directly against it).

Comment 4 Jon Thomas 2010-08-09 13:06:25 UTC
I'm assuming results are with the patch. They are strange results. Did you nuke the accountant log?

Other than that the only thing I see is perhaps some code needs to be added in Initialize() when it uses "thisUser".

char const *thisUser = &(key[CustomerRecord.Length()]);

I have no idea what form &(key[CustomerRecord.Length()]) is in,  but apparently it's matched against the groupnamelist.  This code doesn't do any filtering or string trimming of thisUser and matching directly against the groupnamelist seems odd.

The older code truncated at the first ".", so I'm not sure why this might produce values for b1, but not a1 or for a1.a2.  Add to that why is b1 usage 8 and b1.b1 and b1.b2 sum to 40. 

I'll have to see if I can repro your results.

Comment 5 Jon Thomas 2010-08-09 14:07:37 UTC
well, I can't duplicate your results with the patch. I get the expected "0" value for both b1 and a1.

Comment 6 Erik Erlandson 2010-08-09 15:13:57 UTC
Verified that patch works when acct log is actually nuked (note to self: acct log lives in 'spool', not 'log')

Pushed branch V7_4-BZ617709-accountant-subgroup-stats to the FH repo.

Comment 7 Lubos Trilety 2010-08-19 13:46:48 UTC
Reproduced with (version):
condor-7.4.4-0.8

08/19/10 09:25:14 Group Table : group a1 quota 0.400 usage 20.000 prio 5000.00
08/19/10 09:25:14 Group Table : group b1 quota 0.400 usage 40.000 prio 10000.00
08/19/10 09:25:14 Group Table : group a1.a1 quota 0.500 usage 0.000 prio 0.00
08/19/10 09:25:14 Group Table : group a1.a2 quota 0.500 usage 0.000 prio 0.00
08/19/10 09:25:14 Group Table : group a1.a2.a1 quota 0.500 usage 0.000 prio 0.00
08/19/10 09:25:14 Group Table : group a1.a2.a2 quota 0.500 usage 0.000 prio 0.00
08/19/10 09:25:14 Group Table : group b1.b2 quota 0.500 usage 0.000 prio 0.00
08/19/10 09:25:14 Group Table : group b1.b1 quota 0.500 usage 0.000 prio 0.00

Comment 8 Lubos Trilety 2010-08-19 14:25:38 UTC
Tested with (version):
condor-7.4.4-0.9

Tested on:
RHEL5 i386,x86_64  - passed
RHEL4 i386,x86_64  - passed

>>> VERIFIED

Comment 9 Florian Nadge 2010-10-07 17:19:15 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, HFS group names were handled incorrectly in the accountant. Groupname was parsed from the beginning of the group string leading to some stats being assigned to the wrong group. With this update, the group, to which the customer is submitted against, is parsed from the end of the customername string.

Comment 11 errata-xmlrpc 2010-10-14 16:11:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html


Note You need to log in before you can comment on or make changes to this bug.