673592 – Accountant does not properly recompute group usages on reconfig

Bug 673592 - Accountant does not properly recompute group usages on reconfig

Summary: Accountant does not properly recompute group usages on reconfig

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.3
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	2.0
Target Release:	---
Assignee:	Erik Erlandson
QA Contact:	Lubos Trilety
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	693778
TreeView+	depends on / blocked

Reported:	2011-01-28 20:39 UTC by Erik Erlandson
Modified:	2018-11-14 14:24 UTC (History)
CC List:	8 users (show)
Fixed In Version:	condor-7.5.6-0.1
Doc Type:	Bug Fix
Doc Text:	Cause Logic in ReportState() did not correctly handle lookup of accounting group entries on reconfig or restart. Consequence Entries got improperly reset to zero. Fix Logic was updated to properly handle both accounting groups and submitter names. Result Group resource usages are properly preserved on reconfig or restart.
Clone Of:
Environment:
Last Closed:	2011-06-23 15:36:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
fixup group usage on reconfig (2.30 KB, patch) 2011-02-18 21:35 UTC, Jon Thomas	no flags	Details \| Diff
rework of patch (1.58 KB, patch) 2011-02-21 20:17 UTC, Jon Thomas	no flags	Details \| Diff
Patch that includes logic for userprio totals and handling "defunct" groups on reconfig (15.53 KB, patch) 2011-02-21 22:44 UTC, Erik Erlandson	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2011:0889	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Grid 2.0 Release	2011-06-23 15:35:53 UTC

Description Erik Erlandson 2011-01-28 20:39:46 UTC

Description of problem:

On a negotiator reconfig, the usage values for accounting groups is improperly set to zero.

How reproducible: 100%


Steps to Reproduce:
1. Start with simple HFS config:
NUM_CPUS = 2
GROUP_NAMES = a, b
GROUP_QUOTA_DYNAMIC_a = 0.5
GROUP_QUOTA_DYNAMIC_b = 0.5


2. Submit a job against an accounting group "a":
% echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=1d\n+AccountingGroup=\"a.user1\"\nqueue 1\n" | condor_submit


3. observe user-priorities:
% condor_userprio -all
Last Priority Update:  1/28 12:09
                                    Effective   Real     Priority   Res   Total Usage       Usage            Last      
User Name                           Priority  Priority    Factor    Used (wghted-hrs)    Start Time       Usage Time   
------------------------------      --------- -------- ------------ ----  ----------- ---------------- ----------------
a                                        0.50     0.50         1.00    1         0.03  1/28/2011 12:08  1/28/2011 12:09
a.user1@localdomain                      0.50     0.50         1.00    1         0.03  1/28/2011 12:08  1/28/2011 12:09
------------------------------      --------- -------- ------------ ----  ----------- ---------------- ----------------
Number of users: 2                                                     2         0.07  1/28/2011 12:08  1/27/2011 12:09


4. Reconfig:
% condor_reconfig -sub negotiator

5. Observe that usage for acct group "a" got improperly reset to zero:
[eje@rorschach gt986]$ condor_userprio -all
Last Priority Update:  1/28 12:10
                                    Effective   Real     Priority   Res   Total Usage       Usage            Last      
User Name                           Priority  Priority    Factor    Used (wghted-hrs)    Start Time       Usage Time   
------------------------------      --------- -------- ------------ ----  ----------- ---------------- ----------------
a                                        0.50     0.50         1.00    0         0.04  1/28/2011 12:08  1/28/2011 12:10
a.user1@localdomain                      0.50     0.50         1.00    1         0.05  1/28/2011 12:08  1/28/2011 12:10
------------------------------      --------- -------- ------------ ----  ----------- ---------------- ----------------
Number of users: 2                                                     1         0.09  1/28/2011 12:08  1/27/2011 12:10

  
Actual results:
(see above)

Expected results:
usage for group "a" should be (1), as it was prior to reconfig.

Tangentially, it also appears that total usage sum in userprio output is improperly counting accounting groups twice, so it is reading too high: should fix that while we're at it.

Comment 1 Jon Thomas 2011-02-18 21:35:38 UTC

Created attachment 479614 [details]
fixup group usage on reconfig

I commented out the dprintf's in here and left them in case you might want to debug a bit. You won't want to put any dprintf in there permanently as looks like the function is called against every resource classad for each group. 

So far this seems to work, but I need to test the deep case such as:

a
a.b
a.b.c
a.b.c.user1

Comment 2 Jon Thomas 2011-02-21 20:17:09 UTC

Created attachment 479988 [details]
rework of patch

Here is same functionality, but much more simplified.

Comment 3 Jon Thomas 2011-02-21 21:52:51 UTC

If I submit jobs (say 10 jobs and I get the slots) as a.b.c.user1, accounting group usage information will be

a (not in userprio because it's not in the accountant log)
a.b (not in userprio because it's not in the accountant log)
a.b.c = 10
a.b.c.user1 = 10

The accountant code doesn't track the fact that usage for a and a.b is also 10. HFS code doesn't use these values from the accountant. It does use the usage value of any group if it has submitters attached to it. In the example, the code relies on a.b.c usage numbers. The bug here is that the reconfig triggered the accountant to throw away the usage in the accountant log because of not being able to pattern match a groupname against the resource's name string. This is new in 1.3.2 because groupnames are appended to the username list in initialize. ReportState is then called with a groupname as a customer name.

Condor_userprio needs to be fixed up. It would be useful if the accountant tracked usage in the hierarchy. I think we sort of half way there. The totals line should be fixed too.

Comment 4 Jon Thomas 2011-02-21 22:16:13 UTC

upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1923

Comment 5 Erik Erlandson 2011-02-21 22:36:34 UTC

> 
> The accountant code doesn't track the fact that usage for a and a.b is also 10.

The current policy is that a parent group's usage does not include the usages of its children.  The reasoning was that jobs can be submitted against a parent independently of its children, and it also keeps things analogous with the HFS negotiation logic, where parents and children are all treated essentially equally.  (The only place where "parent" and "child" matter are during the assignment of quotas and distribution of surplus).

Comment 6 Erik Erlandson 2011-02-21 22:42:06 UTC

I pushed a branch: V7_4-BZ673592-reconfig-group-usage

This branch actually has two separate commits:  the 1st is nearly identical to jrt's updated patch (and is under his name).   

The 2nd is an enhancement that includes logic for condor_userprio to provide correct total values which do not incorrectly count acct-group stats twice.  It also has logic that allows the accountant/userprio to gracefully handle the corner-case of a reconfig/restart where the group-list changes and acct groups may no longer be defined.

I'm also attaching a patch that includes both of these changes.

Comment 7 Erik Erlandson 2011-02-21 22:44:02 UTC

Created attachment 480013 [details]
Patch that includes logic for userprio totals and handling "defunct" groups on reconfig

Comment 8 Jon Thomas 2011-02-22 14:15:06 UTC

>The current policy is that a parent group's usage does not include the usages
>of its children.

It was never changed when we went from flat groups to HFS, but it probably should have been. Condor_userprio was already broken at the time, so making condor_userprio changes wasn't a priority in the hfs transition. 

Nevermind if the data would be used in the negotiator, the output of condor_userprio would be more meaningful if one could see the usage in the entire hierarchy. As it stands, one has to grep for this information or sit down with a calculator and figure it out from condor_userprio.

Comment 9 Erik Erlandson 2011-02-22 14:29:31 UTC

(In reply to comment #8)
 
> Nevermind if the data would be used in the negotiator, the output of
> condor_userprio would be more meaningful if one could see the usage in the
> entire hierarchy.

We could file an RFE to have condor_userprio report the group usages in that way for the user, if it would be helpful.   I'm pretty sure that the internal accountant semantic needs to remain as-is, or it would break the negotiation logic that checks against group usage.

Comment 10 Erik Erlandson 2011-02-22 18:36:29 UTC

Ported V7_4-BZ673592-reconfig-group-usage to upstream V7_6-branch (and merged to master) changed status of upstream #1923 to resolved.

Comment 13 Lubos Trilety 2011-04-26 14:16:16 UTC

Tested with:
condor-7.6.1-0.1

Tested on:
RHEL5 x86_64,i386  - passed
RHEL6 x86_64,i386  - passed


>>> VERIFIED

Comment 14 Erik Erlandson 2011-04-27 20:45:39 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
Logic in ReportState() did not correctly handle lookup of accounting group entries on reconfig or restart.

Consequence
Entries got improperly reset to zero.

Fix
Logic was updated to properly handle both accounting groups and submitter names.

Result
Group resource usages are properly preserved on reconfig or restart.

Comment 15 errata-xmlrpc 2011-06-23 15:36:32 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html

Note You need to log in before you can comment on or make changes to this bug.