Description of problem: On a negotiator reconfig, the usage values for accounting groups is improperly set to zero. How reproducible: 100% Steps to Reproduce: 1. Start with simple HFS config: NUM_CPUS = 2 GROUP_NAMES = a, b GROUP_QUOTA_DYNAMIC_a = 0.5 GROUP_QUOTA_DYNAMIC_b = 0.5 2. Submit a job against an accounting group "a": % echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=1d\n+AccountingGroup=\"a.user1\"\nqueue 1\n" | condor_submit 3. observe user-priorities: % condor_userprio -all Last Priority Update: 1/28 12:09 Effective Real Priority Res Total Usage Usage Last User Name Priority Priority Factor Used (wghted-hrs) Start Time Usage Time ------------------------------ --------- -------- ------------ ---- ----------- ---------------- ---------------- a 0.50 0.50 1.00 1 0.03 1/28/2011 12:08 1/28/2011 12:09 a.user1@localdomain 0.50 0.50 1.00 1 0.03 1/28/2011 12:08 1/28/2011 12:09 ------------------------------ --------- -------- ------------ ---- ----------- ---------------- ---------------- Number of users: 2 2 0.07 1/28/2011 12:08 1/27/2011 12:09 4. Reconfig: % condor_reconfig -sub negotiator 5. Observe that usage for acct group "a" got improperly reset to zero: [eje@rorschach gt986]$ condor_userprio -all Last Priority Update: 1/28 12:10 Effective Real Priority Res Total Usage Usage Last User Name Priority Priority Factor Used (wghted-hrs) Start Time Usage Time ------------------------------ --------- -------- ------------ ---- ----------- ---------------- ---------------- a 0.50 0.50 1.00 0 0.04 1/28/2011 12:08 1/28/2011 12:10 a.user1@localdomain 0.50 0.50 1.00 1 0.05 1/28/2011 12:08 1/28/2011 12:10 ------------------------------ --------- -------- ------------ ---- ----------- ---------------- ---------------- Number of users: 2 1 0.09 1/28/2011 12:08 1/27/2011 12:10 Actual results: (see above) Expected results: usage for group "a" should be (1), as it was prior to reconfig. Tangentially, it also appears that total usage sum in userprio output is improperly counting accounting groups twice, so it is reading too high: should fix that while we're at it.
Created attachment 479614 [details] fixup group usage on reconfig I commented out the dprintf's in here and left them in case you might want to debug a bit. You won't want to put any dprintf in there permanently as looks like the function is called against every resource classad for each group. So far this seems to work, but I need to test the deep case such as: a a.b a.b.c a.b.c.user1
Created attachment 479988 [details] rework of patch Here is same functionality, but much more simplified.
If I submit jobs (say 10 jobs and I get the slots) as a.b.c.user1, accounting group usage information will be a (not in userprio because it's not in the accountant log) a.b (not in userprio because it's not in the accountant log) a.b.c = 10 a.b.c.user1 = 10 The accountant code doesn't track the fact that usage for a and a.b is also 10. HFS code doesn't use these values from the accountant. It does use the usage value of any group if it has submitters attached to it. In the example, the code relies on a.b.c usage numbers. The bug here is that the reconfig triggered the accountant to throw away the usage in the accountant log because of not being able to pattern match a groupname against the resource's name string. This is new in 1.3.2 because groupnames are appended to the username list in initialize. ReportState is then called with a groupname as a customer name. Condor_userprio needs to be fixed up. It would be useful if the accountant tracked usage in the hierarchy. I think we sort of half way there. The totals line should be fixed too.
upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1923
> > The accountant code doesn't track the fact that usage for a and a.b is also 10. The current policy is that a parent group's usage does not include the usages of its children. The reasoning was that jobs can be submitted against a parent independently of its children, and it also keeps things analogous with the HFS negotiation logic, where parents and children are all treated essentially equally. (The only place where "parent" and "child" matter are during the assignment of quotas and distribution of surplus).
I pushed a branch: V7_4-BZ673592-reconfig-group-usage This branch actually has two separate commits: the 1st is nearly identical to jrt's updated patch (and is under his name). The 2nd is an enhancement that includes logic for condor_userprio to provide correct total values which do not incorrectly count acct-group stats twice. It also has logic that allows the accountant/userprio to gracefully handle the corner-case of a reconfig/restart where the group-list changes and acct groups may no longer be defined. I'm also attaching a patch that includes both of these changes.
Created attachment 480013 [details] Patch that includes logic for userprio totals and handling "defunct" groups on reconfig
>The current policy is that a parent group's usage does not include the usages >of its children. It was never changed when we went from flat groups to HFS, but it probably should have been. Condor_userprio was already broken at the time, so making condor_userprio changes wasn't a priority in the hfs transition. Nevermind if the data would be used in the negotiator, the output of condor_userprio would be more meaningful if one could see the usage in the entire hierarchy. As it stands, one has to grep for this information or sit down with a calculator and figure it out from condor_userprio.
(In reply to comment #8) > Nevermind if the data would be used in the negotiator, the output of > condor_userprio would be more meaningful if one could see the usage in the > entire hierarchy. We could file an RFE to have condor_userprio report the group usages in that way for the user, if it would be helpful. I'm pretty sure that the internal accountant semantic needs to remain as-is, or it would break the negotiation logic that checks against group usage.
Ported V7_4-BZ673592-reconfig-group-usage to upstream V7_6-branch (and merged to master) changed status of upstream #1923 to resolved.
Tested with: condor-7.6.1-0.1 Tested on: RHEL5 x86_64,i386 - passed RHEL6 x86_64,i386 - passed >>> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause Logic in ReportState() did not correctly handle lookup of accounting group entries on reconfig or restart. Consequence Entries got improperly reset to zero. Fix Logic was updated to properly handle both accounting groups and submitter names. Result Group resource usages are properly preserved on reconfig or restart.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0889.html