Description of problem: condor_status's totals code overflows Version-Release number of selected component (if applicable): All including at least 7.4.3-0.1 How reproducible: 100% Steps to Reproduce: 1. run a collector (condor_collector -t -f) 2. create a dummy slot with lots of resources: (4294967297 is 2^32 + 1) $ cat largeresource.ad MyType = "Machine" TargetType = "Job" Name = "many" Machine = "resources.test" State = "UNKNOWN" Activity = "UNKNOWN" MyAddress = "<1.2.3.4:5678>" OpSys = "TESTER" Arch = "WHOCARES" VirtualMemory = 4294967297 Disk = 4294967297 Memory = 4294967297 KFlops = 4294967297 Mips = 4294967297 3. advertise the slot: $ condor_advertise UPDATE_STARTD_AD largeresource.ad 4. run condor_status resources.test -total -server Actual results: $ condor_status resources.test -total -server Machines Avail Memory Disk MIPS KFLOPS WHOCARES/TESTER 1 0 1 1 1 1 Total 1 0 1 1 1 1 Expected results: Memory = Disk = MIPS = KFLOPS = 4294967297 Additional info: Repeat with 4294967295 (2^32 - 1)... $ condor_status resources.test -total -server Machines Avail Memory Disk MIPS KFLOPS WHOCARES/TESTER 1 0 -1 18446744073709551615 -1 -1 Total 1 0 -1 18446744073709551615 -1 -1 Expected... Memory = Disk = MIPS = KFLOPS = 4294967295
totals.h: long memory; uint64_t disk; long condor_mips; long kflops; totals.cpp: if (!ad->LookupInteger(ATTR_MEMORY,attrMem)) { badAd = true; attrMem = 0;} if (!ad->LookupInteger(ATTR_DISK, attrDisk)){ badAd = true; attrDisk = 0;} if (!ad->LookupInteger(ATTR_MIPS, attrMips)){ badAd = true; attrMips = 0;} if (!ad->LookupInteger(ATTR_KFLOPS,attrKflops)){badAd= true;attrKflops = 0;} There is clearly a type mismatch going on here.
Using a resource size of 2147483647 (2^31 - 1), multiple times (Name attribute varies).. $ condor_advertise UPDATE_STARTD_AD largeresource.ad $ condor_status resources.test -total -server Machines Avail Memory Disk MIPS KFLOPS WHOCARES/TESTER 1 0 2147483647 2147483647 2147483647 2147483647 Total 1 0 2147483647 2147483647 2147483647 2147483647 $ condor_advertise UPDATE_STARTD_AD largeresource.ad2 $ condor_status resources.test -total -server Machines Avail Memory Disk MIPS KFLOPS WHOCARES/TESTER 2 0 4294967294 4294967294 4294967294 4294967294 Total 2 0 4294967294 4294967294 4294967294 4294967294 $ condor_advertise UPDATE_STARTD_AD largeresource.ad3 $ condor_status resources.test -total -server Machines Avail Memory Disk MIPS KFLOPS WHOCARES/TESTER 3 0 6442450941 6442450941 6442450941 6442450941 Total 3 0 6442450941 6442450941 6442450941 6442450941 $ condor_advertise UPDATE_STARTD_AD largeresource.ad4 $ condor_status resources.test -total -server Machines Avail Memory Disk MIPS KFLOPS WHOCARES/TESTER 4 0 8589934588 8589934588 8589934588 8589934588 Total 4 0 8589934588 8589934588 8589934588 8589934588 $ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime many2 TESTER WHOCAR UNKNOWN UNKNOWN [???] 2147483647 [Unknown] many3 TESTER WHOCAR UNKNOWN UNKNOWN [???] 2147483647 [Unknown] many4 TESTER WHOCAR UNKNOWN UNKNOWN [???] 2147483647 [Unknown] many TESTER WHOCAR UNKNOWN UNKNOWN [???] 2147483647 [Unknown] Total Owner Claimed Unclaimed Matched Preempting Backfill WHOCARES/TESTER 0 0 0 0 0 0 0 Total 0 0 0 0 0 0 0 (Omitted 4 malformed ads in computed attribute totals)
There is also a type mismatch in StratdRunTotal.
Originally reported on condor-users... https://lists.cs.wisc.edu/archive/condor-users/2010-February/msg00078.shtml The condor_startd might have an overflow itself and may be advertising negative kflop values. It is worth finding out what condor_status -long | grep ^KFlop returns in Pascal's pool.
V7_4-branch has a fix for KFLOPS, https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1771
Created attachment 526267 [details] patch for condor_utils/totals.{cpp, h}
So there are a few problems here: 1. Integer quantities in ClassAds are signed and 32-bit. So we won't be able to advertise values larger than 2^32-1; as the example indicates, these are clamped if read in from a textual ad. 2. (Related: It would be nice if there were some convention for marking an ad as suspicious when parsing errors like the above happen. The ad in the example is reported as "malformed" only because the state and activity are invalid.) 3. (Also related: It would also be nice if ClassAd evaluation had a means to record that overflow had occurred.) 4. As Matt mentions, the actual code doing the totaling uses 32-bit ints in some places; one of these was fixed upstream for 7.4. IMHO problems 1-3 should be addressed by a different BZ (there is a ticket upstream for issue #1). I've attached a patch for issue #4.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: Previous versions of the condor_status tool used a signed 32-bit integer to represent total disk space and memory available in a group of machines. C: In large pools, this counter could overflow, resulting in negative or otherwise nonsensical results when running "condor_status -total". F: This code now uses unsigned 64-bit integers to total machine attribute values. R: It is thus significantly less likely to overflow.
$ cat x MyType = "Machine" TargetType = "Job" Name = "many" Machine = "resources.test" State = "UNKNOWN" Activity = "UNKNOWN" MyAddress = "<1.2.3.4:5678>" OpSys = "TESTER" Arch = "WHOCARES" VirtualMemory = 4294967297 Disk = 4294967297 Memory = 4294967297 KFlops = 4294967297 Mips = 4294967297 32bit RHEL 5/6 $ condor_status -total -server Machines Avail Memory Disk MIPS KFLOPS INTEL/LINUX 1 1 502 1801532622241792 24094766530560 5308583872823296 WHOCARES/TESTER 1 0 2147483647 9223372032559808512 9223372032559808512 9223372032559808512 Total 2 1 -2147483147 9225173565182050304 9223396127326339072 9228680620727599103 64 bit RHEL 5/6 $ condor_status -total -server Machines Avail Memory Disk MIPS KFLOPS WHOCARES/TESTER 1 0 2147483647 2147483647 2147483647 2147483647 X86_64/LINUX 1 1 497 310360 6292 1213079 Total 2 1 2147484144 2147794007 2147489939 2148696726 I use 4294967295: 32 bit RHEL 5/6 $ condor_status -total -server Machines Avail Memory Disk MIPS KFLOPS INTEL/LINUX 1 1 502 1800587729436672 24133421236224 6182326249717760 WHOCARES/TESTER 1 0 2147483647 9223372032559808512 9223372032559808512 9223372032559808512 Total 2 1 -2147483147 9225172620289245184 9223396165981044736 9229554363104493567 64 bit RHEL 5/6 $ condor_status -total -server Machines Avail Memory Disk MIPS KFLOPS WHOCARES/TESTER 1 0 2147483647 2147483647 2147483647 2147483647 X86_64/LINUX 1 1 497 310144 7778 1122100 Total 2 1 2147484144 2147793791 2147491425 2148605747 I use 2147483647: 32 bit RHEL 5/6 $ condor_status -total -server Machines Avail Memory Disk MIPS KFLOPS INTEL/LINUX 1 1 502 1800209772314624 0 0 WHOCARES/TESTER 1 0 2147483647 9223372032559808512 9223372032559808512 9223372032559808512 Total 2 1 -2147483147 9225172242332123136 9223372032559808512 9223372032559808512 (Omitted 1 malformed ads in computed attribute totals) 32 bit RHEL 5/6 $ condor_status -total -server Machines Avail Memory Disk MIPS KFLOPS WHOCARES/TESTER 1 0 2147483647 2147483647 2147483647 2147483647 X86_64/LINUX 1 1 497 310064 0 0 Total 2 1 2147484144 2147793711 2147483647 2147483647 (Omitted 1 malformed ads in computed attribute totals) I think this issue is not fixed. There are negative numbers in Memory column and values are different than they are advertised by condor_advertise. -> Assigned
I've used condor-7.6.4-0.8.
Martin, the values being different than in the ad supplied to condor_advertise is covered by issue 1 in comment 7 (and outside the scope of this bug). I'll look at the totals again on a 32-bit machine.
Fixed in b37df2b.
Tested on RHEL 5.7/6.1 x x86_64/i386 with condor-7.6.5-0.5 and it works. -->VERIFIED
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,4 +1 @@ -C: Previous versions of the condor_status tool used a signed 32-bit integer to represent total disk space and memory available in a group of machines. +The condor_status utility used a signed 32-bit integer to represent total disk space and memory available in a group of machines. This integer could have overflowed with a large pool, resulting in negative or nonsensical results when the "condor_status -total" command was run. This update changes the data type to an unsigned 64-bit integer value, which is significantly less likely to overflow.-C: In large pools, this counter could overflow, resulting in negative or otherwise nonsensical results when running "condor_status -total". -F: This code now uses unsigned 64-bit integers to total machine attribute values. -R: It is thus significantly less likely to overflow.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2012-0045.html