Bug 565501 - integer overflow in condor_status -total -server
Summary: integer overflow in condor_status -total -server
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.2
Hardware: All
OS: Linux
low
low
Target Milestone: 2.1
: ---
Assignee: Will Benton
QA Contact: Martin Kudlej
URL:
Whiteboard:
Depends On:
Blocks: 743350
TreeView+ depends on / blocked
 
Reported: 2010-02-15 14:09 UTC by Matthew Farrellee
Modified: 2012-01-23 17:25 UTC (History)
4 users (show)

Fixed In Version: condor-7.6.5-0.3
Doc Type: Bug Fix
Doc Text:
The condor_status utility used a signed 32-bit integer to represent total disk space and memory available in a group of machines. This integer could have overflowed with a large pool, resulting in negative or nonsensical results when the "condor_status -total" command was run. This update changes the data type to an unsigned 64-bit integer value, which is significantly less likely to overflow.
Clone Of:
Environment:
Last Closed: 2012-01-23 17:25:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch for condor_utils/totals.{cpp, h} (1.42 KB, patch)
2011-10-04 15:24 UTC, Will Benton
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2012:0045 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Grid 2.1 bug fix and enhancement update 2012-01-23 22:22:58 UTC

Description Matthew Farrellee 2010-02-15 14:09:43 UTC
Description of problem:

condor_status's totals code overflows


Version-Release number of selected component (if applicable):

All including at least 7.4.3-0.1


How reproducible:

100%


Steps to Reproduce:
1. run a collector (condor_collector -t -f)
2. create a dummy slot with lots of resources: (4294967297 is 2^32 + 1)
$ cat largeresource.ad
MyType = "Machine"
TargetType = "Job"
Name = "many"
Machine = "resources.test"
State = "UNKNOWN"
Activity = "UNKNOWN"
MyAddress = "<1.2.3.4:5678>"
OpSys = "TESTER"
Arch = "WHOCARES"
VirtualMemory = 4294967297
Disk = 4294967297
Memory = 4294967297
KFlops = 4294967297
Mips = 4294967297
3. advertise the slot:
$ condor_advertise UPDATE_STARTD_AD largeresource.ad
4. run condor_status resources.test -total -server


Actual results:

$ condor_status resources.test -total -server
                     Machines Avail  Memory        Disk        MIPS      KFLOPS

     WHOCARES/TESTER        1     0       1           1           1           1
               Total        1     0       1           1           1           1


Expected results:

Memory = Disk = MIPS = KFLOPS = 4294967297


Additional info:

Repeat with 4294967295 (2^32 - 1)...

$ condor_status resources.test -total -server
                     Machines Avail  Memory        Disk        MIPS      KFLOPS

     WHOCARES/TESTER        1     0      -1 18446744073709551615          -1          -1
               Total        1     0      -1 18446744073709551615          -1          -1

Expected...

Memory = Disk = MIPS = KFLOPS = 4294967295

Comment 1 Matthew Farrellee 2010-02-15 14:16:10 UTC
totals.h:
  long memory;
  uint64_t disk;
  long condor_mips;
  long kflops;

totals.cpp:
  if (!ad->LookupInteger(ATTR_MEMORY,attrMem)) { badAd = true; attrMem  = 0;}
  if (!ad->LookupInteger(ATTR_DISK,  attrDisk)){ badAd = true; attrDisk = 0;}
  if (!ad->LookupInteger(ATTR_MIPS,  attrMips)){ badAd = true; attrMips = 0;}
  if (!ad->LookupInteger(ATTR_KFLOPS,attrKflops)){badAd= true;attrKflops = 0;}

There is clearly a type mismatch going on here.

Comment 2 Matthew Farrellee 2010-02-15 14:19:25 UTC
Using a resource size of 2147483647 (2^31 - 1), multiple times (Name attribute varies)..

$ condor_advertise UPDATE_STARTD_AD largeresource.ad
$ condor_status resources.test -total -server
                     Machines Avail  Memory        Disk        MIPS      KFLOPS
     WHOCARES/TESTER        1     0 2147483647  2147483647  2147483647  2147483647
               Total        1     0 2147483647  2147483647  2147483647  2147483647
$ condor_advertise UPDATE_STARTD_AD largeresource.ad2
$ condor_status resources.test -total -server
                     Machines Avail  Memory        Disk        MIPS      KFLOPS
     WHOCARES/TESTER        2     0 4294967294  4294967294  4294967294  4294967294
               Total        2     0 4294967294  4294967294  4294967294  4294967294
$ condor_advertise UPDATE_STARTD_AD largeresource.ad3
$ condor_status resources.test -total -server
                     Machines Avail  Memory        Disk        MIPS      KFLOPS
     WHOCARES/TESTER        3     0 6442450941  6442450941  6442450941  6442450941
               Total        3     0 6442450941  6442450941  6442450941  6442450941
$ condor_advertise UPDATE_STARTD_AD largeresource.ad4
$ condor_status resources.test -total -server
                     Machines Avail  Memory        Disk        MIPS      KFLOPS
     WHOCARES/TESTER        4     0 8589934588  8589934588  8589934588  8589934588
               Total        4     0 8589934588  8589934588  8589934588  8589934588

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
many2 TESTER     WHOCAR UNKNOWN   UNKNOWN  [???]  2147483647   [Unknown]
many3 TESTER     WHOCAR UNKNOWN   UNKNOWN  [???]  2147483647   [Unknown]
many4 TESTER     WHOCAR UNKNOWN   UNKNOWN  [???]  2147483647   [Unknown]
many TESTER     WHOCAR UNKNOWN   UNKNOWN  [???]  2147483647   [Unknown]
                     Total Owner Claimed Unclaimed Matched Preempting Backfill
     WHOCARES/TESTER     0     0       0         0       0          0        0
               Total     0     0       0         0       0          0        0
                    (Omitted 4 malformed ads in computed attribute totals)

Comment 3 Matthew Farrellee 2010-02-15 14:21:19 UTC
There is also a type mismatch in StratdRunTotal.

Comment 4 Matthew Farrellee 2010-02-15 14:32:53 UTC
Originally reported on condor-users...

https://lists.cs.wisc.edu/archive/condor-users/2010-February/msg00078.shtml

The condor_startd might have an overflow itself and may be advertising negative kflop values. It is worth finding out what condor_status -long | grep ^KFlop returns in Pascal's pool.

Comment 5 Matthew Farrellee 2010-11-18 14:24:13 UTC
V7_4-branch has a fix for KFLOPS,

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1771

Comment 6 Will Benton 2011-10-04 15:24:33 UTC
Created attachment 526267 [details]
patch for condor_utils/totals.{cpp, h}

Comment 7 Will Benton 2011-10-04 15:25:11 UTC
So there are a few problems here:

1.  Integer quantities in ClassAds are signed and 32-bit.  So we won't be able to advertise values larger than 2^32-1; as the example indicates, these are clamped if read in from a textual ad.
2.  (Related:  It would be nice if there were some convention for marking an ad as suspicious when parsing errors like the above happen.  The ad in the example is reported as "malformed" only because the state and activity are invalid.)
3.  (Also related:  It would also be nice if ClassAd evaluation had a means to record that overflow had occurred.)
4.  As Matt mentions, the actual code doing the totaling uses 32-bit ints in some places; one of these was fixed upstream for 7.4.

IMHO problems 1-3 should be addressed by a different BZ (there is a ticket upstream for issue #1).  I've attached a patch for issue #4.

Comment 8 Will Benton 2011-10-04 18:54:02 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C:  Previous versions of the condor_status tool used a signed 32-bit integer to represent total disk space and memory available in a group of machines.
C:  In large pools, this counter could overflow, resulting in negative or otherwise nonsensical results when running "condor_status -total".
F:  This code now uses unsigned 64-bit integers to total machine attribute values.
R:  It is thus significantly less likely to overflow.

Comment 10 Martin Kudlej 2011-10-21 09:53:57 UTC
$ cat x
MyType = "Machine"
TargetType = "Job"
Name = "many"
Machine = "resources.test"
State = "UNKNOWN"
Activity = "UNKNOWN"
MyAddress = "<1.2.3.4:5678>"
OpSys = "TESTER"
Arch = "WHOCARES"
VirtualMemory = 4294967297
Disk = 4294967297
Memory = 4294967297
KFlops = 4294967297
Mips = 4294967297

32bit RHEL 5/6 $ condor_status -total -server
                     Machines Avail  Memory        Disk        MIPS      KFLOPS

         INTEL/LINUX        1     1     502 1801532622241792 24094766530560 5308583872823296
     WHOCARES/TESTER        1     0 2147483647 9223372032559808512 9223372032559808512 9223372032559808512

               Total        2     1 -2147483147 9225173565182050304 9223396127326339072 9228680620727599103

64 bit RHEL 5/6 $ condor_status -total -server
                     Machines Avail  Memory        Disk        MIPS      KFLOPS

     WHOCARES/TESTER        1     0 2147483647  2147483647  2147483647  2147483647
        X86_64/LINUX        1     1     497      310360        6292     1213079

               Total        2     1 2147484144  2147794007  2147489939  2148696726

I use 4294967295:

32 bit RHEL 5/6 $ condor_status -total -server
                     Machines Avail  Memory        Disk        MIPS      KFLOPS

         INTEL/LINUX        1     1     502 1800587729436672 24133421236224 6182326249717760
     WHOCARES/TESTER        1     0 2147483647 9223372032559808512 9223372032559808512 9223372032559808512

               Total        2     1 -2147483147 9225172620289245184 9223396165981044736 9229554363104493567

64 bit RHEL 5/6 $ condor_status -total -server
                     Machines Avail  Memory        Disk        MIPS      KFLOPS

     WHOCARES/TESTER        1     0 2147483647  2147483647  2147483647  2147483647
        X86_64/LINUX        1     1     497      310144        7778     1122100

               Total        2     1 2147484144  2147793791  2147491425  2148605747


I use 2147483647:
32 bit RHEL 5/6 $ condor_status -total -server
                     Machines Avail  Memory        Disk        MIPS      KFLOPS

         INTEL/LINUX        1     1     502 1800209772314624           0           0
     WHOCARES/TESTER        1     0 2147483647 9223372032559808512 9223372032559808512 9223372032559808512

               Total        2     1 -2147483147 9225172242332123136 9223372032559808512 9223372032559808512

                    (Omitted 1 malformed ads in computed attribute totals)

32 bit RHEL 5/6 $ condor_status -total -server
                     Machines Avail  Memory        Disk        MIPS      KFLOPS

     WHOCARES/TESTER        1     0 2147483647  2147483647  2147483647  2147483647
        X86_64/LINUX        1     1     497      310064           0           0

               Total        2     1 2147484144  2147793711  2147483647  2147483647

                    (Omitted 1 malformed ads in computed attribute totals)

I think this issue is not fixed. There are negative numbers in Memory column and values are different than they are advertised by condor_advertise. -> Assigned

Comment 11 Martin Kudlej 2011-10-21 09:54:51 UTC
I've used condor-7.6.4-0.8.

Comment 12 Will Benton 2011-10-21 13:38:12 UTC
Martin, the values being different than in the ad supplied to condor_advertise is covered by issue 1 in comment 7 (and outside the scope of this bug).  I'll look at the totals again on a 32-bit machine.

Comment 13 Will Benton 2011-10-27 17:17:52 UTC
Fixed in b37df2b.

Comment 15 Martin Kudlej 2011-11-04 16:40:36 UTC
Tested on RHEL 5.7/6.1 x x86_64/i386 with condor-7.6.5-0.5 and it works. -->VERIFIED

Comment 16 Douglas Silas 2011-11-16 15:27:44 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1 @@
-C:  Previous versions of the condor_status tool used a signed 32-bit integer to represent total disk space and memory available in a group of machines.
+The condor_status utility used a signed 32-bit integer to represent total disk space and memory available in a group of machines. This integer could have overflowed with a large pool, resulting in negative or nonsensical results when the "condor_status -total" command was run. This update changes the data type to an unsigned 64-bit integer value, which is significantly less likely to overflow.-C:  In large pools, this counter could overflow, resulting in negative or otherwise nonsensical results when running "condor_status -total".
-F:  This code now uses unsigned 64-bit integers to total machine attribute values.
-R:  It is thus significantly less likely to overflow.

Comment 17 errata-xmlrpc 2012-01-23 17:25:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2012-0045.html


Note You need to log in before you can comment on or make changes to this bug.