Bug 1569678 - virsh capabilities reports invalid values for 4K pages
Summary: virsh capabilities reports invalid values for 4K pages
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Michal Privoznik
QA Contact: Jing Qi
URL:
Whiteboard:
Depends On:
Blocks: 1519540 1582416 1582418 1625119 1625120 1625122
TreeView+ depends on / blocked
 
Reported: 2018-04-19 19:11 UTC by Mark Jones
Modified: 2019-06-26 09:23 UTC (History)
11 users (show)

Fixed In Version: libvirt-4.3.0-1.el7
Doc Type: Bug Fix
Doc Text:
The "virsh capabilities" command previously displayed an inaccurate number of 4 KiB memory pages on systems with very large amounts of memory. This update optimizes the memory diagnostic mechanism to ensure memory page numbers are displayed correctly on such systems.
Clone Of:
: 1582416 1582418 (view as bug list)
Environment:
Last Closed: 2018-10-30 09:55:31 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:3113 None None None 2018-10-30 09:57:51 UTC

Comment 3 Luyao Huang 2018-04-23 07:24:27 UTC
I can reproduce this issue on libvirt-3.2.0-14.el7_4.9.x86_64:

1.

# numactl --har
available: 4 nodes (0-3)
node 0 cpus: 0 2 4 6
node 0 size: 2047 MB
node 0 free: 1666 MB
node 1 cpus: 8 10 12 14
node 1 size: 2014 MB
node 1 free: 1640 MB
node 2 cpus: 9 11 13 15
node 2 size: 4096 MB
node 2 free: 3881 MB
node 3 cpus: 1 3 5 7
node 3 size: 8191 MB
node 3 free: 7945 MB
node distances:
node   0   1   2   3 
  0:  10  20  20  20 
  1:  20  10  20  20 
  2:  20  20  10  20 
  3:  20  20  20  10 

2. 

# virsh capabilities
...
        <cell id='3'>
          <memory unit='KiB'>8388604</memory>
          <pages unit='KiB' size='4'>2097151</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
...

3. add 4 1g hugepage (only reproduce this issue when use page >= 4)

echo 4 > /sys/devices/system/node/node3/hugepages/hugepages-1048576kB/nr_hugepages

4.

        <cell id='3'>
          <memory unit='KiB'>8388604</memory>
          <pages unit='KiB' size='4'>2097151</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>4</pages>


And i used gdb to check why this will happened and found some interesting things:

After add 4294967296 to huge_page_sum, the huge_page_sum still
is 0, that looks like overflow happened, but huge_page_sum is a ull type and 
4294967296 is 2^32 (32bit ull length).

781	        huge_page_sum += 1024 * page_size * page_avail;
(gdb) p page_avail
$26 = 4
(gdb) p page_size
$27 = 1048576
(gdb) p huge_page_sum
$28 = 0
(gdb) n
745	    while (dir && (direrr = virDirRead(dir, &entry, path)) > 0) {
(gdb) p huge_page_sum
$29 = 0

Comment 4 Luyao Huang 2018-04-23 08:12:37 UTC
The problem is that page_size and page_avail is unsigned int, and the result is still unsigned int, this is a multiplication overflow.

Comment 5 Michal Privoznik 2018-04-23 15:45:18 UTC
Patch proposed on the upstream list:

https://www.redhat.com/archives/libvir-list/2018-April/msg02229.html

Comment 9 Michal Privoznik 2018-04-24 09:07:41 UTC
I've pushed the patch upstream:

commit 31daccf5a550e7ede35532004006b34ba5c5b92e
Author:     Michal Privoznik <mprivozn@redhat.com>
AuthorDate: Mon Apr 23 16:36:53 2018 +0200
Commit:     Michal Privoznik <mprivozn@redhat.com>
CommitDate: Tue Apr 24 11:02:28 2018 +0200

    virNumaGetHugePageInfo: Return page_avail and page_free as ULL
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1569678
    
    On some large systems (with ~400GB of RAM) it is possible for
    unsigned int to overflow in which case we report invalid number
    of 4K pages pool size. Switch to unsigned long long.
    
    We hit overflow in virNumaGetPages when doing:
    
        huge_page_sum += 1024 * page_size * page_avail;
    
    because although 'huge_page_sum' is an unsigned long long, the
    page_size and page_avail are both unsigned int, so the promotion
    to unsigned long long doesn't happen until the sum has been
    calculated, by which time we've already overflowed.
    
    Turning page_avail into a unsigned long long is not strictly
    needed until we need ability to represent more than 2^32
    4k pages, which equates to 16 TB of RAM. That's not
    outside the realm of possibility, so makes sense that we
    change it to unsigned long long to avoid future problems.
    
    Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
    Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

v4.2.0-456-g31daccf5a5

Comment 23 Jing Qi 2018-07-09 04:50:43 UTC
Verified as described in bug 1582418 & 1582416.

Comment 25 errata-xmlrpc 2018-10-30 09:55:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3113


Note You need to log in before you can comment on or make changes to this bug.