Bug 1916519 - Host memory statistics discrepancies due to SReclaimable
Summary: Host memory statistics discrepancies due to SReclaimable
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.3.9
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.4.5
: ---
Assignee: Liran Rotenberg
QA Contact: Qin Yuan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-15 00:43 UTC by Germano Veit Michel
Modified: 2022-08-03 12:20 UTC (History)
8 users (show)

Fixed In Version: vdsm-4.40.50.4
Doc Type: Bug Fix
Doc Text:
Previously, the used memory of the host didn't take the SReclaimable memory into consideration while it did for free memory. As a result, there were discrepancies in the host statistics. In this release, the SReclaimable memory is a part of the used memory calculation.
Clone Of:
Environment:
Last Closed: 2021-04-14 11:38:44 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:1184 0 None None None 2021-04-14 11:38:52 UTC
oVirt gerrit 113166 0 master MERGED sampling: consider claimable memory 2021-02-18 08:17:14 UTC

Description Germano Veit Michel 2021-01-15 00:43:30 UTC
Description of problem:

Since BZ1749630, vdsm started adding SReclaimable to the reported free memory 'memFree'. 
However, the percent calculation 'memUsed' is not taking SReclaimable into account, producing a few side effects let alone the discrepancy.

See this, from a host with 512G:

$ egrep 'mem[PFSUCA]' sos_commands/vdsm/vdsm-client_Host_getStats
    "memShared": 56714, 
    "memUsed": "81", 
    "memCommitted": 294912, 
    "memAvailable": 257590, 
    "memFree": 257334, 

If memFree is ~256G on a host with 512G, then memUsed should be ~50%, not 81%.. The 30% difference are SReclaimable (160G)
MemFree:        99845680 kB
Buffers:          100560 kB
Cached:          5942972 kB
SReclaimable:   158048536 kB

Host sampling is not taking SReclaimable into account for memUsed calculation:

lib/vdsm/virt/sampling.py:
   172	        freeOrCached = (meminfo['MemFree'] +
   173	                        meminfo['Cached'] + meminfo['Buffers'])
   174	        self.memUsed = 100 - int(100.0 * (freeOrCached) / meminfo['MemTotal'])

This has a few side effects:

1) The Hosts tab memory graph in the Admin Portal shows 81% usage instead of ~50%

2) Engine warns of memory thresholds exceeded more easily (audit_log, events) as it uses memUsed

backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/HostMonitoring.java:
   325	    private void checkVdsMemoryThresholdPercentage(Cluster cluster, VdsStatistics stat) {
   326	        Integer maxUsedPercentageThreshold = cluster.getLogMaxMemoryUsedThreshold();
   327	
   328	        if (stat.getUsageMemPercent() > maxUsedPercentageThreshold) {
   329	            logMemoryAuditLog(vds, cluster, stat, AuditLogType.VDS_HIGH_MEM_USE, maxUsedPercentageThreshold);
   330	        }
   331	    }

3) API host statistics also seem to return mem.free and mem.used based on the percent that comes from VDSM, giving misleading values for mem.free and mem.used and triggering monitoring warnings (i.e. Nagios)

backend/manager/modules/restapi/jaxrs/src/main/java/org/ovirt/engine/api/restapi/resource/HostStatisticalQuery.java:
    41	    public List<Statistic> getStatistics(VDS entity) {
    42	        VdsStatistics s = entity.getStatisticsData();
    43	        // if user queries host statistics before host installation completed, null values are possible (therefore added checks).
    44	        long memTotal = entity.getPhysicalMemMb()==null ? 0 : entity.getPhysicalMemMb() * Mb;
    45	        long memUsed = (s==null || s.getUsageMemPercent()==null) ? 0 : memTotal * s.getUsageMemPercent() / 100;
    46	        List<Statistic> statistics = asList(setDatum(clone(MEM_TOTAL),   memTotal),
    47	                      setDatum(clone(MEM_USED),    memUsed),
    48	                      setDatum(clone(MEM_FREE),    memTotal-memUsed),

See:

<statistic href="/ovirt-engine/api/hosts/5abf7bf8-d35f-4077-92c6-3cdc65f635a1/statistics/7816602b-c05c-3db7-a4da-3769f7ad8896" id="7816602b-c05c-3db7-a4da-3769f7ad8896">
  <name>memory.total</name>
    <description>Total memory</description>
    <kind>gauge</kind>
    <type>integer</type>
    <unit>bytes</unit>
    <values>
      <value>
        <datum>540307095552</datum>
      </value>
    </values>
    <host href="/ovirt-engine/api/hosts/5abf7bf8-d35f-4077-92c6-3cdc65f635a1" id="5abf7bf8-d35f-4077-92c6-3cdc65f635a1"/>
</statistic>

<statistic href="/ovirt-engine/api/hosts/5abf7bf8-d35f-4077-92c6-3cdc65f635a1/statistics/b7499508-c1c3-32f0-8174-c1783e57bb08" id="b7499508-c1c3-32f0-8174-c1783e57bb08">
  <name>memory.used</name>
  <description>Used memory</description>
  <kind>gauge</kind>
  <type>integer</type>
  <unit>bytes</unit>
  <values>
    <value>
      <datum>432245676441</datum>
    </value>
  </values>
  <host href="/ovirt-engine/api/hosts/5abf7bf8-d35f-4077-92c6-3cdc65f635a1" id="5abf7bf8-d35f-4077-92c6-3cdc65f635a1"/>
</statistic>

<statistic href="/ovirt-engine/api/hosts/5abf7bf8-d35f-4077-92c6-3cdc65f635a1/statistics/5a0fba9d-33d7-3cbf-addd-ba462040c946" id="5a0fba9d-33d7-3cbf-addd-ba462040c946">
  <name>memory.free</name>
  <description>Free memory</description>
  <kind>gauge</kind>
  <type>integer</type>
  <unit>bytes</unit>
  <values>
    <value>
      <datum>108061419111</datum>
    </value>
  </values>
  <host href="/ovirt-engine/api/hosts/5abf7bf8-d35f-4077-92c6-3cdc65f635a1" id="5abf7bf8-d35f-4077-92c6-3cdc65f635a1"/>
</statistic>

Note: caches and buffers were removed in BZ1751423 as they were always zero, they did not include SReclaimable anyway.

Version-Release number of selected component (if applicable):
- 4.3.9 engine and vdsm-4.30.44 (customer)
- don't see any changes on master

How reproducible:
- Customer with high SReclaimable
- Not very easy to get high SReclaimable for the discrepancy to be very clear like above, probably needs a good uptime and load.

Comment 6 Qin Yuan 2021-02-10 03:09:38 UTC
Verified with:
vdsm-4.40.50.4-1.el8ev.x86_64

Steps:
1. Create a large number of empty directories on host to make SReclaimable big
2. Check memUsed

Results:
1. memUsed is correct.

meminfo:
[root@ocelot06 ~]# cat /proc/meminfo |grep -e MemTotal -e MemFree -e Buffers -e '^Cached' -e SReclaimable
MemTotal:       98597084 kB
MemFree:        64820576 kB
Buffers:            4540 kB
Cached:          3329852 kB
SReclaimable:   10779752 kB

Consider SReclaimable when calculate memUsed:
memUsed = 100-int(100*(MemFree+Buffers+Cached+SReclaimable)/MemTotal)
        = 100-int(100*(64820576+4540+3329852+10779752)/98597084)
        = 20

Don't consider SReclaimable when calculate memUsed:
memUsed = 100-int(100*(MemFree+Buffers+Cached)/MemTotal)
        = 100-int(100*(64820576+4540+3329852)/98597084)
        = 31

Check actual memUsed:
[root@ocelot06 ~]# vdsm-client Host getStats |grep -E 'mem[PFSUCA]'
    "memAvailable": 77306,
    "memCommitted": 0,
    "memFree": 77050,
    "memShared": 0,
    "memUsed": "20",

As you can see, the actual memUsed is the result of considering SReclaimable.

Comment 12 errata-xmlrpc 2021-04-14 11:38:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV RHEL Host (ovirt-host) 4.4.z [ovirt-4.4.5] security, bug fix, enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1184

Comment 13 meital avital 2022-08-03 12:20:54 UTC
Due to QE capacity, we are not going to cover this issue in our automation


Note You need to log in before you can comment on or make changes to this bug.