Bug 2143329

Summary: dsconf monitor dbmon: ERROR: Error: math domain error when dbcache is over 2GB
Product: Red Hat Directory Server Reporter: mreynolds
Component: 389-ds-baseAssignee: Pierre Rogier <progier>
Status: CLOSED ERRATA QA Contact: LDAP QA Team <idm-ds-qe-bugs>
Severity: medium Docs Contact: Zuzana Zoubkova <zzoubkov>
Priority: high    
Version: 12.2CC: aadhikar, emartyny, idm-ds-dev-bugs, mreynolds, msauton, pasik, progier, vashirov
Target Milestone: DS12.2Keywords: Triaged
Target Release: dirsrv-12.2   
Hardware: All   
OS: Linux   
Whiteboard: sync-to-jira
Fixed In Version: redhat-ds-12-9020020230314150545.1674d574 Doc Type: Bug Fix
Doc Text:
Cause: When computing the monitoring statistics, a wrong page size (db files page size instead of mempool page size) was used to compute the db. Consequence: The db cache statistics are wrong and an math domain error occurs when the cache get more than 50% busy. Fix: Now the ldap backend monitoring query also returns the mempool page size and it is used to compute rightly the statistics. Result: The statistics are now accurate and there is no more errors.
Story Points: ---
Clone Of: 2034407 Environment:
Last Closed: 2023-05-30 09:40:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2034407    
Bug Blocks:    

Description mreynolds 2022-11-16 16:09:07 UTC
+++ This bug was initially created as a clone of Bug #2034407 +++

Description of problem:

dsconf monitor dbmon fails with ERROR: Error: math domain error


Version-Release number of selected component (if applicable):
RHDS-11.4 RHEL-8.5
vendorVersion: 389-Directory/1.4.3.27 B2021.263.1550


How reproducible:
on demand

Steps to Reproduce:
1. have an entry cache > 2GB
2.
3.
dsconf monitor dbmon works for a while, but after the server is in operation it begins to report ERROR: Error: math domain error which is then persistent until restart.

# dsconf -D "cn=Directory Manager" -w ... ldap://`hostname --fqdn` -v monitor dbmon 
...snip...


Actual results:

...snip...
DEBUG: list filter = (&(objectclass=nsBackendInstance))
DEBUG: cn=monitor,cn=ldbm database,cn=plugins,cn=config get_attrs_vals_utf8(['dbcachehits', 'dbcachetries', 'dbcachehitratio', 'dbcachepagein', 'dbcachepageout', 'dbcacheroevict', 'dbcacherwevict', 'normalizeddncachetries', 'normalizeddncachehits', 'normalizeddncachemisses', 'normalizeddncachehitratio', 'normalizeddncacheevictions', 'currentnormalizeddncachesize', 'maxnormalizeddncachesize', 'currentnormalizeddncachecount', 'normalizeddncachethreadsize', 'normalizeddncachethreadslots'])
DEBUG: cn=database,cn=monitor,cn=ldbm database,cn=plugins,cn=config get_attrs_vals_utf8(['nsslapd-db-abort-rate', 'nsslapd-db-active-txns', 'nsslapd-db-cache-hit', 'nsslapd-db-cache-try', 'nsslapd-db-cache-region-wait-rate', 'nsslapd-db-cache-size-bytes', 'nsslapd-db-clean-pages', 'nsslapd-db-commit-rate', 'nsslapd-db-deadlock-rate', 'nsslapd-db-dirty-pages', 'nsslapd-db-hash-buckets', 'nsslapd-db-hash-elements-examine-rate', 'nsslapd-db-hash-search-rate', 'nsslapd-db-lock-conflicts', 'nsslapd-db-lock-region-wait-rate', 'nsslapd-db-lock-request-rate', 'nsslapd-db-lockers', 'nsslapd-db-configured-locks', 'nsslapd-db-current-locks', 'nsslapd-db-max-locks', 'nsslapd-db-current-lock-objects', 'nsslapd-db-max-lock-objects', 'nsslapd-db-log-bytes-since-checkpoint', 'nsslapd-db-log-region-wait-rate', 'nsslapd-db-log-write-rate', 'nsslapd-db-longest-chain-length', 'nsslapd-db-page-create-rate', 'nsslapd-db-page-read-rate', 'nsslapd-db-page-ro-evict-rate', 'nsslapd-db-page-rw-evict-rate', 'nsslapd-db-page-trickle-rate', 'nsslapd-db-page-write-rate', 'nsslapd-db-pages-in-use', 'nsslapd-db-txn-region-wait-rate'])
DEBUG: math domain error
Traceback (most recent call last):
  File "/usr/sbin/dsconf", line 134, in <module>
    result = args.func(inst, None, log, args)
  File "/usr/lib/python3.6/site-packages/lib389/cli_conf/monitor.py", line 156, in db_monitor
    'free': convert_bytes(str(dbcachefree)),
  File "/usr/lib/python3.6/site-packages/lib389/utils.py", line 1360, in convert_bytes
    i = int(math.floor(math.log(bytes, 1024)))
ValueError: math domain error
ERROR: Error: math domain error



Expected results:
yes


Additional info:

opening bug report on behalf of customer:

"
It looks like the root cause is likely in the monitor plug-in, I would guess it is related to a signed 32-bit integer counter limit in SNMP.

If nsslap-dbcachesize is configured for any value larger that 2Gb, as you see below.  The monitor plugin will only report 2Gb.  This renders the cache usage values unusable for caches above 2Gb.

I would suggest that dbmon should be querying cn=bdb,cn=config,cn=ldbm database,cn=plugins,cn=config for the cache size, not cn=database,cn=monitor,cn=ldbm database,cn=plugins,cn=config.

If SNMP is related to this issue then SNMP will be incorrectly reporting 2Gb for any cache larger than 2Gb.

ldapsearch -x -D "cn=directory manager" -W -b 'cn=bdb,cn=config,cn=ldbm database,cn=plugins,cn=config' -LLL -s base nsslapd-dbcachesize
Enter LDAP Password: 
dn: cn=bdb,cn=config,cn=ldbm database,cn=plugins,cn=config
nsslapd-dbcachesize: 6442450944

ldapsearch -D "cn=directory manager" -W -x -h localhost -b 'cn=database,cn=monitor,cn=ldbm database,cn=plugins,cn=config' -LLL nsslapd-db-cache-size-bytes
Enter LDAP Password: 
dn: cn=database,cn=monitor,cn=ldbm database,cn=plugins,cn=config
nsslapd-db-cache-size-bytes: 2147483648


When receiving the error, here are the stats involved

# database, monitor, ldbm database, plugins, config
dn: cn=database,cn=monitor,cn=ldbm database,cn=plugins,cn=config
nsslapd-db-abort-rate: 429
nsslapd-db-active-txns: 0
nsslapd-db-cache-hit: 1226370346
nsslapd-db-cache-try: 1226682911
nsslapd-db-cache-region-wait-rate: 87
nsslapd-db-cache-size-bytes: 1073741824
nsslapd-db-clean-pages: 312618
nsslapd-db-commit-rate: 238910
nsslapd-db-deadlock-rate: 0
nsslapd-db-dirty-pages: 8
nsslapd-db-hash-buckets: 524294
nsslapd-db-hash-elements-examine-rate: 0
nsslapd-db-hash-search-rate: 2476689799
nsslapd-db-lock-conflicts: 60
nsslapd-db-lock-region-wait-rate: 839
nsslapd-db-lock-request-rate: 18319531
nsslapd-db-lockers: 345
nsslapd-db-configured-locks: 240000
nsslapd-db-current-locks: 32
nsslapd-db-max-locks: 545
nsslapd-db-current-lock-objects: 32
nsslapd-db-max-lock-objects: 398
nsslapd-db-log-bytes-since-checkpoint: 630735
nsslapd-db-log-region-wait-rate: 839
nsslapd-db-log-write-rate: 1231916143
nsslapd-db-longest-chain-length: 6
nsslapd-db-page-create-rate: 7928
nsslapd-db-page-read-rate: 312565
nsslapd-db-page-ro-evict-rate: 0
nsslapd-db-page-rw-evict-rate: 0
nsslapd-db-page-trickle-rate: 0
nsslapd-db-page-write-rate: 44690
nsslapd-db-pages-in-use: 312626
nsslapd-db-txn-region-wait-rate: 839

This results in the following calculation in /usr/lib/python3.6/site-packages/lib389/cli_conf/monitor.py:137

    dbcachefree = int(dbcachesize - (pagesize * dbpages))
    dbcachefree = int(1073741824 - (8192 * 312626))
    dbcachefree = -1487290368

Which then causes /usr/lib/python3.6/site-packages/lib389/cli_conf/monitor.py:156

            'free': convert_bytes(str(dbcachefree)),

To cause convert_bytes to fail in the call to math.log at   /usr/lib/python3.6/site-packages/lib389/utils.py:1360 because bytes is negative.

This can be avoided by changing /usr/lib/python3.6/site-packages/lib389/utils.py:1357 from
    if bytes == 0:
to
    if bytes <=0:

This is just work around, the function should really store the negative, abs the value, compute the units and restore the negate the value on return to produce the value in the correct unit.

However there is a HUGE problem with the math here.  It indicates that we have exceeded the allocated DB cache by what, 8 Gigabytes???   Something is WILDLY suspect in either the reported metric or the math.

INFO: --------------------------------------------------------
INFO: Database Cache:
INFO:  - Cache Hit Ratio:     99%
INFO:  - Free Space:          0 B
INFO:  - Free Percentage:     -138.6%
INFO:  - RO Page Drops:       0
INFO:  - Pages In:            312631
INFO:  - Pages Out:           44891
INFO: 
INFO: Normalized DN Cache:
INFO:  - Cache Hit Ratio:     81%
INFO:  - Free Space:          368.96 MB
INFO:  - Free Percentage:     72.1%
INFO:  - DN Count:            868826
INFO:  - Evictions:           3407485

"

Comment 2 Pierre Rogier 2023-01-25 17:09:57 UTC
Beware that this bug title is a bit misleading: The error is not dependent of the cache size and typically occurs when the cache is more than 50% busy.
because cache size computation is wrongly using db file page size value (i.e 8K by default)  while typical db cache page size (aka file system preferred size) is 4K

Comment 3 mreynolds 2023-02-08 16:49:53 UTC
Upstream ticket: 

https://github.com/389ds/389-ds-base/issues/5550

Comment 7 Viktor Ashirov 2023-05-04 11:07:50 UTC
Automated test passed:
============================================================= test session starts =============================================================
platform linux -- Python 3.9.16, pytest-6.2.2, py-1.10.0, pluggy-0.13.1 -- /usr/bin/python3
cachedir: .pytest_cache
389-ds-base: 2.2.7-2.module+el9dsrv+18726+78959e84
nss: 3.79.0-18.el9_1
nspr: 4.34.0-18.el9_1
openldap: 2.6.2-3.el9
cyrus-sasl: not installed
FIPS: disabled
rootdir: /root/ds/dirsrvtests, configfile: pytest.ini
collected 3 items

dirsrvtests/tests/suites/clu/dbmon_test.py::test_dsconf_dbmon PASSED                                                                    [ 33%]
dirsrvtests/tests/suites/clu/dbmon_test.py::test_dbmon_mp_pagesize PASSED                                                               [ 66%]
dirsrvtests/tests/suites/replication/changelog_test.py::test_changelog_pagesize PASSED                                                  [100%]

================================================== 3 passed, 12 warnings in 67.90s (0:01:07) ==================================================

Marking as VERIFIED.

Comment 9 errata-xmlrpc 2023-05-30 09:40:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (redhat-ds:12 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3344