Bug 2268136

Summary: Long etime and error "Retry count exceeded' on BIND/ADD/DEL/MOD from revert_cache [11.7.z]
Product: Red Hat Directory Server Reporter: Viktor Ashirov <vashirov>
Component: 389-ds-baseAssignee: LDAP Maintainers <idm-ds-dev-bugs>
Status: CLOSED ERRATA QA Contact: LDAP QA Team <idm-ds-qe-bugs>
Severity: unspecified Docs Contact: Evgenia Martynyuk <emartyny>
Priority: unspecified    
Version: 11.7CC: cgaynor, ddas, idm-ds-dev-bugs, msauton, musoni, rmarigny, tbordaz
Target Milestone: DS11.7   
Target Release: dirsrv-11.7   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 389-ds-base-1.4.3.34-3.module+el8dsrv+21391+b62d2223 Doc Type: Bug Fix
Doc Text:
.Directory Server now flushes the entry cache less frequently Previously, Directory Server flushed its entry cache even when it was not necessary. As a result, in certain situations, Directory Server was unresponsive and had bad performance. With this update, Director Server flushes the entry cache only when it is necessary.
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-03-19 11:26:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2268183    

Description Viktor Ashirov 2024-03-06 11:30:19 UTC
This bug was initially created as a copy of Bug #2234613

Description of problem:

issue to investigate a RHDS-11 long etime situation with error "Retry cound exceeded' on BIND/ADD/DEL/MOD from revert_cache ( entry cache )


Version-Release number of selected component (if applicable):

RHDS-11.7 on RHEL-8.8
389-ds-base-1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
redhat-release-8.8-0.8.el8.x86_64


How reproducible:
N/A, high traffic, and other unknowns in environment.

Steps to Reproduce:
1. N/A
2.
3.

Actual results:

pattern event in errors log:

ERR - find_entry_internal_dn - Retry count exceeded (uid=

thread signature:
Contention on backend lock while reverting TXN failure
Many threads (update) are stucked waiting for backend lock => can contribute to worker starvation

        Thread 36 (Thread 0x7f16839fe700 (LWP 1672443)):
        #0  0x00007f191177ee92 in flush_hash () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
        #1  0x00007f191177f103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
        #2  0x00007f19117aea7c in ldbm_back_modify () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
        #3  0x00007f19204604d0 in op_shared_modify () at target:/usr/lib64/dirsrv/libslapd.so.0
        #4  0x00007f192046112b in modify_internal_pb () at target:/usr/lib64/dirsrv/libslapd.so.0
        #5  0x00007f1920486479 in pw_apply_mods () at target:/usr/lib64/dirsrv/libslapd.so.0
        #6  0x00007f1920486686 in set_retry_cnt_and_time.constprop () at target:/usr/lib64/dirsrv/libslapd.so.0
        #7  0x00007f19204867fb in update_pw_retry () at target:/usr/lib64/dirsrv/libslapd.so.0
        #8  0x00007f192048c2cb in send_ldap_result_ext () at target:/usr/lib64/dirsrv/libslapd.so.0
        #9  0x00007f192048c54f in send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
        #10 0x00007f1920473267 in slapi_send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
        #11 0x00007f191179c01b in ldbm_back_bind () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
        #12 0x0000562dfa8c4ed2 in pw_verify_be_dn ()
        #13 0x0000562dfa8b1449 in do_bind ()
        #14 0x0000562dfa8b64b5 in connection_threadmain ()
        #15 0x00007f191ce97968 in _pt_root () at target:/lib64/libnspr4.so
        #16 0x00007f191c8321cf in start_thread () at target:/lib64/libpthread.so.0
        #17 0x00007f191eae5dd3 in clone () at target:/lib64/libc.so.6


Expected results:
yes


Additional info:

RHDS-11.6 related fix: bz 2051476 - high contention in find_entry_internal_dn on mixed load
https://bugzilla.redhat.com/2051476
https://access.redhat.com/errata/RHBA-2023:0186
"
Cause: Cache c_mutex type was changed from PR_Monitor to pthread recursive mutex implementation. It brought a minor performance boost but also proved to be a less stable solution in its current way.
Additionally, another issue happens when updating the parent entry of a deleted entry (numsubordinates), if it fails to lock the parent it does not return the parent entry.

Consequence: "find_entry_internal_dn - Retry count exceeded" error appears in the error log with high concurrent mixed operations load on a flat tree.
And when the other issue happens, refcnt becomes invalid. Which may lead to other cache locking issues.

Fix: Change cache c_mutex type to PR_Monitor.
In the case of the failure to lock the parent entry, the entry should be returned.

Result: "find_entry_internal_dn - Retry count exceeded" error doesn't appear. And the cache structure exists in the correct state with the correct refcnt.
"

so
ERR - find_entry_internal_dn - Retry count exceeded
will happen again, and there have been more reports
https://bugzilla.redhat.com/show_bug.cgi?id=2051476#c45

Comment 1 thierry bordaz 2024-03-06 14:53:47 UTC
*** Bug 2268186 has been marked as a duplicate of this bug. ***

Comment 15 errata-xmlrpc 2024-03-19 11:26:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: redhat-ds:11 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1372