Bug 2234613

Summary: RHDS-11 investigate long etime and error "Retry cound exceeded' on BIND/ADD/DEL/MOD from revert_cache [12.4]
Product: Red Hat Directory Server Reporter: Marc Sauton <msauton>
Component: 389-ds-baseAssignee: thierry bordaz <tbordaz>
Status: CLOSED ERRATA QA Contact: LDAP QA Team <idm-ds-qe-bugs>
Severity: urgent Docs Contact: Evgenia Martynyuk <emartyny>
Priority: urgent    
Version: 11.7CC: bsmejkal, cgaynor, ddas, idm-ds-dev-bugs, knakai, musoni, rmarigny, tbordaz, tmihinto, tscherf, vashirov, vvanhaft
Target Milestone: DS12.4Keywords: Triaged
Target Release: dirsrv-12.4   
Hardware: All   
OS: Linux   
Whiteboard: sync-to-jira
Fixed In Version: redhat-ds-12-9040020240116164822.1674d574 Doc Type: Bug Fix
Doc Text:
.Directory Server now flushes the entry cache less frequently Previously, Directory Server flushed its entry cache even when it was not necessary. As a result, in certain situations, Directory Server was unresponsive and had bad performance. With this update, Director Server flushes the entry cache only when it is necessary.
Story Points: ---
Clone Of:
: 2268177 (view as bug list) Environment:
Last Closed: 2024-05-07 00:15:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2268177, 2268183, 2268186    

Description Marc Sauton 2023-08-24 21:47:15 UTC
Description of problem:

issue to investigate a RHDS-11 long etime situation with error "Retry cound exceeded' on BIND/ADD/DEL/MOD from revert_cache ( entry cache )


Version-Release number of selected component (if applicable):

RHDS-11.7 on RHEL-8.8
389-ds-base-1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
redhat-release-8.8-0.8.el8.x86_64


How reproducible:
N/A, high traffic, and other unknowns in environment.

Steps to Reproduce:
1. N/A
2.
3.

Actual results:

pattern event in errors log:

ERR - find_entry_internal_dn - Retry count exceeded (uid=

thread signature:
Contention on backend lock while reverting TXN failure
Many threads (update) are stucked waiting for backend lock => can contribute to worker starvation

        Thread 36 (Thread 0x7f16839fe700 (LWP 1672443)):
        #0  0x00007f191177ee92 in flush_hash () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
        #1  0x00007f191177f103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
        #2  0x00007f19117aea7c in ldbm_back_modify () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
        #3  0x00007f19204604d0 in op_shared_modify () at target:/usr/lib64/dirsrv/libslapd.so.0
        #4  0x00007f192046112b in modify_internal_pb () at target:/usr/lib64/dirsrv/libslapd.so.0
        #5  0x00007f1920486479 in pw_apply_mods () at target:/usr/lib64/dirsrv/libslapd.so.0
        #6  0x00007f1920486686 in set_retry_cnt_and_time.constprop () at target:/usr/lib64/dirsrv/libslapd.so.0
        #7  0x00007f19204867fb in update_pw_retry () at target:/usr/lib64/dirsrv/libslapd.so.0
        #8  0x00007f192048c2cb in send_ldap_result_ext () at target:/usr/lib64/dirsrv/libslapd.so.0
        #9  0x00007f192048c54f in send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
        #10 0x00007f1920473267 in slapi_send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
        #11 0x00007f191179c01b in ldbm_back_bind () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
        #12 0x0000562dfa8c4ed2 in pw_verify_be_dn ()
        #13 0x0000562dfa8b1449 in do_bind ()
        #14 0x0000562dfa8b64b5 in connection_threadmain ()
        #15 0x00007f191ce97968 in _pt_root () at target:/lib64/libnspr4.so
        #16 0x00007f191c8321cf in start_thread () at target:/lib64/libpthread.so.0
        #17 0x00007f191eae5dd3 in clone () at target:/lib64/libc.so.6


Expected results:
yes


Additional info:

RHDS-11.6 related fix: bz 2051476 - high contention in find_entry_internal_dn on mixed load
https://bugzilla.redhat.com/2051476
https://access.redhat.com/errata/RHBA-2023:0186
"
Cause: Cache c_mutex type was changed from PR_Monitor to pthread recursive mutex implementation. It brought a minor performance boost but also proved to be a less stable solution in its current way.
Additionally, another issue happens when updating the parent entry of a deleted entry (numsubordinates), if it fails to lock the parent it does not return the parent entry.

Consequence: "find_entry_internal_dn - Retry count exceeded" error appears in the error log with high concurrent mixed operations load on a flat tree.
And when the other issue happens, refcnt becomes invalid. Which may lead to other cache locking issues.

Fix: Change cache c_mutex type to PR_Monitor.
In the case of the failure to lock the parent entry, the entry should be returned.

Result: "find_entry_internal_dn - Retry count exceeded" error doesn't appear. And the cache structure exists in the correct state with the correct refcnt.
"

so
ERR - find_entry_internal_dn - Retry count exceeded
will happen again, and there have been more reports
https://bugzilla.redhat.com/show_bug.cgi?id=2051476#c45

Comment 31 Colum Gaynor 2024-03-20 19:29:58 UTC
@tbordaz Thanks - Colum

Comment 36 errata-xmlrpc 2024-05-07 00:15:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (redhat-ds:12 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2024:2718

Comment 37 Red Hat Bugzilla 2024-09-05 04:25:04 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days