Bug 2234613

Summary:	RHDS-11 investigate long etime and error "Retry cound exceeded' on BIND/ADD/DEL/MOD from revert_cache [12.4]
Product:	Red Hat Directory Server	Reporter:	Marc Sauton <msauton>
Component:	389-ds-base	Assignee:	thierry bordaz <tbordaz>
Status:	CLOSED ERRATA	QA Contact:	LDAP QA Team <idm-ds-qe-bugs>
Severity:	urgent	Docs Contact:	Evgenia Martynyuk <emartyny>
Priority:	urgent
Version:	11.7	CC:	bsmejkal, cgaynor, ddas, idm-ds-dev-bugs, knakai, musoni, rmarigny, tbordaz, tmihinto, tscherf, vashirov, vvanhaft
Target Milestone:	DS12.4	Keywords:	Triaged
Target Release:	dirsrv-12.4
Hardware:	All
OS:	Linux
Whiteboard:	sync-to-jira
Fixed In Version:	redhat-ds-12-9040020240116164822.1674d574	Doc Type:	Bug Fix
Doc Text:	.Directory Server now flushes the entry cache less frequently Previously, Directory Server flushed its entry cache even when it was not necessary. As a result, in certain situations, Directory Server was unresponsive and had bad performance. With this update, Director Server flushes the entry cache only when it is necessary.	Story Points:	---
Clone Of:
Clones:	2268177 (view as bug list)		Environment:
Last Closed:	2024-05-07 00:15:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2268177, 2268183, 2268186

Description Marc Sauton 2023-08-24 21:47:15 UTC

Description of problem:

issue to investigate a RHDS-11 long etime situation with error "Retry cound exceeded' on BIND/ADD/DEL/MOD from revert_cache ( entry cache )


Version-Release number of selected component (if applicable):

RHDS-11.7 on RHEL-8.8
389-ds-base-1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
redhat-release-8.8-0.8.el8.x86_64


How reproducible:
N/A, high traffic, and other unknowns in environment.

Steps to Reproduce:
1. N/A
2.
3.

Actual results:

pattern event in errors log:

ERR - find_entry_internal_dn - Retry count exceeded (uid=

thread signature:
Contention on backend lock while reverting TXN failure
Many threads (update) are stucked waiting for backend lock => can contribute to worker starvation

        Thread 36 (Thread 0x7f16839fe700 (LWP 1672443)):
        #0  0x00007f191177ee92 in flush_hash () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
        #1  0x00007f191177f103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
        #2  0x00007f19117aea7c in ldbm_back_modify () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
        #3  0x00007f19204604d0 in op_shared_modify () at target:/usr/lib64/dirsrv/libslapd.so.0
        #4  0x00007f192046112b in modify_internal_pb () at target:/usr/lib64/dirsrv/libslapd.so.0
        #5  0x00007f1920486479 in pw_apply_mods () at target:/usr/lib64/dirsrv/libslapd.so.0
        #6  0x00007f1920486686 in set_retry_cnt_and_time.constprop () at target:/usr/lib64/dirsrv/libslapd.so.0
        #7  0x00007f19204867fb in update_pw_retry () at target:/usr/lib64/dirsrv/libslapd.so.0
        #8  0x00007f192048c2cb in send_ldap_result_ext () at target:/usr/lib64/dirsrv/libslapd.so.0
        #9  0x00007f192048c54f in send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
        #10 0x00007f1920473267 in slapi_send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
        #11 0x00007f191179c01b in ldbm_back_bind () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
        #12 0x0000562dfa8c4ed2 in pw_verify_be_dn ()
        #13 0x0000562dfa8b1449 in do_bind ()
        #14 0x0000562dfa8b64b5 in connection_threadmain ()
        #15 0x00007f191ce97968 in _pt_root () at target:/lib64/libnspr4.so
        #16 0x00007f191c8321cf in start_thread () at target:/lib64/libpthread.so.0
        #17 0x00007f191eae5dd3 in clone () at target:/lib64/libc.so.6


Expected results:
yes


Additional info:

RHDS-11.6 related fix: bz 2051476 - high contention in find_entry_internal_dn on mixed load
https://bugzilla.redhat.com/2051476
https://access.redhat.com/errata/RHBA-2023:0186
"
Cause: Cache c_mutex type was changed from PR_Monitor to pthread recursive mutex implementation. It brought a minor performance boost but also proved to be a less stable solution in its current way.
Additionally, another issue happens when updating the parent entry of a deleted entry (numsubordinates), if it fails to lock the parent it does not return the parent entry.

Consequence: "find_entry_internal_dn - Retry count exceeded" error appears in the error log with high concurrent mixed operations load on a flat tree.
And when the other issue happens, refcnt becomes invalid. Which may lead to other cache locking issues.

Fix: Change cache c_mutex type to PR_Monitor.
In the case of the failure to lock the parent entry, the entry should be returned.

Result: "find_entry_internal_dn - Retry count exceeded" error doesn't appear. And the cache structure exists in the correct state with the correct refcnt.
"

so
ERR - find_entry_internal_dn - Retry count exceeded
will happen again, and there have been more reports
https://bugzilla.redhat.com/show_bug.cgi?id=2051476#c45

Comment 31 Colum Gaynor 2024-03-20 19:29:58 UTC

@tbordaz Thanks - Colum

Comment 36 errata-xmlrpc 2024-05-07 00:15:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (redhat-ds:12 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2024:2718

Comment 37 Red Hat Bugzilla 2024-09-05 04:25:04 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days