Bug 2051476
| Summary: | high contention in find_entry_internal_dn on mixed load | ||
|---|---|---|---|
| Product: | Red Hat Directory Server | Reporter: | Anton Bobrov <abobrov> |
| Component: | 389-ds-base | Assignee: | Simon Pichugin <spichugi> |
| Status: | CLOSED ERRATA | QA Contact: | RHDS QE <ds-qe-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 11.0 | CC: | bs168, bsmejkal, ccallaha, jwooten, ldap-maint, mreynolds, msauton, pasik, pcech, spichugi, tbordaz, tmihinto |
| Target Milestone: | DS11.6 | Keywords: | Triaged |
| Target Release: | dirsrv-11.6 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | sync-to-jira | ||
| Fixed In Version: | redhat-ds-11-8070020230103174318.95958119 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: Cache c_mutex type was changed from PR_Monitor to pthread recursive mutex implementation. It brought a minor performance boost but also proved to be a less stable solution in its current way.
Additionally, another issue happens when updating the parent entry of a deleted entry (numsubordinates), if it fails to lock the parent it does not return the parent entry.
Consequence: "find_entry_internal_dn - Retry count exceeded" error appears in the error log with high concurrent mixed operations load on a flat tree.
And when the other issue happens, refcnt becomes invalid. Which may lead to other cache locking issues.
Fix: Change cache c_mutex type to PR_Monitor.
In the case of the failure to lock the parent entry, the entry should be returned.
Result: "find_entry_internal_dn - Retry count exceeded" error doesn't appear. And the cache structure exists in the correct state with the correct refcnt.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-17 08:52:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Anton Bobrov
2022-02-07 11:01:33 UTC
There are multiple problems with related code that can be broken down into the following major points: 1) The locking is backwards. 2) The locking is exclusive. 3) The locking is unfair. (1) The locking logic works by 1st retrieving an entry from a backend and then locking its cache representation. Instead it should perform a logical lock of the entry 1st before retrieving it. In order to do so entry locking should either be decoupled or a dummy entry created 1st on which a lock can be held before trying to retrieve an entry from a backend. In essence the locking should be based on entry DN not the actual entry itself. This would prevent wasting cpu cycles and other resources because logical entry lock (by DN) costs nothing while backend retrieval plus logical lock always costs resources especially if it is retried 100s of times like in this case. (2) Currently an exclusive locking is performed. This should be reworked with read locks and write locks respectively, depending on a operation. Taking exclusive locks here just worsens the contention in a very severe way and it is completely unnecessary. If an operation takes a read lock it does not create contention with other read operations by doing so. (3) Fair locking should be implemented here to avoid situations where worker threads waiting on a lock the longest never having a chance to acquire it. This should also smooth out etimes on the operations involved ie in high contention situation all threads would incur waiting costs roughly equally. This gives some better predictability and consistent behavior to operations involved plus avoids single operation failures. Furthermore this can be drastically improved by making whether or not operational attributes like 'numSubordinates' and 'hasSubordinates' are returned, configurable. If there is no reason to update the parent entry only a read lock can be taken on it (to prevent orphaned entries) and thus there will be no contention for write operations on its children. For example we had 'numSubordinates' implemented in OpenDS as a virtual attribute where it would be computed/generated on the fly only when explicitly requested without storing it as part of the entry. Its worth looking into how a similar approach can be implemented here as well. Please note that these comments/suggestions are not some hand wavy stuff but are actually based on the work I had previously done addressing similar issues with OpenDS implementation. I hope they are taken on board with that in mind and are ultimately helpful to improve the current implementation going forward. find_entry_internal_uniqueid() function suffers from exact same problems. Actually even worse if UUIDCACHE_ON is undefined. Hello Petr, If there is a hot fix could I offer it to the cu for a test. Not sure if they will be able/willing to do so. Thanks Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (redhat-ds:11 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:0186 Hi, A customer is using 389-ds-1.4.3.31-11 ( the version in errata https://access.redhat.com/errata/RHBA-2023:0186 ) but is still getting the "find_entry_internal_dn - Retry count exceeded" error message: =============================================== errors:[08/Jun/2023:12:46:06.612302100 +051800] - ERR - find_entry_internal_dn - Retry count exceeded (uid=<REDACTED>) errors:[09/Jun/2023:11:49:48.933452279 +051800] - ERR - find_entry_internal_dn - Retry count exceeded (uid=<REDACTED>) errors:[10/Jun/2023:12:32:48.203596735 +051800] - ERR - find_entry_internal_dn - Retry count exceeded (uid=<REDACTED>) errors:[10/Jun/2023:12:32:52.062239379 +051800] - ERR - find_entry_internal_dn - Retry count exceeded (uid=<REDACTED>) errors:[12/Jun/2023:15:31:52.128013373 +051800] - ERR - find_entry_internal_dn - Retry count exceeded (uid=<REDACTED>) =============================================== Is this message still expected with the fix in place? What debug data would be useful for further troubleshooting? Regards, Têko. (In reply to Têko Mihinto from comment #44) > Hi, > > A customer is using 389-ds-1.4.3.31-11 ( the version in errata > https://access.redhat.com/errata/RHBA-2023:0186 ) > but is still getting the "find_entry_internal_dn - Retry count exceeded" > error message: > =============================================== > errors:[08/Jun/2023:12:46:06.612302100 +051800] - ERR - > find_entry_internal_dn - Retry count exceeded (uid=<REDACTED>) > errors:[09/Jun/2023:11:49:48.933452279 +051800] - ERR - > find_entry_internal_dn - Retry count exceeded (uid=<REDACTED>) > errors:[10/Jun/2023:12:32:48.203596735 +051800] - ERR - > find_entry_internal_dn - Retry count exceeded (uid=<REDACTED>) > errors:[10/Jun/2023:12:32:52.062239379 +051800] - ERR - > find_entry_internal_dn - Retry count exceeded (uid=<REDACTED>) > errors:[12/Jun/2023:15:31:52.128013373 +051800] - ERR - > find_entry_internal_dn - Retry count exceeded (uid=<REDACTED>) > =============================================== > > Is this message still expected with the fix in place? This message can happen whenever the database is really busy (or under tuned?). There is no promise that it will never happen again. Looking at your log clip it looks like it sporadically happens (once or twice per day). IMHO this is nothing to be worried about, unless it's happening thousands of times per day, or in large clusters. The only debug info I can think of, which will be very hard to "time", is to get a pstack right as the error is happening. Since it only happens once per day that will make it almost impossible to time it correctly. Other than that I would check the entry cache tuning, and make sure then entire database is in memory. |