Bug 1877647 - ERR - vlv_build_idl - Can't follow db cursor (err -30988)
Summary: ERR - vlv_build_idl - Can't follow db cursor (err -30988)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Directory Server
Classification: Red Hat
Component: 389-ds-base
Version: 11.5
Hardware: All
OS: Linux
high
high
Target Milestone: DS11.5
: dirsrv-11.6
Assignee: thierry bordaz
QA Contact: RHDS QE
Marc Muehlfeld
URL:
Whiteboard: sync-to-jira
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-10 05:16 UTC by Marc Sauton
Modified: 2023-12-15 19:16 UTC (History)
10 users (show)

Fixed In Version: redhat-ds-11-8070020220729181859-54d36466
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-14 08:53:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github 389ds 389-ds-base issues 5252 0 None open During DEL, vlv search can erroneously return NULL candidate 2022-04-11 12:52:49 UTC
Red Hat Issue Tracker IDMDS-2157 0 None None None 2022-04-04 08:41:09 UTC
Red Hat Issue Tracker IDMDS-2480 0 None None None 2022-08-30 08:27:53 UTC
Red Hat Product Errata RHBA-2022:7929 0 None None None 2022-11-14 08:53:41 UTC

Description Marc Sauton 2020-09-10 05:16:32 UTC
Description of problem:

we have those events:

[14/Aug/2020:22:08:06.331297223 -0500] - ERR - vlv_build_idl - Can't follow db cursor (err -30988)
[14/Aug/2020:22:08:06.331938667 -0500] - ERR - vlv_find_index_by_filter_txn - vlv find index: err 1

matching some failed ldapsearches

stack trace and core file from the attachments called

sf..loop.263.2020-09-03-113401.507312375.core.bin.29771
sf..loop.263.2020-09-03-113401.507312375.stacktrace.txt

have a thread of interest: thread 27

#2  0x00007fa5b6aebcb8 in vslapd_log_error (fp=0x564239570c90, sev_level=sev_level@entry=22, subsystem=subsystem@entry=0x7fa5a9e0cfad "vlv_build_idl", fmt=fmt@entry=0x7fa5a9e0cb68 "Can't follow db cursor (err %d)\n", ap=ap@entry=0x7fa48b67a1d0, locked=locked@entry=1) at ldap/servers/slapd/log.c:2445

the overall information is

c_connid = 3473104
o_opid = 156
c_dn = cn=unixproxy,cn=config
c_ipaddr = edited
ldapsearch base = ou=people,dc=edited,dc=edited
database = userroot / cn=config,cn=userRoot,cn=ldbm database,cn=plugins,cn=config
ldapsearch filter = "(&(uid=edited)(objectClass=posixAccount)(&(uidNumber=*)(!(uidNumber=0))))"
ldapsearch attributes = objectClass uid userPassword uidNumber gidNumber gecos homeDirectory loginShell krbprincipalname cn memberOf modifyTimestamp modifyTimestamp shadowLastChange shadowMin shadowMax shadowWarning shadowI"...

userRoot/vlv#vlvpeopleusridsort.db
uidNumber

the matching access log looks like a SSSD client:

[03/Sep/2020:11:34:01.440136237 -0500] conn=3473104 op=156 SRCH base=\"ou=people,dc=edited,dc=edited\" scope=2 filter=\"(&(uid=edited)(objectClass=posixAccount)(&(uidNumber=*)(!(uidNumber=0))))\" attrs=\"objectClass uid userPassword uidNumber gidNumber gecos homeDirectory loginShell krbprincipalname cn memberOf modifyTimestamp modifyTimestamp shadowLastChange shadowMin shadowMax shadowWarning shadowInactive shadowExpire shadowFlag krblastpwdchange krbpasswordexpiration pwdattribute authorizedService accountexpires useraccountcontrol nsAccountLock host logindisabled loginexpirationtime loginallowedtimemap sshPublicKey\"

this is a ldapsearch, but I do not seem to see the LDAP controls requesting a VLV and server side sort of uidnumber to end up in the vlvpeopleusridsort.db , why the search ends up in the frame

#14 0x00007fa5a9de3082 in ldbm_back_search (pb=0x564244d5e240) at ldap/servers/slapd/back-ldbm/ldbm_search.c:619
        rc = <optimized out>
        expire_time = {tv_sec = 0, tv_nsec = 0}
        vlv_response_control = {targetPosition = 964808128, contentCount = 22082, result = 0}

the content count seem reasonable at 22082, but the VLV target position seem out of bound to me, this does not look correct.

there is some significant search traffic on the system, may be some memory corruption?


I need a review of the stack trace



Version-Release number of selected component (if applicable):

RHEL-7.8
389-ds-base-1.3.10.1-14.el7_8.x86_64
RHDS-10.5


How reproducible:
N/A


Steps to Reproduce:
1. event hard to catch in customer's environment, needed hundreds of core files
2.
3.


Actual results:
failing ldapsearch with error 1
Can't follow db cursor (err 1)


Expected results:
yes


Additional info:


frame 14
p *operation
$32 = {o_ber = 0x564280af4500, o_msgid = 157, o_tag = 99, o_hr_time_rel = {tv_sec = 7136702, tv_nsec = 862461888}, o_hr_time_utc = {tv_sec = 1599150841, tv_nsec = 439942088}, o_isroot = 0, o_sdn = {flag = 10 '\n', udn = 0x56426aa46ec0 "cn=edited,cn=config", dn = 0x5642dc654aa0 "cn=edited,cn=config", ndn = 0x0, ndn_len = 22}, 
  o_authtype = 0x5642dec8cee0 "simple", o_ssf = 256, o_opid = 156, o_connid = 3473104, o_handler_data = 0x0, o_result_handler = 0x0, o_search_entry_handler = 0x0, o_search_referral_handler = 0x0, o_csngen_handler = 0x0, o_replica_attr_handler = 0x0, o_next = 0x564258e56960, o_status = 0, o_searchattrs = 0x56427addde60, o_flags = 960, 
  o_extension = 0x5642bf682660, o_target_spec = 0x56427562a600, o_target_entry = 0x5642c6954ee0, o_target_entry_id = 3, o_abandoned_op = 0, o_params = {operation_type = 4, target_address = {udn = 0x564264b1b740 "ou=people,dc=edited,dc=edited", uniqueid = 0x0, sdn = 0x56425ecb1ad0}, csn = 0x0, request_controls = 0x0, p = {p_add = {target_entry = 0x2, 
        parentuniqueid = 0xe10ffffffff <Address 0xe10ffffffff out of bounds>}, p_bind = {bind_method = 2, bind_creds = 0xe10ffffffff, bind_saslmechanism = 0x5642dd0c7080 "", bind_ret_saslcreds = 0x5642588685b0}, p_compare = {compare_ava = {ava_type = 0x2 <Address 0x2 out of bounds>, ava_value = {bv_len = 15466177232895, bv_val = 0x5642dd0c7080 ""}, 
          ava_private = 0x5642588685b0}}, p_modify = {modify_mods = 0x2}, p_modrdn = {modrdn_newrdn = 0x2 <Address 0x2 out of bounds>, modrdn_deloldrdn = -1, modrdn_newsuperior_address = {udn = 0x5642dd0c7080 "", uniqueid = 0x5642588685b0 "(&(uid=s_pmon)(objectClass=posixAccount)(&(uidNumber=*)(!(uidNumber=0))))", sdn = 0x5642456ccfc0}, 
        modrdn_mods = 0x100000000}, p_search = {search_scope = 2, search_deref = 0, search_sizelimit = -1, search_timelimit = 3600, search_filter = 0x5642dd0c7080, search_strfilter = 0x5642588685b0 "(&(uid=edited)(objectClass=posixAccount)(&(uidNumber=*)(!(uidNumber=0))))", search_attrs = 0x5642456ccfc0, search_attrsonly = 0, search_is_and = 1, 
        search_gerattrs = 0x0}, p_abandon = {abandon_targetmsgid = 2}, p_extended = {exop_oid = 0x2 <Address 0x2 out of bounds>, exop_value = 0xe10ffffffff}}}, o_results = {operation_type = 0, opreturn = 0, result_controls = 0x0, result_code = 0, result_text = 0x0, result_matched = 0x0, r = {r_bind = {bind_ret_saslcreds = 0x564254f2fb80}, r_search = {
        search_result_set = 0x564254f2fb80, search_result_entry = 0x0, opaque_backend_ptr = 0x0, nentries = 0, search_referrals = 0x0, estimate = 0}, r_extended = {exop_ret_oid = 0x564254f2fb80 "", exop_ret_value = 0x0}}}, o_pagedresults_sizelimit = -1, o_reverse_search_state = 0}
(gdb)

Comment 8 thierry bordaz 2022-03-23 15:54:14 UTC
Please have an initial investigation. Expectations
- acknowledge if it is a valid bug
- impact on customer, frequency, dangerous
- complexity

Comment 9 thierry bordaz 2022-04-08 12:15:59 UTC
   - The RC of this bug is an improper computation of the #record in the vlv index
     During a vlv search, the #records is computed in vlv_find_index_by_filter_txn/vlvIndex_get_indexlength 
     The #records is then used in vlv_build_idl to iterate over the index.
     If the index is updated (removal of records) between the time the #record is computed
     and the time the index is walked through, then vlv_build_idl can try to access a record
     that does no longer exist

   - A reproducer testcase
     Create a vlv index (like cn=vlvIdx,cn=vlvSrch in suites/vlv/regression_test.py)
     Provision 10K users
     Then run in a search bench [1] and a deletion bench (to shrink the index) [2]
     It logs messages like
[08/Apr/2022:13:44:47.184996442 +0200] - ERR - vlv_build_idl - Can't follow db cursor (err -12797)
[08/Apr/2022:13:44:47.185800492 +0200] - ERR - vlv_find_index_by_filter_txn - vlv find index: err 1
     Err return code in vlv_build_idl differs from CU on master

   - The error does not reflect an index corruption
     The index is valid, just because of the dynamic of the server #record can
     fluctuates

   - This bug triggers 
     - failure of the VLV search that hits this mismatch
       when computed #records > real #records
     - VLV search may return improper result set
       when computed #records < real #record

   - The error logs are false Error and should not be logged


   In Conclusion:
      This bug impacts VLV SRCH that fail during decrease of VLV values
      or report improper result set during increase of VLV values
      The logged message is a false alarming message.
      I see no reason to reindex when such messages are logged

Next:
  - working on a fix

[1]
#!/bin/bash
while :
do
ldapsearch -LLL  -D "cn=directory manager" -w password -b "dc=example,dc=com"  -E vlv='0/5/100/0' -z 4 -E sss=cn '(|(objectclass=*)(objectclass=ldapsubentry))' dn cn sn ou &
ldapsearch -LLL  -D "cn=directory manager" -w password -b "dc=example,dc=com"  -E vlv='0/5/100/0' -z 4 -E sss=cn '(|(objectclass=*)(objectclass=ldapsubentry))' dn cn sn ou &
..
ldapsearch -LLL  -D "cn=directory manager" -w password -b "dc=example,dc=com"  -E vlv='0/5/100/0' -z 4 -E sss=cn '(|(objectclass=*)(objectclass=ldapsubentry))' dn cn sn ou &
ldapsearch -LLL  -D "cn=directory manager" -w password -b "dc=example,dc=com"  -E vlv='0/5/100/0' -z 4 -E sss=cn '(|(objectclass=*)(objectclass=ldapsubentry))' dn cn sn ou 

done

[2]
For CPT=0..10K
ldapdelete -D "cn=directory manager" -w password "uid=user${CPT},ou=people,dc=example,dc=com"

Comment 11 thierry bordaz 2022-04-11 14:26:12 UTC
Fix pushed upstream => POST

Comment 15 bsmejkal 2022-10-11 12:36:33 UTC
As per comment #c14, marking as VERIFIED.

Comment 17 errata-xmlrpc 2022-11-14 08:53:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (redhat-ds:11 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7929


Note You need to log in before you can comment on or make changes to this bug.