Description of problem: we have those events: [14/Aug/2020:22:08:06.331297223 -0500] - ERR - vlv_build_idl - Can't follow db cursor (err -30988) [14/Aug/2020:22:08:06.331938667 -0500] - ERR - vlv_find_index_by_filter_txn - vlv find index: err 1 matching some failed ldapsearches stack trace and core file from the attachments called sf..loop.263.2020-09-03-113401.507312375.core.bin.29771 sf..loop.263.2020-09-03-113401.507312375.stacktrace.txt have a thread of interest: thread 27 #2 0x00007fa5b6aebcb8 in vslapd_log_error (fp=0x564239570c90, sev_level=sev_level@entry=22, subsystem=subsystem@entry=0x7fa5a9e0cfad "vlv_build_idl", fmt=fmt@entry=0x7fa5a9e0cb68 "Can't follow db cursor (err %d)\n", ap=ap@entry=0x7fa48b67a1d0, locked=locked@entry=1) at ldap/servers/slapd/log.c:2445 the overall information is c_connid = 3473104 o_opid = 156 c_dn = cn=unixproxy,cn=config c_ipaddr = edited ldapsearch base = ou=people,dc=edited,dc=edited database = userroot / cn=config,cn=userRoot,cn=ldbm database,cn=plugins,cn=config ldapsearch filter = "(&(uid=edited)(objectClass=posixAccount)(&(uidNumber=*)(!(uidNumber=0))))" ldapsearch attributes = objectClass uid userPassword uidNumber gidNumber gecos homeDirectory loginShell krbprincipalname cn memberOf modifyTimestamp modifyTimestamp shadowLastChange shadowMin shadowMax shadowWarning shadowI"... userRoot/vlv#vlvpeopleusridsort.db uidNumber the matching access log looks like a SSSD client: [03/Sep/2020:11:34:01.440136237 -0500] conn=3473104 op=156 SRCH base=\"ou=people,dc=edited,dc=edited\" scope=2 filter=\"(&(uid=edited)(objectClass=posixAccount)(&(uidNumber=*)(!(uidNumber=0))))\" attrs=\"objectClass uid userPassword uidNumber gidNumber gecos homeDirectory loginShell krbprincipalname cn memberOf modifyTimestamp modifyTimestamp shadowLastChange shadowMin shadowMax shadowWarning shadowInactive shadowExpire shadowFlag krblastpwdchange krbpasswordexpiration pwdattribute authorizedService accountexpires useraccountcontrol nsAccountLock host logindisabled loginexpirationtime loginallowedtimemap sshPublicKey\" this is a ldapsearch, but I do not seem to see the LDAP controls requesting a VLV and server side sort of uidnumber to end up in the vlvpeopleusridsort.db , why the search ends up in the frame #14 0x00007fa5a9de3082 in ldbm_back_search (pb=0x564244d5e240) at ldap/servers/slapd/back-ldbm/ldbm_search.c:619 rc = <optimized out> expire_time = {tv_sec = 0, tv_nsec = 0} vlv_response_control = {targetPosition = 964808128, contentCount = 22082, result = 0} the content count seem reasonable at 22082, but the VLV target position seem out of bound to me, this does not look correct. there is some significant search traffic on the system, may be some memory corruption? I need a review of the stack trace Version-Release number of selected component (if applicable): RHEL-7.8 389-ds-base-1.3.10.1-14.el7_8.x86_64 RHDS-10.5 How reproducible: N/A Steps to Reproduce: 1. event hard to catch in customer's environment, needed hundreds of core files 2. 3. Actual results: failing ldapsearch with error 1 Can't follow db cursor (err 1) Expected results: yes Additional info: frame 14 p *operation $32 = {o_ber = 0x564280af4500, o_msgid = 157, o_tag = 99, o_hr_time_rel = {tv_sec = 7136702, tv_nsec = 862461888}, o_hr_time_utc = {tv_sec = 1599150841, tv_nsec = 439942088}, o_isroot = 0, o_sdn = {flag = 10 '\n', udn = 0x56426aa46ec0 "cn=edited,cn=config", dn = 0x5642dc654aa0 "cn=edited,cn=config", ndn = 0x0, ndn_len = 22}, o_authtype = 0x5642dec8cee0 "simple", o_ssf = 256, o_opid = 156, o_connid = 3473104, o_handler_data = 0x0, o_result_handler = 0x0, o_search_entry_handler = 0x0, o_search_referral_handler = 0x0, o_csngen_handler = 0x0, o_replica_attr_handler = 0x0, o_next = 0x564258e56960, o_status = 0, o_searchattrs = 0x56427addde60, o_flags = 960, o_extension = 0x5642bf682660, o_target_spec = 0x56427562a600, o_target_entry = 0x5642c6954ee0, o_target_entry_id = 3, o_abandoned_op = 0, o_params = {operation_type = 4, target_address = {udn = 0x564264b1b740 "ou=people,dc=edited,dc=edited", uniqueid = 0x0, sdn = 0x56425ecb1ad0}, csn = 0x0, request_controls = 0x0, p = {p_add = {target_entry = 0x2, parentuniqueid = 0xe10ffffffff <Address 0xe10ffffffff out of bounds>}, p_bind = {bind_method = 2, bind_creds = 0xe10ffffffff, bind_saslmechanism = 0x5642dd0c7080 "", bind_ret_saslcreds = 0x5642588685b0}, p_compare = {compare_ava = {ava_type = 0x2 <Address 0x2 out of bounds>, ava_value = {bv_len = 15466177232895, bv_val = 0x5642dd0c7080 ""}, ava_private = 0x5642588685b0}}, p_modify = {modify_mods = 0x2}, p_modrdn = {modrdn_newrdn = 0x2 <Address 0x2 out of bounds>, modrdn_deloldrdn = -1, modrdn_newsuperior_address = {udn = 0x5642dd0c7080 "", uniqueid = 0x5642588685b0 "(&(uid=s_pmon)(objectClass=posixAccount)(&(uidNumber=*)(!(uidNumber=0))))", sdn = 0x5642456ccfc0}, modrdn_mods = 0x100000000}, p_search = {search_scope = 2, search_deref = 0, search_sizelimit = -1, search_timelimit = 3600, search_filter = 0x5642dd0c7080, search_strfilter = 0x5642588685b0 "(&(uid=edited)(objectClass=posixAccount)(&(uidNumber=*)(!(uidNumber=0))))", search_attrs = 0x5642456ccfc0, search_attrsonly = 0, search_is_and = 1, search_gerattrs = 0x0}, p_abandon = {abandon_targetmsgid = 2}, p_extended = {exop_oid = 0x2 <Address 0x2 out of bounds>, exop_value = 0xe10ffffffff}}}, o_results = {operation_type = 0, opreturn = 0, result_controls = 0x0, result_code = 0, result_text = 0x0, result_matched = 0x0, r = {r_bind = {bind_ret_saslcreds = 0x564254f2fb80}, r_search = { search_result_set = 0x564254f2fb80, search_result_entry = 0x0, opaque_backend_ptr = 0x0, nentries = 0, search_referrals = 0x0, estimate = 0}, r_extended = {exop_ret_oid = 0x564254f2fb80 "", exop_ret_value = 0x0}}}, o_pagedresults_sizelimit = -1, o_reverse_search_state = 0} (gdb)
Please have an initial investigation. Expectations - acknowledge if it is a valid bug - impact on customer, frequency, dangerous - complexity
- The RC of this bug is an improper computation of the #record in the vlv index During a vlv search, the #records is computed in vlv_find_index_by_filter_txn/vlvIndex_get_indexlength The #records is then used in vlv_build_idl to iterate over the index. If the index is updated (removal of records) between the time the #record is computed and the time the index is walked through, then vlv_build_idl can try to access a record that does no longer exist - A reproducer testcase Create a vlv index (like cn=vlvIdx,cn=vlvSrch in suites/vlv/regression_test.py) Provision 10K users Then run in a search bench [1] and a deletion bench (to shrink the index) [2] It logs messages like [08/Apr/2022:13:44:47.184996442 +0200] - ERR - vlv_build_idl - Can't follow db cursor (err -12797) [08/Apr/2022:13:44:47.185800492 +0200] - ERR - vlv_find_index_by_filter_txn - vlv find index: err 1 Err return code in vlv_build_idl differs from CU on master - The error does not reflect an index corruption The index is valid, just because of the dynamic of the server #record can fluctuates - This bug triggers - failure of the VLV search that hits this mismatch when computed #records > real #records - VLV search may return improper result set when computed #records < real #record - The error logs are false Error and should not be logged In Conclusion: This bug impacts VLV SRCH that fail during decrease of VLV values or report improper result set during increase of VLV values The logged message is a false alarming message. I see no reason to reindex when such messages are logged Next: - working on a fix [1] #!/bin/bash while : do ldapsearch -LLL -D "cn=directory manager" -w password -b "dc=example,dc=com" -E vlv='0/5/100/0' -z 4 -E sss=cn '(|(objectclass=*)(objectclass=ldapsubentry))' dn cn sn ou & ldapsearch -LLL -D "cn=directory manager" -w password -b "dc=example,dc=com" -E vlv='0/5/100/0' -z 4 -E sss=cn '(|(objectclass=*)(objectclass=ldapsubentry))' dn cn sn ou & .. ldapsearch -LLL -D "cn=directory manager" -w password -b "dc=example,dc=com" -E vlv='0/5/100/0' -z 4 -E sss=cn '(|(objectclass=*)(objectclass=ldapsubentry))' dn cn sn ou & ldapsearch -LLL -D "cn=directory manager" -w password -b "dc=example,dc=com" -E vlv='0/5/100/0' -z 4 -E sss=cn '(|(objectclass=*)(objectclass=ldapsubentry))' dn cn sn ou done [2] For CPT=0..10K ldapdelete -D "cn=directory manager" -w password "uid=user${CPT},ou=people,dc=example,dc=com"
Fix pushed upstream => POST
As per comment #c14, marking as VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (redhat-ds:11 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:7929