Bug 1930231
Summary: | Large updates can reset the CLcache to the beginning of the changelog | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | thierry bordaz <tbordaz> | |
Component: | 389-ds-base | Assignee: | thierry bordaz <tbordaz> | |
Status: | CLOSED ERRATA | QA Contact: | RHDS QE <ds-qe-bugs> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 7.9 | CC: | aadhikar, bsmejkal, cobrown, kbanerje, ldap-maint, mreynolds, mrhodes, msauton, rcain, sgouvern, tmihinto, wwinter | |
Target Milestone: | rc | Keywords: | TestCannotAutomate | |
Target Release: | 7.9 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | sync-to-jira | |||
Fixed In Version: | 389-ds-base-1.3.10.2-10.el7_9 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1972590 (view as bug list) | Environment: | ||
Last Closed: | 2021-03-16 13:57:39 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1972590 |
Description
thierry bordaz
2021-02-18 14:40:06 UTC
The problem signature is this message: clcache_load_buffer - bulk load cursor (602e4152000100030000) is lower than starting csn 602e59ec000000010000. Ending session. This message has been introduced with #1898541 as a safety check that a replication agreement does not go too far back in the past The diagnostic of that bug can be done with replication debug log enabled on the supplier side: [18/Feb/2021:13:43:03.631942523 +0100] - INFO - slapd_daemon - slapd started. Listening on All Interfaces port 39001 for LDAP requests ... # # A first session retrieves 602e59ec000000010000 # Note that csnBuf (aka head CSN of bulk load ) is 00000000000000000000 # [18/Feb/2021:13:45:22.676742131 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_initial_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 0 - state 0) - csnPrevMax () csnMax (602e6129000000010000) csnBuf (00000000000000000000) csnConsumerMax (602e59ec000000010000) [18/Feb/2021:13:45:22.677518575 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_initial_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 1 - state 1) - csnPrevMax () csnMax (602e4156000500030000) csnBuf (00000000000000000000) csnConsumerMax (602e4156000500030000) [18/Feb/2021:13:45:22.678204233 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_initial_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 2 - state 1) - csnPrevMax () csnMax (602e415f000600040000) csnBuf (00000000000000000000) csnConsumerMax (602e415f000600040000) # # the initial phase that creates the iterator retrieve the valid starting (aka anchor) CSN 602e59ec000000010000 # [18/Feb/2021:13:45:22.679083870 +0100] - DEBUG - clcache_initial_anchorcsn - anchor is now: 602e59ec000000010000 # # because of the crafted small (8K) CLcache buffer, the setting of the cursor (to 602e59ec000000010000) # reports DB_BUFFER_SMALL. # I think it is the culprit of the bug, the cursor is then *not* initialized to 602e59ec000000010000 # # then when retrieving the next update DB_NEXT with a *not* initialized cursor it upload the CSN from # the begining of the CL [18/Feb/2021:13:45:22.679804365 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_load_buffer_bulk - c_get -> DB_BUFFER_SMALL for csn=602e59ec000000010000 ... # # Later when retrieving the next update to send, it is too late. The CLcache has been updated with a # very old CSN (likely begining of CL) in csnBuf (602e4152000100030000). # [18/Feb/2021:13:45:22.690670576 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_adjust_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 0 - state 0) - csnPrevMax (602e6129000000010000) csnMax (602e6129000000010000) csnBuf (602e4152000100030000) csnConsumerMax (602e59ec000000010000) [18/Feb/2021:13:45:22.691393809 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_adjust_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 1 - state 1) - csnPrevMax (602e4156000500030000) csnMax (602e4156000500030000) csnBuf (602e4152000100030000) csnConsumerMax (602e4156000500030000) [18/Feb/2021:13:45:22.692264421 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_adjust_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 2 - state 1) - csnPrevMax (602e415f000600040000) csnMax (602e415f000600040000) csnBuf (602e4152000100030000) csnConsumerMax (602e415f000600040000) # # a consequence is that the next starting CSN will be 602e4152000100030000 instead of # next of 602e59ec000000010000 # This triggers https://bugzilla.redhat.com/show_bug.cgi?id=1898541 protective fix # that prevent a jump too far in the past # [18/Feb/2021:13:45:22.693006027 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_adjust_anchorcsn - anchor is now: 602e4152000100030000 [18/Feb/2021:13:45:22.693678474 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_load_buffer - bulk load cursor (602e4152000100030000) is lower than starting csn 602e59ec000000010000. Ending session. [18/Feb/2021:13:45:22.694409920 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_load_buffer - rc=-30988 ... # # Further replication session will inherit from the "broken" (csnBuf) CLcache # and even if it selects the right anchor it will pick update from far in the past # [18/Feb/2021:13:50:22.920756260 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_initial_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 0 - state 0) - csnPrevMax () csnMax (602e6129000000010000) csnBuf (602e4152000100030000) csnConsumerMax (602e59ec000000010000) [18/Feb/2021:13:50:22.921666429 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_initial_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 1 - state 1) - csnPrevMax () csnMax (602e4156000500030000) csnBuf (602e4152000100030000) csnConsumerMax (602e4156000500030000) [18/Feb/2021:13:50:22.922543181 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_initial_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 2 - state 1) - csnPrevMax () csnMax (602e415f000600040000) csnBuf (602e4152000100030000) csnConsumerMax (602e415f000600040000) [18/Feb/2021:13:50:22.923573600 +0100] - DEBUG - clcache_initial_anchorcsn - anchor is now: 602e59ec000000010000 The condition to trigger that bug are that the changelog contains big changes (more the 1Mb) and that during a bulk load, the first CSN of the bulk is a big change. Reproducer: To help diagnostic I used a modified DS version diff --git a/ldap/servers/plugins/replication/cl5_clcache.c b/ldap/servers/plugins/replication/cl5_clcache.c index 5eea6e9c6..5e2c23ec1 100644 --- a/ldap/servers/plugins/replication/cl5_clcache.c +++ b/ldap/servers/plugins/replication/cl5_clcache.c @@ -477,6 +477,10 @@ retry: * Continue if the error is no-mem since we don't need to * load in the key record anyway with DB_SET. */ + if (rc == DB_BUFFER_SMALL) { + slapi_log_err(SLAPI_LOG_REPL, buf->buf_agmt_name, + "clcache_load_buffer_bulk - c_get -> DB_BUFFER_SMALL for csn=%s\n", (char *)buf->buf_key.data); + } if (0 == rc || DB_BUFFER_SMALL == rc) { rc = clcache_cursor_get(cursor, buf, use_flag); } diff --git a/ldap/servers/plugins/replication/cl5_clcache.c b/ldap/servers/plugins/replication/cl5_clcache.c index aa3b2cb54..5eea6e9c6 100644 --- a/ldap/servers/plugins/replication/cl5_clcache.c +++ b/ldap/servers/plugins/replication/cl5_clcache.c @@ -193,7 +193,7 @@ clcache_set_config() * data buffer should be a multiple of 1024 bytes in size * for DB_MULTIPLE_KEY operation. */ - _pool->pl_buffer_default_pages = CL5_DEFAULT_CONFIG_CACHEMEMSIZE / DEFAULT_CLC_BUFFER_PAGE_SIZE + 1; + _pool->pl_buffer_default_pages = 8; if (_pool->pl_buffer_default_pages <= 0) { /* this never be true... */ _pool->pl_buffer_default_pages = DEFAULT_CLC_BUFFER_PAGE_COUNT; } Then setup M1 <-> M2 Configure M1 and M2 to support large updates: maxbersize: 2147483647 provision M1 with many small updates modrate --hostname localhost --port 39001 --bindDN 'cn=Directory Manager' --bindPassword xxx --entryDN "uid=mmrepl_test,dc=example,dc=com" --attribute homedirectory --valueLength 12 --numThreads 10 provision many big updates dn: uid=mmrepl_test,dc=example,dc=com changetype: modify replace: homedirectory homedirectory:< file:///tmp/10M.data Here is a testcase that reproduces the CLcache breakage: # # Preparation of the env # Setup M1 <-> M2 enable debug replication configure M1/M2 with nsslapd-maxbersize: 2147483647 restart M1/M2 apply 2000+ small MODs ./ldapbench/bin/modrate --hostname localhost --port 39001 --bindDN 'cn=Directory Manager' --bindPassword password --entryDN "uid=mmrepl_test,dc=example,dc=com" --attribute homedirectory --valueLength 12 --numThreads 10 After a delay check that M1/M2 are insync) Check that CLcache is not broken on M1 grep clcache_adjust_anchorcsn ~/install_master/var/log/dirsrv/slapd-master1/errors | grep 'cscb 0 - state 0' | tail -1 => csnBuf and csnConsumerMax should be in sync !!! # for example # DEBUG - agmt="cn=002" (Host-003:39002) - clcache_adjust_anchorcsn - agmt="cn=002" (Host-003:39002) - (cscb 0 - state 0) - # csnPrevMax (602f88fb001600010000) csnMax (602f88fb001700010000) csnBuf (602f88fb001600010000) csnConsumerMax (602f88fb001600010000) # I had to run this loop twice to hit the bug # but it could be valid to run it 5-10 times to confirm the bug is fixed retry: # keep big updates local to M1 Stop M2 start M1 Apply 10 big MODs on M1 dd if=/dev/zero of=/tmp/100M.data bs=1 count=0 seek=100M ldapmodify -h localhost -p 39001 -D 'cn=Directory Manager' -w password -f /tmp/big_update /tmp/big_update: dn: uid=mmrepl_test,dc=example,dc=com changetype: modify replace: homedirectory homedirectory:< file:///tmp/100M.data Export CL on M1 dsconf -D "cn=directory manager" master1 replication export-changelog default --replica-root 'dc=example,dc=com' # just to accelerate M1->M2 stop M1 start M2 start M1 Wait a delay so that replication M1->M2 occurs # can be done checking 'csn=' in access logs # now the verification step grep clcache_adjust_anchorcsn M1.errors | | grep 'cscb 0 - state 0' | tail -5 # on my test csnBuf and csnConsumerMax are in sync # only in the fist loop if csnBuf and csnConsumerMax are in sync goto retry # Here csnBuf and csnConsumerMax are not in sync # now check that csnBuf did a jump to the beginning of the CL Check there is jump in the past grep clcache_adjust_anchorcsn M1.errors | | grep 'cscb 0 - state 0' | tail -5 get csnBuf -> <CSN_buf> grep -n '^csn' <changelog_dump> | wc -l --> nb_rec = This is the #record in changelog grep -n '^csn' <changelog_dump> | grep <CSN_buf> --> csnbuf_rec = This is the record in the CLcache The bug is hit is 'csnbuf_rec' is from the beginning of the CL. The end of the CL is 'nb_rec' ~ +2000 In my tests 'csnbuf_rec' usually = '182' Fix is pushed upstream => POST Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (389-ds-base bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0868 Step to verify if you hit the bug: - enable replication logging on the sender size (the host that is slow to send updates) - grep clcache_adjust_anchorcsn /var/log/dirsrv/slapd-<host>/errors | grep 'cscb 0 - state 0 ... clcache_adjust_anchorcsn - agmt="cn=002"... csnBuf (xxxx) csnConsumerMax .... 'xxx' is the position of the CL recorded loaded by the replication agreement. If 'xxx' is very old (close to the oldest record in the CL) then you likely hit that bug |