Bug 1930231

Summary:	Large updates can reset the CLcache to the beginning of the changelog
Product:	Red Hat Enterprise Linux 7	Reporter:	thierry bordaz <tbordaz>
Component:	389-ds-base	Assignee:	thierry bordaz <tbordaz>
Status:	CLOSED ERRATA	QA Contact:	RHDS QE <ds-qe-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.9	CC:	aadhikar, bsmejkal, cobrown, kbanerje, ldap-maint, mreynolds, mrhodes, msauton, rcain, sgouvern, tmihinto, wwinter
Target Milestone:	rc	Keywords:	TestCannotAutomate
Target Release:	7.9
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	sync-to-jira
Fixed In Version:	389-ds-base-1.3.10.2-10.el7_9	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1972590 (view as bug list)		Environment:
Last Closed:	2021-03-16 13:57:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1972590

Description thierry bordaz 2021-02-18 14:40:06 UTC

Description of problem:
The replication agreements are using bulk load to load updates. For bulk load it uses a cursor with DB_MULTIPLE_KEY and DB_NEXT. Before it initializes the cursor with DB_SET.

If during the cursor/DB_SET the CSN refers to an update that is larger than the size of the provided buffer, then the cursor remains not initialized and c_get returns DB_BUFFER_SMALL.
The consequence is that the next c_get(DB_MULTIPLE_KEY and DB_NEXT) will return the first record in the changelog DB.

Version-Release number of selected component (if applicable):
All version


How reproducible:
It exists reproducible steps.
See next updates for description



Actual results:
CL cache is reset to the beginning of the CL and then replication agreement will iterate too far in the past


Expected results:
CL cache should be broken

Comment 2 thierry bordaz 2021-02-18 14:50:26 UTC

The problem signature is this message:

clcache_load_buffer - bulk load cursor (602e4152000100030000) is lower than starting csn 602e59ec000000010000. Ending session.

This message has been introduced with #1898541 as a safety check that a replication agreement does not go too far back in the past

The diagnostic of that bug can be done with replication debug log enabled on the supplier side:


[18/Feb/2021:13:43:03.631942523 +0100] - INFO - slapd_daemon - slapd started.  Listening on All Interfaces port 39001 for LDAP requests
...
#
# A first session retrieves 602e59ec000000010000
# Note that csnBuf (aka head CSN of bulk load ) is 00000000000000000000
#
[18/Feb/2021:13:45:22.676742131 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_initial_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 0 - state 0) - csnPrevMax () csnMax (602e6129000000010000) csnBuf (00000000000000000000) csnConsumerMax (602e59ec000000010000)
[18/Feb/2021:13:45:22.677518575 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_initial_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 1 - state 1) - csnPrevMax () csnMax (602e4156000500030000) csnBuf (00000000000000000000) csnConsumerMax (602e4156000500030000)
[18/Feb/2021:13:45:22.678204233 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_initial_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 2 - state 1) - csnPrevMax () csnMax (602e415f000600040000) csnBuf (00000000000000000000) csnConsumerMax (602e415f000600040000)

#
# the initial phase that creates the iterator retrieve the valid starting (aka anchor) CSN 602e59ec000000010000
#
[18/Feb/2021:13:45:22.679083870 +0100] - DEBUG - clcache_initial_anchorcsn - anchor is now: 602e59ec000000010000

#
# because of the crafted small (8K) CLcache buffer, the setting of the cursor (to 602e59ec000000010000)
# reports DB_BUFFER_SMALL.
# I think it is the culprit of the bug, the cursor is then *not* initialized to 602e59ec000000010000
#
# then when retrieving the next update DB_NEXT with a *not* initialized cursor it upload the CSN from
# the begining of the CL
[18/Feb/2021:13:45:22.679804365 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_load_buffer_bulk - c_get -> DB_BUFFER_SMALL for csn=602e59ec000000010000
...

#
# Later when retrieving the next update to send, it is too late. The CLcache has been updated with a 
# very old CSN (likely begining of CL) in csnBuf (602e4152000100030000).
#
[18/Feb/2021:13:45:22.690670576 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_adjust_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 0 - state 0) - csnPrevMax (602e6129000000010000) csnMax (602e6129000000010000) csnBuf (602e4152000100030000) csnConsumerMax (602e59ec000000010000)
[18/Feb/2021:13:45:22.691393809 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_adjust_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 1 - state 1) - csnPrevMax (602e4156000500030000) csnMax (602e4156000500030000) csnBuf (602e4152000100030000) csnConsumerMax (602e4156000500030000)
[18/Feb/2021:13:45:22.692264421 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_adjust_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 2 - state 1) - csnPrevMax (602e415f000600040000) csnMax (602e415f000600040000) csnBuf (602e4152000100030000) csnConsumerMax (602e415f000600040000)

#
# a consequence is that the next starting CSN will be 602e4152000100030000 instead of
# next of 602e59ec000000010000
# This triggers https://bugzilla.redhat.com/show_bug.cgi?id=1898541 protective fix
# that prevent a jump too far in the past
#
[18/Feb/2021:13:45:22.693006027 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_adjust_anchorcsn - anchor is now: 602e4152000100030000
[18/Feb/2021:13:45:22.693678474 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_load_buffer - bulk load cursor (602e4152000100030000) is lower than starting csn 602e59ec000000010000. Ending session.
[18/Feb/2021:13:45:22.694409920 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_load_buffer - rc=-30988
...

#
# Further replication session will inherit from the "broken" (csnBuf) CLcache
# and even if it selects the right anchor it will pick update from far in the past
#
[18/Feb/2021:13:50:22.920756260 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_initial_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 0 - state 0) - csnPrevMax () csnMax (602e6129000000010000) csnBuf (602e4152000100030000) csnConsumerMax (602e59ec000000010000)
[18/Feb/2021:13:50:22.921666429 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_initial_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 1 - state 1) - csnPrevMax () csnMax (602e4156000500030000) csnBuf (602e4152000100030000) csnConsumerMax (602e4156000500030000)
[18/Feb/2021:13:50:22.922543181 +0100] - DEBUG - agmt="cn=002" (Host-001:39002) - clcache_initial_anchorcsn - agmt="cn=002" (Host-001:39002) - (cscb 2 - state 1) - csnPrevMax () csnMax (602e415f000600040000) csnBuf (602e4152000100030000) csnConsumerMax (602e415f000600040000)
[18/Feb/2021:13:50:22.923573600 +0100] - DEBUG - clcache_initial_anchorcsn - anchor is now: 602e59ec000000010000

Comment 3 thierry bordaz 2021-02-18 14:58:46 UTC

The condition to trigger that bug are that the changelog contains big changes (more the 1Mb) and that during a bulk load, the first CSN of the bulk is a big change.

Reproducer:

To help diagnostic I used a modified DS version
diff --git a/ldap/servers/plugins/replication/cl5_clcache.c b/ldap/servers/plugins/replication/cl5_clcache.c
index 5eea6e9c6..5e2c23ec1 100644
--- a/ldap/servers/plugins/replication/cl5_clcache.c
+++ b/ldap/servers/plugins/replication/cl5_clcache.c
@@ -477,6 +477,10 @@ retry:
          * Continue if the error is no-mem since we don't need to
          * load in the key record anyway with DB_SET.
          */
+        if (rc == DB_BUFFER_SMALL) {
+            slapi_log_err(SLAPI_LOG_REPL, buf->buf_agmt_name,
+                          "clcache_load_buffer_bulk - c_get -> DB_BUFFER_SMALL for csn=%s\n", (char *)buf->buf_key.data);
+        }
         if (0 == rc || DB_BUFFER_SMALL == rc) {
             rc = clcache_cursor_get(cursor, buf, use_flag);
         }

diff --git a/ldap/servers/plugins/replication/cl5_clcache.c b/ldap/servers/plugins/replication/cl5_clcache.c
index aa3b2cb54..5eea6e9c6 100644
--- a/ldap/servers/plugins/replication/cl5_clcache.c
+++ b/ldap/servers/plugins/replication/cl5_clcache.c
@@ -193,7 +193,7 @@ clcache_set_config()
      * data buffer should be a multiple of 1024 bytes in size
      * for DB_MULTIPLE_KEY operation.
      */
-    _pool->pl_buffer_default_pages = CL5_DEFAULT_CONFIG_CACHEMEMSIZE / DEFAULT_CLC_BUFFER_PAGE_SIZE + 1;
+    _pool->pl_buffer_default_pages =  8;
     if (_pool->pl_buffer_default_pages <= 0) { /* this never be true... */
         _pool->pl_buffer_default_pages = DEFAULT_CLC_BUFFER_PAGE_COUNT;
     }

Then setup M1 <-> M2
Configure M1 and M2 to support large updates: maxbersize: 2147483647

provision M1 with many small updates
		modrate --hostname localhost --port 39001 --bindDN 'cn=Directory Manager' --bindPassword xxx --entryDN "uid=mmrepl_test,dc=example,dc=com" --attribute homedirectory --valueLength 12 --numThreads 10

provision many big updates
		dn: uid=mmrepl_test,dc=example,dc=com
		changetype: modify
		replace: homedirectory
		homedirectory:< file:///tmp/10M.data

Comment 4 thierry bordaz 2021-02-19 13:42:31 UTC

Here is a testcase that reproduces the CLcache breakage:

        #
        # Preparation of the env
        #
	Setup M1 <-> M2
	enable debug replication
	configure M1/M2 with nsslapd-maxbersize: 2147483647
	restart M1/M2
	apply 2000+ small MODs
		./ldapbench/bin/modrate --hostname localhost --port 39001 --bindDN 'cn=Directory Manager' --bindPassword password --entryDN "uid=mmrepl_test,dc=example,dc=com" --attribute homedirectory --valueLength 12 --numThreads 10
	After a delay check that M1/M2 are insync)
	Check that CLcache is not broken on M1
		grep clcache_adjust_anchorcsn ~/install_master/var/log/dirsrv/slapd-master1/errors | grep 'cscb 0 - state 0' | tail -1
		=> csnBuf and csnConsumerMax should be in sync !!!

         # for example
         # DEBUG - agmt="cn=002" (Host-003:39002) - clcache_adjust_anchorcsn - agmt="cn=002" (Host-003:39002) - (cscb 0 - state 0) - 
         # csnPrevMax (602f88fb001600010000) csnMax (602f88fb001700010000) csnBuf (602f88fb001600010000) csnConsumerMax (602f88fb001600010000)

         # I had to run this loop twice to hit the bug
         # but it could be valid to run it 5-10 times to confirm the bug is fixed
retry:
        # keep big updates local to M1
	Stop M2
	start M1
	Apply 10 big MODs on M1
		dd if=/dev/zero of=/tmp/100M.data bs=1 count=0 seek=100M
		ldapmodify -h localhost -p 39001 -D 'cn=Directory Manager' -w password -f /tmp/big_update
		/tmp/big_update:
			dn: uid=mmrepl_test,dc=example,dc=com
			changetype: modify
			replace: homedirectory
			homedirectory:< file:///tmp/100M.data
	Export CL on M1
		dsconf -D "cn=directory manager" master1  replication export-changelog default --replica-root 'dc=example,dc=com'
        # just to accelerate M1->M2
	stop M1
	start M2
	start M1
	Wait a delay so that replication M1->M2 occurs # can be done checking 'csn=' in access logs

        # now the verification step
	grep clcache_adjust_anchorcsn M1.errors | | grep 'cscb 0 - state 0' | tail -5

	# on my test csnBuf and csnConsumerMax are in sync
	# only in the fist loop
	if csnBuf and csnConsumerMax are in sync
		goto retry

        # Here csnBuf and csnConsumerMax are not in sync
        # now check that csnBuf did a jump to the beginning of the CL
	Check there is jump in the past
		grep clcache_adjust_anchorcsn M1.errors | | grep 'cscb 0 - state 0' | tail -5
		get csnBuf -> <CSN_buf>
		grep -n '^csn' <changelog_dump> | wc -l --> nb_rec = This is the #record in changelog
		
		grep -n '^csn' <changelog_dump> | grep <CSN_buf> --> csnbuf_rec = This is the record in the CLcache
		
	
	The bug is hit is 'csnbuf_rec' is from the beginning of the CL.
	The end of the CL is 'nb_rec' ~ +2000
	In my tests 'csnbuf_rec' usually = '182'

Comment 7 thierry bordaz 2021-02-23 12:51:11 UTC

Fix is pushed upstream => POST

Comment 17 errata-xmlrpc 2021-03-16 13:57:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (389-ds-base bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0868

Comment 18 thierry bordaz 2021-04-29 17:12:47 UTC

Step to verify if you hit the bug:

- enable replication logging on the sender size (the host that is slow to send updates)
- grep clcache_adjust_anchorcsn /var/log/dirsrv/slapd-<host>/errors |  grep 'cscb 0 - state 0
...
clcache_adjust_anchorcsn - agmt="cn=002"... csnBuf (xxxx) csnConsumerMax ....

'xxx' is the position of the CL recorded loaded by the replication agreement.
If 'xxx' is very old (close to the oldest record in the CL) then you likely hit that bug