Bug 1192099

Summary:	IPA server replication broken, after DS stop-start, due to changelog reset
Product:	Red Hat Enterprise Linux 7	Reporter:	Scott Poore <spoore>
Component:	389-ds-base	Assignee:	Ludwig <lkrispen>
Status:	CLOSED ERRATA	QA Contact:	Viktor Ashirov <vashirov>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.1	CC:	amsharma, gparente, lkrispen, mkosek, mreynolds, msauton, nkinder, rmeggins, sauchter, spoore, tbordaz, tmihinto
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	389-ds-base-1.3.7.5-10.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1523507 (view as bug list)		Environment:
Last Closed:	2018-04-10 14:15:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1523507

Description Scott Poore 2015-02-12 15:52:30 UTC

Description of problem:

It appears that some changes were not replicated from an IPA master to replica when the master is being upgraded (during yum update).  The symptom I saw was that a DNS forward zone was not converted properly on the replica.  After IPA and DS dev looked into it, it seems that the DEL for the old dnzone was not replicated from master to replica.  

From Thierry (also reflected in full description below):

"
	Looking at the master logs there is an unexpected
	RA wants to start replication at update 54dabbd4001300040000
	But access to the changelog via a cursor reports an error (DB_DELETED : -30988 like if changelog db file was deleted)
	On this errors, the server takes a "backup" starting csn 54daba2c000300030000 and
	starts sending update from csn=54dabc4a000000040000. 
		So it skipped more that 100 updates !
"

Version-Release number of selected component (if applicable):
389-ds-base-1.3.3.1-13

How reproducible:
unknown.  only known to have been seen once in test automation.

Steps to Reproduce:
1.  Setup IPA Master and Replica on RHEL7.0
2.  On Master: ipa dnszone-add --forwarder=<> --forward-policy=only
3.  On Replica: ipa dnszone-find
4.  On Master: point yum configs to RHEL7.1 repos and yum update.
5.  On Replica: ipa dnszone-find

I'm unsure of DS specific reproduction here so I included what I did when I saw this in IPA env.

Actual results:
Step 5 returns the forward zone when it should not.

Expected results:
Step 5 should fail.  dnszone-find should not return a forward zone.  


Additional info:

From Thierry Bordaz while troubleshooting an issue I saw:

Here are my findings regarding this last issue:


	On Master a first DNS entry was added then later deleted
	and a second entry (with a '.' at the end of the RDN) was added

[10/Feb/2015:21:04:03 -0500] conn=202 op=7 ADD dn="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test"
[10/Feb/2015:21:04:03 -0500] conn=202 op=7 RESULT err=0 tag=105 nentries=0 etime=0 csn=54dab894000100040000
...
[10/Feb/2015:21:18:02 -0500] conn=2 op=1155 DEL dn="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test"
[10/Feb/2015:21:18:02 -0500] conn=2 op=1155 RESULT err=0 tag=107 nentries=0 etime=0 csn=54dabbdb001100040000
...
[10/Feb/2015:21:18:02 -0500] conn=2 op=1160 SRCH base="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test" scope=0 filter="(objectClass=*)" attrs=""
[10/Feb/2015:21:18:02 -0500] conn=2 op=1160 RESULT err=32 tag=101 nentries=0 etime=0
..
[10/Feb/2015:21:18:02 -0500] conn=2 op=1163 ADD dn="idnsname=ipaad2012r2.test.,cn=dns,dc=testrelm,dc=test"
[10/Feb/2015:21:18:02 -0500] conn=2 op=1163 RESULT err=0 tag=105 nentries=0 etime=0 csn=54dabbdb001200040000

	Later we can see that first DSN entry was no longer present
	and the second one can be found

[10/Feb/2015:22:06:41 -0500] conn=590 op=4 SRCH base="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test" scope=0 filter="(objectClass=*)" attrs=""
[10/Feb/2015:22:06:41 -0500] conn=590 op=4 RESULT err=32 tag=101 nentries=0 etime=0
[10/Feb/2015:22:06:41 -0500] conn=590 op=5 SRCH base="idnsname=ipaad2012r2.test.,cn=dns,dc=testrelm,dc=test" scope=0 filter="(objectClass=*)" attrs="objectClass"
[10/Feb/2015:22:06:41 -0500] conn=590 op=5 RESULT err=0 tag=101 nentries=1 etime=0
	



	The problem is that the replica did not get the DEL operation of the first entry

[10/Feb/2015:21:04:03 -0500] conn=22 op=328 ADD dn="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test"
[10/Feb/2015:21:04:03 -0500] conn=22 op=328 RESULT err=0 tag=105 nentries=0 etime=1 csn=54dab894000100040000


[10/Feb/2015:21:38:14 -0500] conn=6 op=8 MOD dn="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test"
[10/Feb/2015:21:38:14 -0500] conn=Internal op=-1 SRCH base="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test" scope=0 filter="(|(objectclass=*)(objectclass=ldapsubentry))" attrs=ALL
[10/Feb/2015:21:38:14 -0500] conn=Internal op=-1 RESULT err=0 tag=48 nentries=1 etime=0
[10/Feb/2015:21:38:14 -0500] conn=6 op=8 RESULT err=0 tag=103 nentries=0 etime=0 csn=54dac097000700030000


	When the master was supposed to replicate the DEL, several stop/start occured.
	I do not know it it can explain why the DEL got not replicated
[10/Feb/2015:21:18:04 -0500] - slapd stopped.
[10/Feb/2015:21:18:08 -0500] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
[10/Feb/2015:21:18:44 -0500] - slapd stopped.
[10/Feb/2015:21:18:47 -0500] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
[10/Feb/2015:21:19:44 -0500] - slapd stopped.
[10/Feb/2015:21:19:51 -0500] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
[10/Feb/2015:21:22:24 -0500] - slapd stopped.
[10/Feb/2015:21:22:28 -0500] - slapd started.  Listening on All Interfaces port 389 for LDAP requests

	Looking at the master logs there is an unexpected
	RA wants to start replication at update 54dabbd4001300040000
	But access to the changelog via a cursor reports an error (DB_DELETED : -30988 like if changelog db file was deleted)
	On this errors, the server takes a "backup" starting csn 54daba2c000300030000 and
	starts sending update from csn=54dabc4a000000040000. 
		So it skipped more that 100 updates !

[10/Feb/2015:21:20:23 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - session start: anchorcsn=54dabbd4001300040000
[10/Feb/2015:21:20:23 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - clcache_load_buffer: rc=-30988
[10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - changelog program - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): CSN 54dabbd4001300040000 not found and no purging, probably a reinit
[10/Feb/2015:21:20:24 -0500] NSMMReplicationPlugin - changelog program - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): Will try to use supplier min CSN 54daba2c000300030000 to load changelog
[10/Feb/2015:21:20:24 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - session start: anchorcsn=54daba2c000300030000
[10/Feb/2015:21:20:24 -0500] NSMMReplicationPlugin - changelog program - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): CSN 54daba2c000300030000 found, position set for replay
[10/Feb/2015:21:20:24 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - Skipping update because the changelog buffer current csn [54daba2c000300030000] is less than or equal to the consumer max csn [54dabc68000000030000]
[10/Feb/2015:21:20:24 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - load=1 rec=2 csn=54dabc4a000000040000
[10/Feb/2015:21:20:24 -0500] NSMMReplicationPlugin - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): replay_update: Sending modify operation (dn="fqdn=ipaqavmg.testrelm.test,cn=computers,cn=accounts,dc=testrelm,dc=test" csn=54dabc4a000000040000)


	In conclusion: 
                Scott, please open a separated bug for this last issue.
		We need to find why changelog cursor returned DB_DELETED.
		Also determine how replication should handle this error because in that 
		case is silently skipped tons of updates

Comment 5 thierry bordaz 2015-02-12 17:22:53 UTC

Hi Noriko,

Thanks you so much for your update, in fact DB_NOTFOUND looks much more familiar than DB_DELETE.

The missing update is
[10/Feb/2015:21:17:55 -0500] conn=2 op=900 MOD dn="cn=sudocmdgroups,cn=sudo,dc=testrelm,dc=test"
[10/Feb/2015:21:17:55 -0500] conn=Internal op=-1 SRCH base="cn=sudocmdgroups,cn=sudo,dc=testrelm,dc=test" scope=0 filter="(|(objectclass=*)(objectclass=ldapsubentry))" attrs=ALL
[10/Feb/2015:21:17:55 -0500] conn=Internal op=-1 RESULT err=0 tag=48 nentries=1 etime=0
[10/Feb/2015:21:17:56 -0500] conn=2 op=900 RESULT err=0 tag=103 nentries=0 etime=1 csn=54dabbd4001300040000


It was not found in the CL:
[10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - changelog program - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): CSN 54dabbd4001300040000 not found and no purging, probably a reinit

There is not trace of offline/online init at that period, so except CL purging I see no reason why this update was missing in the CL

Scott, I did not find the dse.ldif (master/replica) from the provided tarball

thanks

Comment 6 Scott Poore 2015-02-12 17:43:21 UTC

Thierry,

Unfortunately, those servers are no longer available.  Their reservation expired and were returned to the test server pool.  So, I cannot get the dse.ldif from them either.

I don't know how reproducible this is.  I think it's the first time we've seen this so I don't think it's easily reproducible.

Scott

Comment 7 thierry bordaz 2015-03-03 15:17:34 UTC

Here is the current status
--------------------------

		- There are two issues. 
			The first one is an unexpected reset of the changelog. This reset is likely
		  a bug. The consequence is that updates not yet replicated are definitely lost.

		  	The second one is that the replica agreement being not able to find 
		  the 'primary' min_ruv from the changelog, starts replication with a "secondary" 
		  min_ruv.
		  This is possibly not a bug but in the case where no other master knows the 
		  'primary' min_ruv, it may hide a fatal error in replication. If it is not a bug
		  it should be logged with the appropriate level.

		- First issue (unexpected reset of the changelog)

	#
	# Beginning of valid replication session (master -> replica)
	#
[10/Feb/2015:21:17:12 -0500] - _cl5PositionCursorForReplay ): Consumer RUV:
[10/Feb/2015:21:17:12 -0500] NSMMReplicationPlugin -: {replicageneration} 54dab6bf000000040000
[10/Feb/2015:21:17:12 -0500] NSMMReplicationPlugin - 54dab6c4000000030000 54dabba7000000030000 00000000
[10/Feb/2015:21:17:12 -0500] NSMMReplicationPlugin - 54dab705000000040000 54dabba7000500040000 00000000
[10/Feb/2015:21:17:12 -0500] - _cl5PositionCursorForReplay ): Supplier RUV:
[10/Feb/2015:21:17:12 -0500] NSMMReplicationPlugin -: {replicageneration} 54dab6bf000000040000
[10/Feb/2015:21:17:12 -0500] NSMMReplicationPlugin - 54dab705000000040000 54dabba9001700040000 54dabba8
[10/Feb/2015:21:17:12 -0500] NSMMReplicationPlugin - 54dab6c4000000030000 54daba2c000300030000 00000000
[10/Feb/2015:21:17:12 -0500] - clcache_get_buffer: found thread private buffer cache 7f30b800bf70
[10/Feb/2015:21:17:12 -0500] - clcache_get_buffer: _pool is 7f30df3913f0 _pool->pl_busy_lists is 7f30b81f37f0 _pool->pl_busy_lists->bl_buffers is 7f30b800bf70
[10/Feb/2015:21:17:12 -0500] - session start: anchorcsn=54dabba7000500040000


	#
	# DEL operation that will miss
	#
[10/Feb/2015:21:18:02 -0500] conn=2 op=1155 DEL dn="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test"
[10/Feb/2015:21:18:02 -0500] conn=2 op=1155 RESULT err=0 tag=107 nentries=0 etime=0 csn=54dabbdb001100040000
[10/Feb/2015:21:18:02 -0500] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 54dabbdb001100040000 into pending list
[10/Feb/2015:21:18:02 -0500] NSMMReplicationPlugin - changelog program - _cl5GetDBFileByReplicaName: found DB object 7fdb0ba3b930 for database /var/lib/dirsrv/slapd-TESTRELM-TEST/cldb/134ac903-b19111e4-9b16c67f-e9d60a91_54dab6bf000000040000.db
[10/Feb/2015:21:18:02 -0500] NSMMReplicationPlugin - changelog program - _cl5GetDBFileByReplicaName: found DB object 7fdb0ba3b930 for database /var/lib/dirsrv/slapd-TESTRELM-TEST/cldb/134ac903-b19111e4-9b16c67f-e9d60a91_54dab6bf000000040000.db
[10/Feb/2015:21:18:02 -0500] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 54dabbdb001100040000

	# RESTART
[10/Feb/2015:21:18:04 -0500] - slapd stopped.
[10/Feb/2015:21:18:08 -0500] - slapd started.  Listening on All Interfaces port 389 for LDAP requests


	#
	# Replication session that did not send the missing update because of a restart
	#
[10/Feb/2015:21:18:51 -0500] - _cl5PositionCursorForReplay ): Consumer RUV:
[10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin -: {replicageneration} 54dab6bf000000040000
[10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin - 54dab6c4000000030000 54dabc0b000000030000 00000000
[10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin - 54dab705000000040000 54dabba9001700040000 00000000
[10/Feb/2015:21:18:51 -0500] - _cl5PositionCursorForReplay ): Supplier RUV:
[10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin -: {replicageneration} 54dab6bf000000040000
[10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin - 54dab705000000040000 54dabc0c000000040000 54dabc0b
[10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin - 54dab6c4000000030000 54daba2c000300030000 00000000
[10/Feb/2015:21:18:51 -0500] - session start: anchorcsn=54dabba9001700040000

	# Send lot of updates from 54dabba9001700040000 -> 54dabbd4001300040000
[10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389       ): replay_update: Consumer successfully sent operation with csn 54dabbaa000000040000
...
[10/Feb/2015:21:19:42 -0500] NSMMReplicationPlugin - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389       ): replay_update: Consumer successfully sent operation with csn 54dabbd4001300040000
[10/Feb/2015:21:19:42 -0500] - repl5_inc_result_threadmain exiting  <<  shutdown


	# RESTART
[10/Feb/2015:21:19:44 -0500] - slapd stopped.
[10/Feb/2015:21:19:51 -0500] - slapd started.  Listening on All Interfaces port 389 for LDAP requests

	#
	# the changelog is recreated (losing the update csn=54dabbdb001100040000)
	#
[10/Feb/2015:21:19:51 -0500] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: data for replica dc=testrelm,dc=test        does not match the data in the changelog. Recreating the changelog file. This could affect replication with replica's consumers        in which case the consumers should be reinitialized.  


	# 
	# replication session after the changelog recreation
	# as we do not find 54dabbdb001100040000, we continue replication starting with 54daba2c000300030000
[10/Feb/2015:21:20:23 -0500] - _cl5PositionCursorForReplay ): Consumer RUV:
[10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin -: {replicageneration} 54dab6bf000000040000
[10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - 54dab6c4000000030000 54dabc68000000030000 00000000
[10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - 54dab705000000040000 54dabbd4001300040000 00000000
[10/Feb/2015:21:20:23 -0500] - _cl5PositionCursorForReplay ): Supplier RUV:
[10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin -: {replicageneration} 54dab6bf000000040000
[10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - 54dab705000000040000 54dabc66000000040000 54dabc65
[10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - 54dab6c4000000030000 54daba2c000300030000 00000000
[10/Feb/2015:21:20:23 -0500] - session start: anchorcsn=54dabbd4001300040000

[10/Feb/2015:21:20:23 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - clcache_load_buffer: rc=-30988
[10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - changelog program - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): CSN 54dabbd4001300040000 not found and no purging, probably a reinit
[10/Feb/2015:21:20:24 -0500] NSMMReplicationPlugin - changelog program - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): Will try to use supplier min CSN 54daba2c000300030000 to load changelog
[10/Feb/2015:21:20:24 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - session start: anchorcsn=54daba2c000300030000


Here are the next steps
-----------------------

	- identify why the changelog is recreated

	- check if continuing replication is valid (likely)
	  but failure to retrieve the min_csn should be logged with
	  the right level

Comment 8 thierry bordaz 2015-03-04 12:46:37 UTC

The reason why the changelog gets recreated is possibly a bug in DS shutdown

	Client updates
	# last update before restart
[10/Feb/2015:21:19:00 -0500] conn=8 op=4 MOD dn="fqdn=ipaqavmg.testrelm.test,cn=computers,cn=accounts,dc=testrelm,dc=test"
[10/Feb/2015:21:19:30 -0500] conn=8 op=4 RESULT err=0 tag=103 nentries=0 etime=30 csn=54dabc20000000040000
...
	# first update after restart
[10/Feb/2015:21:19:53 -0500] conn=5 op=16 MOD dn="fqdn=ipaqavmg.testrelm.test,cn=computers,cn=accounts,dc=testrelm,dc=test"
[10/Feb/2015:21:19:53 -0500] conn=5 op=16 RESULT err=0 tag=103 nentries=0 etime=0 csn=54dabc4a000000040000


	# last update (csn=54dabc20000000040000) before restart with associated internal updates
	# replica RUV was successfully updated
[10/Feb/2015:21:19:30 -0500] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 54dabc20000000040000

[10/Feb/2015:21:19:35 -0500] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 54dabc33000000040000
[10/Feb/2015:21:19:37 -0500] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 54dabc38000000040000
[10/Feb/2015:21:19:40 -0500] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 54dabc3a000000040000


	# The replica RUV (in memory) contains 54dabc3a000000040000 before restart
	# during shutdown the replica RUV is dumped and contains the last update
[10/Feb/2015:21:19:42 -0500] NSMMReplicationPlugin - Database RUV: {replicageneration} 54dab6bf000000040000
[10/Feb/2015:21:19:42 -0500] NSMMReplicationPlugin - Database RUV: {replica 4 ldap://ipaqavmg.testrelm.test:389} 54dab705000000040000 54dabc3a000000040000 54dabc3c
[10/Feb/2015:21:19:42 -0500] NSMMReplicationPlugin - Database RUV: {replica 3 ldap://hp-dl380pgen8-02-vm-2.testrelm.test:389} 54dab6c4000000030000 54daba2c000300030000 00000000

[10/Feb/2015:21:19:44 -0500] - slapd stopped.

	# But after restart this update is not present in the database RUV
[10/Feb/2015:21:19:49 -0500] - 389-Directory/1.3.3.1 B2015.041.149 starting up
[10/Feb/2015:21:19:51 -0500] changelog max RUV: {replicageneration} 54dab6bf000000040000
[10/Feb/2015:21:19:51 -0500] changelog max RUV: {replica 3} 54dab6c4000000030000 54daba2c000300030000 00000000
[10/Feb/2015:21:19:51 -0500] changelog max RUV: {replica 4} 54dab705000000040000 54dabc3a000000040000 00000000
[10/Feb/2015:21:19:51 -0500] database RUV: {replicageneration} 54dab6bf000000040000
[10/Feb/2015:21:19:51 -0500] database RUV: {replica 4 ldap://ipaqavmg.testrelm.test:389} 54dab705000000040000 54dabc38000000040000 00000000
[10/Feb/2015:21:19:51 -0500] database RUV: {replica 3 ldap://hp-dl380pgen8-02-vm-2.testrelm.test:389} 54dab6c4000000030000 54daba2c000300030000 00000000

[10/Feb/2015:21:19:51 -0500] - slapd started.  Listening on All Interfaces port 389 for LDAP requests

	In conclusion:
		It is looking like the replica RUV is not flushed to database during shutdown.
	I ignore if the problem is related to the fact that the last update was an internal update.
	The most probable cause is a shutdown bug, where the flushing of the replica RUV (replica_flush)
	is not forced.

Comment 9 thierry bordaz 2015-03-09 14:53:14 UTC

I opened DS ticket https://fedorahosted.org/389/ticket/48118.
If the last update is processed less than 30 sec before a regular DS shutdown, there is a possibility that the database_RUV is less than the changelog_RUV.

At startup, this will be seen as a condition error and the changelog will be rebuilt.

I am trying to reproduce this.

Comment 10 thierry bordaz 2015-03-09 17:30:11 UTC

So far I failed to reproduce on VMs (1cpu nehalem) as well as bare metal.

Scott, may I provide you an automatic test to see if you succeed to reproduce on one of your VM ?

Comment 11 Scott Poore 2015-03-16 14:28:52 UTC

I'll send you details for the VM.

Comment 14 thierry bordaz 2015-03-20 10:54:15 UTC

Here is current status

 - I have not been able to reproduce

 - As mentionned in https://bugzilla.redhat.com/show_bug.cgi?id=1192099#c8, before the shutdown DB RUV contained 54dabc3a000000040000 and after restart it contained 54dabc38000000040000.

 - At shutdown, a hash table that contains the replicas, is enumerated and each replica RUV is write back to disk (replica_destroy_hash_entry).
   according to the code/log we should not be in a condition where the write back is skipped.

 - Shutdown occurred normally. In particular multimaster_stop ran without problem so the enumeration of the replica hash table should have occurred.

 - There is no specific log in the enumeration/RUV flush that help to understand what was going on.

Here are the next steps

  - I need to build a debug version with more logs and try to reproduce

  - The possible cause I can think are:
    - a bug in in hash enumeration that skipped a replica
    - corner case, that reset replica_name, dirty flag, csngen. But I do not see how it can occur.

Comment 15 Scott Poore 2015-03-20 15:25:02 UTC

ok more details on setup:

Master:

- add new hostname (host.testrelm.test) to /etc/hosts
- change hostname in /etc/hostname as well
- disable firewalld
- install 389-ds-base-debuginfo
- sysctl -w fs.suid_dumpable=1
- echo 'ulimit -c unlimited' >> /etc/sysconfig/dirsrv
- echo 'LimitCORE=infinity' >> /etc/sysconfig/dirsrv.systemd
- systemctl daemon-reload
- /usr/sbin/ipa-server-install -U --setup-dns --forwarder=<DNS_FORWARDER> --hostname=<HOSTNAME>.testrelm.test -r TESTRELM.TEST -n testrelm.test -p <PASSWORD> -P <PASSWORD> -a <PASSWORD>
- force Master and Replica IPs into /etc/resolv.conf
- set nsslapd-errorlog-level: 8192
- set nsslapd-accesslog-level: 260
- set nsslapd-sasl-max-buffer-size: 1048576
- ipa config-mod --searchtimelimit=100
- ipactl stop; ipactl start

Replica:

- add new hostname (host.testrelm.test) to /etc/hosts
- change hostname in /etc/hostname as well
- force Master and Replica IPs into /etc/resolv.conf
- disable firewalld
- install 389-ds-base-debuginfo
- sysctl -w fs.suid_dumpable=1
- echo 'ulimit -c unlimited' >> /etc/sysconfig/dirsrv
- echo 'LimitCORE=infinity' >> /etc/sysconfig/dirsrv.systemd
- systemctl daemon-reload
- on MASTER run:  ipa-replica-prepare -p <PASSWORD> --ip-address=<REPLICA_IP> --reverse-zone=<REPLICA_REVERSE_ZONE> <REPLICA_NAME>.testrelm.test
- copy /var/lib/ipa/replica-info-<REPLICA_NAME>.testrelm.test.gpg to REPLICA:/var/lib/ipa
- /usr/sbin/ipa-replica-install -U --setup-ca --setup-dns --forwarder=<DNS_FORWARDER> -w <PASSWORD> -p <PASSWORD> /opt/rhqa_ipa/replica-info-<REPLICA_NAME>.testrelm.test.gpg
- ipactl stop; ipactl start
- force Master and Replica IPs into /etc/resolv.conf
- set nsslapd-errorlog-level: 8192
- set nsslapd-accesslog-level: 260
- set nsslapd-sasl-max-buffer-size: 1048576
- ipactl stop; ipactl start

Client:

- add new hostname (host.testrelm.test) to /etc/hosts
- change hostname in /etc/hostname as well
- force Master and Replica IPs into /etc/resolv.conf
- disable firewalld
- install 389-ds-base-debuginfo
- sysctl -w fs.suid_dumpable=1
- echo 'ulimit -c unlimited' >> /etc/sysconfig/dirsrv
- echo 'LimitCORE=infinity' >> /etc/sysconfig/dirsrv.systemd
- systemctl daemon-reload
- /usr/sbin/ipa-client-install -d -U --domain=testrelm.test --realm=TESTRELM.TEST -p admin -w <PASSWORD> --server=<MASTER_NAME>.testrelm.test

Not quite sure why I left the 389 stuff for the client there but, I did so I listed it.

Comment 16 thierry bordaz 2015-04-21 11:29:56 UTC

Scott,

Long time ago and gave no feedback..
So I applied the test case (thanks again) but was not able to reproduce the problem.

To progress on this bug, I need to be able to reproduce with a debug version.
(detect at shutdown, why the db RUV is not written back)

Would you help me trying to reproduce ? The test case https://bugzilla.redhat.com/show_bug.cgi?id=1192099#c15 would need to be run in a loop as the problem is dynamic.

thanks
thierry

Comment 17 Scott Poore 2015-04-21 13:32:00 UTC

Thierry,

To do that in a loop will require additional uninstall steps.  In the past this proved difficult for some versions of IPA/RHEL when re-installing afterwards.  It may be easier for me to just run many jobs to catch this.

Would you be able to provide me with a debug version I can put in a custom repo for this?

Or, I can try to work out a way for you to run my exact tests on your VMs.  Would that help?

Comment 19 thierry bordaz 2015-07-31 11:57:42 UTC

This bug is triggered by the fact that the last update CSN (54dabc3a000000040000 in #8) was not written (RUV) in the suffix database during shutdown.

The bug described in http://post-office.corp.redhat.com/archives/ipa-and-samba-team-list/2015-July/msg00194.html / https://fedorahosted.org/freeipa/ticket/5112 is that the the changelog RUV was not written in the changelog database during shutdown.

Both bugs are related RUV write back at shutdown, so it could be related.
Now I do no see a immediate relationship between the two bugs. In particular with it current bug, the comparison (that triggered the changelog recreation) is done on RID that was not cleaned.

Comment 23 thierry bordaz 2016-12-02 08:44:02 UTC

This update is related to the last run where the problem happened again while internal operations were logged:

Similar symptom of the missing csn:

# 583f1cb2000300040000 was successful and went into in memory RUV
[01/Dec/2016:00:08:42.249813656 +051800] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 583f1cb2000300040000 into pending list
[01/Dec/2016:00:08:42.441151046 +051800] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 583f1cb2000300040000


# 583f1cb3000000040000 was successful and went into in memory RUV
# last operation before shutdown
[01/Dec/2016:00:08:42.493242558 +051800] conn=6 op=819 ADD dn="idnsserverid=vm-idm-005.testrelm.test,cn=servers,cn=dns,dc=testrelm,dc=test"
[01/Dec/2016:00:08:42.641521991 +051800] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 583f1cb3000000040000 into pending list
[01/Dec/2016:00:08:42.738379945 +051800] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 583f1cb3000000040000
[01/Dec/2016:00:08:42.751134933 +051800] conn=6 op=819 RESULT err=0 tag=105 nentries=0 etime=0 csn=583f1cb3000000040000


# Shutdown
[01/Dec/2016:00:08:43.017988179 +051800] slapd shutting down - signaling operation threads - op stack size 4 max work q size         2 max work q stack size 2
[01/Dec/2016:00:08:43.024334224 +051800] slapd shutting down - waiting for 29 threads to terminate
[01/Dec/2016:00:08:43.029683669 +051800] slapd shutting down - closing down internal subsystems and plugins


#dump of the in memory DB RUV showing that 583f1cb3000000040000 was in the in memory RUV
#
[01/Dec/2016:00:08:44.286974135 +051800] NSMMReplicationPlugin - Database RUV: {replicageneration} 583f12c9000000040000
[01/Dec/2016:00:08:44.298360539 +051800] NSMMReplicationPlugin - Database RUV: {replica 4 ldap://vm-idm-005.testrelm.test:38        9} 583f12cb000000040000 583f1cb3000000040000 583f1cb2
[01/Dec/2016:00:08:44.304101449 +051800] NSMMReplicationPlugin - Database RUV: {replica 3 ldap://vm-idm-008.testrelm.test:38        9} 583f12d3000000030000 583f1a91000300030000 00000000


# But after restart DB RUV does not contain 583f1cb3000000040000 and "roll back" to 583f1cb2000300040000

        changelog max RUV: {replicageneration} 583f12c9000000040000
        changelog max RUV: {replica 4} 583f12cb000000040000 583f1cb3000000040000 00000000
        changelog max RUV: {replica 3} 583f12d3000000030000 583f1a91000300030000 00000000

        database RUV: {replicageneration} 583f12c9000000040000
        database RUV: {replica 4 ldap://vm-idm-005.testrelm.test:389} 583f12cb000000040000 583f1cb2000300040000 00000000
        database RUV: {replica 3 ldap://vm-idm-008.testrelm.test:389} 583f12d3000000030000 583f1a91000300030000 00000000

[01/Dec/2016:00:08:48.373284220 +051800] NSMMReplicationPlugin - 
                 ruv_compare_ruv: 
                 the max CSN [583f1cb2000300040000] from RUV [database RUV] is less than or equal to 
                 the max CSN [583f1cb3000000040000] from RUV [changelog max RUV] for element 
                 [{replica 4 ldap://vm-idm-005.testrelm.test:389} 583f12cb000000040000 583f1cb2000300040000]

[01/Dec/2016:00:08:48.375940973 +051800] NSMMReplicationPlugin - 
                 ruv_compare_ruv: 
                 the max CSN [583f1a91000300030000] from RUV [database RUV] is less than or equal to 
                 the max CSN [583f1a91000300030000] from RUV [changelog max RUV] for element 
                 [{replica 3 ldap://vm-idm-008.testrelm.test:389} 583f12d3000000030000 583f1a91000300030000]

[01/Dec/2016:00:08:48.384849429 +051800] NSMMReplicationPlugin - 
                 ruv_compare_ruv: 
                 the max CSN [583f1cb3000000040000] from RUV [changelog max RUV] is larger than 
                 the max CSN [583f1cb2000300040000] from RUV [database RUV] for element 
                 [{replica 4} 583f12cb000000040000 583f1cb3000000040000]

[01/Dec/2016:00:08:48.390535115 +051800] NSMMReplicationPlugin - 
                 ruv_compare_ruv: 
                 the max CSN [583f1a91000300030000] from RUV [changelog max RUV] is less than or equal to 
                 the max CSN [583f1a91000300030000] from RUV [database RUV] for element 
                 [{replica 3} 583f12d3000000030000 583f1a91000300030000]


# That triggered CL recreate
[01/Dec/2016:00:08:48.401903812 +051800] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: data for replica dc=testrelm,dc=test does not match the data in the changelog. Recreating the changelog file. This could affect replication with replica's consumers in which case the consumers should be reinitialized.


What internal operations logging is showing

# the logging of internal operation is showing that the update of the RUV
# (during shutdown) occurred and completed.

[01/Dec/2016:00:08:42.493242558 +051800] conn=6 op=819 ADD dn="idnsserverid=vm-idm-005.testrelm.test,cn=servers,cn=dns,dc=testrelm,dc=test"
[01/Dec/2016:00:08:42.582616273 +051800] conn=Internal op=-1 MOD dn="dc=testrelm,dc=test"   <<< RUV update
[01/Dec/2016:00:08:42.656846289 +051800] conn=Internal op=-1 ADD dn="changenumber=1223,cn=changelog"
[01/Dec/2016:00:08:42.693464205 +051800] conn=Internal op=-1 RESULT err=0 tag=48 nentries=0 etime=0
[01/Dec/2016:00:08:42.751134933 +051800] conn=6 op=819 RESULT err=0 tag=105 nentries=0 etime=0 csn=583f1cb3000000040000
[01/Dec/2016:00:08:42.752813923 +051800] conn=6 op=820 SRCH base="idnsserverid=vm-idm-005.testrelm.test,cn=servers,cn=dns,dc=testrelm,dc=test" scope=0 filter="(objectClass=*)" attrs="idnsForwardPolicy idnsServerId idnsForwarders idnsServerId objectClass idnsSOAmName"
[01/Dec/2016:00:08:42.753043302 +051800] conn=6 op=820 RESULT err=0 tag=101 nentries=1 etime=0
[01/Dec/2016:00:08:42.757724587 +051800] conn=Internal op=-1 RESULT err=0 tag=48 nentries=0 etime=0   <<< RUV update result
[01/Dec/2016:00:08:42.766283948 +051800] conn=6 op=821 UNBIND
[01/Dec/2016:00:08:42.766348052 +051800] conn=6 op=821 fd=70 closed - U1

# here is the first op after restart
[01/Dec/2016:00:08:47.316981795 +051800] conn=Internal op=-1 ADD dn=""


# BUT the sync of the in-memory RUV to the DB RUV occurred BEFORE the in-memory RUV actually contains 583f1cb3000000040000

[01/Dec/2016:00:08:42.582616273 +051800] conn=Internal op=-1 MOD dn="dc=testrelm,dc=test"
[01/Dec/2016:00:08:42.757724587 +051800] conn=Internal op=-1 RESULT err=0 tag=48 nentries=0 etime=0


[01/Dec/2016:00:08:42.592733665 +051800] _csngen_adjust_local_time: gen state before 583f1cb20004:1480531121:0:1
[01/Dec/2016:00:08:42.600076855 +051800] _csngen_adjust_local_time: gen state after 583f1cb30000:1480531122:0:1
[01/Dec/2016:00:08:42.641521991 +051800] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 583f1cb3000000040000 into pending list
[01/Dec/2016:00:08:42.692916413 +051800] repl5_inc_waitfor_async_results: 179 180
[01/Dec/2016:00:08:42.705634226 +051800] NSMMReplicationPlugin - changelog program - _cl5GetDBFileByReplicaName: found DB object 7fa5a33c6720 for database /var/lib/dirsrv/slapd-TESTRELM-TEST/cldb/39ef4086-b72611e6-8ad58ed8-9553efe0_583f12c9000000040000.db
[01/Dec/2016:00:08:42.721368479 +051800] NSMMReplicationPlugin - changelog program - _cl5GetDBFileByReplicaName: found DB object 7fa5a33c6720 for database /var/lib/dirsrv/slapd-TESTRELM-TEST/cldb/39ef4086-b72611e6-8ad58ed8-9553efe0_583f12c9000000040000.db
[01/Dec/2016:00:08:42.738379945 +051800] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 583f1cb3000000040000


I think it is a consequence of https://fedorahosted.org/389/ticket/564
Where changelog/DB RUV were no longer written in the same txn.

Possibilities are
  - roll back https://fedorahosted.org/389/ticket/564
  - after normal shutdown rebuild DB RUV from CL RUV if CL RUV is ahead (like it is done for disorderly shutdown)
  - Make sure that write DB RUV at shutdown occurs only when all pending operation are committed or aborted
  - ...

Comment 24 Ludwig 2017-10-24 14:18:20 UTC

I just reproduced this issue. It happens if you have an active operation, triggering internal mods and stop the server before the operation is completed

Comment 26 Viktor Ashirov 2017-11-22 11:00:26 UTC

Hi Ludwig,
(In reply to Ludwig from comment #24)
> I just reproduced this issue. It happens if you have an active operation,
> triggering internal mods and stop the server before the operation is
> completed

Could you please elaborate a bit more on steps to reproduce?

Thanks!

Comment 27 Ludwig 2017-11-22 13:40:19 UTC

Here is the scenario I used.

- Enable memberof
- add 1000 entries allowing memberof
- add a group with these 1000 entries as members
-- member of will start and internally add memberof to all thousand entries doing 1000 internal modify ops

- while this group add is running stop the server

the stop should be timed that the internal ops have started already and not completed

Comment 29 Ludwig 2017-12-08 13:51:15 UTC

backport to 1.3.6 upstream done

Comment 30 Amita Sharma 2018-01-04 08:45:59 UTC

I am following https://bugzilla.redhat.com/show_bug.cgi?id=1192099#c27 and while running ->

ldapmodify -x -h localhost -p 39001 -D "cn=directory manager" -w password -a << EOF
dn: cn=test_Group1,dc=example,dc=com
changetype: add
objectclass: top
objectclass: groupOfNames
cn: test_Group1
member: uid=DDa3296903592,ou=Product Testing,dc=example,dc=com
member: uid=LNosewort3448819975,ou=Payroll,dc=example,dc=com
member: uid=HSiu2407396923,ou=Product Development,dc=example,dc=com
member: uid=MClocklab89074690,ou=Product Development,dc=example,dc=com
member: uid=SKurniawa835765406,ou=Payroll,dc=example,dc=com
member: uid=BAngobald3638662746,ou=Payroll,dc=example,dc=com
member: uid=LDantu1647639336,ou=Product Testing,dc=example,dc=com
.
.
.
till 1000 members
AND
Interrupting the operation by restarting the server in between, I am getting ::
*****************************************************************
1. With Build - 389-ds-base-1.3.7.5-9.el7.x86_64, Error logs
***************************************************************
[04/Jan/2018:03:27:13.406707137 -0500] - NOTICE - ldbm_back_start - total cache size: 239888793 B;
[04/Jan/2018:03:27:13.476231122 -0500] - ERR - NSMMReplicationPlugin - ruv_compare_ruv - The max CSN [5a4de55c03cc00010000] from RUV [changelog max RUV] is larger than the max CSN [5a4de4c3000300010000] from RUV [database RUV] for element [{replica 1} 5a4ddc2c000100010000 5a4de55c03cc00010000]
[04/Jan/2018:03:27:13.476923021 -0500] - WARN - NSMMReplicationPlugin - replica_check_for_data_reload - Data for replica dc=example,dc=com does not match the data in the changelog. Recreating the changelog file. This could affect replication with replica's consumers in which case the consumers should be reinitialized.
[04/Jan/2018:03:27:13.481802483 -0500] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes...

*****************************************************************
2. With Build - 389-ds-base-1.3.7.5-11.el7.x86_64 , Error logs
***************************************************************
[04/Jan/2018:03:39:06.199138419 -0500] - INFO - slapd_daemon - slapd shutting down - waiting for 1 thread to terminate
[04/Jan/2018:03:39:06.221547179 -0500] - ERR - memberof-plugin - memberof_postop_add - Failed to add dn(cn=test_Group1,dc=example,dc=com), error (-1)
[04/Jan/2018:03:39:06.252656504 -0500] - INFO - slapd_daemon - slapd shutting down - closing down internal subsystems and plugins
[04/Jan/2018:03:39:06.460171955 -0500] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteRUV - changelog maxRUV not found in changelog for file /var/lib/dirsrv/slapd-master1/changelogdb/b6263584-f12911e7-a7a3dc41-0d407204_5a4de677000000010000.db
[04/Jan/2018:03:39:06.465160239 -0500] - INFO - dblayer_pre_close - Waiting for 4 database threads to stop
[04/Jan/2018:03:39:08.297022792 -0500] - INFO - dblayer_pre_close - All database threads now stopped

Comment 31 Amita Sharma 2018-01-04 08:48:45 UTC

Other observation ::
I am getting many errors like this ::
[04/Jan/2018:03:39:06.150928402 -0500] - ERR - oc_check_allowed_sv - Entry "uid=VBarr300201628,ou=Product Testing,dc=example,dc=com" -- attribute "memberOf" not allowed
[04/Jan/2018:03:39:06.155038617 -0500] - ERR - oc_check_allowed_sv - Entry "uid=JNyce1044647570,ou=Product Development,dc=example,dc=com" -- attribute "memberOf" not allowed
[04/Jan/2018:03:39:06.160956054 -0500] - ERR - oc_check_allowed_sv - Entry "uid=ADuncan-S2591409655,ou=Human Resources,dc=example,dc=com" -- attribute "memberOf" not allowed

Though having Object Class and also after that I check the entry has memberof attribute in it. This error is misleading. I can file another bug for that.

Comment 32 Ludwig 2018-01-04 08:53:05 UTC

why dio you say "failed QA" ? In comment #30 I assume that -9 is without the fix and -11 is with the fix, with -9 the changelog is rebuilt, with -11 I don't see this mesage, but correctly see the message at shutdown that the changelog max ruv is not in the changelog (and it will not be written)

Comment 33 Amita Sharma 2018-01-04 09:06:08 UTC

Ok, I thought "ERR - NSMMReplicationPlugin - changelog program - _cl5WriteRUV - changelog maxRUV not found in changelog for file /var/lib/dirsrv/slapd-master1/changelogdb/b6263584-f12911e7-a7a3dc41-0d407204_5a4de677000000010000.db" is alarming.

After discussion, got to know that it is included in the patch and it means the ADD operation for group with members will not be written in changelogs and will not be reflected in any of the master. Marking the bug as verified.

Thanks Ludwig.

Comment 36 Ludwig 2018-01-04 10:12:14 UTC

well, they indicate the failure to writ the changelog maxruv, which leads to recreation at startup and can make startup longer.
But there is a a scenario where the max ruv is not correct and if we write it the changelog would be recreated at startup, so we trigger this failure and log it.

The scenario should be rare, but if you want we can change it with a later patch

Comment 55 errata-xmlrpc 2018-04-10 14:15:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0811