Bug 1192099
Summary: | IPA server replication broken, after DS stop-start, due to changelog reset | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Scott Poore <spoore> | |
Component: | 389-ds-base | Assignee: | Ludwig <lkrispen> | |
Status: | CLOSED ERRATA | QA Contact: | Viktor Ashirov <vashirov> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 7.1 | CC: | amsharma, gparente, lkrispen, mkosek, mreynolds, msauton, nkinder, rmeggins, sauchter, spoore, tbordaz, tmihinto | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | 389-ds-base-1.3.7.5-10.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1523507 (view as bug list) | Environment: | ||
Last Closed: | 2018-04-10 14:15:15 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1523507 |
Description
Scott Poore
2015-02-12 15:52:30 UTC
Hi Noriko, Thanks you so much for your update, in fact DB_NOTFOUND looks much more familiar than DB_DELETE. The missing update is [10/Feb/2015:21:17:55 -0500] conn=2 op=900 MOD dn="cn=sudocmdgroups,cn=sudo,dc=testrelm,dc=test" [10/Feb/2015:21:17:55 -0500] conn=Internal op=-1 SRCH base="cn=sudocmdgroups,cn=sudo,dc=testrelm,dc=test" scope=0 filter="(|(objectclass=*)(objectclass=ldapsubentry))" attrs=ALL [10/Feb/2015:21:17:55 -0500] conn=Internal op=-1 RESULT err=0 tag=48 nentries=1 etime=0 [10/Feb/2015:21:17:56 -0500] conn=2 op=900 RESULT err=0 tag=103 nentries=0 etime=1 csn=54dabbd4001300040000 It was not found in the CL: [10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - changelog program - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): CSN 54dabbd4001300040000 not found and no purging, probably a reinit There is not trace of offline/online init at that period, so except CL purging I see no reason why this update was missing in the CL Scott, I did not find the dse.ldif (master/replica) from the provided tarball thanks Thierry, Unfortunately, those servers are no longer available. Their reservation expired and were returned to the test server pool. So, I cannot get the dse.ldif from them either. I don't know how reproducible this is. I think it's the first time we've seen this so I don't think it's easily reproducible. Scott Here is the current status -------------------------- - There are two issues. The first one is an unexpected reset of the changelog. This reset is likely a bug. The consequence is that updates not yet replicated are definitely lost. The second one is that the replica agreement being not able to find the 'primary' min_ruv from the changelog, starts replication with a "secondary" min_ruv. This is possibly not a bug but in the case where no other master knows the 'primary' min_ruv, it may hide a fatal error in replication. If it is not a bug it should be logged with the appropriate level. - First issue (unexpected reset of the changelog) # # Beginning of valid replication session (master -> replica) # [10/Feb/2015:21:17:12 -0500] - _cl5PositionCursorForReplay ): Consumer RUV: [10/Feb/2015:21:17:12 -0500] NSMMReplicationPlugin -: {replicageneration} 54dab6bf000000040000 [10/Feb/2015:21:17:12 -0500] NSMMReplicationPlugin - 54dab6c4000000030000 54dabba7000000030000 00000000 [10/Feb/2015:21:17:12 -0500] NSMMReplicationPlugin - 54dab705000000040000 54dabba7000500040000 00000000 [10/Feb/2015:21:17:12 -0500] - _cl5PositionCursorForReplay ): Supplier RUV: [10/Feb/2015:21:17:12 -0500] NSMMReplicationPlugin -: {replicageneration} 54dab6bf000000040000 [10/Feb/2015:21:17:12 -0500] NSMMReplicationPlugin - 54dab705000000040000 54dabba9001700040000 54dabba8 [10/Feb/2015:21:17:12 -0500] NSMMReplicationPlugin - 54dab6c4000000030000 54daba2c000300030000 00000000 [10/Feb/2015:21:17:12 -0500] - clcache_get_buffer: found thread private buffer cache 7f30b800bf70 [10/Feb/2015:21:17:12 -0500] - clcache_get_buffer: _pool is 7f30df3913f0 _pool->pl_busy_lists is 7f30b81f37f0 _pool->pl_busy_lists->bl_buffers is 7f30b800bf70 [10/Feb/2015:21:17:12 -0500] - session start: anchorcsn=54dabba7000500040000 # # DEL operation that will miss # [10/Feb/2015:21:18:02 -0500] conn=2 op=1155 DEL dn="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test" [10/Feb/2015:21:18:02 -0500] conn=2 op=1155 RESULT err=0 tag=107 nentries=0 etime=0 csn=54dabbdb001100040000 [10/Feb/2015:21:18:02 -0500] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 54dabbdb001100040000 into pending list [10/Feb/2015:21:18:02 -0500] NSMMReplicationPlugin - changelog program - _cl5GetDBFileByReplicaName: found DB object 7fdb0ba3b930 for database /var/lib/dirsrv/slapd-TESTRELM-TEST/cldb/134ac903-b19111e4-9b16c67f-e9d60a91_54dab6bf000000040000.db [10/Feb/2015:21:18:02 -0500] NSMMReplicationPlugin - changelog program - _cl5GetDBFileByReplicaName: found DB object 7fdb0ba3b930 for database /var/lib/dirsrv/slapd-TESTRELM-TEST/cldb/134ac903-b19111e4-9b16c67f-e9d60a91_54dab6bf000000040000.db [10/Feb/2015:21:18:02 -0500] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 54dabbdb001100040000 # RESTART [10/Feb/2015:21:18:04 -0500] - slapd stopped. [10/Feb/2015:21:18:08 -0500] - slapd started. Listening on All Interfaces port 389 for LDAP requests # # Replication session that did not send the missing update because of a restart # [10/Feb/2015:21:18:51 -0500] - _cl5PositionCursorForReplay ): Consumer RUV: [10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin -: {replicageneration} 54dab6bf000000040000 [10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin - 54dab6c4000000030000 54dabc0b000000030000 00000000 [10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin - 54dab705000000040000 54dabba9001700040000 00000000 [10/Feb/2015:21:18:51 -0500] - _cl5PositionCursorForReplay ): Supplier RUV: [10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin -: {replicageneration} 54dab6bf000000040000 [10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin - 54dab705000000040000 54dabc0c000000040000 54dabc0b [10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin - 54dab6c4000000030000 54daba2c000300030000 00000000 [10/Feb/2015:21:18:51 -0500] - session start: anchorcsn=54dabba9001700040000 # Send lot of updates from 54dabba9001700040000 -> 54dabbd4001300040000 [10/Feb/2015:21:18:51 -0500] NSMMReplicationPlugin - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389 ): replay_update: Consumer successfully sent operation with csn 54dabbaa000000040000 ... [10/Feb/2015:21:19:42 -0500] NSMMReplicationPlugin - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389 ): replay_update: Consumer successfully sent operation with csn 54dabbd4001300040000 [10/Feb/2015:21:19:42 -0500] - repl5_inc_result_threadmain exiting << shutdown # RESTART [10/Feb/2015:21:19:44 -0500] - slapd stopped. [10/Feb/2015:21:19:51 -0500] - slapd started. Listening on All Interfaces port 389 for LDAP requests # # the changelog is recreated (losing the update csn=54dabbdb001100040000) # [10/Feb/2015:21:19:51 -0500] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: data for replica dc=testrelm,dc=test does not match the data in the changelog. Recreating the changelog file. This could affect replication with replica's consumers in which case the consumers should be reinitialized. # # replication session after the changelog recreation # as we do not find 54dabbdb001100040000, we continue replication starting with 54daba2c000300030000 [10/Feb/2015:21:20:23 -0500] - _cl5PositionCursorForReplay ): Consumer RUV: [10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin -: {replicageneration} 54dab6bf000000040000 [10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - 54dab6c4000000030000 54dabc68000000030000 00000000 [10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - 54dab705000000040000 54dabbd4001300040000 00000000 [10/Feb/2015:21:20:23 -0500] - _cl5PositionCursorForReplay ): Supplier RUV: [10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin -: {replicageneration} 54dab6bf000000040000 [10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - 54dab705000000040000 54dabc66000000040000 54dabc65 [10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - 54dab6c4000000030000 54daba2c000300030000 00000000 [10/Feb/2015:21:20:23 -0500] - session start: anchorcsn=54dabbd4001300040000 [10/Feb/2015:21:20:23 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - clcache_load_buffer: rc=-30988 [10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - changelog program - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): CSN 54dabbd4001300040000 not found and no purging, probably a reinit [10/Feb/2015:21:20:24 -0500] NSMMReplicationPlugin - changelog program - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): Will try to use supplier min CSN 54daba2c000300030000 to load changelog [10/Feb/2015:21:20:24 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - session start: anchorcsn=54daba2c000300030000 Here are the next steps ----------------------- - identify why the changelog is recreated - check if continuing replication is valid (likely) but failure to retrieve the min_csn should be logged with the right level The reason why the changelog gets recreated is possibly a bug in DS shutdown Client updates # last update before restart [10/Feb/2015:21:19:00 -0500] conn=8 op=4 MOD dn="fqdn=ipaqavmg.testrelm.test,cn=computers,cn=accounts,dc=testrelm,dc=test" [10/Feb/2015:21:19:30 -0500] conn=8 op=4 RESULT err=0 tag=103 nentries=0 etime=30 csn=54dabc20000000040000 ... # first update after restart [10/Feb/2015:21:19:53 -0500] conn=5 op=16 MOD dn="fqdn=ipaqavmg.testrelm.test,cn=computers,cn=accounts,dc=testrelm,dc=test" [10/Feb/2015:21:19:53 -0500] conn=5 op=16 RESULT err=0 tag=103 nentries=0 etime=0 csn=54dabc4a000000040000 # last update (csn=54dabc20000000040000) before restart with associated internal updates # replica RUV was successfully updated [10/Feb/2015:21:19:30 -0500] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 54dabc20000000040000 [10/Feb/2015:21:19:35 -0500] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 54dabc33000000040000 [10/Feb/2015:21:19:37 -0500] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 54dabc38000000040000 [10/Feb/2015:21:19:40 -0500] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 54dabc3a000000040000 # The replica RUV (in memory) contains 54dabc3a000000040000 before restart # during shutdown the replica RUV is dumped and contains the last update [10/Feb/2015:21:19:42 -0500] NSMMReplicationPlugin - Database RUV: {replicageneration} 54dab6bf000000040000 [10/Feb/2015:21:19:42 -0500] NSMMReplicationPlugin - Database RUV: {replica 4 ldap://ipaqavmg.testrelm.test:389} 54dab705000000040000 54dabc3a000000040000 54dabc3c [10/Feb/2015:21:19:42 -0500] NSMMReplicationPlugin - Database RUV: {replica 3 ldap://hp-dl380pgen8-02-vm-2.testrelm.test:389} 54dab6c4000000030000 54daba2c000300030000 00000000 [10/Feb/2015:21:19:44 -0500] - slapd stopped. # But after restart this update is not present in the database RUV [10/Feb/2015:21:19:49 -0500] - 389-Directory/1.3.3.1 B2015.041.149 starting up [10/Feb/2015:21:19:51 -0500] changelog max RUV: {replicageneration} 54dab6bf000000040000 [10/Feb/2015:21:19:51 -0500] changelog max RUV: {replica 3} 54dab6c4000000030000 54daba2c000300030000 00000000 [10/Feb/2015:21:19:51 -0500] changelog max RUV: {replica 4} 54dab705000000040000 54dabc3a000000040000 00000000 [10/Feb/2015:21:19:51 -0500] database RUV: {replicageneration} 54dab6bf000000040000 [10/Feb/2015:21:19:51 -0500] database RUV: {replica 4 ldap://ipaqavmg.testrelm.test:389} 54dab705000000040000 54dabc38000000040000 00000000 [10/Feb/2015:21:19:51 -0500] database RUV: {replica 3 ldap://hp-dl380pgen8-02-vm-2.testrelm.test:389} 54dab6c4000000030000 54daba2c000300030000 00000000 [10/Feb/2015:21:19:51 -0500] - slapd started. Listening on All Interfaces port 389 for LDAP requests In conclusion: It is looking like the replica RUV is not flushed to database during shutdown. I ignore if the problem is related to the fact that the last update was an internal update. The most probable cause is a shutdown bug, where the flushing of the replica RUV (replica_flush) is not forced. I opened DS ticket https://fedorahosted.org/389/ticket/48118. If the last update is processed less than 30 sec before a regular DS shutdown, there is a possibility that the database_RUV is less than the changelog_RUV. At startup, this will be seen as a condition error and the changelog will be rebuilt. I am trying to reproduce this. So far I failed to reproduce on VMs (1cpu nehalem) as well as bare metal. Scott, may I provide you an automatic test to see if you succeed to reproduce on one of your VM ? I'll send you details for the VM. Here is current status - I have not been able to reproduce - As mentionned in https://bugzilla.redhat.com/show_bug.cgi?id=1192099#c8, before the shutdown DB RUV contained 54dabc3a000000040000 and after restart it contained 54dabc38000000040000. - At shutdown, a hash table that contains the replicas, is enumerated and each replica RUV is write back to disk (replica_destroy_hash_entry). according to the code/log we should not be in a condition where the write back is skipped. - Shutdown occurred normally. In particular multimaster_stop ran without problem so the enumeration of the replica hash table should have occurred. - There is no specific log in the enumeration/RUV flush that help to understand what was going on. Here are the next steps - I need to build a debug version with more logs and try to reproduce - The possible cause I can think are: - a bug in in hash enumeration that skipped a replica - corner case, that reset replica_name, dirty flag, csngen. But I do not see how it can occur. ok more details on setup: Master: - add new hostname (host.testrelm.test) to /etc/hosts - change hostname in /etc/hostname as well - disable firewalld - install 389-ds-base-debuginfo - sysctl -w fs.suid_dumpable=1 - echo 'ulimit -c unlimited' >> /etc/sysconfig/dirsrv - echo 'LimitCORE=infinity' >> /etc/sysconfig/dirsrv.systemd - systemctl daemon-reload - /usr/sbin/ipa-server-install -U --setup-dns --forwarder=<DNS_FORWARDER> --hostname=<HOSTNAME>.testrelm.test -r TESTRELM.TEST -n testrelm.test -p <PASSWORD> -P <PASSWORD> -a <PASSWORD> - force Master and Replica IPs into /etc/resolv.conf - set nsslapd-errorlog-level: 8192 - set nsslapd-accesslog-level: 260 - set nsslapd-sasl-max-buffer-size: 1048576 - ipa config-mod --searchtimelimit=100 - ipactl stop; ipactl start Replica: - add new hostname (host.testrelm.test) to /etc/hosts - change hostname in /etc/hostname as well - force Master and Replica IPs into /etc/resolv.conf - disable firewalld - install 389-ds-base-debuginfo - sysctl -w fs.suid_dumpable=1 - echo 'ulimit -c unlimited' >> /etc/sysconfig/dirsrv - echo 'LimitCORE=infinity' >> /etc/sysconfig/dirsrv.systemd - systemctl daemon-reload - on MASTER run: ipa-replica-prepare -p <PASSWORD> --ip-address=<REPLICA_IP> --reverse-zone=<REPLICA_REVERSE_ZONE> <REPLICA_NAME>.testrelm.test - copy /var/lib/ipa/replica-info-<REPLICA_NAME>.testrelm.test.gpg to REPLICA:/var/lib/ipa - /usr/sbin/ipa-replica-install -U --setup-ca --setup-dns --forwarder=<DNS_FORWARDER> -w <PASSWORD> -p <PASSWORD> /opt/rhqa_ipa/replica-info-<REPLICA_NAME>.testrelm.test.gpg - ipactl stop; ipactl start - force Master and Replica IPs into /etc/resolv.conf - set nsslapd-errorlog-level: 8192 - set nsslapd-accesslog-level: 260 - set nsslapd-sasl-max-buffer-size: 1048576 - ipactl stop; ipactl start Client: - add new hostname (host.testrelm.test) to /etc/hosts - change hostname in /etc/hostname as well - force Master and Replica IPs into /etc/resolv.conf - disable firewalld - install 389-ds-base-debuginfo - sysctl -w fs.suid_dumpable=1 - echo 'ulimit -c unlimited' >> /etc/sysconfig/dirsrv - echo 'LimitCORE=infinity' >> /etc/sysconfig/dirsrv.systemd - systemctl daemon-reload - /usr/sbin/ipa-client-install -d -U --domain=testrelm.test --realm=TESTRELM.TEST -p admin -w <PASSWORD> --server=<MASTER_NAME>.testrelm.test Not quite sure why I left the 389 stuff for the client there but, I did so I listed it. Scott, Long time ago and gave no feedback.. So I applied the test case (thanks again) but was not able to reproduce the problem. To progress on this bug, I need to be able to reproduce with a debug version. (detect at shutdown, why the db RUV is not written back) Would you help me trying to reproduce ? The test case https://bugzilla.redhat.com/show_bug.cgi?id=1192099#c15 would need to be run in a loop as the problem is dynamic. thanks thierry Thierry, To do that in a loop will require additional uninstall steps. In the past this proved difficult for some versions of IPA/RHEL when re-installing afterwards. It may be easier for me to just run many jobs to catch this. Would you be able to provide me with a debug version I can put in a custom repo for this? Or, I can try to work out a way for you to run my exact tests on your VMs. Would that help? This bug is triggered by the fact that the last update CSN (54dabc3a000000040000 in #8) was not written (RUV) in the suffix database during shutdown. The bug described in http://post-office.corp.redhat.com/archives/ipa-and-samba-team-list/2015-July/msg00194.html / https://fedorahosted.org/freeipa/ticket/5112 is that the the changelog RUV was not written in the changelog database during shutdown. Both bugs are related RUV write back at shutdown, so it could be related. Now I do no see a immediate relationship between the two bugs. In particular with it current bug, the comparison (that triggered the changelog recreation) is done on RID that was not cleaned. This update is related to the last run where the problem happened again while internal operations were logged: Similar symptom of the missing csn: # 583f1cb2000300040000 was successful and went into in memory RUV [01/Dec/2016:00:08:42.249813656 +051800] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 583f1cb2000300040000 into pending list [01/Dec/2016:00:08:42.441151046 +051800] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 583f1cb2000300040000 # 583f1cb3000000040000 was successful and went into in memory RUV # last operation before shutdown [01/Dec/2016:00:08:42.493242558 +051800] conn=6 op=819 ADD dn="idnsserverid=vm-idm-005.testrelm.test,cn=servers,cn=dns,dc=testrelm,dc=test" [01/Dec/2016:00:08:42.641521991 +051800] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 583f1cb3000000040000 into pending list [01/Dec/2016:00:08:42.738379945 +051800] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 583f1cb3000000040000 [01/Dec/2016:00:08:42.751134933 +051800] conn=6 op=819 RESULT err=0 tag=105 nentries=0 etime=0 csn=583f1cb3000000040000 # Shutdown [01/Dec/2016:00:08:43.017988179 +051800] slapd shutting down - signaling operation threads - op stack size 4 max work q size 2 max work q stack size 2 [01/Dec/2016:00:08:43.024334224 +051800] slapd shutting down - waiting for 29 threads to terminate [01/Dec/2016:00:08:43.029683669 +051800] slapd shutting down - closing down internal subsystems and plugins #dump of the in memory DB RUV showing that 583f1cb3000000040000 was in the in memory RUV # [01/Dec/2016:00:08:44.286974135 +051800] NSMMReplicationPlugin - Database RUV: {replicageneration} 583f12c9000000040000 [01/Dec/2016:00:08:44.298360539 +051800] NSMMReplicationPlugin - Database RUV: {replica 4 ldap://vm-idm-005.testrelm.test:38 9} 583f12cb000000040000 583f1cb3000000040000 583f1cb2 [01/Dec/2016:00:08:44.304101449 +051800] NSMMReplicationPlugin - Database RUV: {replica 3 ldap://vm-idm-008.testrelm.test:38 9} 583f12d3000000030000 583f1a91000300030000 00000000 # But after restart DB RUV does not contain 583f1cb3000000040000 and "roll back" to 583f1cb2000300040000 changelog max RUV: {replicageneration} 583f12c9000000040000 changelog max RUV: {replica 4} 583f12cb000000040000 583f1cb3000000040000 00000000 changelog max RUV: {replica 3} 583f12d3000000030000 583f1a91000300030000 00000000 database RUV: {replicageneration} 583f12c9000000040000 database RUV: {replica 4 ldap://vm-idm-005.testrelm.test:389} 583f12cb000000040000 583f1cb2000300040000 00000000 database RUV: {replica 3 ldap://vm-idm-008.testrelm.test:389} 583f12d3000000030000 583f1a91000300030000 00000000 [01/Dec/2016:00:08:48.373284220 +051800] NSMMReplicationPlugin - ruv_compare_ruv: the max CSN [583f1cb2000300040000] from RUV [database RUV] is less than or equal to the max CSN [583f1cb3000000040000] from RUV [changelog max RUV] for element [{replica 4 ldap://vm-idm-005.testrelm.test:389} 583f12cb000000040000 583f1cb2000300040000] [01/Dec/2016:00:08:48.375940973 +051800] NSMMReplicationPlugin - ruv_compare_ruv: the max CSN [583f1a91000300030000] from RUV [database RUV] is less than or equal to the max CSN [583f1a91000300030000] from RUV [changelog max RUV] for element [{replica 3 ldap://vm-idm-008.testrelm.test:389} 583f12d3000000030000 583f1a91000300030000] [01/Dec/2016:00:08:48.384849429 +051800] NSMMReplicationPlugin - ruv_compare_ruv: the max CSN [583f1cb3000000040000] from RUV [changelog max RUV] is larger than the max CSN [583f1cb2000300040000] from RUV [database RUV] for element [{replica 4} 583f12cb000000040000 583f1cb3000000040000] [01/Dec/2016:00:08:48.390535115 +051800] NSMMReplicationPlugin - ruv_compare_ruv: the max CSN [583f1a91000300030000] from RUV [changelog max RUV] is less than or equal to the max CSN [583f1a91000300030000] from RUV [database RUV] for element [{replica 3} 583f12d3000000030000 583f1a91000300030000] # That triggered CL recreate [01/Dec/2016:00:08:48.401903812 +051800] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: data for replica dc=testrelm,dc=test does not match the data in the changelog. Recreating the changelog file. This could affect replication with replica's consumers in which case the consumers should be reinitialized. What internal operations logging is showing # the logging of internal operation is showing that the update of the RUV # (during shutdown) occurred and completed. [01/Dec/2016:00:08:42.493242558 +051800] conn=6 op=819 ADD dn="idnsserverid=vm-idm-005.testrelm.test,cn=servers,cn=dns,dc=testrelm,dc=test" [01/Dec/2016:00:08:42.582616273 +051800] conn=Internal op=-1 MOD dn="dc=testrelm,dc=test" <<< RUV update [01/Dec/2016:00:08:42.656846289 +051800] conn=Internal op=-1 ADD dn="changenumber=1223,cn=changelog" [01/Dec/2016:00:08:42.693464205 +051800] conn=Internal op=-1 RESULT err=0 tag=48 nentries=0 etime=0 [01/Dec/2016:00:08:42.751134933 +051800] conn=6 op=819 RESULT err=0 tag=105 nentries=0 etime=0 csn=583f1cb3000000040000 [01/Dec/2016:00:08:42.752813923 +051800] conn=6 op=820 SRCH base="idnsserverid=vm-idm-005.testrelm.test,cn=servers,cn=dns,dc=testrelm,dc=test" scope=0 filter="(objectClass=*)" attrs="idnsForwardPolicy idnsServerId idnsForwarders idnsServerId objectClass idnsSOAmName" [01/Dec/2016:00:08:42.753043302 +051800] conn=6 op=820 RESULT err=0 tag=101 nentries=1 etime=0 [01/Dec/2016:00:08:42.757724587 +051800] conn=Internal op=-1 RESULT err=0 tag=48 nentries=0 etime=0 <<< RUV update result [01/Dec/2016:00:08:42.766283948 +051800] conn=6 op=821 UNBIND [01/Dec/2016:00:08:42.766348052 +051800] conn=6 op=821 fd=70 closed - U1 # here is the first op after restart [01/Dec/2016:00:08:47.316981795 +051800] conn=Internal op=-1 ADD dn="" # BUT the sync of the in-memory RUV to the DB RUV occurred BEFORE the in-memory RUV actually contains 583f1cb3000000040000 [01/Dec/2016:00:08:42.582616273 +051800] conn=Internal op=-1 MOD dn="dc=testrelm,dc=test" [01/Dec/2016:00:08:42.757724587 +051800] conn=Internal op=-1 RESULT err=0 tag=48 nentries=0 etime=0 [01/Dec/2016:00:08:42.592733665 +051800] _csngen_adjust_local_time: gen state before 583f1cb20004:1480531121:0:1 [01/Dec/2016:00:08:42.600076855 +051800] _csngen_adjust_local_time: gen state after 583f1cb30000:1480531122:0:1 [01/Dec/2016:00:08:42.641521991 +051800] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 583f1cb3000000040000 into pending list [01/Dec/2016:00:08:42.692916413 +051800] repl5_inc_waitfor_async_results: 179 180 [01/Dec/2016:00:08:42.705634226 +051800] NSMMReplicationPlugin - changelog program - _cl5GetDBFileByReplicaName: found DB object 7fa5a33c6720 for database /var/lib/dirsrv/slapd-TESTRELM-TEST/cldb/39ef4086-b72611e6-8ad58ed8-9553efe0_583f12c9000000040000.db [01/Dec/2016:00:08:42.721368479 +051800] NSMMReplicationPlugin - changelog program - _cl5GetDBFileByReplicaName: found DB object 7fa5a33c6720 for database /var/lib/dirsrv/slapd-TESTRELM-TEST/cldb/39ef4086-b72611e6-8ad58ed8-9553efe0_583f12c9000000040000.db [01/Dec/2016:00:08:42.738379945 +051800] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 583f1cb3000000040000 I think it is a consequence of https://fedorahosted.org/389/ticket/564 Where changelog/DB RUV were no longer written in the same txn. Possibilities are - roll back https://fedorahosted.org/389/ticket/564 - after normal shutdown rebuild DB RUV from CL RUV if CL RUV is ahead (like it is done for disorderly shutdown) - Make sure that write DB RUV at shutdown occurs only when all pending operation are committed or aborted - ... I just reproduced this issue. It happens if you have an active operation, triggering internal mods and stop the server before the operation is completed Hi Ludwig, (In reply to Ludwig from comment #24) > I just reproduced this issue. It happens if you have an active operation, > triggering internal mods and stop the server before the operation is > completed Could you please elaborate a bit more on steps to reproduce? Thanks! Here is the scenario I used. - Enable memberof - add 1000 entries allowing memberof - add a group with these 1000 entries as members -- member of will start and internally add memberof to all thousand entries doing 1000 internal modify ops - while this group add is running stop the server the stop should be timed that the internal ops have started already and not completed backport to 1.3.6 upstream done I am following https://bugzilla.redhat.com/show_bug.cgi?id=1192099#c27 and while running -> ldapmodify -x -h localhost -p 39001 -D "cn=directory manager" -w password -a << EOF dn: cn=test_Group1,dc=example,dc=com changetype: add objectclass: top objectclass: groupOfNames cn: test_Group1 member: uid=DDa3296903592,ou=Product Testing,dc=example,dc=com member: uid=LNosewort3448819975,ou=Payroll,dc=example,dc=com member: uid=HSiu2407396923,ou=Product Development,dc=example,dc=com member: uid=MClocklab89074690,ou=Product Development,dc=example,dc=com member: uid=SKurniawa835765406,ou=Payroll,dc=example,dc=com member: uid=BAngobald3638662746,ou=Payroll,dc=example,dc=com member: uid=LDantu1647639336,ou=Product Testing,dc=example,dc=com . . . till 1000 members AND Interrupting the operation by restarting the server in between, I am getting :: ***************************************************************** 1. With Build - 389-ds-base-1.3.7.5-9.el7.x86_64, Error logs *************************************************************** [04/Jan/2018:03:27:13.406707137 -0500] - NOTICE - ldbm_back_start - total cache size: 239888793 B; [04/Jan/2018:03:27:13.476231122 -0500] - ERR - NSMMReplicationPlugin - ruv_compare_ruv - The max CSN [5a4de55c03cc00010000] from RUV [changelog max RUV] is larger than the max CSN [5a4de4c3000300010000] from RUV [database RUV] for element [{replica 1} 5a4ddc2c000100010000 5a4de55c03cc00010000] [04/Jan/2018:03:27:13.476923021 -0500] - WARN - NSMMReplicationPlugin - replica_check_for_data_reload - Data for replica dc=example,dc=com does not match the data in the changelog. Recreating the changelog file. This could affect replication with replica's consumers in which case the consumers should be reinitialized. [04/Jan/2018:03:27:13.481802483 -0500] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes... ***************************************************************** 2. With Build - 389-ds-base-1.3.7.5-11.el7.x86_64 , Error logs *************************************************************** [04/Jan/2018:03:39:06.199138419 -0500] - INFO - slapd_daemon - slapd shutting down - waiting for 1 thread to terminate [04/Jan/2018:03:39:06.221547179 -0500] - ERR - memberof-plugin - memberof_postop_add - Failed to add dn(cn=test_Group1,dc=example,dc=com), error (-1) [04/Jan/2018:03:39:06.252656504 -0500] - INFO - slapd_daemon - slapd shutting down - closing down internal subsystems and plugins [04/Jan/2018:03:39:06.460171955 -0500] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteRUV - changelog maxRUV not found in changelog for file /var/lib/dirsrv/slapd-master1/changelogdb/b6263584-f12911e7-a7a3dc41-0d407204_5a4de677000000010000.db [04/Jan/2018:03:39:06.465160239 -0500] - INFO - dblayer_pre_close - Waiting for 4 database threads to stop [04/Jan/2018:03:39:08.297022792 -0500] - INFO - dblayer_pre_close - All database threads now stopped Other observation :: I am getting many errors like this :: [04/Jan/2018:03:39:06.150928402 -0500] - ERR - oc_check_allowed_sv - Entry "uid=VBarr300201628,ou=Product Testing,dc=example,dc=com" -- attribute "memberOf" not allowed [04/Jan/2018:03:39:06.155038617 -0500] - ERR - oc_check_allowed_sv - Entry "uid=JNyce1044647570,ou=Product Development,dc=example,dc=com" -- attribute "memberOf" not allowed [04/Jan/2018:03:39:06.160956054 -0500] - ERR - oc_check_allowed_sv - Entry "uid=ADuncan-S2591409655,ou=Human Resources,dc=example,dc=com" -- attribute "memberOf" not allowed Though having Object Class and also after that I check the entry has memberof attribute in it. This error is misleading. I can file another bug for that. why dio you say "failed QA" ? In comment #30 I assume that -9 is without the fix and -11 is with the fix, with -9 the changelog is rebuilt, with -11 I don't see this mesage, but correctly see the message at shutdown that the changelog max ruv is not in the changelog (and it will not be written) Ok, I thought "ERR - NSMMReplicationPlugin - changelog program - _cl5WriteRUV - changelog maxRUV not found in changelog for file /var/lib/dirsrv/slapd-master1/changelogdb/b6263584-f12911e7-a7a3dc41-0d407204_5a4de677000000010000.db" is alarming. After discussion, got to know that it is included in the patch and it means the ADD operation for group with members will not be written in changelogs and will not be reflected in any of the master. Marking the bug as verified. Thanks Ludwig. well, they indicate the failure to writ the changelog maxruv, which leads to recreation at startup and can make startup longer. But there is a a scenario where the max ruv is not correct and if we write it the changelog would be recreated at startup, so we trigger this failure and log it. The scenario should be rare, but if you want we can change it with a later patch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0811 |