Red Hat Bugzilla – Bug 1270002
cleanallruv should completely clean changelog
Last modified: 2016-05-10 15:21:45 EDT
Verification steps: https://bugzilla.redhat.com/show_bug.cgi?id=1240845#c6 (verified on 7.2)
1). I have six master replication setuip. My replica agreement for M2: [root@cisco-c22m3-01 MMR_WINSYNC]# PORT=1189 ; ldapsearch -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 -b "cn=replica,cn=\"dc=passsync,dc=com\",cn=mapping tree,cn=config" |grep -i dn: |grep 1626 dn: cn=1189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com,cn=replica,cn=dc 2). Replication in sync between M1 and M2. [root@cisco-c22m3-01 MMR_WINSYNC]# for PORT in `echo "1189 1289"`; do Users=`ldapsearch -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 -b "dc=passsync,dc=com" |grep -i "dn: uid=*" |wc -l`; echo "User entries on PORT-$PORT is $Users"; done User entries on PORT-1189 is 216 User entries on PORT-1289 is 216 3). Stop M2 [root@cisco-c22m3-01 MMR_WINSYNC]# service dirsrv status M2 dirsrv M2 is stopped 4). Deleting M2 replication agreement PORT=1189 ; ldapdelete -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 "cn=1189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com,cn=replica,cn=dc\3Dpasssync\2Cdc\3Dcom,cn=mapping tree,cn=config" 5). LDIF file for cleanallruv [root@cisco-c22m3-01 MMR_WINSYNC]# cat /export/cleanall.ldif dn: cn=M2Clean,cn=cleanallruv,cn=tasks,cn=config cn: M2Clean objectclass: extensibleObject replica-base-dn: dc=passsync,dc=com replica-id: 2212 6). Running cleanallruv for M2. [root@cisco-c22m3-01 MMR_WINSYNC]# PORT="1189"; ldapmodify -a -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 -avf /export/cleanall.ldif ldap_initialize( ldap://localhost:1189 ) add cn: M2Clean add objectclass: extensibleObject add replica-base-dn: dc=passsync,dc=com add replica-id: 2212 adding new entry "cn=M2Clean,cn=cleanallruv,cn=tasks,cn=config" modify complete 7) Wait for cleanallruv to complete and then kill replica M1 [28/Mar/2016:09:23:12 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Replicas have not been cleaned yet, retrying in 640 seconds [28/Mar/2016:09:33:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Replica is not cleaned yet (agmt="cn=1189_to_2616_on_cisco-c22m3-01.rhts.eng.bos.redhat.com" (cisco-c22m3-01:2616)) [28/Mar/2016:09:33:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Replicas have not been cleaned yet, retrying in 1280 seconds 8). It was keep printing... retrying. Then, I noticed the agmt which is trying to clean is not correct. It's trying to delete "cn=1189_to_2616"(which is M3), instead of "cn=1189_to_1626". I am not sure if I have done something wrong. Check if RID is removed from RUV [root@cisco-c22m3-01 ~]# ldapsearch -H ldap://localhost:1189 -xLLL -D "cn=directory manager" -w Secret123 -b dc=passsync,dc=com '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv |grep -i 2212 9). Search for RUV of M2 -1289 [root@cisco-c22m3-01 ~]# ldapsearch -H ldap://localhost:1189 -xLLL -D "cn=directory manager" -w Secret123 -b dc=passsync,dc=com '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv |grep -i 1289 [root@cisco-c22m3-01 ~]# echo $? 1 10). Search for RID-2212 on M3: [root@cisco-c22m3-01 ~]# ldapsearch -H ldap://localhost:2189 -xLLL -D "cn=directory manager" -w Secret123 -b dc=passsync,dc=com '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv |grep 1289 nsds50ruv: {replica 2212 ldap://cisco-c22m3-01.rhts.eng.bos.redhat.com:1289} 5 Looks like something is wrong with the cleanallruv task or with my replication setup. Can someone help me?
I guess we need to login the host and take a look at the env.
BTW, Sankar, you already verified the same fix on 7.1. What is the difference? The number of masters -- 2 masters vs. 6 masters? https://bugzilla.redhat.com/show_bug.cgi?id=1260001#c8
(In reply to Noriko Hosoi from comment #8) > I guess we need to login the host and take a look at the env. Hostname/IP - cisco-c22m3-01.rhts.eng.bos.redhat.com/10.16.70.61 Root passw - Default beaker root pw
(In reply to Noriko Hosoi from comment #9) > BTW, Sankar, you already verified the same fix on 7.1. What is the > difference? The number of masters -- 2 masters vs. 6 masters? > > https://bugzilla.redhat.com/show_bug.cgi?id=1260001#c8 Yes, the number of masters are different. For RHEL7.1, I had 4 masters and for RHEL6.8, I have 6 masters.
Sankar, I'm confused at your MMR topology. You have 6 masters and 2 read only replicas. On M1, there are 6 agreements: {M1->C1, M1->C2, M1->M3, M1->M4, M1->M5, M1->M6} Since M1 does not have a direct agreement for M2, it's natural to send the request to M3. (Mark, please correct me if I'm wrong.) And it stucks here on M3. Not sure, but since M2 is shut down, the bind against M2 would fail and CleanAllRUV task request won't reach M2... Is this the right test scenario? [29/Mar/2016:08:35:09 -0400] slapi_ldap_bind - Error: could not send bind request for id [cn=SyncManager,cn=config] mech [SIMPLE]: error -1 (Can't contact LDAP server) -5987 (Invalid function argument.) 107 (Transport endpoint is not connected) [29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin - agmt="cn=2189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com" (cisco-c22m3-01:1626): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) ((null)) [29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Replica not online (agmt="cn=2189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com" (cisco-c22m3-01:1626)) [29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Not all replicas online, retrying in 14400 seconds... Mark, could you please advice?
(In reply to Noriko Hosoi from comment #12) > Sankar, > > I'm confused at your MMR topology. You have 6 masters and 2 read only > replicas. > > On M1, there are 6 agreements: > {M1->C1, M1->C2, M1->M3, M1->M4, M1->M5, M1->M6} > > Since M1 does not have a direct agreement for M2, it's natural to send the > request to M3. (Mark, please correct me if I'm wrong.) > > And it stucks here on M3. Not sure, but since M2 is shut down, the bind > against M2 would fail and CleanAllRUV task request won't reach M2... Is > this the right test scenario? > > [29/Mar/2016:08:35:09 -0400] slapi_ldap_bind - Error: could not send bind > request for id [cn=SyncManager,cn=config] mech [SIMPLE]: error -1 (Can't > contact LDAP server) -5987 (Invalid function argument.) 107 (Transport > endpoint is not connected) > [29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin - > agmt="cn=2189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com" > (cisco-c22m3-01:1626): Replication bind with SIMPLE auth failed: LDAP error > -1 (Can't contact LDAP server) ((null)) > [29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid > 2212): Replica not online > (agmt="cn=2189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com" > (cisco-c22m3-01:1626)) > [29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid > 2212): Not all replicas online, retrying in 14400 seconds... > > Mark, could you please advice? The logging tells us everything we need to know: [29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Replica not online (agmt="cn=2189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com" (cisco-c22m3-01:1626)) This particular replica can not contact cisco-c22m3-01.rhts.eng.bos.redhat.com:1626. The cleanAllRUV task will continue to run until it can reach this remote replica. So either the agreement (cn=2189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com) is incorrect and needs to be removed, or that remote replica is simply down(if so start it).
I followed the reproducible steps as given in - https://bugzilla.redhat.com/show_bug.cgi?id=1240845#c6. This says that... 1. Stop replica B 2. On A, remove replication agreement that points to B 3. On A, run cleanallruv for B Replica Id. The server keeps printing message to remove the RUV for replica B. Replica A - 1189/1616 Replica B - 1289/1626 I have a similar but fresh Six master replication setup on 10.16.96.80. Feel free to use this to experiment or give me alternate steps to verify this bugzilla
[root@iceman ~]# ldapsearch -H ldap://localhost:1189 -xLLL -D "cn=directory manager" -w Secret123 -b "dc=passsync,dc=com" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv dn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=passsync,dc=com nsds50ruv: {replicageneration} 56fb75e7000008a30000 nsds50ruv: {replica 2211 ldap://iceman.idmqe.lab.eng.bos.redhat.com:1189} 56fb 786c000008a30000 56fb7878000208a30000 nsds50ruv: {replica 2215 ldap://iceman.idmqe.lab.eng.bos.redhat.com:3189} nsds50ruv: {replica 2213 ldap://iceman.idmqe.lab.eng.bos.redhat.com:2189} nsds50ruv: {replica 2216 ldap://iceman.idmqe.lab.eng.bos.redhat.com:3289} nsds50ruv: {replica 2214 ldap://iceman.idmqe.lab.eng.bos.redhat.com:2289} nsds50ruv: {replica 2212 ldap://iceman.idmqe.lab.eng.bos.redhat.com:1289}
You still have agreements that point to replica B, these must "all" be removed. http://www.port389.org/docs/389ds/howto/howto-cleanruv.html#cleanallruv You added more replicas to the test case, and more agreements. If we are removing replica B(and cleaning any reference to it), then we must remove all the agreements to replica B. So go through the other 5 replicas and make sure you have removed the agreements that point back to Replica B. Thanks
(In reply to mreynolds from comment #16) > You still have agreements that point to replica B, these must "all" be > removed. Thanks Mark! for the hint. I ran cleanallruv task again to remove the replica B. Now, I am waiting for the cleanallruv task to be completed on all masters. > > http://www.port389.org/docs/389ds/howto/howto-cleanruv.html#cleanallruv > > You added more replicas to the test case, and more agreements. If we are > removing replica B(and cleaning any reference to it), then we must remove > all the agreements to replica B. So go through the other 5 replicas and > make sure you have removed the agreements that point back to Replica B. > > Thanks
What is the default beaker root password? (In reply to Sankar Ramalingam from comment #17) > (In reply to mreynolds from comment #16) > > You still have agreements that point to replica B, these must "all" be > > removed. > Thanks Mark! for the hint. I ran cleanallruv task again to remove the > replica B. > Now, I am waiting for the cleanallruv task to be completed on all masters. Did you remove all the agreements to replica B from all the replicas? I also suggest you restart all the servers, because the original task is still running and stuck in a backoff loop(running the task twice has no effect) What is the default beaker root password (I do not know what this is)? > > > > http://www.port389.org/docs/389ds/howto/howto-cleanruv.html#cleanallruv > > > > You added more replicas to the test case, and more agreements. If we are > > removing replica B(and cleaning any reference to it), then we must remove > > all the agreements to replica B. So go through the other 5 replicas and > > make sure you have removed the agreements that point back to Replica B. > > > > Thanks
I verified the bug on a fresh Six Master replication setup. 1) Stopped M2 and deleted all replica agreements pointing to M2. 2). Ran cleanallruv task on M1 [root@iceman MMR_WINSYNC]# PORT="1189"; ldapmodify -a -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 -avf /export/clean.ldif ldap_initialize( ldap://localhost:1189 ) add cn: M2Clean add objectclass: extensibleObject add replica-base-dn: dc=passsync,dc=com add replica-id: 2212 adding new entry "cn=M2Clean,cn=cleanallruv,cn=tasks,cn=config" modify complete [root@iceman MMR_WINSYNC]# tail -f /var/log/dirsrv/slapd-M1/errors [30/Mar/2016:22:52:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Found maxcsn (00000000000000000000) [30/Mar/2016:22:52:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Cleaning rid (2212)... [30/Mar/2016:22:52:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Waiting to process all the updates from the deleted replica... [30/Mar/2016:22:52:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Waiting for all the replicas to be online... [30/Mar/2016:22:52:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Waiting for all the replicas to receive all the deleted replica updates... [30/Mar/2016:22:52:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Sending cleanAllRUV task to all the replicas... [30/Mar/2016:22:52:54 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Cleaning local ruv's... [30/Mar/2016:22:52:55 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Waiting for all the replicas to be cleaned... [30/Mar/2016:22:52:55 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Waiting for all the replicas to finish cleaning... [30/Mar/2016:22:52:55 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Not all replicas finished cleaning, retrying in 10 seconds [30/Mar/2016:22:53:06 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Successfully cleaned rid(2212). 3). Restarted all masters, except M2 4). Checked if M2's ruv entry deleted on all masters [root@iceman MMR_WINSYNC]# for PORT in `echo "1189 2189 2289 3189 3289"`; do ldapsearch -H ldap://localhost:$PORT -xLLL -D "cn=directory manager" -w Secret123 -b dc=passsync,dc=com '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv |grep -i 1289 ; if [ $? -eq 0 ]; then echo "Test FAIL, the replicas not cleaned" ; else echo "test PASS, the replcia B's RUV cleaned up on $PORT " ; fi ; done test PASS, the replcia B's RUV cleaned up on 1189 test PASS, the replcia B's RUV cleaned up on 2189 test PASS, the replcia B's RUV cleaned up on 2289 test PASS, the replcia B's RUV cleaned up on 3189 test PASS, the replcia B's RUV cleaned up on 3289 Hence, marking the bug as Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0737.html