Bug 1270002 - cleanallruv should completely clean changelog
cleanallruv should completely clean changelog
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: 389-ds-base (Show other bugs)
6.7
Unspecified Unspecified
urgent Severity urgent
: rc
: ---
Assigned To: Noriko Hosoi
Viktor Ashirov
Petr Bokoc
: ZStream
Depends On: 1240845
Blocks: 1172231 1260001 1272327
  Show dependency treegraph
 
Reported: 2015-10-08 14:19 EDT by Marc Sauton
Modified: 2016-05-10 15:21 EDT (History)
13 users (show)

See Also:
Fixed In Version: 389-ds-base-1.2.11.15-67.el6
Doc Type: Bug Fix
Doc Text:
`cleanAllRUV` now clears the changelog completely Previously, after the `cleanAllRUV` task finished, the changelog still contained entries from the cleaned `rid`. As a consequence, the RUV could contain undesirable data, and the RUV element could be missing the replica URL. Now, `cleanAllRUV` cleans changelog completely as expected.
Story Points: ---
Clone Of: 1240845
: 1272327 (view as bug list)
Environment:
Last Closed: 2016-05-10 15:21:45 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Comment 3 Noriko Hosoi 2015-10-08 20:10:57 EDT
Verification steps: https://bugzilla.redhat.com/show_bug.cgi?id=1240845#c6
(verified on 7.2)
Comment 7 Sankar Ramalingam 2016-03-28 10:05:41 EDT
1). I have six master replication setuip. My replica agreement for M2:
[root@cisco-c22m3-01 MMR_WINSYNC]# PORT=1189 ; ldapsearch -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 -b "cn=replica,cn=\"dc=passsync,dc=com\",cn=mapping tree,cn=config" |grep -i dn: |grep 1626
dn: cn=1189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com,cn=replica,cn=dc

2). Replication in sync between M1 and M2.
[root@cisco-c22m3-01 MMR_WINSYNC]# for PORT in `echo "1189 1289"`; do Users=`ldapsearch -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 -b "dc=passsync,dc=com" |grep -i "dn: uid=*" |wc -l`; echo "User entries on PORT-$PORT is $Users"; done
User entries on PORT-1189 is 216
User entries on PORT-1289 is 216

3). Stop M2
[root@cisco-c22m3-01 MMR_WINSYNC]# service dirsrv status M2
dirsrv M2 is stopped

4). Deleting M2 replication agreement
PORT=1189 ; ldapdelete -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 "cn=1189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com,cn=replica,cn=dc\3Dpasssync\2Cdc\3Dcom,cn=mapping tree,cn=config"

5). LDIF file for cleanallruv
[root@cisco-c22m3-01 MMR_WINSYNC]# cat /export/cleanall.ldif 
dn: cn=M2Clean,cn=cleanallruv,cn=tasks,cn=config
cn: M2Clean
objectclass: extensibleObject
replica-base-dn: dc=passsync,dc=com
replica-id: 2212

6). Running cleanallruv for M2.
[root@cisco-c22m3-01 MMR_WINSYNC]# PORT="1189"; ldapmodify -a -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 -avf /export/cleanall.ldif 
ldap_initialize( ldap://localhost:1189 )
add cn:
	M2Clean
add objectclass:
	extensibleObject
add replica-base-dn:
	dc=passsync,dc=com
add replica-id:
	2212
adding new entry "cn=M2Clean,cn=cleanallruv,cn=tasks,cn=config"
modify complete

7) Wait for cleanallruv to complete and then kill replica M1
[28/Mar/2016:09:23:12 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Replicas have not been cleaned yet, retrying in 640 seconds 
[28/Mar/2016:09:33:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Replica is not cleaned yet (agmt="cn=1189_to_2616_on_cisco-c22m3-01.rhts.eng.bos.redhat.com" (cisco-c22m3-01:2616)) 
[28/Mar/2016:09:33:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Replicas have not been cleaned yet, retrying in 1280 seconds 

8). It was keep printing... retrying. Then, I noticed the agmt which is trying to clean is not correct.

It's trying to delete "cn=1189_to_2616"(which is M3), instead of "cn=1189_to_1626". I am not sure if I have done something wrong.
Check if RID is removed from RUV
[root@cisco-c22m3-01 ~]# ldapsearch -H ldap://localhost:1189 -xLLL -D "cn=directory manager" -w Secret123 -b dc=passsync,dc=com  '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv |grep -i 2212

9). Search for RUV of M2 -1289
[root@cisco-c22m3-01 ~]# ldapsearch -H ldap://localhost:1189 -xLLL -D "cn=directory manager" -w Secret123 -b dc=passsync,dc=com  '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv |grep -i 1289
[root@cisco-c22m3-01 ~]# echo $?
1

10). Search for RID-2212 on M3:
[root@cisco-c22m3-01 ~]# ldapsearch -H ldap://localhost:2189 -xLLL -D "cn=directory manager" -w Secret123 -b dc=passsync,dc=com  '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv |grep 1289
nsds50ruv: {replica 2212 ldap://cisco-c22m3-01.rhts.eng.bos.redhat.com:1289} 5


Looks like something is wrong with the cleanallruv task or with my replication setup. Can someone help me?
Comment 8 Noriko Hosoi 2016-03-28 15:39:31 EDT
I guess we need to login the host and take a look at the env.
Comment 9 Noriko Hosoi 2016-03-28 15:45:39 EDT
BTW, Sankar, you already verified the same fix on 7.1.  What is the difference?  The number of masters -- 2 masters vs. 6 masters?

https://bugzilla.redhat.com/show_bug.cgi?id=1260001#c8
Comment 10 Sankar Ramalingam 2016-03-29 05:11:11 EDT
(In reply to Noriko Hosoi from comment #8)
> I guess we need to login the host and take a look at the env.

Hostname/IP - cisco-c22m3-01.rhts.eng.bos.redhat.com/10.16.70.61
Root passw - Default beaker root pw
Comment 11 Sankar Ramalingam 2016-03-29 05:15:10 EDT
(In reply to Noriko Hosoi from comment #9)
> BTW, Sankar, you already verified the same fix on 7.1.  What is the
> difference?  The number of masters -- 2 masters vs. 6 masters?
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1260001#c8

Yes, the number of masters are different. For RHEL7.1, I had 4 masters and for RHEL6.8, I have 6 masters.
Comment 12 Noriko Hosoi 2016-03-29 14:36:31 EDT
Sankar,

I'm confused at your MMR topology.  You have 6 masters and 2 read only replicas.

On M1, there are 6 agreements:
{M1->C1, M1->C2, M1->M3, M1->M4, M1->M5, M1->M6}

Since M1 does not have a direct agreement for M2, it's natural to send the request to M3.  (Mark, please correct me if I'm wrong.)

And it stucks here on M3.  Not sure, but since M2 is shut down, the bind against M2 would fail and CleanAllRUV task request won't reach M2...  Is this the right test scenario?

[29/Mar/2016:08:35:09 -0400] slapi_ldap_bind - Error: could not send bind request for id [cn=SyncManager,cn=config] mech [SIMPLE]: error -1 (Can't contact LDAP server) -5987 (Invalid function argument.) 107 (Transport endpoint is not connected)
[29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin - agmt="cn=2189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com" (cisco-c22m3-01:1626): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) ((null))
[29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Replica not online (agmt="cn=2189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com" (cisco-c22m3-01:1626)) 
[29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Not all replicas online, retrying in 14400 seconds... 

Mark, could you please advice?
Comment 13 mreynolds 2016-03-30 09:03:40 EDT
(In reply to Noriko Hosoi from comment #12)
> Sankar,
> 
> I'm confused at your MMR topology.  You have 6 masters and 2 read only
> replicas.
> 
> On M1, there are 6 agreements:
> {M1->C1, M1->C2, M1->M3, M1->M4, M1->M5, M1->M6}
> 
> Since M1 does not have a direct agreement for M2, it's natural to send the
> request to M3.  (Mark, please correct me if I'm wrong.)
> 
> And it stucks here on M3.  Not sure, but since M2 is shut down, the bind
> against M2 would fail and CleanAllRUV task request won't reach M2...  Is
> this the right test scenario?
> 
> [29/Mar/2016:08:35:09 -0400] slapi_ldap_bind - Error: could not send bind
> request for id [cn=SyncManager,cn=config] mech [SIMPLE]: error -1 (Can't
> contact LDAP server) -5987 (Invalid function argument.) 107 (Transport
> endpoint is not connected)
> [29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin -
> agmt="cn=2189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com"
> (cisco-c22m3-01:1626): Replication bind with SIMPLE auth failed: LDAP error
> -1 (Can't contact LDAP server) ((null))
> [29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid
> 2212): Replica not online
> (agmt="cn=2189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com"
> (cisco-c22m3-01:1626)) 
> [29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid
> 2212): Not all replicas online, retrying in 14400 seconds... 
> 
> Mark, could you please advice?

The logging tells us everything we need to know:

[29/Mar/2016:08:35:09 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Replica not online (agmt="cn=2189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com" (cisco-c22m3-01:1626)) 

This particular replica can not contact cisco-c22m3-01.rhts.eng.bos.redhat.com:1626.  The cleanAllRUV task will continue to run until it can reach this remote replica.  So either the agreement (cn=2189_to_1626_on_cisco-c22m3-01.rhts.eng.bos.redhat.com) is incorrect and needs to be removed, or that remote replica is simply down(if so start it).
Comment 14 Sankar Ramalingam 2016-03-30 09:18:31 EDT
I followed the reproducible steps as given in - https://bugzilla.redhat.com/show_bug.cgi?id=1240845#c6. 

This says that...
1. Stop replica B
2. On A, remove replication agreement that points to B
3. On A, run cleanallruv for B Replica Id.

The server keeps printing message to remove the RUV for replica B.
Replica A - 1189/1616
Replica B - 1289/1626

I have a similar but fresh Six master replication setup on 10.16.96.80. Feel free to use this to experiment or give me alternate steps to verify this bugzilla
Comment 15 Sankar Ramalingam 2016-03-30 09:20:04 EDT
[root@iceman ~]# ldapsearch -H ldap://localhost:1189 -xLLL -D "cn=directory manager" -w Secret123 -b "dc=passsync,dc=com"  '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv
dn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=passsync,dc=com
nsds50ruv: {replicageneration} 56fb75e7000008a30000
nsds50ruv: {replica 2211 ldap://iceman.idmqe.lab.eng.bos.redhat.com:1189} 56fb
 786c000008a30000 56fb7878000208a30000
nsds50ruv: {replica 2215 ldap://iceman.idmqe.lab.eng.bos.redhat.com:3189}
nsds50ruv: {replica 2213 ldap://iceman.idmqe.lab.eng.bos.redhat.com:2189}
nsds50ruv: {replica 2216 ldap://iceman.idmqe.lab.eng.bos.redhat.com:3289}
nsds50ruv: {replica 2214 ldap://iceman.idmqe.lab.eng.bos.redhat.com:2289}
nsds50ruv: {replica 2212 ldap://iceman.idmqe.lab.eng.bos.redhat.com:1289}
Comment 16 mreynolds 2016-03-30 09:42:14 EDT
You still have agreements that point to replica B, these must "all" be removed. 

http://www.port389.org/docs/389ds/howto/howto-cleanruv.html#cleanallruv 

You added more replicas to the test case, and more agreements.  If we are removing replica B(and cleaning any reference to it), then we must remove all the agreements to replica B.  So go through the other 5 replicas and make sure you have removed the agreements that point back to Replica B.

Thanks
Comment 17 Sankar Ramalingam 2016-03-30 13:43:30 EDT
(In reply to mreynolds from comment #16)
> You still have agreements that point to replica B, these must "all" be
> removed. 
Thanks Mark! for the hint. I ran cleanallruv task again to remove the replica B.
Now, I am waiting for the cleanallruv task to be completed on all masters.
> 
> http://www.port389.org/docs/389ds/howto/howto-cleanruv.html#cleanallruv 
> 
> You added more replicas to the test case, and more agreements.  If we are
> removing replica B(and cleaning any reference to it), then we must remove
> all the agreements to replica B.  So go through the other 5 replicas and
> make sure you have removed the agreements that point back to Replica B.
> 
> Thanks
Comment 18 mreynolds 2016-03-30 14:13:07 EDT
What is the default beaker root password?  (In reply to Sankar Ramalingam from comment #17)
> (In reply to mreynolds from comment #16)
> > You still have agreements that point to replica B, these must "all" be
> > removed. 
> Thanks Mark! for the hint. I ran cleanallruv task again to remove the
> replica B.
> Now, I am waiting for the cleanallruv task to be completed on all masters.

Did you remove all the agreements to replica B from all the replicas?  

I also suggest you restart all the servers, because the original task is still running and stuck in a backoff loop(running the task twice has no effect)

What is the default beaker root password (I do not know what this is)?

> > 
> > http://www.port389.org/docs/389ds/howto/howto-cleanruv.html#cleanallruv 
> > 
> > You added more replicas to the test case, and more agreements.  If we are
> > removing replica B(and cleaning any reference to it), then we must remove
> > all the agreements to replica B.  So go through the other 5 replicas and
> > make sure you have removed the agreements that point back to Replica B.
> > 
> > Thanks
Comment 20 Sankar Ramalingam 2016-03-30 23:06:14 EDT
I verified the bug on a fresh Six Master replication setup.

1) Stopped M2 and deleted all replica agreements pointing to M2.

2). Ran cleanallruv task on M1
[root@iceman MMR_WINSYNC]# PORT="1189"; ldapmodify -a -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 -avf /export/clean.ldif 
ldap_initialize( ldap://localhost:1189 )
add cn:
	M2Clean
add objectclass:
	extensibleObject
add replica-base-dn:
	dc=passsync,dc=com
add replica-id:
	2212
adding new entry "cn=M2Clean,cn=cleanallruv,cn=tasks,cn=config"
modify complete

[root@iceman MMR_WINSYNC]# tail -f /var/log/dirsrv/slapd-M1/errors
[30/Mar/2016:22:52:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Found maxcsn (00000000000000000000) 
[30/Mar/2016:22:52:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Cleaning rid (2212)... 
[30/Mar/2016:22:52:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Waiting to process all the updates from the deleted replica... 
[30/Mar/2016:22:52:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Waiting for all the replicas to be online... 
[30/Mar/2016:22:52:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Waiting for all the replicas to receive all the deleted replica updates... 
[30/Mar/2016:22:52:51 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Sending cleanAllRUV task to all the replicas... 
[30/Mar/2016:22:52:54 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Cleaning local ruv's... 
[30/Mar/2016:22:52:55 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Waiting for all the replicas to be cleaned... 
[30/Mar/2016:22:52:55 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Waiting for all the replicas to finish cleaning... 
[30/Mar/2016:22:52:55 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Not all replicas finished cleaning, retrying in 10 seconds 
[30/Mar/2016:22:53:06 -0400] NSMMReplicationPlugin - CleanAllRUV Task (rid 2212): Successfully cleaned rid(2212). 

3). Restarted all masters, except M2

4). Checked if M2's ruv entry deleted on all masters
[root@iceman MMR_WINSYNC]# for PORT in `echo "1189 2189 2289 3189 3289"`; do ldapsearch -H ldap://localhost:$PORT -xLLL -D "cn=directory manager" -w Secret123 -b dc=passsync,dc=com  '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv |grep -i 1289 ; if [ $? -eq 0 ]; then echo "Test FAIL, the replicas not cleaned" ; else echo "test PASS, the replcia B's RUV cleaned up on $PORT " ; fi ; done
test PASS, the replcia B's RUV cleaned up on 1189 
test PASS, the replcia B's RUV cleaned up on 2189 
test PASS, the replcia B's RUV cleaned up on 2289 
test PASS, the replcia B's RUV cleaned up on 3189 
test PASS, the replcia B's RUV cleaned up on 3289 

Hence, marking the bug as Verified.
Comment 22 errata-xmlrpc 2016-05-10 15:21:45 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0737.html

Note You need to log in before you can comment on or make changes to this bug.