Bug 1295971 - ipa-server-4.2.0-15.el7_2.3 data sync issue
ipa-server-4.2.0-15.el7_2.3 data sync issue
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: ipa (Show other bugs)
7.2
x86_64 Linux
unspecified Severity urgent
: rc
: ---
Assigned To: IPA Maintainers
Namita Soman
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-05 18:53 EST by lmgnid
Modified: 2016-01-20 03:55 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-01-20 03:55:33 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description lmgnid 2016-01-05 18:53:22 EST
Description of problem:

We followed the guide bellow to updated our old IPA 3.x to 4.2 (Now all servers are 4.2):
<https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Linux_Domain_Identity_Authentication_and_Policy_Guide/index.html#migrating-ipa-proc> 
But found user data cannot be synced sometimes.

Version-Release number of selected component (if applicable):

[root@usdev1]# rpm -qa | grep openldap
openldap-clients-2.4.40-8.el7.x86_64
openldap-2.4.40-8.el7.x86_64
[root@usdev1]# rpm -qa | grep ipa-server
ipa-server-4.2.0-15.el7_2.3.x86_64
ipa-server-dns-4.2.0-15.el7_2.3.x86_64

How reproducible:

Almost reproduce 50% of the time, but the results are not stable, we may get different results for the same server on different time.

Steps to Reproduce:

1. Change some user data (for example, user's "job title") in one IPA server
2. Wait for 3 mins or more (We even tried 3 days)
3. Then check if other servers can be synced

Actual results:
Sometime, the user data in one or more servers will NOT be synced

Expected results:
The user data all other servers should be synced within reasonable time

Additional info:

We have 5 servers and the replica relationship as bellow:
         _____USDEV1____
        /               \  
      USQA1_____________USQA2
        |                | 
      EUPRE1____________USPRE1

Here we changed user data in each server, I listed the details result and also logs from /var/log/dirsrv/slapd-INTERNAL-COM/ for each test case

1. Usqa1 sync to others OK

Usqa1:
[05/Jan/2016:22:27:28 +0000] slapi_ldap_bind - Error: could not bind id [cn=Replication Manager cloneAgreement1-usdev1.internal.com-pki-tomcat,ou=csusers,cn=config] authentication mechanism [SIMPLE]: error 49 (Invalid credentials) errno 0 (Success)

Eupre1:
[05/Jan/2016:22:28:11 +0000] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 107 (Transport endpoint is not connected)


2. Usdev1 sync to others slow but OK, first usqa2, then others

usdev1
[05/Jan/2016:22:31:16 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://usqa2.internal.com:389/o%3Dipaca) failed.

usqa2
[05/Jan/2016:22:31:17 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://usqa1.internal.com:389/o%3Dipaca) failed.

[05/Jan/2016:22:31:19 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://eupre1.internal.com:389/o%3Dipaca) failed.

eupre1
[05/Jan/2016:22:31:13 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://uspre1.internal.com:389/o%3Dipaca) failed.

3. Usqa2 sync to others slow, first others OK, but NOT synced to uspre1


Eupre1:
[05/Jan/2016:22:34:38 +0000] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 107 (Transport endpoint is not connected)

Usdev1:
[05/Jan/2016:22:34:36 +0000] - slapd_poll(143) timed out
[05/Jan/2016:22:35:44 +0000] NSMMReplicationPlugin - CleanAllRUV Task (rid 52): Not all replicas finished cleaning, retrying in 2560 seconds
[05/Jan/2016:22:35:44 +0000] NSMMReplicationPlugin - CleanAllRUV Task (rid 52): Not all replicas finished cleaning, retrying in 2560 seconds

Usqa2:
[05/Jan/2016:22:36:17 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://usqa1.internal.com:389/o%3Dipaca) failed.

4. Eupre1 sync to others slow, first uspre1 ok, but NOT synced to others

Usqa1:
[05/Jan/2016:22:37:28 +0000] slapi_ldap_bind - Error: could not bind id [cn=Replication Manager cloneAgreement1-usdev1.internal.com-pki-tomcat,ou=csusers,cn=config] authentication mechanism [SIMPLE]: error 49 (Invalid credentials) errno 0 (Success)

Eupre1:
[05/Jan/2016:22:38:52 +0000] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 107 (Transport endpoint is not connected)

5. Uspre1 sync to others slow, first others OK, but NOT synced to eupre1

Uspre1:
[05/Jan/2016:22:41:11 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://eupre1.internal.com:389/o%3Dipaca) failed.

Eupre1:
[05/Jan/2016:22:41:10 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://uspre1.internal.com:389/o%3Dipaca) failed.
[05/Jan/2016:22:41:16 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://usqa1.internal.com:389/o%3Dipaca) failed.
[05/Jan/2016:22:43:07 +0000] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 107 (Transport endpoint is not connected)

Usqa2:
[05/Jan/2016:22:41:15 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://usqa1.internal.com:389/o%3Dipaca) failed.
[05/Jan/2016:22:41:20 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://usdev1.internal.com:389/o%3Dipaca) failed.
[05/Jan/2016:22:41:55 +0000] NSMMReplicationPlugin - CleanAllRUV Task (rid 55): Replica is not cleaned yet (agmt="cn=meTousdev1.internal.com" (usdev1:389))
[05/Jan/2016:22:41:55 +0000] NSMMReplicationPlugin - CleanAllRUV Task (rid 55): Replicas have not been cleaned yet, retrying in 10240 seconds
[05/Jan/2016:22:42:01 +0000] NSMMReplicationPlugin - CleanAllRUV Task (rid 52): Replica is not cleaned yet (agmt="cn=meTouspre1.internal.com" (uspre1:389))
[05/Jan/2016:22:42:01 +0000] NSMMReplicationPlugin - CleanAllRUV Task (rid 52): Replicas have not been cleaned yet, retrying in 10240 seconds
[05/Jan/2016:22:42:02 +0000] NSMMReplicationPlugin - CleanAllRUV Task (rid 52): Replica is not cleaned yet (agmt="cn=meTouspre1.internal.com" (uspre1:389))
[05/Jan/2016:22:42:02 +0000] NSMMReplicationPlugin - CleanAllRUV Task (rid 52): Replicas have not been cleaned yet, retrying in 10240 seconds

USDEV1:
[05/Jan/2016:22:41:15 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://usqa2.internal.com:389/o%3Dipaca) failed.

USQA1:
[05/Jan/2016:22:41:15 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://eupre1.internal.com:389/o%3Dipaca) failed.
[05/Jan/2016:22:41:19 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://usqa2.internal.com:389/o%3Dipaca) failed.
[05/Jan/2016:22:42:05 +0000] NSMMReplicationPlugin - CleanAllRUV Task (rid 55): Replica is not cleaned yet (agmt="cn=meToeupre1.internal.com" (eupre1:389))
[05/Jan/2016:22:42:05 +0000] NSMMReplicationPlugin - CleanAllRUV Task (rid 55): Replicas have not been cleaned yet, retrying in 10240 seconds
[05/Jan/2016:22:42:29 +0000] slapi_ldap_bind - Error: could not bind id [cn=Replication Manager cloneAgreement1-usdev1.internal.com-pki-tomcat,ou=csusers,cn=config] authentication mechanism [SIMPLE]: error 49 (Invalid credentials) errno 0 (Success)
Comment 2 lmgnid 2016-01-05 19:08:08 EST
BTW, here is the "ipa-replica-manage list -v" results for each servers. FYI.

usqa1:

eupre1.internal.com: replica
  last init status: None
  last init ended: 1970-01-01 00:00:00+00:00
  last update status: 1 Can't acquire busy replica
  last update ended: 1970-01-01 00:00:00+00:00
usdev1.internal.com: replica
  last init status: None
  last init ended: 1970-01-01 00:00:00+00:00
  last update status: 0 Replica acquired successfully: Incremental update succee
ded
  last update ended: 2016-01-06 00:02:03+00:00
usqa2.internal.com: replica
  last init status: None
  last init ended: 1970-01-01 00:00:00+00:00
  last update status: 0 Replica acquired successfully: Incremental update succee
ded
  last update ended: 2016-01-06 00:02:03+00:00
  
usdev1:

usqa1.internal.com: replica
  last init status: None
  last init ended: 1970-01-01 00:00:00+00:00
  last update status: 0 Replica acquired successfully: Incremental update succee
ded
  last update ended: 2016-01-06 00:04:42+00:00
usqa2.internal.com: replica
  last init status: None
  last init ended: 1970-01-01 00:00:00+00:00
  last update status: 1 Can't acquire busy replica
  last update ended: 2016-01-06 00:04:25+00:00
  
usqa2:

 usdev1.internal.com: replica
  last init status: None
  last init ended: 1970-01-01 00:00:00+00:00
  last update status: 1 Can't acquire busy replica
  last update ended: 2016-01-06 00:01:37+00:00
uspre1.internal.com: replica
  last init status: None
  last init ended: 1970-01-01 00:00:00+00:00
  last update status: 0 Replica acquired successfully: Incremental update succee
ded
  last update ended: 2016-01-06 00:04:05+00:00
usqa1.internal.com: replica
  last init status: None
  last init ended: 1970-01-01 00:00:00+00:00
  last update status: 0 Replica acquired successfully: Incremental update succee
ded
  last update ended: 2016-01-06 00:04:05+00:00

eupre1:

uspre1.internal.com: replica
  last init status: None
  last init ended: 1970-01-01 00:00:00+00:00
  last update status: 0 Replica acquired successfully: Incremental update starte
d
  last update ended: 1970-01-01 00:00:00+00:00
usqa1.internal.com: replica
  last init status: None
  last init ended: 1970-01-01 00:00:00+00:00
  last update status: 0 Replica acquired successfully: Incremental update succee
ded
  last update ended: 2016-01-06 00:05:29+00:00

uspre1:

eupre1.internal.com: replica
  last init status: None
  last init ended: 1970-01-01 00:00:00+00:00
  last update status: 0 Replica acquired successfully: Incremental update succee
ded
  last update ended: 2016-01-06 00:05:29+00:00
usqa2.internal.com: replica
  last init status: None
  last init ended: 1970-01-01 00:00:00+00:00
  last update status: 1 Can't acquire busy replica
  last update ended: 2016-01-06 00:04:26+00:00
Comment 3 Petr Vobornik 2016-01-06 06:51:39 EST
In the log outputs above I see that there are 2 clean all ruv tasks still running - for replication ids 52 and 55. It might be there from decommission of the old 3.x servers. The task are running probably because all replicas can't be contacted (probably because they no longer exist).

You can list the tasks using `ipa-replica-mange list-clean-ruv` and abort them `ipa-replica-mange abort-clean-ruv REPLICATION_ID`

Also there are following errors:
usdev1
[05/Jan/2016:22:31:16 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://usqa2.internal.com:389/o%3Dipaca) failed.

usqa2
[05/Jan/2016:22:31:17 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://usqa1.internal.com:389/o%3Dipaca) failed.

[05/Jan/2016:22:31:19 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://eupre1.internal.com:389/o%3Dipaca) failed.

eupre1
[05/Jan/2016:22:31:13 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://uspre1.internal.com:389/o%3Dipaca) failed.

Which indicates that replicas has still some dangling RUVs for o=ipaca suffix. Unfortunately, IPA doesn't have a tool which helps with cleaning RUVs of o=ipaca suffix: https://fedorahosted.org/freeipa/ticket/4987

The order of things is to
1. find correct replication ids - nsDS5ReplicaID atribute of cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config entry of all replicas with CA
2. get RUVs (described in the ticket)
3. clean ruvs which are no longer used (are not part of the list from step 1) described at http://www.port389.org/docs/389ds/howto/howto-cleanruv.html#cleanallruv

It is very important that you won't clean RUVs for replicas which are used.
Comment 4 Petr Vobornik 2016-01-06 06:53:57 EST
Other than the above, Ludwig do you see any bug to fix here? Or if I missed something.
Comment 5 lmgnid 2016-01-06 20:23:52 EST
Hi Petr, Thanks a lot for your quick reply, here are my comments for RUVs:

1. For RUV under internal.com:

I already cleaned all NON-valid RUVs the day before yesterday, now only these valid RUVs left:
eupre1.internal.com:389: 52
usqa1.internal.com:389: 53
uspre1.internal.com:389: 55
usqa2.internal.com:389: 56
usdev1.internal.com:389: 57
I had some mistakes to clean some valid RUV, for example, 52 for eupre1, but it seems that the valid RUV for that server itself cannot be cleaned, although it can clean in other servers, this RUV will be synced to other servers later. So all the valid RUVs above are still left.
And I already checked with "ipa-replica-manage list-clean-ruv" and no CLEANALLRUV/about CLEANALLRUV tasks running.
But I still have the sync issue even after all NON-valid RUV cleaned.

2. For RUV under ipaca, I just checked today:

2.1. I found ~30 ofthem as examples bellow:
nsruvReplicaLastModified: {replica 1990 ldap://uspre1.internal.com:389} 568db94e
...
...
...
nsruvReplicaLastModified: {replica 2195 ldap://usdev1.internal.com:389} 00000000

 
2.2. And If this is the right way to get the valid replica ID on each server, for example:
[root@uspre1 secaops]# ldapsearch -xLLL -D "cn=directory manager" -W -b "o=ipaca" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))'
...
nsDS5ReplicaId: 1990

2.3. Should I find the valid ID on each servers and remove all others?
Comment 6 Petr Vobornik 2016-01-07 12:47:22 EST
Do I get it right that eupre1 and uspre1 are synced only with each other but not with the rest(usqa1,usqa2,usdev1) and vice-versa?


2.1: to get all valid replica IDs of o=ipaca suffix, run on each server with CA:
  ldapsearch -xLLL -D "cn=directory manager" -W -b "cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config" nsDS5ReplicaId | grep -i nsDS5ReplicaId

it should return only one value.


2.2: collect all replica ids from RUVs. Run on each server with CA:
ldapsearch -xLLL -D "cn=directory manager" -W -b o=ipaca '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv | grep -i 'nsds50ruv: {replica '

It may(will) return multiple values.

e.g. in RUV which looks like:
nsds50ruv: {replica 6 ldap://replica.example.com:389} 56
the replica id is "6"

2.3: Collect the IDs in 2.1. and 2.2. Call "clean all ruv" task for the IDs from the 2.2 set which are not in the 2.1 set.

FYI: there is an effort to automate this procedure: https://fedorahosted.org/freeipa/ticket/5411#comment:7
Comment 7 lmgnid 2016-01-07 12:55:34 EST
Hi Petr, a quick reply, as in first comment block:
We have 5 servers and the replica relationship as bellow:
         _____USDEV1____
        /               \  
      USQA1_____________USQA2
        |                | 
      EUPRE1____________USPRE1
And I will try you suggestion in Comment 6 soon, will let you know. THanks!
Comment 8 Petr Vobornik 2016-01-07 13:18:14 EST
The question was about whether replication doesn't work between the two replica sets even though there are replication agreements established.
Comment 9 lmgnid 2016-01-07 21:28:26 EST
Hi Petr,

1. I already followed your comments in comment 6 to clean all invalid RUV in ipaca.
Here I list all valid RUV for each server (first block for domain, 2nd block for ipaca.
But I don't understand why some servers have only 4 replica IDs, whle other servers have 5 IDs?

Usqa1
nsruvReplicaLastModified: {replica 53 ldap://usqa1.internal
nsruvReplicaLastModified: {replica 57 ldap://usdev1.internal
nsruvReplicaLastModified: {replica 55 ldap://uspre1.internal
nsruvReplicaLastModified: {replica 58 ldap://eupre1.internal

nsruvReplicaLastModified: {replica 1995 ldap://usqa1.internal
nsruvReplicaLastModified: {replica 2090 ldap://usqa2.internal.
nsruvReplicaLastModified: {replica 1990 ldap://uspre1.internal.
nsruvReplicaLastModified: {replica 2195 ldap://usdev1.internal.

Usdev1
nsruvReplicaLastModified: {replica 57 ldap://usdev1.internal.
nsruvReplicaLastModified: {replica 53 ldap://usqa1.internal.
nsruvReplicaLastModified: {replica 55 ldap://uspre1.internal.
nsruvReplicaLastModified: {replica 56 ldap://usqa2.internal.
nsruvReplicaLastModified: {replica 58 ldap://eupre1.internal.

nsruvReplicaLastModified: {replica 2195 ldap://usdev1.internal.
nsruvReplicaLastModified: {replica 2090 ldap://usqa2.internal.
nsruvReplicaLastModified: {replica 1995 ldap://usqa1.internal.
nsruvReplicaLastModified: {replica 1990 ldap://uspre1.internal.

USqa2
nsruvReplicaLastModified: {replica 56 ldap://usqa2.internal.
nsruvReplicaLastModified: {replica 53 ldap://usqa1.internal.
nsruvReplicaLastModified: {replica 57 ldap://usdev1.internal.
nsruvReplicaLastModified: {replica 58 ldap://eupre1.internal.

nsruvReplicaLastModified: {replica 2090 ldap://usqa2.internal.
nsruvReplicaLastModified: {replica 2195 ldap://usdev1.internal.
nsruvReplicaLastModified: {replica 1995 ldap://usqa1.internal.
nsruvReplicaLastModified: {replica 1990 ldap://uspre1.internal.

EUPRE1
nsruvReplicaLastModified: {replica 58 ldap://eupre1.internal.
nsruvReplicaLastModified: {replica 55 ldap://uspre1.internal.
nsruvReplicaLastModified: {replica 56 ldap://usqa2.internal.
nsruvReplicaLastModified: {replica 53 ldap://usqa1.internal.
nsruvReplicaLastModified: {replica 57 ldap://usdev1.internal.

nsruvReplicaLastModified: {replica 2295 ldap://eupre1.internal.
nsruvReplicaLastModified: {replica 1990 ldap://uspre1.internal.
nsruvReplicaLastModified: {replica 1995 ldap://usqa1.internal.
nsruvReplicaLastModified: {replica 2090 ldap://usqa2.internal.
nsruvReplicaLastModified: {replica 2195 ldap://usdev1.internal.

USPRE1:
nsruvReplicaLastModified: {replica 55 ldap://uspre1.internal.
nsruvReplicaLastModified: {replica 56 ldap://usqa2.internal.
nsruvReplicaLastModified: {replica 53 ldap://usqa1.internal.
nsruvReplicaLastModified: {replica 58 ldap://eupre1.internal.

nsruvReplicaLastModified: {replica 1990 ldap://uspre1.internal.
nsruvReplicaLastModified: {replica 2295 ldap://eupre1.internal.
nsruvReplicaLastModified: {replica 1995 ldap://usqa1.internal.
nsruvReplicaLastModified: {replica 2090 ldap://usqa2.internal.
nsruvReplicaLastModified: {replica 2195 ldap://usdev1.internal.

2. For your questions in comment 8 and more, I did another sync testing and here is the results (Followed with error log) after clean all invalid RUVs.
Do you have any ideas why sync still doesn't work for some cases? I guess some sync issues are related to the missing replica IDs above? Or if still any other issues?

2.1 Usqa1 sync to all others OK

[08/Jan/2016:01:11:37 +0000] NSMMReplicationPlugin - changelog program - agmt="cn=meToeupre1.internal.com" (eupre1:389): CSN 568ff7dc000700390000 not found, we aren't as up to date, or we purged
[08/Jan/2016:01:12:06 +0000] NSMMReplicationPlugin - changelog program - agmt="cn=meTousqa1.i
nternal.com" (usqa1:389): CSN 5690075e000200390000 not found, we aren't as up to date, or we purged

[08/Jan/2016:01:12:20 +0000] NSMMReplicationPlugin - CleanAllRUV Task (rid 53): Replica is not cleaned yet (agmt="cn=meTousqa2.internal.com" (usqa2:389))

2.2 Usdev1, sync to usqa1/eupre1 OK, but NOT to usqa2/uspre1 within 2 mins

[08/Jan/2016:01:14:57 +0000] NSMMReplicationPlugin - changelog program - agmt="cn=meToeupre1.internal.com" (eupre1:389): CSN 568ff7dc000700390000 not found, we aren't as up to date, or we purged

[08/Jan/2016:01:16:06 +0000] slapi_ldap_bind - Error: could not bind id [cn=Replication Manager cloneAgreement1-usdev1.internal.com-pki-tomcat,ou=csusers,cn=config] authentication mechanism [SIMPLE]: error 49 (Invalid credentials) errno 0 (Success)

[08/Jan/2016:01:17:06 +0000] agmt="cn=meTousqa1.internal.com" (usqa1:389) - Can't locate CSN 56900252000000380000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized.

2.3 USqa2, sync to uspre1 OK, but NOT to usqa1/usdev1/eupre1 within 2 mins

[08/Jan/2016:01:18:51 +0000] slapi_ldap_bind - Error: could not bind id [cn=Replication Manager masterAgreement1-uspre1.internal.com-pki-tomcat,ou=csusers,cn=config] authentication mechanism [SIMPLE]: error 32 (No such object) errno 0 (Success)

[08/Jan/2016:01:19:42 +0000] NSMMReplicationPlugin - changelog program - agmt="cn=meTousqa1.internal.com" (usqa1:389): CSN 56900252000000380000 not found, we aren't as up to date, or we purged

2.4 EUPRE1: sync to all others OK

[08/Jan/2016:01:21:03 +0000] NSMMReplicationPlugin - changelog program - agmt="cn=meTousqa2.internal.com" (usqa2:389): CSN 568d2238000000370000 not found, we aren't as up to date, or we purged
[08/Jan/2016:01:21:06 +0000] slapi_ldap_bind - Error: could not bind id [cn=Replication Manager cloneAgreement1-usdev1.internal.com-pki-tomcat,ou=csusers,cn=config] authentication mechanism [SIMPLE]: error 49 (Invalid credentials) errno 0 (Success)

[08/Jan/2016:01:22:08 +0000] NSMMReplicationPlugin - CleanAllRUV Task (rid 57): Replica is not cleaned yet (agmt="cn=meTousdev1.internal.com" (usdev1:389))
[08/Jan/2016:01:22:08 +0000] NSMMReplicationPlugin - CleanAllRUV Task (rid 57): Replicas have not been cleaned yet, retrying in 5120 seconds
[08/Jan/2016:01:22:53 +0000] NSMMReplicationPlugin - changelog program - agmt="cn=meTousqa2.internal.com" (usqa2:389): CSN 568d2238000000370000 not found, we aren't as up to date, or we purged


2.5 USPRE1: sync to eupre1 OK, but NOT to usqa1/usdev1/usqa2 within 2 mins

[08/Jan/2016:01:18:51 +0000] slapi_ldap_bind - Error: could not bind id [cn=Replication Manager masterAgreement1-uspre1.internal.com-pki-tomcat,ou=csusers,cn=config] authentication mechanism [SIMPLE]: error 32 (No such object) errno 0 (Success)
Comment 10 Ludwig 2016-01-08 04:00:43 EST
if modifications for users don't get synced this is not related to the ipaca backend but to the backend containing the domain data.

What is strange is that there seem to be dandling cleanallruv tasks for this backend, eg: 
NSMMReplicationPlugin - CleanAllRUV Task (rid 57): Replica is not cleaned yet (agmt="cn=meTousdev1.internal.com" (usdev1:389))

This can interfere with replication. Can you try to abort these tasks:
ipa-replica-manage abort-clean-ruv <rid>.
Comment 11 lmgnid 2016-01-08 13:46:54 EST
Hi Petr/Ludwig,

This might be caused by my mistakes in "Comment5". Just a notice that I had the sync issue even before the "mistake" to clean valid RUVs.

Here is the current valid RUVs: 
usqa1: 53
usdev1: 57
usqa2: 56
eupre1: 58
uspre1: 55

And here is the "Abort CLEANALLRUV tasks" for each server:
usqa1:
RID 55: Not all replicas finished aborting, retrying in 1280 seconds
RID 57: Not all replicas finished aborting, retrying in 320 seconds

usdev1:
RID 55: Not all replicas finished aborting, retrying in 1280 seconds
RID 56: Not all replicas finished aborting, retrying in 320 seconds
RID 57: Not all replicas finished aborting, retrying in 320 seconds

usqa2:
RID 56: Not all replicas finished aborting, retrying in 1280 seconds
RID 57: Not all replicas finished aborting, retrying in 320 seconds

eupre1: (Just uninstalled and reinstall the replica yesterday)
None

uspre1:
RID 56: Not all replicas finished aborting, retrying in 1280 seconds

It seems the cleanruv tasks cannot be aborted, do you have any suggestions?
Or should I just uninstall server 55/uspre1,56/usqa2,57/usdev1 to install the replica again, just as in 58/eupre1? Thanks!
Comment 12 lmgnid 2016-01-08 21:08:55 EST
Hi Petr/Ludwig,

Good news! 

As in Comment 9, it seems usqa1 is the only server that do NOT have sync issues in both direction (in and out), so I uninstalled all other servers and did the replica installation based on usqa1. Then cleaned all invalid RUVs, also checked all valid RUVs are show up in all IPA servers (Both domain and ipaca).

Now seems all syncs work Ok!!
BTW, not sure if these errors bellow matter?


[09/Jan/2016:02:01:33 +0000] NSMMReplicationPlugin - replication keep alive entry <cn=repl keep alive 53,dc=internal,dc=com> already exists
[09/Jan/2016:02:05:22 +0000] slapi_ldap_bind - Error: could not bind id [cn=Replication Manager masterAgreement1-usqa1.internal.com-pki-tomcat,ou=csusers,cn=config] authentication mechanism [SIMPLE]: error 32 (No such object) errno 0 (Success)
[09/Jan/2016:02:01:54 +0000] - slapd_poll(69) timed out


BTW, Will keep it running and try it again next week. If you have any other suggestion/comments. Please let me know.

Thanks!
L.M.
Comment 13 lmgnid 2016-01-19 19:16:51 EST
IT seems still works so far. Thanks for the help and this ticket can be closed.

Cheers!
L.M.

Note You need to log in before you can comment on or make changes to this bug.