Bug 1401897

Summary: Replication issues between 2 freeipa servers
Product: Red Hat Enterprise Linux 7 Reporter: Valentin Bajrami <valentin.bajrami>
Component: 389-ds-baseAssignee: Noriko Hosoi <nhosoi>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Viktor Ashirov <vashirov>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.2CC: nhosoi, nkinder, rmeggins, tbordaz, valentin.bajrami
Target Milestone: pre-dev-freeze   
Target Release: 7.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-19 16:32:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Valentin Bajrami 2016-12-06 11:19:29 UTC
Description of problem:

Replication problems between freeipa servers. Password synchronization doesn't seem to work.  

The whole problem imho is caused by this uniqueid:

....
targetuniqueid: ec5ffe96-c65211e5-a537be1e-d065a3fa
....
targetDn: cn=repl keep alive 3,dc=intra,dc=sub,dc=domain,dc=tld



Scenario:

1) ipamaster

[root@ipamaster ~]# rpm -qa 389-ds-*
389-ds-base-libs-1.3.4.0-33.el7_2.x86_64
389-ds-base-debuginfo-1.3.4.0-33.el7_2.x86_64
389-ds-base-1.3.4.0-33.el7_2.x86_64


2) ipamaster-1

[root@ipamaster-1 ~]# rpm -qa 389-ds-*
389-ds-base-libs-1.3.4.0-33.el7_2.x86_64
389-ds-base-debuginfo-1.3.4.0-33.el7_2.x86_64
389-ds-base-1.3.4.0-33.el7_2.x86_64


root@ipamaster # tail -F /var/log/dirsrv/slapd-INTRA-SUB-DOMAIN-TLD/errors*

[06/Dec/2016:11:54:30 +0100] NSMMReplicationPlugin - process_postop: Failed to apply update (580f4ec7000100030000) error (-1).  Aborting replication session(conn=2632893 op=6)
[06/Dec/2016:11:54:35 +0100] NSMMReplicationPlugin - process_postop: Failed to apply update (580f4ec7000100030000) error (-1).  Aborting replication session(conn=2632898 op=6)
[06/Dec/2016:11:54:41 +0100] NSMMReplicationPlugin - process_postop: Failed to apply update (580f4ec7000100030000) error (-1).  Aborting replication session(conn=2632904 op=6)
[06/Dec/2016:11:54:46 +0100] NSMMReplicationPlugin - process_postop: Failed to apply update (580f4ec7000100030000) error (-1).  Aborting replication session(conn=2632905 op=6)


root@ipamaster-1 # tail -F /var/log/dirsrv/slapd-INTRA-SUB-DOMAIN-TLD/errors*

[05/Dec/2016:15:40:07 +0100] NSMMReplicationPlugin - agmt="cn=meToipamaster.intra.astro.rug.nl" (ipamaster:389): Replication bind with GSSAPI auth resumed
[05/Dec/2016:15:40:08 +0100] NSMMReplicationPlugin - agmt="cn=meToipamaster.intra.astro.rug.nl" (ipamaster:389): Failed to send modify operation: LDAP error -1 (Can't contact LDAP server)
[05/Dec/2016:15:40:08 +0100] NSMMReplicationPlugin - agmt="cn=meToipamaster.intra.astro.rug.nl" (ipamaster:389): Failed to send update operation to consumer (uniqueid ec5ffe96-c65211e5-a537be1e-d065a3fa, CSN 580f6b6b000200030000): Can't contact LDAP server. Will retry later.
[05/Dec/2016:15:40:08 +0100] NSMMReplicationPlugin - agmt="cn=meToipamaster.intra.astro.rug.nl" (ipamaster:389): Consumer failed to replay change (uniqueid ec5ffe96-c65211e5-a537be1e-d065a3fa, CSN 580f52a6000300030000): Can't contact LDAP server(-1). Will retry later.
[05/Dec/2016:15:40:08 +0100] NSMMReplicationPlugin - agmt="cn=meToipamaster.intra.astro.rug.nl" (ipamaster:389): Warning: unable to send endReplication extended operation (Can't contact LDAP server)



Running the following command on the ipamaster, produces this output

root@ipamaster # ldapsearch -x -W -h localhost -p 389 -D "cn=Directory Manager" -b "cn=changelog" | grep -C17 'ec5ffe96-c65211e5-a537be1e-d065a3fa'

# 31567089, changelog
dn: changenumber=31567089,cn=changelog
objectClass: top
objectClass: changelogentry
objectClass: extensibleObject
targetuniqueid: ec5ffe96-c65211e5-a537be1e-d065a3fa
changeNumber: 31567089
targetDn: cn=repl keep alive 3,dc=intra,dc=sub,dc=domain,dc=tld
changeTime: 20161205144220Z
changeType: modify
changes:: cmVwbGFjZToga2VlcGFsaXZldGltZXN0YW1wCmtlZXBhbGl2ZXRpbWVzdGFtcDogMjAx
 NjEyMDUxNDQyMjBaCi0KcmVwbGFjZTogbW9kaWZpZXJzbmFtZQptb2RpZmllcnNuYW1lOiBjbj1Nd
 Wx0aW1hc3RlciBSZXBsaWNhdGlvbiBQbHVnaW4sY249cGx1Z2lucyxjbj1jb25maWcKLQpyZXBsYW
 NlOiBtb2RpZtld0aW1lc3RhbXAKbW9kaWZ5dGltZXN0YW1wOiAyMDE2MTIwNTE0NDIyMFoKLQpyZXB
 sYWNlOiBlbnRyeXVzbgplbnRyeXVzbjogMTA1NzY4NTQzCi0KAA==


Anyone any suggestion?

Comment 1 Noriko Hosoi 2016-12-08 23:05:42 UTC
Hello, Valentin.

Could you tell us how this issue started?  The replication between ipamaster and ipamaster-1 was working, then it stopped?  Or it never worked?

You mentioned "Password synchronization".  Could you give us more details about it?  For instance, could you share the steps to duplicate the problem with us?

We usually use the term "password synchronization" for the password synchronization between Active Directory and Directory Server/IPA.  I'd think you don't mean it?

Thierry, do you have any idea what is causing this problem?

Thanks.

Comment 2 thierry bordaz 2016-12-09 08:13:09 UTC
Hello Valentin,

It is looking like the bug is related to a case, would you please link the case so that I can access all sosreport files.

With the current data I would say:

 - On ipamaster it exists the entry
   cn=repl keep alive 3,dc=intra,dc=sub,dc=domain,dc=tld
   because its update was successfully applied in ipamaster as it was recorded in the retroCL
   I do not know if it was created on ipamaster or on ipamaster-1

  - ipamaster-1 --> ipamaster fails to send 580f52a6000300030000 but it is
    likely the consequence of 580f4ec7000100030000 aborted on ipamaster that
    triggered the connection closure under 580f52a6000300030000

A complete dataset is needed (config/access/errors log). In addition while it seems broken, you may turn replication logging on both servers to get more details.

It would be interested to know what is 580f4ec7000100030000 that is aborted.
Access log would help but also dumping the change from the changelog.

Is it a test environment or a production

Comment 4 thierry bordaz 2017-01-05 17:48:53 UTC
Hi Valentin,

We can not really progress on this bug without the requested data (https://bugzilla.redhat.com/show_bug.cgi?id=1401897#c2).
By any chance are they still available somewhere ? Else we would have to close the bug.

regards
thierry

Comment 5 Valentin Bajrami 2017-01-10 13:10:02 UTC
(In reply to thierry bordaz from comment #4)
> Hi Valentin,
> 
> We can not really progress on this bug without the requested data
> (https://bugzilla.redhat.com/show_bug.cgi?id=1401897#c2).
> By any chance are they still available somewhere ? Else we would have to
> close the bug.
> 
> regards
> thierry

Hi Thierry,

Thanks for getting back on this. The problem has been solved just today by re-initializing the replication.  It took ~40s to replicate and the message was:

[10/Jan/2017:12:07:41 +0100] NSMMReplicationPlugin - Beginning total
update of replica "agmt="cn=meToipamaster.sub.domain.tld"
(ipamaster:389)".
[10/Jan/2017:12:08:13 +0100] NSMMReplicationPlugin - Finished total
update of replica "agmt="cn=meToipamaster.sub.domain.tld"
(ipamaster:389)". Sent 2289 entries.


I'm not sure if this contributes anything to the problem, but this was the most straight forward step to take.

We first tried to use force-sync which wasn't a success. 

Please close this bug report as you best see fit given the description.

Comment 6 thierry bordaz 2017-01-10 16:07:25 UTC
Hi Valentin,

Thanks for your feedback. The messages "Beginning total" and "Finished total" are normal and show the normal completion of the re-init. It is difficult to comment the import rate (2289entries/40s) because many parameters can contribute (power of the box, size of entries, plugins enabled, indexed attributes...) but I would assume it is normal.

You are right, reinit (moreover if it takes only few sec) is usually the fastest approach to recover. If you do not have the old logs (from early december) it will not be possible to identify the RC of the bug and closing it makes sense.

In case you hit again such error:

NSMMReplicationPlugin - process_postop: Failed to apply update (xxx) error (-1).  Aborting replication session(conn=xxx op=xxx)

You would get more info turning on some error log level (http://www.port389.org/docs/389ds/FAQ/faq.html#Troubleshooting) especially replication and possibly plugin.

Then reopen that bug and attach the log(access/error)/config 

thanks

Comment 7 thierry bordaz 2017-01-19 16:32:35 UTC
Following previous update, closing the BZ