Bug 1377730

Summary: Replica install fails with old IPA master sometimes during replication process
Product: Red Hat Enterprise Linux 7 Reporter: Kaleem <ksiddiqu>
Component: 389-ds-baseAssignee: Noriko Hosoi <nhosoi>
Status: CLOSED ERRATA QA Contact: Viktor Ashirov <vashirov>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.3CC: anazmy, gagriogi, gparente, mkolaja, mreynolds, msauton, nkinder, pvoborni, rcritten, rmeggins, spichugi, tbordaz, vonsch
Target Milestone: rcKeywords: Regression, ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 389-ds-base-1.3.6.1-3.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1378353 1384785 1385300 (view as bug list) Environment:
Last Closed: 2017-08-01 21:10:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1378353, 1384785, 1385300    

Description Kaleem 2016-09-20 13:27:07 UTC
Description of problem:

Replica install fails on 7.3 from a 6.8.z IPA master during replication process.

error seen:
===========
 [27/42]: setting up initial replication
Starting replication, please wait until this has completed.
Update in progress, 303 seconds elapsed
Update succeeded

  [28/42]: adding sasl mappings to the directory
  [29/42]: updating schema
  [30/42]: setting Auto Member configuration
  [31/42]: enabling S4U2Proxy delegation
ipa         : CRITICAL Failed to load replica-s4u2proxy.ldif: Command '/usr/bin/ldapmodify -v -f /tmp/tmpGb3cVr -H ldapi://%2fvar%2frun%2fslapd-TESTRELM-TEST.socket -x -D cn=Directory Manager -y /tmp/tmp31hbs9' returned non-zero exit status 1
  [32/42]: importing CA certificates from LDAP
  [error] DatabaseError: Operations error: 


Version-Release number of selected component (if applicable):
[root@dhcp207-46 ~]# rpm -q ipa-server 389-ds-base pki-ca
ipa-server-3.0.0-50.el6_8.3.x86_64
389-ds-base-1.2.11.15-78.el6_8.x86_64
pki-ca-9.0.3-50.el6_8.noarch
[root@dhcp207-46 ~]#

How reproducible:
3 times out 5 replica install attempts. No change done on replica during these 5 replica install attempts. Also Master is freshly installed for each replica install attempt.

Steps to Reproduce:
1. Install replica on 7.3 from a 6.8.z IPA master.


Actual results:
Replica install fails during dirsrv component configuration

Expected results:
Replica install should be successful

Additional info:
Please refer https://bugzilla.redhat.com/show_bug.cgi?id=1371519#c28 for more details

Comment 3 thierry bordaz 2016-09-21 12:28:59 UTC
    - Schema definitions can be different between two replicated instances.
      Replication relies of all instances sharing a common schema. If a supplier
      detects that schema definitions differs, it 'pushes' its schema at a beginning of 
      a replication session.
      A problem exists if the 'pushed' schema (from supplier) is a subset of
      of the consumer schema. To prevent this it was introduced since DS 1.3.3
      (48988, 47490 and 47676) a mechanism to check, before accepting a pushed schema,
      that the supplier schema is a superset of the consumer schema.

    - The pushed schema is rejected (UNWILLING_TO_PERFORM)
      During the total update, the schema is sent BUT the server schema (6.8)
      being a subset of replica schema (7.3), the update fails
Server
[20/Sep/2016:10:04:00 +0200] NSMMReplicationPlugin - agmt="cn=meTo<vm-fqdn>" (<vm>:389): Schema replication update failed: Server is unwilling to perform
[20/Sep/2016:10:04:00 +0200] NSMMReplicationPlugin - Warning: unable to replicate schema to host <vm-fqdn>, port 389. Continuing with total update session.
[20/Sep/2016:10:04:00 +0200] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=meTo<vm-fqdn>" (<vm>:389)".

Consumer

[20/Sep/2016:10:05:17.857331444 +0200] conn=4 op=1 BIND dn="cn=replication manager,cn=config" method=128 version=3
[20/Sep/2016:10:05:17.857701775 +0200] conn=4 op=1 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=replication manager,cn=config"
...
[20/Sep/2016:10:05:17.860818168 +0200] conn=4 op=4 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[20/Sep/2016:10:05:18.206903643 +0200] conn=4 op=5 RESULT err=0 tag=101 nentries=1 etime=1
...
[20/Sep/2016:10:05:18.264291536 +0200] conn=4 op=6 MOD dn="cn=schema"
[20/Sep/2016:10:05:18.970809613 +0200] conn=4 op=6 RESULT err=53 tag=103 nentries=0 etime=0
[20/Sep/2016:10:05:18.971069207 +0200] NSMMReplicationPlugin - process_postop: Failed to apply update () error (53).  Aborting replication session(conn=4 op=6)

    - The total update continues on server side
[20/Sep/2016:10:04:00 +0200] NSMMReplicationPlugin - Warning: unable to replicate schema to host <vm-fqdn>, port 389. Continuing with total update session.

    - Total update is aborted on the client side
      Since DS 1.3.5.1 (https://fedorahosted.org/389/ticket/47788), UNWILLING_TO_PERFORM
      of a replicated operation is seen as a fatal error in the session and
      triggers the closure of the connection.
      In that case, the total init does not complete and server-replica
      will not be insync

    - There is no easy workaround
      The only one I can think of is to copy the consumer (7.3) schema
      on the server side (6.8)

    - it is not clear to me why this problem is not systematic
      The scenario 6.x Server and 7.3 Replica, with schema on 6.x being a subset
      of 7.3 schema,  should systematically trigger the failure of total update

In conclusion:
    Total update can fail between a 6.x server and a replica >= 1.3.5.1,
depending if 6.x schema < 7.3 schema

Comment 4 thierry bordaz 2016-09-21 12:32:14 UTC
The DS upstream ticket is https://fedorahosted.org/389/ticket/48992

A test fix was provided and replica installation succeeded everytime.
Continuing testing (beaker)

Comment 7 thierry bordaz 2016-09-22 16:08:08 UTC
How to detect that you hit that bug:

On the replica side: 
    - In the access log
        - On a replication connection the update of 'cn=schema' return err=53 (unwilling to perform)
        - The replication connection gets closed immediately after the failure
[22/Sep/2016:12:43:32.573970277 +051800] conn=4 fd=66 slot=66 connection ...
[22/Sep/2016:12:43:32.574302922 +051800] conn=4 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[22/Sep/2016:12:43:32.574384201 +051800] conn=4 op=0 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:12:43:32.586336334 +051800] conn=4 TLS1.2 256-bit AES
[22/Sep/2016:12:43:32.586562444 +051800] conn=4 op=1 BIND dn="cn=replication manager,cn=config" method=128 version=3
[22/Sep/2016:12:43:32.587033902 +051800] conn=4 op=1 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=replication manager,cn=config"
..
[22/Sep/2016:12:43:32.589824012 +051800] conn=4 op=4 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[22/Sep/2016:12:43:32.747245763 +051800] conn=4 op=4 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:12:43:32.846425733 +051800] conn=4 op=5 SRCH base="cn=schema" scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
[22/Sep/2016:12:43:33.096844173 +051800] conn=4 op=5 RESULT err=0 tag=101 nentries=1 etime=0
[22/Sep/2016:12:43:33.228306558 +051800] conn=4 op=6 MOD dn="cn=schema"
[22/Sep/2016:12:43:33.918336785 +051800] conn=4 op=6 RESULT err=53 tag=103 nentries=0 etime=1
[22/Sep/2016:12:43:33.921507058 +051800] conn=4 op=-1 fd=66 closed - B4

    - In the error log at the same time
[22/Sep/2016:12:43:33.860875397 +051800] schema - [C] Local objectClasses must not be overwritten (set replication log for additional info)
[22/Sep/2016:12:43:33.918889184 +051800] NSMMReplicationPlugin - process_postop: Failed to apply update () error (53).  Aborting replication session(conn=4 op=6)
[22/Sep/2016:12:43:33.921988415 +051800] ERROR bulk import abandoned


On the server side:
    - In the error log:
      The following messages are present
[22/Sep/2016:12:43:34 +051800] NSMMReplicationPlugin - agmt="...: Schema replication update failed: Server is unwilling to perform
[22/Sep/2016:12:43:34 +051800] NSMMReplicationPlugin - Warning: unable to replicate schema to host .... Continuing with total update session.
[22/Sep/2016:12:43:34 +051800] NSMMReplicationPlugin - Beginning total update of replica "agmt="...."
    
      But NOT this on
[22/Sep/2016:12:43:40 +051800] NSMMReplicationPlugin - Finished total update of replica "agmt="...". Sent 274 entries.

Comment 8 thierry bordaz 2016-09-22 16:23:11 UTC
The RC is identified and is the one described in https://bugzilla.redhat.com/show_bug.cgi?id=1377730#c3.
When it fails we can see
[22/Sep/2016:12:43:32.573970277 +051800] conn=4 fd=66 slot=66 connection ...
[22/Sep/2016:12:43:32.574302922 +051800] conn=4 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[22/Sep/2016:12:43:32.574384201 +051800] conn=4 op=0 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:12:43:32.586336334 +051800] conn=4 TLS1.2 256-bit AES
[22/Sep/2016:12:43:32.586562444 +051800] conn=4 op=1 BIND dn="cn=replication manager,cn=config" method=128 version=3
[22/Sep/2016:12:43:32.587033902 +051800] conn=4 op=1 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=replication manager,cn=config"
..
[22/Sep/2016:12:43:32.589824012 +051800] conn=4 op=4 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[22/Sep/2016:12:43:32.747245763 +051800] conn=4 op=4 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:12:43:32.846425733 +051800] conn=4 op=5 SRCH base="cn=schema" scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
[22/Sep/2016:12:43:33.096844173 +051800] conn=4 op=5 RESULT err=0 tag=101 nentries=1 etime=0
[22/Sep/2016:12:43:33.228306558 +051800] conn=4 op=6 MOD dn="cn=schema"
[22/Sep/2016:12:43:33.918336785 +051800] conn=4 op=6 RESULT err=53 tag=103 nentries=0 etime=1
[22/Sep/2016:12:43:33.921507058 +051800] conn=4 op=-1 fd=66 closed - B4

If making the MOD failure not fatal (e.g. err=19), the total update continues even after err=19
[22/Sep/2016:11:00:11.610214950 -0400] conn=4 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[22/Sep/2016:11:00:11.610263490 -0400] conn=4 op=0 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:11:00:11.627744863 -0400] conn=4 TLS1.2 256-bit AES
[22/Sep/2016:11:00:11.628268117 -0400] conn=4 op=1 BIND dn="cn=replication manager,cn=config" method=128 version=3
[22/Sep/2016:11:00:11.628666564 -0400] conn=4 op=1 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=replication manager,cn=config"
...
[22/Sep/2016:11:00:12.388909563 -0400] conn=4 op=6 MOD dn="cn=schema"
[22/Sep/2016:11:00:13.015912789 -0400] conn=4 op=6 RESULT err=19 tag=103 nentries=0 etime=1                                               
[22/Sep/2016:11:00:13.670500128 -0400] conn=4 op=7 EXT oid="2.16.840.1.113730.3.5.6" name="replication-multimaster-extop"
[22/Sep/2016:11:00:13.676098009 -0400] conn=4 op=7 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:11:00:13.676219673 -0400] conn=4 op=8 EXT oid="2.16.840.1.113730.3.5.6" name="replication-multimaster-extop"
[22/Sep/2016:11:00:13.676332467 -0400] conn=4 op=8 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:11:00:13.676443585 -0400] conn=4 op=9 EXT oid="2.16.840.1.113730.3.5.6" name="replication-multimaster-extop"
[22/Sep/2016:11:00:13.676536335 -0400] conn=4 op=9 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:11:00:13.676628945 -0400] conn=4 op=10 EXT oid="2.16.840.1.113730.3.5.6" name="replication-multimaster-extop"
[22/Sep/2016:11:00:13.676706609 -0400] conn=4 op=10 RESULT err=0 tag=120 nentries=0 etime=0
...
[22/Sep/2016:11:00:18.233988631 -0400] conn=4 op=282 UNBIND


The problem occurs with two conditions:
     - replica is > DS 1.3.5 (bug https://fedorahosted.org/389/ticket/47788)
       This bug fix, triggers the connection closure if a replicated update is failure
       for example with err=53
     - The server/supplier is pushing a schema although it is a subset of the replica schema
       This should only occur with 6.x.
       In 7.x server checks its schema is a superset before sending it.

Impact of the bug:
    This bug could trigger a failure of total init as well as incremental update.
    It can be transient, because local 7.x server may "learn" definitions from the received schema
    so that it increase its nsSchemaCSN so that later replication connection will 
    find greater nsSchemaCSN and will not attempt to send the schema

Comment 9 thierry bordaz 2016-09-27 09:04:33 UTC
Fix was pushed upstream master and 1.3.5. Move it to POST

Comment 10 Marc Sauton 2016-09-27 20:36:44 UTC
setting GSSApproved for RHEL-7.3.x
this is important for successful IdM customer migration from RHEL 6 to RHEL 7.

Comment 11 Marc Sauton 2016-09-27 20:37:21 UTC
typo fix:

setting GSSApproved for RHEL-7.3.z
this is important for successful IdM customer migration from RHEL 6 to RHEL 7.

Comment 28 Simon Pichugin 2017-05-18 15:24:31 UTC
As per Kaleem recommendation in IRC, I mark this bugzilla as VERIFIED because 
 of the comment - https://bugzilla.redhat.com/show_bug.cgi?id=1378353#c9
It shows that IPA replica-install on 7.4 works with IPA master on 6.9.

Comment 29 errata-xmlrpc 2017-08-01 21:10:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2086