1377730 – Replica install fails with old IPA master sometimes during replication process

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1377730 - Replica install fails with old IPA master sometimes during replication process

Summary: Replica install fails with old IPA master sometimes during replication process

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	389-ds-base
Sub Component:
Version:	7.3
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Noriko Hosoi
QA Contact:	Viktor Ashirov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1378353 1384785 1385300
TreeView+	depends on / blocked

Reported:	2016-09-20 13:27 UTC by Kaleem
Modified:	2020-09-13 21:51 UTC (History)
CC List:	13 users (show)
Fixed In Version:	389-ds-base-1.3.6.1-3.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1378353 1384785 1385300 (view as bug list)
Environment:
Last Closed:	2017-08-01 21:10:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	389ds 389-ds-base issues 2051	None	None	None	2020-09-13 21:51:15 UTC
Github	389ds 389-ds-base issues 2059	None	None	None	2020-09-13 21:51:32 UTC
Red Hat Product Errata	RHBA-2017:2086	normal	SHIPPED_LIVE	389-ds-base bug fix and enhancement update	2017-08-01 18:37:38 UTC

Description Kaleem 2016-09-20 13:27:07 UTC

Description of problem:

Replica install fails on 7.3 from a 6.8.z IPA master during replication process.

error seen:
===========
 [27/42]: setting up initial replication
Starting replication, please wait until this has completed.
Update in progress, 303 seconds elapsed
Update succeeded

  [28/42]: adding sasl mappings to the directory
  [29/42]: updating schema
  [30/42]: setting Auto Member configuration
  [31/42]: enabling S4U2Proxy delegation
ipa         : CRITICAL Failed to load replica-s4u2proxy.ldif: Command '/usr/bin/ldapmodify -v -f /tmp/tmpGb3cVr -H ldapi://%2fvar%2frun%2fslapd-TESTRELM-TEST.socket -x -D cn=Directory Manager -y /tmp/tmp31hbs9' returned non-zero exit status 1
  [32/42]: importing CA certificates from LDAP
  [error] DatabaseError: Operations error: 


Version-Release number of selected component (if applicable):
[root@dhcp207-46 ~]# rpm -q ipa-server 389-ds-base pki-ca
ipa-server-3.0.0-50.el6_8.3.x86_64
389-ds-base-1.2.11.15-78.el6_8.x86_64
pki-ca-9.0.3-50.el6_8.noarch
[root@dhcp207-46 ~]#

How reproducible:
3 times out 5 replica install attempts. No change done on replica during these 5 replica install attempts. Also Master is freshly installed for each replica install attempt.

Steps to Reproduce:
1. Install replica on 7.3 from a 6.8.z IPA master.


Actual results:
Replica install fails during dirsrv component configuration

Expected results:
Replica install should be successful

Additional info:
Please refer https://bugzilla.redhat.com/show_bug.cgi?id=1371519#c28 for more details

Comment 3 thierry bordaz 2016-09-21 12:28:59 UTC

    - Schema definitions can be different between two replicated instances.
      Replication relies of all instances sharing a common schema. If a supplier
      detects that schema definitions differs, it 'pushes' its schema at a beginning of 
      a replication session.
      A problem exists if the 'pushed' schema (from supplier) is a subset of
      of the consumer schema. To prevent this it was introduced since DS 1.3.3
      (48988, 47490 and 47676) a mechanism to check, before accepting a pushed schema,
      that the supplier schema is a superset of the consumer schema.

    - The pushed schema is rejected (UNWILLING_TO_PERFORM)
      During the total update, the schema is sent BUT the server schema (6.8)
      being a subset of replica schema (7.3), the update fails
Server
[20/Sep/2016:10:04:00 +0200] NSMMReplicationPlugin - agmt="cn=meTo<vm-fqdn>" (<vm>:389): Schema replication update failed: Server is unwilling to perform
[20/Sep/2016:10:04:00 +0200] NSMMReplicationPlugin - Warning: unable to replicate schema to host <vm-fqdn>, port 389. Continuing with total update session.
[20/Sep/2016:10:04:00 +0200] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=meTo<vm-fqdn>" (<vm>:389)".

Consumer

[20/Sep/2016:10:05:17.857331444 +0200] conn=4 op=1 BIND dn="cn=replication manager,cn=config" method=128 version=3
[20/Sep/2016:10:05:17.857701775 +0200] conn=4 op=1 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=replication manager,cn=config"
...
[20/Sep/2016:10:05:17.860818168 +0200] conn=4 op=4 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[20/Sep/2016:10:05:18.206903643 +0200] conn=4 op=5 RESULT err=0 tag=101 nentries=1 etime=1
...
[20/Sep/2016:10:05:18.264291536 +0200] conn=4 op=6 MOD dn="cn=schema"
[20/Sep/2016:10:05:18.970809613 +0200] conn=4 op=6 RESULT err=53 tag=103 nentries=0 etime=0
[20/Sep/2016:10:05:18.971069207 +0200] NSMMReplicationPlugin - process_postop: Failed to apply update () error (53).  Aborting replication session(conn=4 op=6)

    - The total update continues on server side
[20/Sep/2016:10:04:00 +0200] NSMMReplicationPlugin - Warning: unable to replicate schema to host <vm-fqdn>, port 389. Continuing with total update session.

    - Total update is aborted on the client side
      Since DS 1.3.5.1 (https://fedorahosted.org/389/ticket/47788), UNWILLING_TO_PERFORM
      of a replicated operation is seen as a fatal error in the session and
      triggers the closure of the connection.
      In that case, the total init does not complete and server-replica
      will not be insync

    - There is no easy workaround
      The only one I can think of is to copy the consumer (7.3) schema
      on the server side (6.8)

    - it is not clear to me why this problem is not systematic
      The scenario 6.x Server and 7.3 Replica, with schema on 6.x being a subset
      of 7.3 schema,  should systematically trigger the failure of total update

In conclusion:
    Total update can fail between a 6.x server and a replica >= 1.3.5.1,
depending if 6.x schema < 7.3 schema

Comment 4 thierry bordaz 2016-09-21 12:32:14 UTC

The DS upstream ticket is https://fedorahosted.org/389/ticket/48992

A test fix was provided and replica installation succeeded everytime.
Continuing testing (beaker)

Comment 7 thierry bordaz 2016-09-22 16:08:08 UTC

How to detect that you hit that bug:

On the replica side: 
    - In the access log
        - On a replication connection the update of 'cn=schema' return err=53 (unwilling to perform)
        - The replication connection gets closed immediately after the failure
[22/Sep/2016:12:43:32.573970277 +051800] conn=4 fd=66 slot=66 connection ...
[22/Sep/2016:12:43:32.574302922 +051800] conn=4 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[22/Sep/2016:12:43:32.574384201 +051800] conn=4 op=0 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:12:43:32.586336334 +051800] conn=4 TLS1.2 256-bit AES
[22/Sep/2016:12:43:32.586562444 +051800] conn=4 op=1 BIND dn="cn=replication manager,cn=config" method=128 version=3
[22/Sep/2016:12:43:32.587033902 +051800] conn=4 op=1 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=replication manager,cn=config"
..
[22/Sep/2016:12:43:32.589824012 +051800] conn=4 op=4 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[22/Sep/2016:12:43:32.747245763 +051800] conn=4 op=4 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:12:43:32.846425733 +051800] conn=4 op=5 SRCH base="cn=schema" scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
[22/Sep/2016:12:43:33.096844173 +051800] conn=4 op=5 RESULT err=0 tag=101 nentries=1 etime=0
[22/Sep/2016:12:43:33.228306558 +051800] conn=4 op=6 MOD dn="cn=schema"
[22/Sep/2016:12:43:33.918336785 +051800] conn=4 op=6 RESULT err=53 tag=103 nentries=0 etime=1
[22/Sep/2016:12:43:33.921507058 +051800] conn=4 op=-1 fd=66 closed - B4

    - In the error log at the same time
[22/Sep/2016:12:43:33.860875397 +051800] schema - [C] Local objectClasses must not be overwritten (set replication log for additional info)
[22/Sep/2016:12:43:33.918889184 +051800] NSMMReplicationPlugin - process_postop: Failed to apply update () error (53).  Aborting replication session(conn=4 op=6)
[22/Sep/2016:12:43:33.921988415 +051800] ERROR bulk import abandoned


On the server side:
    - In the error log:
      The following messages are present
[22/Sep/2016:12:43:34 +051800] NSMMReplicationPlugin - agmt="...: Schema replication update failed: Server is unwilling to perform
[22/Sep/2016:12:43:34 +051800] NSMMReplicationPlugin - Warning: unable to replicate schema to host .... Continuing with total update session.
[22/Sep/2016:12:43:34 +051800] NSMMReplicationPlugin - Beginning total update of replica "agmt="...."
    
      But NOT this on
[22/Sep/2016:12:43:40 +051800] NSMMReplicationPlugin - Finished total update of replica "agmt="...". Sent 274 entries.

Comment 8 thierry bordaz 2016-09-22 16:23:11 UTC

The RC is identified and is the one described in https://bugzilla.redhat.com/show_bug.cgi?id=1377730#c3.
When it fails we can see
[22/Sep/2016:12:43:32.573970277 +051800] conn=4 fd=66 slot=66 connection ...
[22/Sep/2016:12:43:32.574302922 +051800] conn=4 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[22/Sep/2016:12:43:32.574384201 +051800] conn=4 op=0 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:12:43:32.586336334 +051800] conn=4 TLS1.2 256-bit AES
[22/Sep/2016:12:43:32.586562444 +051800] conn=4 op=1 BIND dn="cn=replication manager,cn=config" method=128 version=3
[22/Sep/2016:12:43:32.587033902 +051800] conn=4 op=1 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=replication manager,cn=config"
..
[22/Sep/2016:12:43:32.589824012 +051800] conn=4 op=4 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[22/Sep/2016:12:43:32.747245763 +051800] conn=4 op=4 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:12:43:32.846425733 +051800] conn=4 op=5 SRCH base="cn=schema" scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
[22/Sep/2016:12:43:33.096844173 +051800] conn=4 op=5 RESULT err=0 tag=101 nentries=1 etime=0
[22/Sep/2016:12:43:33.228306558 +051800] conn=4 op=6 MOD dn="cn=schema"
[22/Sep/2016:12:43:33.918336785 +051800] conn=4 op=6 RESULT err=53 tag=103 nentries=0 etime=1
[22/Sep/2016:12:43:33.921507058 +051800] conn=4 op=-1 fd=66 closed - B4

If making the MOD failure not fatal (e.g. err=19), the total update continues even after err=19
[22/Sep/2016:11:00:11.610214950 -0400] conn=4 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[22/Sep/2016:11:00:11.610263490 -0400] conn=4 op=0 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:11:00:11.627744863 -0400] conn=4 TLS1.2 256-bit AES
[22/Sep/2016:11:00:11.628268117 -0400] conn=4 op=1 BIND dn="cn=replication manager,cn=config" method=128 version=3
[22/Sep/2016:11:00:11.628666564 -0400] conn=4 op=1 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=replication manager,cn=config"
...
[22/Sep/2016:11:00:12.388909563 -0400] conn=4 op=6 MOD dn="cn=schema"
[22/Sep/2016:11:00:13.015912789 -0400] conn=4 op=6 RESULT err=19 tag=103 nentries=0 etime=1                                               
[22/Sep/2016:11:00:13.670500128 -0400] conn=4 op=7 EXT oid="2.16.840.1.113730.3.5.6" name="replication-multimaster-extop"
[22/Sep/2016:11:00:13.676098009 -0400] conn=4 op=7 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:11:00:13.676219673 -0400] conn=4 op=8 EXT oid="2.16.840.1.113730.3.5.6" name="replication-multimaster-extop"
[22/Sep/2016:11:00:13.676332467 -0400] conn=4 op=8 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:11:00:13.676443585 -0400] conn=4 op=9 EXT oid="2.16.840.1.113730.3.5.6" name="replication-multimaster-extop"
[22/Sep/2016:11:00:13.676536335 -0400] conn=4 op=9 RESULT err=0 tag=120 nentries=0 etime=0
[22/Sep/2016:11:00:13.676628945 -0400] conn=4 op=10 EXT oid="2.16.840.1.113730.3.5.6" name="replication-multimaster-extop"
[22/Sep/2016:11:00:13.676706609 -0400] conn=4 op=10 RESULT err=0 tag=120 nentries=0 etime=0
...
[22/Sep/2016:11:00:18.233988631 -0400] conn=4 op=282 UNBIND


The problem occurs with two conditions:
     - replica is > DS 1.3.5 (bug https://fedorahosted.org/389/ticket/47788)
       This bug fix, triggers the connection closure if a replicated update is failure
       for example with err=53
     - The server/supplier is pushing a schema although it is a subset of the replica schema
       This should only occur with 6.x.
       In 7.x server checks its schema is a superset before sending it.

Impact of the bug:
    This bug could trigger a failure of total init as well as incremental update.
    It can be transient, because local 7.x server may "learn" definitions from the received schema
    so that it increase its nsSchemaCSN so that later replication connection will 
    find greater nsSchemaCSN and will not attempt to send the schema

Comment 9 thierry bordaz 2016-09-27 09:04:33 UTC

Fix was pushed upstream master and 1.3.5. Move it to POST

Comment 10 Marc Sauton 2016-09-27 20:36:44 UTC

setting GSSApproved for RHEL-7.3.x
this is important for successful IdM customer migration from RHEL 6 to RHEL 7.

Comment 11 Marc Sauton 2016-09-27 20:37:21 UTC

typo fix:

setting GSSApproved for RHEL-7.3.z
this is important for successful IdM customer migration from RHEL 6 to RHEL 7.

Comment 28 Simon Pichugin 2017-05-18 15:24:31 UTC

As per Kaleem recommendation in IRC, I mark this bugzilla as VERIFIED because 
 of the comment - https://bugzilla.redhat.com/show_bug.cgi?id=1378353#c9
It shows that IPA replica-install on 7.4 works with IPA master on 6.9.

Comment 29 errata-xmlrpc 2017-08-01 21:10:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2086

Note You need to log in before you can comment on or make changes to this bug.