Bug 788745

Summary: Data inconsitency during replication
Product: Red Hat Enterprise Linux 6 Reporter: Rich Megginson <rmeggins>
Component: 389-ds-baseAssignee: Rich Megginson <rmeggins>
Status: CLOSED ERRATA QA Contact: IDM QE LIST <seceng-idm-qe-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.3CC: amsharma, jgalipea, mreynolds, nhosoi
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 389-ds-base-1.2.10.0-1.el6 Doc Type: Bug Fix
Doc Text:
Cause: CSNs in RUV were not refreshed when a replication role was changed. Consequence: It caused data inconsistency. Fix: CSNs are refreshed at the timing of role change. Result: Data inconsistency is not observed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 07:14:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rich Megginson 2012-02-08 22:43:17 UTC
This bug is created as a clone of upstream ticket:
https://fedorahosted.org/389/ticket/18

https://bugzilla.redhat.com/show_bug.cgi?id=750425

{{{
Description of problem:

Data loss during the promotion operation(Slave to Master).
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
Step-1:

Have a topology like Master replicating to Slave and Slave replication to
consumer.

Master -> Slave-> Consumer.

Step-2:
Make sure that all are on sync at this time. Let?s take an example all are the
on sync up to CSN5 (5 records are added to master from CSN1 to CSN5).

Step-3:

Delete the replication agreement from Master to Slave and also from Slave to
consumer.

Step-4:

Promote the Slave to master.  Promotion steps are given below.

-       Delete Supplier DN (cn=suppdn,cn=config) from Slave
-       Delete ?cn=replica? entry for the suffix ?o=USA? using ldapmodify. As a
result, it will delete the changelog file.
Ex: dn: cn=replica,cn=o=USA,cn=mapping tree,cn=config
changetype: delete
-       Modify the cn=o=USA ,cn=mapping tree,cn=config entry as below
EX: dn: cn=o=USA,cn=mapping tree,cn=config
changetype: modify
replace: nsslapd-state
nsslapd-state: backend

dn: cn=o=USA,cn=mapping tree,cn=config
changetype: modify
delete: nsslapd-referral
-       Recreate the ?cn=replica? entry for the suffix as below.
dn: cn=replica,cn=o=USA,cn=mapping tree,cn=config
changetype: add
objectClass: nsds5replica
objectClass: top
nsDS5ReplicaRoot: o=USA
nsDS5ReplicaType: 3
nsDS5Flags: 1
nsDS5ReplicaId: 10  --? Please assign the same ?nsDS5ReplicaId value what
master was having. In my case, Original master replica ID was 10.
nsds5ReplicaPurgeDelay: 1
nsds5ReplicaTombstonePurgeInterval: -1
cn: replica
-       Restart  slapd process. Now Slave become Master.

Is there anything am I missing during promotion operation or it?s not the right
way to do the promotion operation?

Step -5:

Add the replication agreement between Slave(newly promoted Master) and Consumer
. At this time both Slave and consumer are on sync up to CSN5. During agreement
creation please do not initialize the consumer.

           Slave(newly promoted as master) - > consumer.

Step-6:

Add another 5 more entries to Slave which was promoted above as Master. Let?s
assume CSN numbers for these 5 entries are from CSN6 to CSN10.

Step-7:

Now, you will see, among the last 5 entries only last few will gets replicated
without halting the replication.


Actual results

Expected results:


Additional info:
}}}

Comment 2 mreynolds 2012-05-11 20:40:29 UTC
Verification steps    

[1]  Set up Master (replica ID = 1), Hub, and Consumer.

[2]  Shutdown Master and reconfigure Hub into a master and assign the Master's replica ID (replica ID = 1)

[3]  Generate an agreement for the NewMaster?? (ex-Hub) pointing to the same Consumer.

[4]  Now, without initializing consumer on the NewMaster?, add multiple entries to the NewMaster?. For instance, add 5 entries: uid=test0, ..., uid=test4 with one ldapadd command-line. 

[5]  If all 5 are replicated to Consumer correctly, the bug was verified.

Comment 3 Noriko Hosoi 2012-05-25 00:24:07 UTC
    Technical note added. If any revisions are required, please edit the
"Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content
Services team.

    New Contents:
Cause: CSNs in RUV were not refreshed when a replication role was changed.
Consequence: It caused data inconsistency.
Fix: CSNs are refreshed at the timing of role change.
Result: Data inconsistency is not observed.

Comment 4 Noriko Hosoi 2012-05-25 00:27:08 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: CSNs in RUV were not refreshed when a replication role was changed.
Consequence: It caused data inconsistency.
Fix: CSNs are refreshed at the timing of role change.
Result: Data inconsistency is not observed.

Comment 5 Amita Sharma 2012-05-29 16:15:46 UTC
Single MAster
===============
[root@dhcp201-194 ~]# cat /var/log/dirsrv/slapd-dhcp201-1942/errors
	389-Directory/1.2.10.2 B2012.144.1937
	dhcp201-194.englab.pnq.redhat.com:1389 (/etc/dirsrv/slapd-dhcp201-1942)

[29/May/2012:20:50:05 +051800] NSMMReplicationPlugin - agmt_delete: begin
[29/May/2012:21:04:50 +051800] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=Master-hub" (dhcp201-194:2389)".
[29/May/2012:21:04:53 +051800] NSMMReplicationPlugin - Finished total update of replica "agmt="cn=Master-hub" (dhcp201-194:2389)". Sent 16 entries.
[29/May/2012:21:13:13 +051800] createprlistensockets - PR_Bind() on All Interfaces port 1389 failed: Netscape Portable Runtime error -5982 (Local Network address is in use.)


Initially Hub thn new master
=====================================
[root@dhcp201-194 ~]# cat /var/log/dirsrv/slapd-dhcp201-1943/errors
	389-Directory/1.2.10.2 B2012.144.1937
	dhcp201-194.englab.pnq.redhat.com:2389 (/etc/dirsrv/slapd-dhcp201-1943)

[29/May/2012:20:52:42 +051800] NSMMReplicationPlugin - agmt_delete: begin
[29/May/2012:21:04:49 +051800] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=example,dc=com is going offline; disabling replication
[29/May/2012:21:04:50 +051800] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[29/May/2012:21:04:52 +051800] - import userRoot: Workers finished; cleaning up...
[29/May/2012:21:04:52 +051800] - import userRoot: Workers cleaned up.
[29/May/2012:21:04:52 +051800] - import userRoot: Indexing complete.  Post-processing...
[29/May/2012:21:04:52 +051800] - import userRoot: Generating numSubordinates complete.
[29/May/2012:21:04:52 +051800] - import userRoot: Flushing caches...
[29/May/2012:21:04:52 +051800] - import userRoot: Closing files...
[29/May/2012:21:04:52 +051800] - import userRoot: Import complete.  Processed 16 entries in 3 seconds. (5.33 entries/sec)
[29/May/2012:21:04:52 +051800] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=example,dc=com is coming online; enabling replication
[29/May/2012:21:04:52 +051800] NSMMReplicationPlugin - replica_reload_ruv: Warning: new data for replica dc=example,dc=com does not match the data in the changelog.
 Recreating the changelog file. This could affect replication with replica's  consumers in which case the consumers should be reinitialized.
[29/May/2012:21:06:16 +051800] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=hub-consumer" (dhcp201-194:24337)".
[29/May/2012:21:06:20 +051800] NSMMReplicationPlugin - Finished total update of replica "agmt="cn=hub-consumer" (dhcp201-194:24337)". Sent 16 entries.

Consumer
============
[25/May/2012:12:42:37 +051800] - 389-Directory/1.2.10.2 B2012.144.1937 starting up
[25/May/2012:12:42:37 +051800] - slapd started.  Listening on All Interfaces port 24337 for LDAP requests
[29/May/2012:11:53:46 +051800] - slapd shutting down - signaling operation threads
[29/May/2012:11:53:46 +051800] - slapd shutting down - closing down internal subsystems and plugins
[29/May/2012:11:53:48 +051800] - Waiting for 4 database threads to stop
[29/May/2012:11:53:49 +051800] - All database threads now stopped
[29/May/2012:11:53:49 +051800] - slapd stopped.
[29/May/2012:11:58:12 +051800] - 389-Directory/1.2.10.2 B2012.144.1937 starting up
[29/May/2012:11:58:15 +051800] - slapd started.  Listening on All Interfaces port 24337 for LDAP requests
[29/May/2012:21:06:16 +051800] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=example,dc=com is going offline; disabling replication
[29/May/2012:21:06:16 +051800] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[29/May/2012:21:06:17 +051800] - import userRoot: WARNING: Skipping entry "nsuniqueid=92a07901-a3f911e1-b71d8fb2-352a3d0e,cn=x,nsuniqueid=d2fef783-a3f711e1-af33e12d-43d608a3,ou=People,dc=example,dc=com" which has no parent, ending at line 0 of file "(bulk import)"
[29/May/2012:21:06:17 +051800] - import userRoot: WARNING: Skipping entry "nsuniqueid=7dc8e301-a40311e1-b71d8fb2-352a3d0e,uid=aami,nsuniqueid=d2fef783-a3f711e1-af33e12d-43d608a3,ou=People,dc=example,dc=com" which has no parent, ending at line 0 of file "(bulk import)"
[29/May/2012:21:06:17 +051800] - import userRoot: WARNING: Skipping entry "nsuniqueid=85888781-a40311e1-bf1bdd56-24a2cdb7,uid=bb,nsuniqueid=d2fef783-a3f711e1-af33e12d-43d608a3,ou=People,dc=example,dc=com" which has no parent, ending at line 0 of file "(bulk import)"
[29/May/2012:21:06:17 +051800] - import userRoot: WARNING: Skipping entry "nsuniqueid=eb4a5001-a57811e1-bf1bdd56-24a2cdb7,uid=tt,nsuniqueid=d2fef783-a3f711e1-af33e12d-43d608a3,ou=People,dc=example,dc=com" which has no parent, ending at line 0 of file "(bulk import)"
[29/May/2012:21:06:17 +051800] - import userRoot: WARNING: Skipping entry "nsuniqueid=c10b9781-a63711e1-b71d8fb2-352a3d0e,uid=xz,nsuniqueid=d2fef783-a3f711e1-af33e12d-43d608a3,ou=People,dc=example,dc=com" which has no parent, ending at line 0 of file "(bulk import)"
[29/May/2012:21:06:17 +051800] - import userRoot: WARNING: bad entry: ID 10
[29/May/2012:21:06:17 +051800] - import userRoot: WARNING: bad entry: ID 11
[29/May/2012:21:06:17 +051800] - import userRoot: WARNING: bad entry: ID 12
[29/May/2012:21:06:17 +051800] - import userRoot: WARNING: bad entry: ID 13
[29/May/2012:21:06:17 +051800] - import userRoot: WARNING: bad entry: ID 14
[29/May/2012:21:06:19 +051800] - import userRoot: Workers finished; cleaning up...
[29/May/2012:21:06:19 +051800] - import userRoot: Workers cleaned up.
[29/May/2012:21:06:19 +051800] - import userRoot: Indexing complete.  Post-processing...
[29/May/2012:21:06:19 +051800] - import userRoot: Generating numSubordinates complete.
[29/May/2012:21:06:19 +051800] - import userRoot: Flushing caches...
[29/May/2012:21:06:19 +051800] - import userRoot: Closing files...
[29/May/2012:21:06:19 +051800] - import userRoot: Import complete.  Processed 16 entries (5 were skipped) in 3 seconds. (5.33 entries/sec)
[29/May/2012:21:06:19 +051800] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=example,dc=com is coming online; enabling replication

NOTE:: Entries got replicated between new master and consumer so bug is VERIFIED.

Comment 6 errata-xmlrpc 2012-06-20 07:14:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0813.html