Bug 750425

Summary: Data inconsitency during replication
Product: [Retired] 389 Reporter: Jyoti ranjan das <jyoti-ranjan.das>
Component: Replication - GeneralAssignee: Rich Megginson <rmeggins>
Status: CLOSED CURRENTRELEASE QA Contact: Ben Levenson <benl>
Severity: high Docs Contact:
Priority: unspecified    
Version: 1.2.1CC: nhosoi
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Other   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-10 18:44:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 743970    

Description Jyoti ranjan das 2011-11-01 04:09:59 UTC
Description of problem:

Data loss during the promotion operation(Slave to Master).
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
Step-1:

Have a topology like Master replicating to Slave and Slave replication to consumer.

Master -> Slave-> Consumer.

Step-2: 
Make sure that all are on sync at this time. Let’s take an example all are the on sync up to CSN5 (5 records are added to master from CSN1 to CSN5).

Step-3:

Delete the replication agreement from Master to Slave and also from Slave to consumer.

Step-4:

Promote the Slave to master.  Promotion steps are given below.

-	Delete Supplier DN (cn=suppdn,cn=config) from Slave
-	Delete “cn=replica” entry for the suffix “o=USA” using ldapmodify. As a result, it will delete the changelog file.
Ex: dn: cn=replica,cn=o=USA,cn=mapping tree,cn=config
changetype: delete
-	Modify the cn=o=USA ,cn=mapping tree,cn=config entry as below
EX: dn: cn=o=USA,cn=mapping tree,cn=config
changetype: modify
replace: nsslapd-state
nsslapd-state: backend

dn: cn=o=USA,cn=mapping tree,cn=config
changetype: modify
delete: nsslapd-referral
-	Recreate the “cn=replica” entry for the suffix as below.
dn: cn=replica,cn=o=USA,cn=mapping tree,cn=config
changetype: add
objectClass: nsds5replica
objectClass: top
nsDS5ReplicaRoot: o=USA
nsDS5ReplicaType: 3 
nsDS5Flags: 1
nsDS5ReplicaId: 10  -- Please assign the same “nsDS5ReplicaId value what master was having. In my case, Original master replica ID was 10.
nsds5ReplicaPurgeDelay: 1
nsds5ReplicaTombstonePurgeInterval: -1
cn: replica
-	Restart  slapd process. Now Slave become Master.

Is there anything am I missing during promotion operation or it’s not the right way to do the promotion operation?

Step -5:

Add the replication agreement between Slave(newly promoted Master) and Consumer . At this time both Slave and consumer are on sync up to CSN5. During agreement creation please do not initialize the consumer.

           Slave(newly promoted as master) - > consumer.

Step-6:

Add another 5 more entries to Slave which was promoted above as Master. Let’s assume CSN numbers for these 5 entries are from CSN6 to CSN10.

Step-7:

Now, you will see, among the last 5 entries only last few will gets replicated without halting the replication.

  
Actual results

Expected results:


Additional info:

Comment 1 Noriko Hosoi 2011-12-22 21:56:54 UTC
I followed your steps (except: "Delete Supplier DN (cn=suppdn,cn=config) from Slave".  I could not figure out what you meant by this...), but I could not duplicate your problem.

First, I set up:
  single master --> hub --> readonly replica
  (ReplicaId: 1)  (65535)   (65535)

Then, I disabled the agreement of the single master.
Promoted hub to a single master having ReplicaId=1.
Created a new agreement from the new master to readonly replica.
  new single master --> readonly replica
  (ReplicaId: 1)        (65535)

I added 10 entries to the new master and verified all of them are replicated.

> Now, you will see, among the last 5 entries only last few will gets replicated
without halting the replication.
Could you tell us how you got this result?  Did you run ldapsearch or some client commands and got just "last few" entries were returned?  What does the access log say for the search request?

If you could attach the config file and errors/access log files of the new single master and the readonly replica, it'd might be a help.

Comment 2 Martin Kosek 2012-01-04 13:18:02 UTC
Upstream ticket:
https://fedorahosted.org/389/ticket/18

Comment 3 Rich Megginson 2012-01-10 17:38:08 UTC
marking as screened because it has been cloned upstream

Comment 4 Noriko Hosoi 2012-01-19 19:56:46 UTC
Thanks to Jyoti for providing us the steps to reproduce.  I could reproduce the problem and find the cause.

See https://fedorahosted.org/389/ticket/18.

Comment 5 Jyoti ranjan das 2012-01-24 08:51:43 UTC
Hi Noriko,

Thanks a lot for the patch. But I have few more observations here.

I slightly changed my reproducer which i had provided you before. In this case, at step no:5, after restarting the newly promoted master, added 5 more entries say (test6 to test10) and then added the replication agreement between newly promoted master and the consumer. I observed the same behavior as before i.e. only entry test10 got replicated to consumer where as consumer missed the changes from test6 to test9.

In this case i observed that, after adding the replication agreement, consumer showed the CSN for entry test5 and the newly promoted master didn't find that CNS either in changelogdb file or in purged RUV list. so it made the following assumption and continued with the replication.

(copied from file /ldapserver/ldap/servers/plugins/replication/cl5_api.c#5810)

/* there is a special case which can occur just after migration - in this case,
the consumer RUV will contain the last state of the supplier before migration,
but the supplier will have an empty changelog, or the supplier changelog will
not contain any entries within the consumer min and max CSN - also, since
the purge RUV contains no CSNs, the changelog has never been purged
ASSUMPTIONS - it is assumed that the supplier had no pending changes to send
to any consumers; that is, we can assume that no changes were lost due to
either changelog purging or database reload - bug# 603061 - richm
*/

Below  message was logged in the error log file.

====
[24/Jan/2012:11:37:38 +051800] NSMMReplicationPlugin - changelog program - agmt="cn=hub_2_consumer" (sysmg7:6401): CSN 4f1e464f0004000a0000 not found and no purging, probably consumer may need to be reinitialized
=====

Here, my question is, once we make the above assumption, why we start from the last record what supplier has received .i.e. test10 in case of our test scenario?

Is there any specific reason for this?

Don't you think we should start from the first change which the supplier has ever received ( In our test scenario, it should be test6) once we make the above assumption? 

Please do advise  me.

regards,
Jyoti

Comment 6 Noriko Hosoi 2012-01-24 18:48:05 UTC
Jyoti,

Could there be any other errors/warnings/info logged in the error log?
For instance, one of these?
"Warning: new data for replica %s does not match the data in the changelog.\n"
" Recreating the changelog file. This could affect replication with replica's "
" consumers in which case the consumers should be reinitialized.\n",

"Warning: The changelog for replica %s is no longer valid since "
"the replica config is being deleted.  Removing the changelog.\n",

Comment 7 Jyoti ranjan das 2012-01-25 05:33:42 UTC
(In reply to comment #6)
> Jyoti,
> 
> Could there be any other errors/warnings/info logged in the error log?
> For instance, one of these?
> "Warning: new data for replica %s does not match the data in the changelog.\n"
> " Recreating the changelog file. This could affect replication with replica's "
> " consumers in which case the consumers should be reinitialized.\n",
> 
> "Warning: The changelog for replica %s is no longer valid since "
> "the replica config is being deleted.  Removing the changelog.\n",



Hi Noriko,

Yes, I can see the second warning message is getting logged in the error log.

But what i feel, it is the expected one as per current design. If you delete the "cn=replica,cn=<suffix>,cn=maping tree,cn=config" entry  while server is running using ldapmodify/ldapdelete, which in-tern will delete the changelogdb file.

Here, my point is, if we are continuing with the replication by making the above mentioned assumption even after the changelogdb file deletion then we should continue with first operation which is logged in the changelogdb file instead of the last record.


Regards,
Jyoti

Comment 8 Jyoti ranjan das 2012-01-27 11:30:24 UTC
Hi Noriko,

I have another question. As we saw, deletion of cn=replica,cn=<suffix>,cn=mapping tree>,cn=config> intern deletes the changelog file.
To retain the changelogdb file, Is there a option to modify the "cn=replica,cn=<suffix>,cn=mapping,cn=config" entry during promotion/demotion operation instead of deleting it ?

Do you for-see any issue with this approach in replication environment?

regards,
Jyoti

Comment 9 Jyoti ranjan das 2012-02-13 10:02:57 UTC
Hi Noriko,

Do you have any input on this particular issue?

regards,
Jyoti

Comment 10 Noriko Hosoi 2012-02-13 17:27:33 UTC
(In reply to comment #9)
> Hi Noriko,
> 
> Do you have any input on this particular issue?
> 
> regards,
> Jyoti

Sorry, Jyoti.  I thought this bug was already closed.

I don't think we could retain a changelog file if the replica is deleted for now.

Please open an RFE bug with the use case.

Comment 11 Jyoti ranjan das 2012-02-14 05:55:25 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > Hi Noriko,
> > 
> > Do you have any input on this particular issue?
> > 
> > regards,
> > Jyoti
> 
> Sorry, Jyoti.  I thought this bug was already closed.
> 
> I don't think we could retain a changelog file if the replica is deleted for
> now.
> 
> Please open an RFE bug with the use case.

Hi Noriko,

Thanks for your reply. 

 I have questions below in three different parts. If you could help in answering these questions, it would be really appreciated.


==== first====
Do you have any idea why the decision was taken to remove the changelogdb file if the replica is deleted which was not the case earlier, it used to retain the changelog db file even if replica deleted?
======


=== Second ====
 Is there any other way like instead of deleting the replica entry can we modify the replica entry during promotion and demotion operation?

This could help us in differentiating the behavior like where the replica is being deleted completely and  where the replica is being modified to play a different role. This way we can retain the change log file when there is a modification for the replica entry to play a different role(Master or Hub).
=====


=== Third ===
I have one use case below where i feel it should have behave in different way.


Use case-1:

Suppose we have a topology like Master, Hub, Consumer1,Consumer2. Master is replicating to Hub and Hub is replicating to both consumers.

In this scenario, if one of the consumer say "consumer1" is out of topology for sometime and in between the Master disaster happened due to some reason. So to reduce the down time, the Hub is promoted to play the new Master role. In that case, if we bring back the consumer1 to the topology again without initializing during agreement creation, we can see some data inconsistency. The updates which were came after the Consumer1 is out of the topology and before the promotion happens, will be missing in the consumer1.

Don't you think it should be a good idea instead of continuing with replication we should stop the replication for the consumer1 if we are not retaining the changelog?


if we could retain the change log specially in case of the promotion/demotion operation, it could resolve few of the above use case in better way.

Or

if we could stop the replication instead of continuing in these scenario where we could see the requested CSN from the consumer is not there in supplier changelog db file and also not in purge RUV list . It will also give a hint to the administrator that there is some problem with the consumer in the topology which need some attention.
====   

regards,
Jyoti

Comment 12 Noriko Hosoi 2012-02-14 21:20:09 UTC
Hi Jyoti,

> ==== first====
> Do you have any idea why the decision was taken to remove the changelogdb file
> if the replica is deleted which was not the case earlier, it used to retain the
> changelog db file even if replica deleted?
> ====== 

The decision was made when this bug was fixed.  Please note that the bug was reported by Glace Lu at HP.
Bug 238630 - ns-slapd sometimes fails with SIGSEGV when removing and recreating replica entry

Please see the comment #9 (https://bugzilla.redhat.com/show_bug.cgi?id=238630#c9).

> === Second ====
>   Is there any other way like instead of deleting the replica entry can we
> modify the replica entry during promotion and demotion operation? 

I guess that's what we asked you to open an RFE bug with the use cases.

> This could help us in differentiating the behavior like where the replica is
> being deleted completely and  where the replica is being modified to play a
> different role. This way we can retain the change log file when there is a
> modification for the replica entry to play a different role(Master or Hub).
> =====
>
> === Third ===
> I have one use case below where i feel it should have behave in different way.
>
> Use case-1:
>
> Suppose we have a topology like Master, Hub, Consumer1,Consumer2. Master is
> replicating to Hub and Hub is replicating to both consumers.
>
> In this scenario, if one of the consumer say "consumer1" is out of topology for
> sometime and in between the Master disaster happened due to some reason. So to
> reduce the down time, the Hub is promoted to play the new Master role. In that
> case, if we bring back the consumer1 to the topology again without initializing
> during agreement creation, we can see some data inconsistency. The updates
> which were came after the Consumer1 is out of the topology and before the
> promotion happens, will be missing in the consumer1.
>
> Don't you think it should be a good idea instead of continuing with replication
> we should stop the replication for the consumer1 if we are not retaining the
> changelog? 

Probably, I'd like to ask you a different question...  It looks you chose a topology with 1 Master, 1 Hub, and 2 Consumers.  Instead, could you consider setting up an MMR topology like this?
Master1 <--> Master2
      \     /
        Hub
      /     \
Consumer1  Consumer2

This way, even if one of the masters go down, you don't have to promote Hub to a master...  Just continue using a healthy master.

> if we could retain the change log specially in case of the promotion/demotion
> operation, it could resolve few of the above use case in better way.
>
> Or
>
> if we could stop the replication instead of continuing in these scenario where
> we could see the requested CSN from the consumer is not there in supplier
> changelog db file and also not in purge RUV list . It will also give a hint to
> the administrator that there is some problem with the consumer in the topology
> which need some attention.
> ==== 

If you are working on this issue and you could come up with your patch, we are more than happy to review it.
Thanks!

Comment 13 Jyoti ranjan das 2012-02-15 07:20:23 UTC
Hi Noriko,

(In reply to comment #12)
> Hi Jyoti,
> 
> > ==== first====
> > Do you have any idea why the decision was taken to remove the changelogdb file
> > if the replica is deleted which was not the case earlier, it used to retain the
> > changelog db file even if replica deleted?
> > ====== 
> 
> The decision was made when this bug was fixed.  Please note that the bug was
> reported by Glace Lu at HP.
> Bug 238630 - ns-slapd sometimes fails with SIGSEGV when removing and recreating
> replica entry
> 
> Please see the comment #9
> (https://bugzilla.redhat.com/show_bug.cgi?id=238630#c9).

Thanks for this information.

> > === Second ====
> >   Is there any other way like instead of deleting the replica entry can we
> > modify the replica entry during promotion and demotion operation? 
> 
> I guess that's what we asked you to open an RFE bug with the use cases.

I have logged a bug with the  bug id:790656. Please let me know if you need any more information
> 
> > This could help us in differentiating the behavior like where the replica is
> > being deleted completely and  where the replica is being modified to play a
> > different role. This way we can retain the change log file when there is a
> > modification for the replica entry to play a different role(Master or Hub).
> > =====
> >
> > === Third ===
> > I have one use case below where i feel it should have behave in different way.
> >
> > Use case-1:
> >
> > Suppose we have a topology like Master, Hub, Consumer1,Consumer2. Master is
> > replicating to Hub and Hub is replicating to both consumers.
> >
> > In this scenario, if one of the consumer say "consumer1" is out of topology for
> > sometime and in between the Master disaster happened due to some reason. So to
> > reduce the down time, the Hub is promoted to play the new Master role. In that
> > case, if we bring back the consumer1 to the topology again without initializing
> > during agreement creation, we can see some data inconsistency. The updates
> > which were came after the Consumer1 is out of the topology and before the
> > promotion happens, will be missing in the consumer1.
> >
> > Don't you think it should be a good idea instead of continuing with replication
> > we should stop the replication for the consumer1 if we are not retaining the
> > changelog? 
> 
> Probably, I'd like to ask you a different question...  It looks you chose a
> topology with 1 Master, 1 Hub, and 2 Consumers.  Instead, could you consider
> setting up an MMR topology like this?
> Master1 <--> Master2
>       \     /
>         Hub
>       /     \
> Consumer1  Consumer2
> 
> This way, even if one of the masters go down, you don't have to promote Hub to
> a master...  Just continue using a healthy master.
> 
> > if we could retain the change log specially in case of the promotion/demotion
> > operation, it could resolve few of the above use case in better way.
> >
> > Or
> >
> > if we could stop the replication instead of continuing in these scenario where
> > we could see the requested CSN from the consumer is not there in supplier
> > changelog db file and also not in purge RUV list . It will also give a hint to
> > the administrator that there is some problem with the consumer in the topology
> > which need some attention.
> > ==== 
> 
> If you are working on this issue and you could come up with your patch, we are
> more than happy to review it.
> Thanks!

The topology suggested by which will definitely help in this case. But the user is somewhat not agreeing to this suggestion. 

Sure, i will provide a patch if i able to get a proper solution for this.

Regards.
Jyoti