Bug 750425
Summary: | Data inconsitency during replication | ||
---|---|---|---|
Product: | [Retired] 389 | Reporter: | Jyoti ranjan das <jyoti-ranjan.das> |
Component: | Replication - General | Assignee: | Rich Megginson <rmeggins> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Ben Levenson <benl> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 1.2.1 | CC: | nhosoi |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Other | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-12-10 18:44:34 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 743970 |
Description
Jyoti ranjan das
2011-11-01 04:09:59 UTC
I followed your steps (except: "Delete Supplier DN (cn=suppdn,cn=config) from Slave". I could not figure out what you meant by this...), but I could not duplicate your problem.
First, I set up:
single master --> hub --> readonly replica
(ReplicaId: 1) (65535) (65535)
Then, I disabled the agreement of the single master.
Promoted hub to a single master having ReplicaId=1.
Created a new agreement from the new master to readonly replica.
new single master --> readonly replica
(ReplicaId: 1) (65535)
I added 10 entries to the new master and verified all of them are replicated.
> Now, you will see, among the last 5 entries only last few will gets replicated
without halting the replication.
Could you tell us how you got this result? Did you run ldapsearch or some client commands and got just "last few" entries were returned? What does the access log say for the search request?
If you could attach the config file and errors/access log files of the new single master and the readonly replica, it'd might be a help.
Upstream ticket: https://fedorahosted.org/389/ticket/18 marking as screened because it has been cloned upstream Thanks to Jyoti for providing us the steps to reproduce. I could reproduce the problem and find the cause. See https://fedorahosted.org/389/ticket/18. Hi Noriko, Thanks a lot for the patch. But I have few more observations here. I slightly changed my reproducer which i had provided you before. In this case, at step no:5, after restarting the newly promoted master, added 5 more entries say (test6 to test10) and then added the replication agreement between newly promoted master and the consumer. I observed the same behavior as before i.e. only entry test10 got replicated to consumer where as consumer missed the changes from test6 to test9. In this case i observed that, after adding the replication agreement, consumer showed the CSN for entry test5 and the newly promoted master didn't find that CNS either in changelogdb file or in purged RUV list. so it made the following assumption and continued with the replication. (copied from file /ldapserver/ldap/servers/plugins/replication/cl5_api.c#5810) /* there is a special case which can occur just after migration - in this case, the consumer RUV will contain the last state of the supplier before migration, but the supplier will have an empty changelog, or the supplier changelog will not contain any entries within the consumer min and max CSN - also, since the purge RUV contains no CSNs, the changelog has never been purged ASSUMPTIONS - it is assumed that the supplier had no pending changes to send to any consumers; that is, we can assume that no changes were lost due to either changelog purging or database reload - bug# 603061 - richm */ Below message was logged in the error log file. ==== [24/Jan/2012:11:37:38 +051800] NSMMReplicationPlugin - changelog program - agmt="cn=hub_2_consumer" (sysmg7:6401): CSN 4f1e464f0004000a0000 not found and no purging, probably consumer may need to be reinitialized ===== Here, my question is, once we make the above assumption, why we start from the last record what supplier has received .i.e. test10 in case of our test scenario? Is there any specific reason for this? Don't you think we should start from the first change which the supplier has ever received ( In our test scenario, it should be test6) once we make the above assumption? Please do advise me. regards, Jyoti Jyoti, Could there be any other errors/warnings/info logged in the error log? For instance, one of these? "Warning: new data for replica %s does not match the data in the changelog.\n" " Recreating the changelog file. This could affect replication with replica's " " consumers in which case the consumers should be reinitialized.\n", "Warning: The changelog for replica %s is no longer valid since " "the replica config is being deleted. Removing the changelog.\n", (In reply to comment #6) > Jyoti, > > Could there be any other errors/warnings/info logged in the error log? > For instance, one of these? > "Warning: new data for replica %s does not match the data in the changelog.\n" > " Recreating the changelog file. This could affect replication with replica's " > " consumers in which case the consumers should be reinitialized.\n", > > "Warning: The changelog for replica %s is no longer valid since " > "the replica config is being deleted. Removing the changelog.\n", Hi Noriko, Yes, I can see the second warning message is getting logged in the error log. But what i feel, it is the expected one as per current design. If you delete the "cn=replica,cn=<suffix>,cn=maping tree,cn=config" entry while server is running using ldapmodify/ldapdelete, which in-tern will delete the changelogdb file. Here, my point is, if we are continuing with the replication by making the above mentioned assumption even after the changelogdb file deletion then we should continue with first operation which is logged in the changelogdb file instead of the last record. Regards, Jyoti Hi Noriko, I have another question. As we saw, deletion of cn=replica,cn=<suffix>,cn=mapping tree>,cn=config> intern deletes the changelog file. To retain the changelogdb file, Is there a option to modify the "cn=replica,cn=<suffix>,cn=mapping,cn=config" entry during promotion/demotion operation instead of deleting it ? Do you for-see any issue with this approach in replication environment? regards, Jyoti Hi Noriko, Do you have any input on this particular issue? regards, Jyoti (In reply to comment #9) > Hi Noriko, > > Do you have any input on this particular issue? > > regards, > Jyoti Sorry, Jyoti. I thought this bug was already closed. I don't think we could retain a changelog file if the replica is deleted for now. Please open an RFE bug with the use case. (In reply to comment #10) > (In reply to comment #9) > > Hi Noriko, > > > > Do you have any input on this particular issue? > > > > regards, > > Jyoti > > Sorry, Jyoti. I thought this bug was already closed. > > I don't think we could retain a changelog file if the replica is deleted for > now. > > Please open an RFE bug with the use case. Hi Noriko, Thanks for your reply. I have questions below in three different parts. If you could help in answering these questions, it would be really appreciated. ==== first==== Do you have any idea why the decision was taken to remove the changelogdb file if the replica is deleted which was not the case earlier, it used to retain the changelog db file even if replica deleted? ====== === Second ==== Is there any other way like instead of deleting the replica entry can we modify the replica entry during promotion and demotion operation? This could help us in differentiating the behavior like where the replica is being deleted completely and where the replica is being modified to play a different role. This way we can retain the change log file when there is a modification for the replica entry to play a different role(Master or Hub). ===== === Third === I have one use case below where i feel it should have behave in different way. Use case-1: Suppose we have a topology like Master, Hub, Consumer1,Consumer2. Master is replicating to Hub and Hub is replicating to both consumers. In this scenario, if one of the consumer say "consumer1" is out of topology for sometime and in between the Master disaster happened due to some reason. So to reduce the down time, the Hub is promoted to play the new Master role. In that case, if we bring back the consumer1 to the topology again without initializing during agreement creation, we can see some data inconsistency. The updates which were came after the Consumer1 is out of the topology and before the promotion happens, will be missing in the consumer1. Don't you think it should be a good idea instead of continuing with replication we should stop the replication for the consumer1 if we are not retaining the changelog? if we could retain the change log specially in case of the promotion/demotion operation, it could resolve few of the above use case in better way. Or if we could stop the replication instead of continuing in these scenario where we could see the requested CSN from the consumer is not there in supplier changelog db file and also not in purge RUV list . It will also give a hint to the administrator that there is some problem with the consumer in the topology which need some attention. ==== regards, Jyoti Hi Jyoti, > ==== first==== > Do you have any idea why the decision was taken to remove the changelogdb file > if the replica is deleted which was not the case earlier, it used to retain the > changelog db file even if replica deleted? > ====== The decision was made when this bug was fixed. Please note that the bug was reported by Glace Lu at HP. Bug 238630 - ns-slapd sometimes fails with SIGSEGV when removing and recreating replica entry Please see the comment #9 (https://bugzilla.redhat.com/show_bug.cgi?id=238630#c9). > === Second ==== > Is there any other way like instead of deleting the replica entry can we > modify the replica entry during promotion and demotion operation? I guess that's what we asked you to open an RFE bug with the use cases. > This could help us in differentiating the behavior like where the replica is > being deleted completely and where the replica is being modified to play a > different role. This way we can retain the change log file when there is a > modification for the replica entry to play a different role(Master or Hub). > ===== > > === Third === > I have one use case below where i feel it should have behave in different way. > > Use case-1: > > Suppose we have a topology like Master, Hub, Consumer1,Consumer2. Master is > replicating to Hub and Hub is replicating to both consumers. > > In this scenario, if one of the consumer say "consumer1" is out of topology for > sometime and in between the Master disaster happened due to some reason. So to > reduce the down time, the Hub is promoted to play the new Master role. In that > case, if we bring back the consumer1 to the topology again without initializing > during agreement creation, we can see some data inconsistency. The updates > which were came after the Consumer1 is out of the topology and before the > promotion happens, will be missing in the consumer1. > > Don't you think it should be a good idea instead of continuing with replication > we should stop the replication for the consumer1 if we are not retaining the > changelog? Probably, I'd like to ask you a different question... It looks you chose a topology with 1 Master, 1 Hub, and 2 Consumers. Instead, could you consider setting up an MMR topology like this? Master1 <--> Master2 \ / Hub / \ Consumer1 Consumer2 This way, even if one of the masters go down, you don't have to promote Hub to a master... Just continue using a healthy master. > if we could retain the change log specially in case of the promotion/demotion > operation, it could resolve few of the above use case in better way. > > Or > > if we could stop the replication instead of continuing in these scenario where > we could see the requested CSN from the consumer is not there in supplier > changelog db file and also not in purge RUV list . It will also give a hint to > the administrator that there is some problem with the consumer in the topology > which need some attention. > ==== If you are working on this issue and you could come up with your patch, we are more than happy to review it. Thanks! Hi Noriko, (In reply to comment #12) > Hi Jyoti, > > > ==== first==== > > Do you have any idea why the decision was taken to remove the changelogdb file > > if the replica is deleted which was not the case earlier, it used to retain the > > changelog db file even if replica deleted? > > ====== > > The decision was made when this bug was fixed. Please note that the bug was > reported by Glace Lu at HP. > Bug 238630 - ns-slapd sometimes fails with SIGSEGV when removing and recreating > replica entry > > Please see the comment #9 > (https://bugzilla.redhat.com/show_bug.cgi?id=238630#c9). Thanks for this information. > > === Second ==== > > Is there any other way like instead of deleting the replica entry can we > > modify the replica entry during promotion and demotion operation? > > I guess that's what we asked you to open an RFE bug with the use cases. I have logged a bug with the bug id:790656. Please let me know if you need any more information > > > This could help us in differentiating the behavior like where the replica is > > being deleted completely and where the replica is being modified to play a > > different role. This way we can retain the change log file when there is a > > modification for the replica entry to play a different role(Master or Hub). > > ===== > > > > === Third === > > I have one use case below where i feel it should have behave in different way. > > > > Use case-1: > > > > Suppose we have a topology like Master, Hub, Consumer1,Consumer2. Master is > > replicating to Hub and Hub is replicating to both consumers. > > > > In this scenario, if one of the consumer say "consumer1" is out of topology for > > sometime and in between the Master disaster happened due to some reason. So to > > reduce the down time, the Hub is promoted to play the new Master role. In that > > case, if we bring back the consumer1 to the topology again without initializing > > during agreement creation, we can see some data inconsistency. The updates > > which were came after the Consumer1 is out of the topology and before the > > promotion happens, will be missing in the consumer1. > > > > Don't you think it should be a good idea instead of continuing with replication > > we should stop the replication for the consumer1 if we are not retaining the > > changelog? > > Probably, I'd like to ask you a different question... It looks you chose a > topology with 1 Master, 1 Hub, and 2 Consumers. Instead, could you consider > setting up an MMR topology like this? > Master1 <--> Master2 > \ / > Hub > / \ > Consumer1 Consumer2 > > This way, even if one of the masters go down, you don't have to promote Hub to > a master... Just continue using a healthy master. > > > if we could retain the change log specially in case of the promotion/demotion > > operation, it could resolve few of the above use case in better way. > > > > Or > > > > if we could stop the replication instead of continuing in these scenario where > > we could see the requested CSN from the consumer is not there in supplier > > changelog db file and also not in purge RUV list . It will also give a hint to > > the administrator that there is some problem with the consumer in the topology > > which need some attention. > > ==== > > If you are working on this issue and you could come up with your patch, we are > more than happy to review it. > Thanks! The topology suggested by which will definitely help in this case. But the user is somewhat not agreeing to this suggestion. Sure, i will provide a patch if i able to get a proper solution for this. Regards. Jyoti |