Bug 1031852
Summary: | RHEL7 ipa-replica-manage hang waiting on CLEANALLRUV tasks | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Scott Poore <spoore> | |
Component: | ipa | Assignee: | Martin Kosek <mkosek> | |
Status: | CLOSED DUPLICATE | QA Contact: | Namita Soman <nsoman> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | medium | |||
Version: | 7.0 | CC: | dpal, jcholast, jgalipea, lkrispen, mkosek, mreynolds, msauton, rcritten, spoore | |
Target Milestone: | rc | Keywords: | Reopened | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1034832 (view as bug list) | Environment: | ||
Last Closed: | 2016-02-25 14:51:17 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1034832 | |||
Bug Blocks: |
Description
Scott Poore
2013-11-19 00:54:29 UTC
Upstream ticket: https://fedorahosted.org/freeipa/ticket/4036 Test blocker keyword is moved to the DS bug. There is no need to have the keyword on two bugs. I can not reproduce the cleanallruv hang. The reason cleanallruv "hung" before was because not all the replicas were in synch. When I ran the test, everything was in synch, and cleanallruv ran fine. So maybe the automatic test suite is removing the replica too quickly before it can send out all its changes? Maybe there was a replication failure that prevented the updates from going out? I don't know, but running the test manually worked fine. This is what I saw when manually running the test: nsds50ruv: {replica 3 ldap://ipaqavmb.testrelm.com:389} 5292bba3000000030000 5292c3ac000000030000 nsds50ruv: {replica 4 ldap://cloud-qe-3.testrelm.com:389} 5292bc21000000040000 5292c3a9000600040000 nsds50ruv: {replica 5 ldap://ipaqavmc.testrelm.com:389} 5292bdd3000000050000 5292c3a9000200050000 nsds50ruv: {replica 6 ldap://tigger.testrelm.com:389} 5292c051000000060000 52950cf3000000060000 nsds50ruv: {replica 7 ldap://apollo.testrelm.com:389} 5292c2b0000000070000 5292c3bc000000070000 On replica 6, deleting replica 7 [root@tigger ~]# ipa-replica-manage -p Secret123 del apollo.testrelm.com Deleting a master is irreversible. To reconnect to the remote master you will need to prepare a new replica file and re-install. Continue to delete? [no]: yes Deleting replication agreements between apollo.testrelm.com and tigger.testrelm.com ipa: INFO: Setting agreement cn=meTotigger.testrelm.com,cn=replica,cn=dc\=testrelm\,dc\=com,cn=mapping tree,cn=config schedule to 2358-2359 0 to force synch ipa: INFO: Deleting schedule 2358-2359 0 from agreement cn=meTotigger.testrelm.com,cn=replica,cn=dc\=testrelm\,dc\=com,cn=mapping tree,cn=config ipa: INFO: Replication Update in progress: TRUE: status: 0 Replica acquired successfully: Incremental update started: start: 0: end: 0 ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica acquired successfully: Incremental update succeeded: start: 0: end: 0 Deleted replication agreement from 'tigger.testrelm.com' to 'apollo.testrelm.com' Background task created to clean replication data. This may take a while. This may be safely interrupted with Ctrl+C [26/Nov/2013:16:30:24 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Initiating CleanAllRUV Task... [26/Nov/2013:16:30:24 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Retrieving maxcsn... [26/Nov/2013:16:30:24 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Found maxcsn (5292c3bc000000070000) [26/Nov/2013:16:30:24 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Cleaning rid (7)... [26/Nov/2013:16:30:24 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Waiting to process all the updates from the deleted replica... [26/Nov/2013:16:30:24 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Waiting for all the replicas to be online... [26/Nov/2013:16:30:24 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Waiting for all the replicas to receive all the deleted replica updates... [26/Nov/2013:16:30:24 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Sending cleanAllRUV task to all the replicas... [26/Nov/2013:16:30:24 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Cleaning local ruv's... [26/Nov/2013:16:30:24 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Waiting for all the replicas to be cleaned... [26/Nov/2013:16:30:24 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Replica is not cleaned yet (agmt="cn=meToipaqavmc.testrelm.com" (ipaqavmc:389)) [26/Nov/2013:16:30:24 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Replicas have not been cleaned yet, retrying in 10 seconds [26/Nov/2013:16:30:36 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Waiting for all the replicas to finish cleaning... [26/Nov/2013:16:30:36 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Not all replicas finished cleaning, retrying in 10 seconds [26/Nov/2013:16:30:46 -0500] NSMMReplicationPlugin - CleanAllRUV Task: Successfully cleaned rid(7). I am closing this Bugzilla. As I discussed with Nathan, there was a problem with a wrong test procedure. When Scott fixed it, the problem went away. There was still the 389-ds-base freeze, but it is being investigated in Bug 1034832. I am closing this Bugzilla then. Martin, I'm reopening this one for clarification to find out if there is anything that should be done for ipa-replica-manage. I suspect this is simply procedural and not something that can be fixed in ipa-replica-manage. Just want confirmation: From Mark's explanation in bug #1034832, the CLEANALLRUV task was waiting on replication which never finished because the re-intialize overwrote the changelog. Now, that re-initialize wasn't necessary to begin with (in fact was incorrect) and has since been removed. So, my particular problem was alleviated. However, I'm wondering if there's something that ipa-replica-manage could do to help prevent that scenario. This was Mark's explanation: > It's not that you need to check the change log, but you need to wait for > replication to complete or be idle(e.g. by putting all the replicas in > read-only mode and checking the replication status of each agreement). Wasn't something put into ipa-replica-manage that locked the replicas for another bug? So, is there something that should be done for ipa-replica-manage to check state before a re-initialize? Or is this simply a procedural issue where the user should check these things before attempting a re-initialize? Thanks, Scott This is a good question, If there would be something we can do to make the re-initialize process on the replica more robust, I am open to it. Mark, any recommendations? I am thinking that putting the replica and all it's peers to readonly mode may not be what we want as it would disrupt service on all connected replication peers, right? This is what we do when re-initializing a replica: 1) enable the agreement from this host to the remote host (put nsds5ReplicaEnabled to ON) 2) enable the agreement from the remote host to local host (put nsds5ReplicaEnabled to ON) 3) Force synchronization from the remote host to the local host (play with nsDS5ReplicaUpdateSchedule) 4) Re-initialize the replication (change nsds5BeginReplicaRefresh to start) I see no wait with the force sync action. I also see we do not force sync with other replication peers. Should we proceed differently? Ideally without disruptions on the remote Directory Servers. (In reply to Martin Kosek from comment #20) > This is a good question, If there would be something we can do to make the > re-initialize process on the replica more robust, I am open to it. > > Mark, any recommendations? I am thinking that putting the replica and all > it's peers to readonly mode may not be what we want as it would disrupt > service on all connected replication peers, right? Yes this would work, but you also need to make sure that replication gets caught up before doing the reinit. Let me back track. This is really a procedural type of issue. In Scott's replication setup, most replicas are chained sequentially, not round-robin. So instead of setting replicas as read-only, you could reinit the first replica and then reinit all its child replicas as well: A / \ B C \ D \ E If you reinit C, then you need to reinit D and E. As D and E might not have received all the updates from the old(pre-reinited) replica C changelog in time. In a round-robin deployment, where every replica is connected to one another, this is not really an issue. This being said, usually you only reinit a replica once replication is broken(or being setup for the very first time), not while it is still correctly working and processing updates. So it's simply problematic when you reinit a replica that is already correctly running. Again setting to read only would work(with client disruptions), but then you need to make sure that replication is idle before doing the reinit. Meaning, make sure replica C has sent out all its updates(all the agreements are idle/caught up), then reinit C. This requires checking RUVs in each agreement against the consumer replica database RUV, etc. But if you only did a reinit if replication was already broken we would not see these issues, its only when you reinit a working replica that problems can arise. So the short story is - don't reinit working replicas :-) Please let me know if you have any more questions. PS - something to keep in mind for the future. If IPA deployments start to become very large. Hundreds of thousands of entries, or more, online reinitializations become very expensive/disruptive. They can even appear to hang the server. This is why reinits are considered to be most expensive/disruptive task replication can do, and they are usually avoided at all costs unless replication can not recover from some serious failure. So, offline (db2ldif -r/ldif2db) reinitializations become the preferred choice once replication breaks, and even for the initial replication setup when dealing with large databases. > > This is what we do when re-initializing a replica: > > 1) enable the agreement from this host to the remote host (put > nsds5ReplicaEnabled to ON) > 2) enable the agreement from the remote host to local host (put > nsds5ReplicaEnabled to ON) > 3) Force synchronization from the remote host to the local host (play with > nsDS5ReplicaUpdateSchedule) > 4) Re-initialize the replication (change nsds5BeginReplicaRefresh to start) > > I see no wait with the force sync action. I also see we do not force sync > with other replication peers. > > Should we proceed differently? Ideally without disruptions on the remote > Directory Servers. Mark, thanks for explanation. I am now thinking what from the proposed improvements could be automated in ipa-replica-manage. We could warn user that he has to re-initialize also other IPA masters in case he reinitializes "C" as in your example. But for that, we would first need to be able to get a full graph of the IPA network: https://fedorahosted.org/freeipa/ticket/3058 As for other enhancements, I am thinking about following update to the process: 1) enable the agreement from this host to the remote host (put nsds5ReplicaEnabled to ON) 2) enable the agreement from the remote host to local host (put nsds5ReplicaEnabled to ON) FOR EACH replication peer: a) Force synchronization from the remote host to the local host (play with nsDS5ReplicaUpdateSchedule) b) Wait until replication is stale (nsds5replicaUpdateInProgress is false) 3) Re-initialize the replication (change nsds5BeginReplicaRefresh to start) Would that improve the process? I was not sure what exactly do you mean by "This requires checking RUVs in each agreement against the consumer replica database RUV, etc.", i.e. how should I check/compare that. (In reply to Martin Kosek from comment #22) > Mark, thanks for explanation. I am now thinking what from the proposed > improvements could be automated in ipa-replica-manage. > > We could warn user that he has to re-initialize also other IPA masters in > case he reinitializes "C" as in your example. I think there should always be some type of warning when doing an online reinit. Stating something like the remote database will be removed, and its changelog invalidated, and that it might require the remote replicas peers to be reinited as well. > But for that, we would first > need to be able to get a full graph of the IPA network: > https://fedorahosted.org/freeipa/ticket/3058 > > As for other enhancements, I am thinking about following update to the > process: > > 1) enable the agreement from this host to the remote host (put > nsds5ReplicaEnabled to ON) > 2) enable the agreement from the remote host to local host (put > nsds5ReplicaEnabled to ON) > > FOR EACH replication peer: > a) Force synchronization from the remote host to the local host (play > with nsDS5ReplicaUpdateSchedule) > b) Wait until replication is stale (nsds5replicaUpdateInProgress is > false) This won't guarantee that replication is idle when you actually do the reinit. You would need to: a) Set this server to read-only mode. b) Force synchronization from the remote host to the local host (play with nsDS5ReplicaUpdateSchedule). c) Then wait for nsds5replicaUpdateInProgress to be false. d) Do the reinit on the remote replica. e) Finally, disable read-only mode. While this is disruptive to clients/replicas, this should not be a common task being performed. If it needs to be run, then there are probably already disruptive problems occurring, or, nothing was even setup yet(in which case it doesn't really matter). > > 3) Re-initialize the replication (change nsds5BeginReplicaRefresh to start) > > Would that improve the process? I was not sure what exactly do you mean by > "This requires checking RUVs in each agreement against the consumer replica > database RUV, etc.", i.e. how should I check/compare that. I was referring to the "hard" way of determining if the replica was idle. Checking nsds5replicaUpdateInProgress should be sufficient. Ok, thanks for suggestions. I reopened the ticket, we will triage it and see what we can do with it upstream. Ludwig, will this Bugzilla be fixed with the latest RUV fixes that were done in Directory Server and FreeIPA? yes, but the corresponding DS ticket #48218 is only committed in master Ok. Just for reference, this is the link to DS ticket: https://fedorahosted.org/389/ticket/48218 It should be used in FreeIPA 4.4, when RUVs are cleaned. The DS and FreeIPA changes should get to RHEL-7.3 with next considered rebase (Bug 1270020). The situation should be also much improved with https://fedorahosted.org/freeipa/ticket/5411 being closed. This all should be tested as part of the IdM topology feature (Bug 1298848) will manage the agreement. The proposed enhancements should be then filed on top of the Topology feature, based on the experience. For now, I am thus closing this bug as duplicate and I will link the upstream ticket to the Topology feature, to be aware of the request. *** This bug has been marked as a duplicate of bug 1298848 *** |