Tuesday night (20 March), replication suddenly ceased between our four nodes, with errors similar to these on all nodes: [20/Mar/2007:17:13:26 -0500] NSMMReplicationPlugin - agmt="cn="Replication to gro ucho (o=isp)"" (groucho:389): Unable to acquire replica: Excessive clock skew bet ween the supplier and the consumer. Replication is aborting. [20/Mar/2007:17:13:26 -0500] NSMMReplicationPlugin - agmt="cn="Replication to gro ucho (o=isp)"" (groucho:389): Incremental update failed and requires administrato r action [20/Mar/2007:17:13:26 -0500] NSMMReplicationPlugin - agmt="cn="Replication to zep po.nebrwesleyan.edu (o=isp)"" (zeppo:389): Unable to acquire replica: Excessive c lock skew between the supplier and the consumer. Replication is aborting. [20/Mar/2007:17:13:26 -0500] NSMMReplicationPlugin - agmt="cn="Replication to zep po.nebrwesleyan.edu (o=isp)"" (zeppo:389): Incremental update failed and requires administrator action [20/Mar/2007:17:13:27 -0500] - csngen_adjust_time: adjustment limit exceeded; val ue - 86401, limit - 86400 [20/Mar/2007:17:13:27 -0500] NSMMReplicationPlugin - conn=1600790 op=4983 replica ="o=isp": Unable to acquire replica: error: excessive clock skew [20/Mar/2007:17:23:56 -0500] NSMMReplicationPlugin - agmt="cn="Replication to har po.nebrwesleyan.edu (o=isp)"" (harpo:389): Unable to acquire replica: Excessive c lock skew between the supplier and the consumer. Replication is aborting. [20/Mar/2007:17:23:56 -0500] NSMMReplicationPlugin - agmt="cn="Replication to har po.nebrwesleyan.edu (o=isp)"" (harpo:389): Incremental update failed and requires administrator action [20/Mar/2007:17:58:27 -0500] - csngen_adjust_time: adjustment limit exceeded; val ue - 86401, limit - 86400 [20/Mar/2007:17:58:27 -0500] NSMMReplicationPlugin - conn=1615833 op=3276 replica ="o=isp": Unable to acquire replica: error: excessive clock skew Subsequent efforts to resume replication have come to naught. (See this thread: https://www.redhat.com/archives/fedora-directory-users/2007-March/msg00100.html for more details on getting replication working again at all.) Once I get replication working, a few minutes after putting the cluster back into production we get the same error messages and replication ceases. The messages do not occur when the machines are not actively getting queries. In all cases, the 'value' on the csngen_adjust_time line is 86401 and the limit is 86400. All four nodes have the same clock times, and are running NTP against a local NTP server. We are using four-way MMR with no read-only replicas. We used the mmr.pl script to set up replication. Each node replicates to all three other nodes, and all four nodes receive updates. We have another database on all nodes that has continued to replicate without a problem. It's only our "o=isp" base that has troubles. This guy seems to have encountered a similar issue: http://www.mail-archive.com/ fedora-directory-users/msg03614.html
*** Bug 233643 has been marked as a duplicate of this bug. ***
Hi, I am trying to follow up on this bug. I have been setup circle MMR in my environment, and i have about 1000 records there. The four servers are running fine for 2 days. May I have more details from you to re-produce this bug. *) What is the MMR agreement you have? *) How many records you have? (rough number would be fine) *) What types of operations did your perform before it crashed? *) Did all servers crashed roughly at same time? or one by one? Can you try to reduce your MMR from 4 servers to 2 servers to confirm this? :) thanks
1. Each of the four nodes had an agreement with each of the three other nodes. 2. 5000-6000. 3. One of the nodes stopped replicating at 17:13:25. The last several operations in the access logs are as follows: [20/Mar/2007:17:13:24 -0500] conn=1593620 op=23 RESULT err=0 tag=101 nentries=0 etime=0 [20/Mar/2007:17:13:24 -0500] conn=1593620 op=24 SRCH base="ou=Groups,o=nebrwesle yan.edu,o=isp" scope=2 filter="(&(objectClass=sambaGroupMapping)(sambaSID=S-1-5- 2))" attrs="gidNumber sambaSID sambaGroupType sambasidlist description displayNa me cn objectClass" [20/Mar/2007:17:13:24 -0500] conn=1593620 op=24 RESULT err=0 tag=101 nentries=0 etime=0 [20/Mar/2007:17:13:24 -0500] conn=1593620 op=25 SRCH base="ou=Groups,o=nebrwesle yan.edu,o=isp" scope=2 filter="(&(objectClass=sambaGroupMapping)(sambaSID=S-1-5- 32-546))" attrs="gidNumber sambaSID sambaGroupType sambasidlist description disp layName cn objectClass" [20/Mar/2007:17:13:24 -0500] conn=1593620 op=25 RESULT err=0 tag=101 nentries=0 etime=0 [20/Mar/2007:17:13:24 -0500] conn=1593620 op=26 SRCH base="ou=Groups,o=nebrwesle yan.edu,o=isp" scope=2 filter="(&(objectClass=sambaGroupMapping)(|(displayName=n tadmin)(cn=ntadmin)))" attrs="gidNumber sambaSID sambaGroupType sambasidlist des cription displayName cn objectClass" [20/Mar/2007:17:13:24 -0500] conn=1593620 op=26 RESULT err=0 tag=101 nentries=1 etime=0 [20/Mar/2007:17:13:25 -0500] conn=1593621 fd=249 slot=249 connection from 10.1.1 .33 to 10.1.1.19 [20/Mar/2007:17:13:25 -0500] conn=1593621 op=0 EXT oid="1.3.6.1.4.1.1466.20037" [20/Mar/2007:17:13:25 -0500] conn=1593621 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="startTLS" [20/Mar/2007:17:13:25 -0500] conn=1593621 op=0 RESULT err=0 tag=120 nentries=0 e time=0 [20/Mar/2007:17:13:25 -0500] conn=1593621 SSL 256-bit AES [20/Mar/2007:17:13:25 -0500] conn=1593621 op=1 SRCH base="ou=people,o=nebrwesley an.edu,o=isp" scope=2 filter="(uid=)" attrs="cn" [20/Mar/2007:17:13:25 -0500] conn=1593621 op=1 RESULT err=0 tag=101 nentries=0 e time=0 [20/Mar/2007:17:13:25 -0500] conn=1593621 op=2 UNBIND [20/Mar/2007:17:13:25 -0500] conn=1593621 op=2 fd=249 closed - U1 [20/Mar/2007:17:13:25 -0500] conn=1592136 op=-1 fd=305 closed - B1 [20/Mar/2007:17:13:25 -0500] conn=1592523 op=90 BIND dn="uid=jcaldwel,ou=people, o=nebrwesleyan.edu,o=isp" method=128 version=3 [20/Mar/2007:17:13:25 -0500] conn=1592523 op=90 RESULT err=0 tag=97 nentries=0 e time=0 dn="uid=jcaldwel,ou=people,o=nebrwesleyan.edu,o=isp" [20/Mar/2007:17:13:25 -0500] conn=1590611 op=874 EXT oid="2.16.840.1.113730.3.5. 3" name="Netscape Replication Start Session" 4. The first error came at between 17:13:25 and 17:13:26 on all nodes. The first of 12 replication agreements broke at 17:14:11, and the last at 17:58:28. With the help of some of the folks on IRC, I've managed to get a three-node cluster back up and running in production. (I haven't tried a four-node cluster.) In order to do so, I followed the instructions I posted on the mailing list (https://www.redhat.com/archives/fedora-directory-users/2007-March/ msg00108.html) with one change: when dismantling the replication agreements, I completely got rid of the cn=replica node for the failed database, not just the agreements within that node. "uffe", the guy who helped me, thought that this would only delay the problem, not prevent it. Without that extra step, every attempt to create an MMR cluster with 2 or 3 nodes failed as soon as I put it into production.
Yi - it was thought that some state information stored in cn=replica or in the replication agreement entry was causing the time skew error.
Created attachment 308969 [details] script to check CSN state This script can be used to periodically check the state of the CSN generator. To use it: python readNsState.py /path/to/slapd-instance/dse.ldif This should print output like this for each replica: For replica cn=replica, cn="dc=example,dc=com", cn=mapping tree, cn=config len of nsstate is 40 CSN generator state: Replica ID : 2 Sampled Time : 1213206956 Time in hex : 0x485011ac Time as str : Wed Jun 11 11:55:56 2008 Local Offset : 0 Remote Offset : 1 Seq. num : 1 System time : Wed Jun 11 11:56:50 2008 Diff in sec. : 54 The state in dse.ldif is updated every 30 seconds. I'm mainly interested in 1) How far off from the system time is the Sampled Time? 2) Does Remote Offset keep increasing and never decreases? 3) Does Seq. num approach 65535? I think this problem is related to 2. I would appreciate it if you could run this script periodically and attach any interesting output to this bug.
I set up mmr between a vm running f8 and one running f9. I synced the clocks to within a few seconds. I ran a script that added 100,000 entries alternately to each server. Within about 40 minutes, the remote offset had gone from 0 to 75. At that rate, the remote offset would hit 86400 in about 32 days, and the time skew error would occur.
Created attachment 310040 [details] diffs
Created attachment 310083 [details] MMR diffs as requested by Rich on mailing list The output of Rich's readNsState.py script for our two MMR servers.
Thanks. Try running the script every half hour - you should see the remote offset field keep increasing.
Created attachment 310193 [details] fix csngen.c
Created attachment 310196 [details] cvs commit log Reviewed by: nhosoi, nkinder (Thanks!) Fix Description: CSN remote offset generation seems broken. We seem to accumulate a remote offset that keeps growing until we hit the limit of 1 day, then replication stops. The idea behind the remote offset is that servers may be seconds or minutes off. When replication starts, one of the itmes in the payload of the start extop is the latest CSN from the supplier. The CSN timestamp field is (sampled_time + local offset + remote offset). Sampled time comes from the time thread in the server that updates the time once per second. This allows the consumer, if also a master, to adjust its CSN generation so as not to generate duplicates or CSNs less than those from the supplier. However, the logic in csngen_adjust_time appears to be wrong: remote_offset = remote_time - gen->state.sampled_time; That is, remote_offset = (remote sampled_time + remote local offset + remote remote offset) - gen->state.sampled_time It should be remote_offset = remote_time - (sampled_time + local offset + remote offset) Since the sampled time is not the actual current time, it may be off by 1 second. So the new remote_offset will be at least 1 second more than it should be. Since this is the same remote_offset used to generate the CSN to send back to the other master, this offset would keep increasing and increasing over time. The script attached to the bug helps measure this effect. The new code also attempts to refresh the sampled time while adjusting to make sure we have as current a sampled_time as possible. In the old code, the remote_offset is "sent" back and forth between the masters, carried along in the CSN timestamp generation. In the new code, this can happen too, but to a far less extent, and should max out at (real offset + N seconds) where N is the number of masters. In the old code, you could only call csngen_adjust_time if you first made sure the remote timestamp >= local timestamp. I have removed this restriction and moved that logic into csngen_adjust_time. I also cleaned up the code in the consumer extop - I combined the checking of the CSN from the extop with the max CSN from the supplier RUV - now we only adjust the time once based on the max of all of these CSNs sent by the supplier. Finally, I cleaned up the error handling in a few places that assumed all errors were time skew errors. Follow up - I found a bug in my previous patch - _csngen_adjust_local_time must not be called when the sampled time == the current time. So I fixed that where I was calling _csngen_adjust_local_time, and I also changed _csngen_adjust_local_time so that time_diff == 0 is a no-op. Platforms tested: RHEL5, F8, F9 Flag Day: no Doc impact: no QA impact: Should test MMR and use the script to measure the offset effect.
Created attachment 310201 [details] readNsState.py output of MMR servers, 24 hours later In our servers the remote offset hasn't changed after a day, we may not be affected (our servers are not that busy). In any case it looks like you're on to a patch.
Yeah, the offset creep rate is proportional to the update rate. It may take you years to hit the bug at the rate you're going :-)
Created attachment 310525 [details] readNsState.py output readNsState.py output for MMR server #1, 24 hour period, 30 min interval between samples
Created attachment 310526 [details] readNsState.py output readNsState.py output for MMR server #2, 24 hour period, 30 min interval between samples
Thanks Gary, this is excellent. Much appreciated. An increase of remote offset of over 1000 in a 24-hour period. At this rate, you should hit the remote offset limit in about 80 or so days.
Checked into Directory71RtmBranch. Checking in ldap/servers/plugins/replication/repl5.h; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5.h,v <-- repl5.h new revision: 1.8.2.1; previous revision: 1.8 done Checking in ldap/servers/plugins/replication/repl5_inc_protocol.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5_inc_protocol.c,v <-- repl5_inc_protocol.c new revision: 1.8.2.2; previous revision: 1.8.2.1 done Checking in ldap/servers/plugins/replication/repl5_replica.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5_replica.c,v <-- repl5_replica.c new revision: 1.11.2.1; previous revision: 1.11 done Checking in ldap/servers/plugins/replication/repl_extop.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl_extop.c,v <-- repl_extop.c new revision: 1.7.2.2; previous revision: 1.7.2.1 done Checking in ldap/servers/plugins/replication/windows_inc_protocol.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/windows_inc_protocol.c,v <-- windows_inc_protocol.c new revision: 1.11.2.1; previous revision: 1.11 done Checking in ldap/servers/slapd/csngen.c; /cvs/dirsec/ldapserver/ldap/servers/slapd/csngen.c,v <-- csngen.c new revision: 1.5.2.1; previous revision: 1.5 done
Checked into Directory_Server_8_0_Branch. Checking in ldap/servers/plugins/replication/repl5.h; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5.h,v <-- repl5.h new revision: 1.10.2.1; previous revision: 1.10 done Checking in ldap/servers/plugins/replication/repl5_inc_protocol.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5_inc_protocol.c,v <-- repl5_inc_protocol.c new revision: 1.11.2.2; previous revision: 1.11.2.1 done Checking in ldap/servers/plugins/replication/repl5_replica.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5_replica.c,v <-- repl5_replica.c new revision: 1.16.2.1; previous revision: 1.16 done Checking in ldap/servers/plugins/replication/repl_extop.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl_extop.c,v <-- repl_extop.c new revision: 1.12.2.1; previous revision: 1.12 done Checking in ldap/servers/plugins/replication/windows_inc_protocol.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/windows_inc_protocol.c,v <-- windows_inc_protocol.c new revision: 1.15.2.1; previous revision: 1.15 done Checking in ldap/servers/slapd/csngen.c; /cvs/dirsec/ldapserver/ldap/servers/slapd/csngen.c,v <-- csngen.c new revision: 1.7.2.1; previous revision: 1.7 done
Hi all, I'm having what seems to be that same problem on a fedora ds 1.0.4 (X64)/RHEL ES 4.5 installation. The topology is made of 4 multimaster servers and 3 consumers. Each master has a MMR agreement to each other master and consumer replica, which make a total of 6 agreements in each master. All worked fine for about a year and a half (we made an upgrade from fds 1.0.2 on may 2006), but a couple of weeks ago replication stopped working with the "too much clock skew" error. Of course the clock was NTPed, so I've tried to reinit the replicas, manually and with the console, but to no avail, then I found this bug and ran the readNsState.py script.. and with no real suprise the remote offset was 86400 on all masters. Then I tried to recreate all the replica topology from scratch: deleted all cn=replica,cn=config entries, stopped all servers, deleted changelogdb/* , started servers, enabled replication from the console, recreated all 24 agreements on the masters, and initialized all the databases from the first master server using the console. All now is working again, but the strange thing is that when I ran readNsState.py as soon as I init'ed replicas, the remote offset was already around 40000 (yeah, that's forty thousands). Now, at 6 hours from the init, this is the output of the script: [root@store1 ~]# python readNsState.py /opt/fedora-ds/slapd-store1/config/dse.ldif For replica cn=replica,cn="dc=notartel,dc=it",cn=mapping tree,cn=config len of nsstate is 40 CSN generator state: Replica ID : 1 Sampled Time : 1218618718 Time in hex : 0x48a2a55e Time as str : Wed Aug 13 11:11:58 2008 Local Offset : 0 Remote Offset : 81404 Seq. num : 1 System time : Wed Aug 13 11:14:32 2008 Diff in sec. : 154 The remote offset value is similar in all servers. I am sure there were no write operations on the masters except from the replicas initializations. How can this be possible? Can I somehow force the value of the remote offset in the nsstate attribute of cn=replica, cn=config ? I ask because it does not seem that I can upgrade FDS to 1.1 on RHEL4 , and an OS upgrade is quite impossible since this is an important production system. I've read that Rich and Nathan made a patch to resolve this, but is there or will ever be available a binary patch to 1.0 that I can install on RHEL4? Many thanks in advance.
(In reply to comment #21) -snip- > I am sure there were no write operations on the masters except from the > replicas initializations. How can this be possible? Can I somehow force the > value of the remote offset in the nsstate attribute of cn=replica, cn=config ? That would be very tricky, because if you make your offset too small, you will generate duplicate CSNs which will break replication. > I ask because it does not seem that I can upgrade FDS to 1.1 on RHEL4 , and an > OS upgrade is quite impossible since this is an important production system. > I've read that Rich and Nathan made a patch to resolve this, but is there or > will ever be available a binary patch to 1.0 that I can install on RHEL4? We will probably not make any more RHEL4 binaries available. It's even very difficult for us to provide binaries that can run on RHEL5 (Fedora Core 6). If you want the latest code, you can build it yourself. If you feel so inclined, I can help you with that. > > Many thanks in advance.
Created attachment 314403 [details] CSN stats for replica 1
Created attachment 314404 [details] csn stats for replica 2
4way MMR (RHEL4, RHEL5, HPUX, SOL9) 200,000 users loading test ran for 2:50 minutes - attached results from checking CSN generator - bug verified.
Created attachment 314765 [details] new readNsState.py This new script can handle 32-bit or 64-bit, big endian or little endian - it should be able to read any of these on any system.
Created attachment 314770 [details] new readNsState.py Bug in previous version
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0602.html