Bug 233642
Summary: | MMR breaks with time skew errors | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] 389 | Reporter: | Chris St. Pierre <cstpierr> | ||||||||||||||||||||||||||
Component: | Replication - General | Assignee: | Rich Megginson <rmeggins> | ||||||||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Chandrasekar Kannan <ckannan> | ||||||||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||||||||
Priority: | medium | ||||||||||||||||||||||||||||
Version: | 1.0.3 | CC: | andrey.ivanov, benl, bugzilla, claudio.tassini, jgalipea, windhamg | ||||||||||||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||||||||
Last Closed: | 2008-08-27 20:39:00 UTC | Type: | --- | ||||||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||||||||
Bug Blocks: | 152373, 249650, 452721, 453229 | ||||||||||||||||||||||||||||
Attachments: |
|
Description
Chris St. Pierre
2007-03-23 15:41:09 UTC
*** Bug 233643 has been marked as a duplicate of this bug. *** Hi, I am trying to follow up on this bug. I have been setup circle MMR in my environment, and i have about 1000 records there. The four servers are running fine for 2 days. May I have more details from you to re-produce this bug. *) What is the MMR agreement you have? *) How many records you have? (rough number would be fine) *) What types of operations did your perform before it crashed? *) Did all servers crashed roughly at same time? or one by one? Can you try to reduce your MMR from 4 servers to 2 servers to confirm this? :) thanks 1. Each of the four nodes had an agreement with each of the three other nodes. 2. 5000-6000. 3. One of the nodes stopped replicating at 17:13:25. The last several operations in the access logs are as follows: [20/Mar/2007:17:13:24 -0500] conn=1593620 op=23 RESULT err=0 tag=101 nentries=0 etime=0 [20/Mar/2007:17:13:24 -0500] conn=1593620 op=24 SRCH base="ou=Groups,o=nebrwesle yan.edu,o=isp" scope=2 filter="(&(objectClass=sambaGroupMapping)(sambaSID=S-1-5- 2))" attrs="gidNumber sambaSID sambaGroupType sambasidlist description displayNa me cn objectClass" [20/Mar/2007:17:13:24 -0500] conn=1593620 op=24 RESULT err=0 tag=101 nentries=0 etime=0 [20/Mar/2007:17:13:24 -0500] conn=1593620 op=25 SRCH base="ou=Groups,o=nebrwesle yan.edu,o=isp" scope=2 filter="(&(objectClass=sambaGroupMapping)(sambaSID=S-1-5- 32-546))" attrs="gidNumber sambaSID sambaGroupType sambasidlist description disp layName cn objectClass" [20/Mar/2007:17:13:24 -0500] conn=1593620 op=25 RESULT err=0 tag=101 nentries=0 etime=0 [20/Mar/2007:17:13:24 -0500] conn=1593620 op=26 SRCH base="ou=Groups,o=nebrwesle yan.edu,o=isp" scope=2 filter="(&(objectClass=sambaGroupMapping)(|(displayName=n tadmin)(cn=ntadmin)))" attrs="gidNumber sambaSID sambaGroupType sambasidlist des cription displayName cn objectClass" [20/Mar/2007:17:13:24 -0500] conn=1593620 op=26 RESULT err=0 tag=101 nentries=1 etime=0 [20/Mar/2007:17:13:25 -0500] conn=1593621 fd=249 slot=249 connection from 10.1.1 .33 to 10.1.1.19 [20/Mar/2007:17:13:25 -0500] conn=1593621 op=0 EXT oid="1.3.6.1.4.1.1466.20037" [20/Mar/2007:17:13:25 -0500] conn=1593621 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="startTLS" [20/Mar/2007:17:13:25 -0500] conn=1593621 op=0 RESULT err=0 tag=120 nentries=0 e time=0 [20/Mar/2007:17:13:25 -0500] conn=1593621 SSL 256-bit AES [20/Mar/2007:17:13:25 -0500] conn=1593621 op=1 SRCH base="ou=people,o=nebrwesley an.edu,o=isp" scope=2 filter="(uid=)" attrs="cn" [20/Mar/2007:17:13:25 -0500] conn=1593621 op=1 RESULT err=0 tag=101 nentries=0 e time=0 [20/Mar/2007:17:13:25 -0500] conn=1593621 op=2 UNBIND [20/Mar/2007:17:13:25 -0500] conn=1593621 op=2 fd=249 closed - U1 [20/Mar/2007:17:13:25 -0500] conn=1592136 op=-1 fd=305 closed - B1 [20/Mar/2007:17:13:25 -0500] conn=1592523 op=90 BIND dn="uid=jcaldwel,ou=people, o=nebrwesleyan.edu,o=isp" method=128 version=3 [20/Mar/2007:17:13:25 -0500] conn=1592523 op=90 RESULT err=0 tag=97 nentries=0 e time=0 dn="uid=jcaldwel,ou=people,o=nebrwesleyan.edu,o=isp" [20/Mar/2007:17:13:25 -0500] conn=1590611 op=874 EXT oid="2.16.840.1.113730.3.5. 3" name="Netscape Replication Start Session" 4. The first error came at between 17:13:25 and 17:13:26 on all nodes. The first of 12 replication agreements broke at 17:14:11, and the last at 17:58:28. With the help of some of the folks on IRC, I've managed to get a three-node cluster back up and running in production. (I haven't tried a four-node cluster.) In order to do so, I followed the instructions I posted on the mailing list (https://www.redhat.com/archives/fedora-directory-users/2007-March/ msg00108.html) with one change: when dismantling the replication agreements, I completely got rid of the cn=replica node for the failed database, not just the agreements within that node. "uffe", the guy who helped me, thought that this would only delay the problem, not prevent it. Without that extra step, every attempt to create an MMR cluster with 2 or 3 nodes failed as soon as I put it into production. Yi - it was thought that some state information stored in cn=replica or in the replication agreement entry was causing the time skew error. Created attachment 308969 [details]
script to check CSN state
This script can be used to periodically check the state of the CSN generator.
To use it:
python readNsState.py /path/to/slapd-instance/dse.ldif
This should print output like this for each replica:
For replica cn=replica, cn="dc=example,dc=com", cn=mapping tree, cn=config
len of nsstate is 40
CSN generator state:
Replica ID : 2
Sampled Time : 1213206956
Time in hex : 0x485011ac
Time as str : Wed Jun 11 11:55:56 2008
Local Offset : 0
Remote Offset : 1
Seq. num : 1
System time : Wed Jun 11 11:56:50 2008
Diff in sec. : 54
The state in dse.ldif is updated every 30 seconds. I'm mainly interested in
1) How far off from the system time is the Sampled Time?
2) Does Remote Offset keep increasing and never decreases?
3) Does Seq. num approach 65535?
I think this problem is related to 2. I would appreciate it if you could run
this script periodically and attach any interesting output to this bug.
I set up mmr between a vm running f8 and one running f9. I synced the clocks to within a few seconds. I ran a script that added 100,000 entries alternately to each server. Within about 40 minutes, the remote offset had gone from 0 to 75. At that rate, the remote offset would hit 86400 in about 32 days, and the time skew error would occur. Created attachment 310040 [details]
diffs
Created attachment 310083 [details]
MMR diffs as requested by Rich on mailing list
The output of Rich's readNsState.py script for our two MMR servers.
Thanks. Try running the script every half hour - you should see the remote offset field keep increasing. Created attachment 310193 [details]
fix csngen.c
Created attachment 310196 [details]
cvs commit log
Reviewed by: nhosoi, nkinder (Thanks!)
Fix Description: CSN remote offset generation seems broken. We seem to
accumulate a remote offset that keeps growing until we hit the limit of 1 day,
then replication stops. The idea behind the remote offset is that servers may
be seconds or minutes off. When replication starts, one of the itmes in the
payload of the start extop is the latest CSN from the supplier. The CSN
timestamp field is (sampled_time + local offset + remote offset). Sampled time
comes from the time thread in the server that updates the time once per second.
This allows the consumer, if also a master, to adjust its CSN generation so as
not to generate duplicates or CSNs less than those from the supplier. However,
the logic in csngen_adjust_time appears to be wrong:
remote_offset = remote_time - gen->state.sampled_time;
That is, remote_offset = (remote sampled_time + remote local offset + remote
remote offset) - gen->state.sampled_time
It should be
remote_offset = remote_time - (sampled_time + local offset + remote
offset)
Since the sampled time is not the actual current time, it may be off by 1
second. So the new remote_offset will be at least 1 second more than it should
be. Since this is the same remote_offset used to generate the CSN to send back
to the other master, this offset would keep increasing and increasing over
time. The script attached to the bug helps measure this effect. The new code
also attempts to refresh the sampled time while adjusting to make sure we have
as current a sampled_time as possible. In the old code, the remote_offset is
"sent" back and forth between the masters, carried along in the CSN timestamp
generation. In the new code, this can happen too, but to a far less extent,
and should max out at (real offset + N seconds) where N is the number of
masters.
In the old code, you could only call csngen_adjust_time if you first made sure
the remote timestamp >= local timestamp. I have removed this restriction and
moved that logic into csngen_adjust_time. I also cleaned up the code in the
consumer extop - I combined the checking of the CSN from the extop with the max
CSN from the supplier RUV - now we only adjust the time once based on the max
of all of these CSNs sent by the supplier.
Finally, I cleaned up the error handling in a few places that assumed all
errors were time skew errors.
Follow up - I found a bug in my previous patch - _csngen_adjust_local_time must
not be called when the sampled time == the current time. So I fixed that where
I was calling _csngen_adjust_local_time, and I also changed
_csngen_adjust_local_time so that time_diff == 0 is a no-op.
Platforms tested: RHEL5, F8, F9
Flag Day: no
Doc impact: no
QA impact: Should test MMR and use the script to measure the offset effect.
Created attachment 310201 [details]
readNsState.py output of MMR servers, 24 hours later
In our servers the remote offset hasn't changed after a day, we may not be
affected (our servers are not that busy). In any case it looks like you're on
to a patch.
Yeah, the offset creep rate is proportional to the update rate. It may take you years to hit the bug at the rate you're going :-) Created attachment 310525 [details]
readNsState.py output
readNsState.py output for MMR server #1, 24 hour period, 30 min interval
between samples
Created attachment 310526 [details]
readNsState.py output
readNsState.py output for MMR server #2, 24 hour period, 30 min interval
between samples
Thanks Gary, this is excellent. Much appreciated. An increase of remote offset of over 1000 in a 24-hour period. At this rate, you should hit the remote offset limit in about 80 or so days. Checked into Directory71RtmBranch. Checking in ldap/servers/plugins/replication/repl5.h; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5.h,v <-- repl5.h new revision: 1.8.2.1; previous revision: 1.8 done Checking in ldap/servers/plugins/replication/repl5_inc_protocol.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5_inc_protocol.c,v <-- repl5_inc_protocol.c new revision: 1.8.2.2; previous revision: 1.8.2.1 done Checking in ldap/servers/plugins/replication/repl5_replica.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5_replica.c,v <-- repl5_replica.c new revision: 1.11.2.1; previous revision: 1.11 done Checking in ldap/servers/plugins/replication/repl_extop.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl_extop.c,v <-- repl_extop.c new revision: 1.7.2.2; previous revision: 1.7.2.1 done Checking in ldap/servers/plugins/replication/windows_inc_protocol.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/windows_inc_protocol.c,v <-- windows_inc_protocol.c new revision: 1.11.2.1; previous revision: 1.11 done Checking in ldap/servers/slapd/csngen.c; /cvs/dirsec/ldapserver/ldap/servers/slapd/csngen.c,v <-- csngen.c new revision: 1.5.2.1; previous revision: 1.5 done Checked into Directory_Server_8_0_Branch. Checking in ldap/servers/plugins/replication/repl5.h; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5.h,v <-- repl5.h new revision: 1.10.2.1; previous revision: 1.10 done Checking in ldap/servers/plugins/replication/repl5_inc_protocol.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5_inc_protocol.c,v <-- repl5_inc_protocol.c new revision: 1.11.2.2; previous revision: 1.11.2.1 done Checking in ldap/servers/plugins/replication/repl5_replica.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5_replica.c,v <-- repl5_replica.c new revision: 1.16.2.1; previous revision: 1.16 done Checking in ldap/servers/plugins/replication/repl_extop.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl_extop.c,v <-- repl_extop.c new revision: 1.12.2.1; previous revision: 1.12 done Checking in ldap/servers/plugins/replication/windows_inc_protocol.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/windows_inc_protocol.c,v <-- windows_inc_protocol.c new revision: 1.15.2.1; previous revision: 1.15 done Checking in ldap/servers/slapd/csngen.c; /cvs/dirsec/ldapserver/ldap/servers/slapd/csngen.c,v <-- csngen.c new revision: 1.7.2.1; previous revision: 1.7 done Hi all, I'm having what seems to be that same problem on a fedora ds 1.0.4 (X64)/RHEL ES 4.5 installation. The topology is made of 4 multimaster servers and 3 consumers. Each master has a MMR agreement to each other master and consumer replica, which make a total of 6 agreements in each master. All worked fine for about a year and a half (we made an upgrade from fds 1.0.2 on may 2006), but a couple of weeks ago replication stopped working with the "too much clock skew" error. Of course the clock was NTPed, so I've tried to reinit the replicas, manually and with the console, but to no avail, then I found this bug and ran the readNsState.py script.. and with no real suprise the remote offset was 86400 on all masters. Then I tried to recreate all the replica topology from scratch: deleted all cn=replica,cn=config entries, stopped all servers, deleted changelogdb/* , started servers, enabled replication from the console, recreated all 24 agreements on the masters, and initialized all the databases from the first master server using the console. All now is working again, but the strange thing is that when I ran readNsState.py as soon as I init'ed replicas, the remote offset was already around 40000 (yeah, that's forty thousands). Now, at 6 hours from the init, this is the output of the script: [root@store1 ~]# python readNsState.py /opt/fedora-ds/slapd-store1/config/dse.ldif For replica cn=replica,cn="dc=notartel,dc=it",cn=mapping tree,cn=config len of nsstate is 40 CSN generator state: Replica ID : 1 Sampled Time : 1218618718 Time in hex : 0x48a2a55e Time as str : Wed Aug 13 11:11:58 2008 Local Offset : 0 Remote Offset : 81404 Seq. num : 1 System time : Wed Aug 13 11:14:32 2008 Diff in sec. : 154 The remote offset value is similar in all servers. I am sure there were no write operations on the masters except from the replicas initializations. How can this be possible? Can I somehow force the value of the remote offset in the nsstate attribute of cn=replica, cn=config ? I ask because it does not seem that I can upgrade FDS to 1.1 on RHEL4 , and an OS upgrade is quite impossible since this is an important production system. I've read that Rich and Nathan made a patch to resolve this, but is there or will ever be available a binary patch to 1.0 that I can install on RHEL4? Many thanks in advance. (In reply to comment #21) -snip- > I am sure there were no write operations on the masters except from the > replicas initializations. How can this be possible? Can I somehow force the > value of the remote offset in the nsstate attribute of cn=replica, cn=config ? That would be very tricky, because if you make your offset too small, you will generate duplicate CSNs which will break replication. > I ask because it does not seem that I can upgrade FDS to 1.1 on RHEL4 , and an > OS upgrade is quite impossible since this is an important production system. > I've read that Rich and Nathan made a patch to resolve this, but is there or > will ever be available a binary patch to 1.0 that I can install on RHEL4? We will probably not make any more RHEL4 binaries available. It's even very difficult for us to provide binaries that can run on RHEL5 (Fedora Core 6). If you want the latest code, you can build it yourself. If you feel so inclined, I can help you with that. > > Many thanks in advance. Created attachment 314403 [details]
CSN stats for replica 1
Created attachment 314404 [details]
csn stats for replica 2
4way MMR (RHEL4, RHEL5, HPUX, SOL9) 200,000 users loading test ran for 2:50 minutes - attached results from checking CSN generator - bug verified. Created attachment 314765 [details]
new readNsState.py
This new script can handle 32-bit or 64-bit, big endian or little endian - it should be able to read any of these on any system.
Created attachment 314770 [details]
new readNsState.py
Bug in previous version
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0602.html |