Bug 233642

Summary:

MMR breaks with time skew errors

Product:

[Retired] 389

Reporter:

Chris St. Pierre <cstpierr>

Component:

Replication - General

Assignee:

Rich Megginson <rmeggins>

Status:

CLOSED ERRATA

QA Contact:

Chandrasekar Kannan <ckannan>

Severity:

high

Docs Contact:

Priority:

medium

Version:

1.0.3

CC:

andrey.ivanov, benl, bugzilla, claudio.tassini, jgalipea, windhamg

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-08-27 20:39:00 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

152373, 249650, 452721, 453229

Attachments:

Description	Flags
script to check CSN state	none
diffs	none
MMR diffs as requested by Rich on mailing list	none
fix csngen.c	none
cvs commit log	none
readNsState.py output of MMR servers, 24 hours later	none
readNsState.py output	none
readNsState.py output	none
CSN stats for replica 1	none
csn stats for replica 2	none
new readNsState.py	none
new readNsState.py	none

Description Chris St. Pierre 2007-03-23 15:41:09 UTC

Tuesday night (20 March), replication suddenly ceased between our four nodes, 
with errors similar to these on all nodes:

[20/Mar/2007:17:13:26 -0500] NSMMReplicationPlugin - agmt="cn="Replication to gro
ucho (o=isp)"" (groucho:389): Unable to acquire replica: Excessive clock skew bet
ween the supplier and the consumer. Replication is aborting.
[20/Mar/2007:17:13:26 -0500] NSMMReplicationPlugin - agmt="cn="Replication to gro
ucho (o=isp)"" (groucho:389): Incremental update failed and requires administrato
r action
[20/Mar/2007:17:13:26 -0500] NSMMReplicationPlugin - agmt="cn="Replication to zep
po.nebrwesleyan.edu (o=isp)"" (zeppo:389): Unable to acquire replica: Excessive c
lock skew between the supplier and the consumer. Replication is aborting.
[20/Mar/2007:17:13:26 -0500] NSMMReplicationPlugin - agmt="cn="Replication to zep
po.nebrwesleyan.edu (o=isp)"" (zeppo:389): Incremental update failed and requires
 administrator action
[20/Mar/2007:17:13:27 -0500] - csngen_adjust_time: adjustment limit exceeded; val
ue - 86401, limit - 86400
[20/Mar/2007:17:13:27 -0500] NSMMReplicationPlugin - conn=1600790 op=4983 replica
="o=isp": Unable to acquire replica: error: excessive clock skew
[20/Mar/2007:17:23:56 -0500] NSMMReplicationPlugin - agmt="cn="Replication to har
po.nebrwesleyan.edu (o=isp)"" (harpo:389): Unable to acquire replica: Excessive c
lock skew between the supplier and the consumer. Replication is aborting.
[20/Mar/2007:17:23:56 -0500] NSMMReplicationPlugin - agmt="cn="Replication to har
po.nebrwesleyan.edu (o=isp)"" (harpo:389): Incremental update failed and requires
 administrator action
[20/Mar/2007:17:58:27 -0500] - csngen_adjust_time: adjustment limit exceeded; val
ue - 86401, limit - 86400
[20/Mar/2007:17:58:27 -0500] NSMMReplicationPlugin - conn=1615833 op=3276 replica
="o=isp": Unable to acquire replica: error: excessive clock skew

Subsequent efforts to resume replication have come to naught.  (See this thread: 
https://www.redhat.com/archives/fedora-directory-users/2007-March/msg00100.html 
for more details on getting replication working again at all.)  Once I get 
replication working, a few minutes after putting the cluster back into production 
we get the same error messages and replication ceases.  The messages do not occur 
when the machines are not actively getting queries.

In all cases, the 'value' on the csngen_adjust_time line is 86401 and the limit 
is 86400.  All four nodes have the same clock times, and are running NTP against 
a local NTP server.

We are using four-way MMR with no read-only replicas.  We used the mmr.pl script 
to set up replication.  Each node replicates to all three other nodes, and all 
four nodes receive updates.

We have another database on all nodes that has continued to replicate without a 
problem.  It's only our "o=isp" base that has troubles.

This guy seems to have encountered a similar issue: http://www.mail-archive.com/
fedora-directory-users/msg03614.html

Comment 1 Rich Megginson 2007-03-23 15:54:40 UTC

*** Bug 233643 has been marked as a duplicate of this bug. ***

Comment 2 Yi Zhang 2007-03-29 19:55:10 UTC

Hi, I am trying to follow up on this bug.
I have been setup circle MMR in my environment, and i have about 1000 records
there. The four servers are running fine for 2 days. May I have more details
from you to re-produce this bug. 
*) What is the MMR agreement you have?
*) How many records you have? (rough number would be fine)
*) What types of operations did your perform before it crashed? 
*) Did all servers crashed roughly at same time? or one by one? Can you try to
reduce your MMR from 4 servers to 2 servers to confirm this?
:) thanks

Comment 3 Chris St. Pierre 2007-03-29 20:24:26 UTC

1. Each of the four nodes had an agreement with each of the three other nodes.

2. 5000-6000.

3. One of the nodes stopped replicating at 17:13:25.  The last several operations 
in the access logs are as follows:

[20/Mar/2007:17:13:24 -0500] conn=1593620 op=23 RESULT err=0 tag=101 nentries=0
etime=0
[20/Mar/2007:17:13:24 -0500] conn=1593620 op=24 SRCH base="ou=Groups,o=nebrwesle
yan.edu,o=isp" scope=2 filter="(&(objectClass=sambaGroupMapping)(sambaSID=S-1-5-
2))" attrs="gidNumber sambaSID sambaGroupType sambasidlist description displayNa
me cn objectClass"
[20/Mar/2007:17:13:24 -0500] conn=1593620 op=24 RESULT err=0 tag=101 nentries=0
etime=0
[20/Mar/2007:17:13:24 -0500] conn=1593620 op=25 SRCH base="ou=Groups,o=nebrwesle
yan.edu,o=isp" scope=2 filter="(&(objectClass=sambaGroupMapping)(sambaSID=S-1-5-
32-546))" attrs="gidNumber sambaSID sambaGroupType sambasidlist description disp
layName cn objectClass"
[20/Mar/2007:17:13:24 -0500] conn=1593620 op=25 RESULT err=0 tag=101 nentries=0
etime=0
[20/Mar/2007:17:13:24 -0500] conn=1593620 op=26 SRCH base="ou=Groups,o=nebrwesle
yan.edu,o=isp" scope=2 filter="(&(objectClass=sambaGroupMapping)(|(displayName=n
tadmin)(cn=ntadmin)))" attrs="gidNumber sambaSID sambaGroupType sambasidlist des
cription displayName cn objectClass"
[20/Mar/2007:17:13:24 -0500] conn=1593620 op=26 RESULT err=0 tag=101 nentries=1
etime=0
[20/Mar/2007:17:13:25 -0500] conn=1593621 fd=249 slot=249 connection from 10.1.1
.33 to 10.1.1.19
[20/Mar/2007:17:13:25 -0500] conn=1593621 op=0 EXT oid="1.3.6.1.4.1.1466.20037"
[20/Mar/2007:17:13:25 -0500] conn=1593621 op=0 EXT oid="1.3.6.1.4.1.1466.20037"
name="startTLS"
[20/Mar/2007:17:13:25 -0500] conn=1593621 op=0 RESULT err=0 tag=120 nentries=0 e
time=0
[20/Mar/2007:17:13:25 -0500] conn=1593621 SSL 256-bit AES
[20/Mar/2007:17:13:25 -0500] conn=1593621 op=1 SRCH base="ou=people,o=nebrwesley
an.edu,o=isp" scope=2 filter="(uid=)" attrs="cn"
[20/Mar/2007:17:13:25 -0500] conn=1593621 op=1 RESULT err=0 tag=101 nentries=0 e
time=0
[20/Mar/2007:17:13:25 -0500] conn=1593621 op=2 UNBIND
[20/Mar/2007:17:13:25 -0500] conn=1593621 op=2 fd=249 closed - U1
[20/Mar/2007:17:13:25 -0500] conn=1592136 op=-1 fd=305 closed - B1
[20/Mar/2007:17:13:25 -0500] conn=1592523 op=90 BIND dn="uid=jcaldwel,ou=people,
o=nebrwesleyan.edu,o=isp" method=128 version=3
[20/Mar/2007:17:13:25 -0500] conn=1592523 op=90 RESULT err=0 tag=97 nentries=0 e
time=0 dn="uid=jcaldwel,ou=people,o=nebrwesleyan.edu,o=isp"
[20/Mar/2007:17:13:25 -0500] conn=1590611 op=874 EXT oid="2.16.840.1.113730.3.5.
3" name="Netscape Replication Start Session"

4.  The first error came at between 17:13:25 and 17:13:26 on all nodes.  The 
first of 12 replication agreements broke at 17:14:11, and the last at 17:58:28.

With the help of some of the folks on IRC, I've managed to get a three-node 
cluster back up and running in production.  (I haven't tried a four-node 
cluster.)  In order to do so, I followed the instructions I posted on the mailing 
list (https://www.redhat.com/archives/fedora-directory-users/2007-March/
msg00108.html) with one change: when dismantling the replication agreements, I 
completely got rid of the cn=replica node for the failed database, not just the 
agreements within that node.  "uffe", the guy who helped me, thought that this 
would only delay the problem, not prevent it.

Without that extra step, every attempt to create an MMR cluster with 2 or 3 nodes 
failed as soon as I put it into production.

Comment 4 Rich Megginson 2007-04-02 16:22:20 UTC

Yi - it was thought that some state information stored in cn=replica or in the
replication agreement entry was causing the time skew error.

Comment 6 Rich Megginson 2008-06-11 18:00:03 UTC

Created attachment 308969 [details]
script to check CSN state

This script can be used to periodically check the state of the CSN generator. 
To use it:
python readNsState.py /path/to/slapd-instance/dse.ldif

This should print output like this for each replica:

For replica cn=replica, cn="dc=example,dc=com", cn=mapping tree, cn=config
  len of nsstate is 40
  CSN generator state:
    Replica ID	  : 2
    Sampled Time  : 1213206956
    Time in hex   : 0x485011ac
    Time as str   : Wed Jun 11 11:55:56 2008
    Local Offset  : 0
    Remote Offset : 1
    Seq. num	  : 1
    System time   : Wed Jun 11 11:56:50 2008
    Diff in sec.  : 54

The state in dse.ldif is updated every 30 seconds.  I'm mainly interested in
1) How far off from the system time is the Sampled Time?
2) Does Remote Offset keep increasing and never decreases?
3) Does Seq. num approach 65535?

I think this problem is related to 2.  I would appreciate it if you could run
this script periodically and attach any interesting output to this bug.

Comment 7 Rich Megginson 2008-06-12 22:13:12 UTC

I set up mmr between a vm running f8 and one running f9.  I synced the clocks to
within a few seconds.  I ran a script that added 100,000 entries alternately to
each server.  Within about 40 minutes, the remote offset had gone from 0 to 75.
 At that rate, the remote offset would hit 86400 in about 32 days, and the time
skew error would occur.

Comment 8 Rich Megginson 2008-06-23 16:37:20 UTC

Created attachment 310040 [details]
diffs

Comment 9 Luke Bigum 2008-06-23 22:39:12 UTC

Created attachment 310083 [details]
MMR diffs as requested by Rich on mailing list

The output of Rich's readNsState.py script for our two MMR servers.

Comment 10 Rich Megginson 2008-06-23 22:51:24 UTC

Thanks.  Try running the script every half hour - you should see the remote
offset field keep increasing.

Comment 11 Rich Megginson 2008-06-24 21:43:28 UTC

Created attachment 310193 [details]
fix csngen.c

Comment 12 Rich Megginson 2008-06-24 22:22:51 UTC

Created attachment 310196 [details]
cvs commit log

Reviewed by: nhosoi, nkinder (Thanks!)
Fix Description: CSN remote offset generation seems broken.  We seem to
accumulate a remote offset that keeps growing until we hit the limit of 1 day,
then replication stops.  The idea behind the remote offset is that servers may
be seconds or minutes off.  When replication starts, one of the itmes in the
payload of the start extop is the latest CSN from the supplier.  The CSN
timestamp field is (sampled_time + local offset + remote offset).  Sampled time
comes from the time thread in the server that updates the time once per second.
 This allows the consumer, if also a master, to adjust its CSN generation so as
not to generate duplicates or CSNs less than those from the supplier.  However,
the logic in csngen_adjust_time appears to be wrong:
       remote_offset = remote_time - gen->state.sampled_time;
That is, remote_offset = (remote sampled_time + remote local offset + remote
remote offset) - gen->state.sampled_time
It should be
       remote_offset = remote_time - (sampled_time + local offset + remote
offset)
Since the sampled time is not the actual current time, it may be off by 1
second.  So the new remote_offset will be at least 1 second more than it should
be.  Since this is the same remote_offset used to generate the CSN to send back
to the other master, this offset would keep increasing and increasing over
time.  The script attached to the bug helps measure this effect.  The new code
also attempts to refresh the sampled time while adjusting to make sure we have
as current a sampled_time as possible.	In the old code, the remote_offset is
"sent" back and forth between the masters, carried along in the CSN timestamp
generation.  In the new code, this can happen too, but to a far less extent,
and should max out at (real offset + N seconds) where N is the number of
masters.
In the old code, you could only call csngen_adjust_time if you first made sure
the remote timestamp >= local timestamp.  I have removed this restriction and
moved that logic into csngen_adjust_time.  I also cleaned up the code in the
consumer extop - I combined the checking of the CSN from the extop with the max
CSN from the supplier RUV - now we only adjust the time once based on the max
of all of these CSNs sent by the supplier.
Finally, I cleaned up the error handling in a few places that assumed all
errors were time skew errors.
Follow up - I found a bug in my previous patch - _csngen_adjust_local_time must
not be called when the sampled time == the current time.  So I fixed that where
I was calling _csngen_adjust_local_time, and I also changed
_csngen_adjust_local_time so that time_diff == 0 is a no-op.
Platforms tested: RHEL5, F8, F9
Flag Day: no
Doc impact: no
QA impact: Should test MMR and use the script to measure the offset effect.

Comment 13 Luke Bigum 2008-06-24 22:54:08 UTC

Created attachment 310201 [details]
readNsState.py output of MMR servers, 24 hours later

In our servers the remote offset hasn't changed after a day, we may not be
affected (our servers are not that busy). In any case it looks like you're on
to a patch.

Comment 14 Rich Megginson 2008-06-24 23:05:49 UTC

Yeah, the offset creep rate is proportional to the update rate.  It may take you
years to hit the bug at the rate you're going :-)

Comment 15 Gary Windham 2008-06-29 05:39:28 UTC

Created attachment 310525 [details]
readNsState.py output

readNsState.py output for MMR server #1, 24 hour period, 30 min interval
between samples

Comment 16 Gary Windham 2008-06-29 05:41:20 UTC

Created attachment 310526 [details]
readNsState.py output

readNsState.py output for MMR server #2, 24 hour period, 30 min interval
between samples

Comment 17 Rich Megginson 2008-06-30 13:27:01 UTC

Thanks Gary, this is excellent.  Much appreciated.
An increase of remote offset of over 1000 in a 24-hour period.  At this rate,
you should hit the remote offset limit in about 80 or so days.

Comment 18 Nathan Kinder 2008-07-09 17:52:33 UTC

Checked into Directory71RtmBranch.

Checking in ldap/servers/plugins/replication/repl5.h;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5.h,v  <--  repl5.h
new revision: 1.8.2.1; previous revision: 1.8
done
Checking in ldap/servers/plugins/replication/repl5_inc_protocol.c;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5_inc_protocol.c,v 
<--  repl5_inc_protocol.c
new revision: 1.8.2.2; previous revision: 1.8.2.1
done
Checking in ldap/servers/plugins/replication/repl5_replica.c;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5_replica.c,v  <-- 
repl5_replica.c
new revision: 1.11.2.1; previous revision: 1.11
done
Checking in ldap/servers/plugins/replication/repl_extop.c;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl_extop.c,v  <-- 
repl_extop.c
new revision: 1.7.2.2; previous revision: 1.7.2.1
done
Checking in ldap/servers/plugins/replication/windows_inc_protocol.c;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/windows_inc_protocol.c,v
 <--  windows_inc_protocol.c
new revision: 1.11.2.1; previous revision: 1.11
done
Checking in ldap/servers/slapd/csngen.c;
/cvs/dirsec/ldapserver/ldap/servers/slapd/csngen.c,v  <--  csngen.c
new revision: 1.5.2.1; previous revision: 1.5
done

Comment 19 Nathan Kinder 2008-07-11 15:28:03 UTC

Checked into Directory_Server_8_0_Branch.

Checking in ldap/servers/plugins/replication/repl5.h;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5.h,v  <--  repl5.h
new revision: 1.10.2.1; previous revision: 1.10
done
Checking in ldap/servers/plugins/replication/repl5_inc_protocol.c;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5_inc_protocol.c,v 
<--  repl5_inc_protocol.c
new revision: 1.11.2.2; previous revision: 1.11.2.1
done
Checking in ldap/servers/plugins/replication/repl5_replica.c;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl5_replica.c,v  <-- 
repl5_replica.c
new revision: 1.16.2.1; previous revision: 1.16
done
Checking in ldap/servers/plugins/replication/repl_extop.c;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/repl_extop.c,v  <-- 
repl_extop.c
new revision: 1.12.2.1; previous revision: 1.12
done
Checking in ldap/servers/plugins/replication/windows_inc_protocol.c;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/windows_inc_protocol.c,v
 <--  windows_inc_protocol.c
new revision: 1.15.2.1; previous revision: 1.15
done
Checking in ldap/servers/slapd/csngen.c;
/cvs/dirsec/ldapserver/ldap/servers/slapd/csngen.c,v  <--  csngen.c
new revision: 1.7.2.1; previous revision: 1.7
done

Comment 21 Claudio Tassini 2008-08-13 09:29:52 UTC

Hi all,

I'm having what seems to be that same problem on a fedora ds 1.0.4 (X64)/RHEL ES 4.5 installation. The topology is made of 4 multimaster servers and 3 consumers.

Each master has a MMR agreement to each other master and consumer replica, which make a total of 6 agreements in each master.

All worked fine for about a year and a half (we made an upgrade from fds 1.0.2 on may 2006), but a couple of weeks ago replication stopped working with the "too much clock skew" error. Of course the clock was NTPed, so I've tried to reinit the replicas, manually and with the console, but to no avail, then I found this bug and ran the readNsState.py script.. and with no real suprise the remote offset was 86400 on all masters.

Then I tried to recreate all the replica topology from scratch: deleted all cn=replica,cn=config entries, stopped all servers, deleted changelogdb/* , started servers, enabled replication from the console, recreated all 24 agreements on the masters, and initialized all the databases from the first master server using the console.

All now is working again, but the strange thing is that when I ran readNsState.py as soon as I init'ed replicas, the remote offset was already around 40000 (yeah, that's forty thousands). Now, at 6 hours from the init, this is the output of the script:

[root@store1 ~]# python readNsState.py /opt/fedora-ds/slapd-store1/config/dse.ldif
For replica cn=replica,cn="dc=notartel,dc=it",cn=mapping tree,cn=config
len of nsstate is 40
CSN generator state:
Replica ID : 1
Sampled Time : 1218618718
Time in hex : 0x48a2a55e
Time as str : Wed Aug 13 11:11:58 2008
Local Offset : 0
Remote Offset : 81404
Seq. num : 1
System time : Wed Aug 13 11:14:32 2008
Diff in sec. : 154

The remote offset value is similar in all servers.
I am sure there were no write operations on the masters except from the replicas initializations. How can this be possible? Can I somehow force the value of the remote offset in the nsstate attribute of cn=replica, cn=config ? I ask because it does not seem that I can upgrade FDS to 1.1 on RHEL4 , and an OS upgrade is quite impossible since this is an important production system.
I've read that Rich and Nathan made a patch to resolve this, but is there or will ever be available a binary patch to 1.0 that I can install on RHEL4?

Many thanks in advance.

Comment 22 Rich Megginson 2008-08-13 14:58:02 UTC

(In reply to comment #21)
-snip-
> I am sure there were no write operations on the masters except from the
> replicas initializations. How can this be possible? Can I somehow force the
> value of the remote offset in the nsstate attribute of cn=replica, cn=config ?

That would be very tricky, because if you make your offset too small, you will generate duplicate CSNs which will break replication.

> I ask because it does not seem that I can upgrade FDS to 1.1 on RHEL4 , and an
> OS upgrade is quite impossible since this is an important production system. 
> I've read that Rich and Nathan made a patch to resolve this, but is there or
> will ever be available a binary patch to 1.0 that I can install on RHEL4?

We will probably not make any more RHEL4 binaries available.  It's even very difficult for us to provide binaries that can run on RHEL5 (Fedora Core 6).  If you want the latest code, you can build it yourself.  If you feel so inclined, I can help you with that.

> 
> Many thanks in advance.

Comment 23 Jenny Severance 2008-08-15 19:48:32 UTC

Created attachment 314403 [details]
CSN stats for replica 1

Comment 24 Jenny Severance 2008-08-15 19:49:03 UTC

Created attachment 314404 [details]
csn stats for replica 2

Comment 25 Jenny Severance 2008-08-15 19:57:43 UTC

4way MMR (RHEL4, RHEL5, HPUX, SOL9) 200,000 users loading test ran for 2:50 minutes - attached results from checking CSN generator - bug verified.

Comment 26 Rich Megginson 2008-08-22 01:14:53 UTC

Created attachment 314765 [details]
new readNsState.py

This new script can handle 32-bit or 64-bit, big endian or little endian - it should be able to read any of these on any system.

Comment 27 Rich Megginson 2008-08-22 02:18:07 UTC

Created attachment 314770 [details]
new readNsState.py

Bug in previous version

Comment 29 errata-xmlrpc 2008-08-27 20:39:00 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0602.html