Bug 547503

Summary: replication broken again, with 389 MMR replication and TCP errors
Product: [Community] 389 Reporter: John Bryson <john.bryson>
Component: Replication - GeneralAssignee: Rich Megginson <rmeggins>
Status: CLOSED CURRENTRELEASE QA Contact: Viktor Ashirov <vashirov>
Severity: medium Docs Contact:
Priority: low    
Version: 1.2.1CC: amsharma, andrey.ivanov, nhosoi, nkinder
Target Milestone: ---Keywords: VerifiedUpstream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-07 11:33:31 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 543590, 639035    
Attachments:
Description Flags
/var/log/dirsrv/slapd-bellar/error except for jan 21 around 16:0*
none
0001-Bug-547503-replication-broken-again-with-389-MMR.patch nhosoi: review+

Comment 1 Rich Megginson 2009-12-15 12:25:25 EST
What do you see in the supplier and consumer access logs at or around the time of this error message:
[11/Dec/2009:00:06:19 -0500] - PR_Write(133) Netscape Portable Runtime error
-5961 (TCP connection reset by peer.)
?

the number 113 here is the file descriptor number logged in the access log when the connection is made e.g.
[08/Dec/2009:20:00:24 -0700] conn=1 fd=64 slot=64 connection from 127.0.0.1 to 127.0.0.1

Using this, you can find fd=113 in the access log, then find the last operation for that corresponding connection.
Comment 4 John Bryson 2009-12-16 12:56:55 EST
What platform is this?  EL5?  Fedora?  Do you only see the problem on 64-bit
machines?  Have you seen this problem on 32-bit machines?

treilhard.iam.gatech.edu: cat redhat-release
Red Hat Enterprise Linux Server release 5.4 (Tikanga)

yes this is 64bit. No we didnt see it with 32 bit.

More data soon....
Comment 5 Rich Megginson 2010-01-22 19:25:26 EST
Any more data?
Comment 6 John Bryson 2010-01-25 15:41:27 EST
Here is another more recent example. Im attaching 2 new files for an incident that occurred Jan 20 around 16:05. One is the error log and one is the access log.
Comment 10 John Bryson 2010-01-26 16:28:34 EST

What is going on here?        Many SSL connections being opened at about the same
time?

[21/Jan/2010:15:53:29 -0500] conn=1313539 fd=155 slot=155 SSL connection from
130.207.185.166 to 130.207.183.26
...
[21/Jan/2010:15:53:29 -0500] conn=1313589 fd=243 slot=243 SSL connection from
130.207.183.19 to 130.207.183.26

Then again, soon afterwards:

[21/Jan/2010:15:59:41 -0500] conn=1314141 fd=227 slot=227 SSL connection from
130.207.185.163 to 130.207.183.26
...
[21/Jan/2010:15:59:50 -0500] conn=1314169 fd=269 slot=269 SSL connection from
130.207.183.19 to 130.207.183.26


The machines here are email machines and stuff coming in via the bigip load balancers:
130.207.185.166 mail6.gatech.edu
130.207.183.19 bigip2.gted.gatech.edu
Usually when you see ip addresses from bigip1 or bigip2 they are doing health checks on the machine - some simple ldap query. If the health check fails, then we assume the FDS server is busy or broken, so new ldap queries are then sent to alternate machines. We do  health checks often, and there are 4 bigips doing them (an active and standby in each data center). So you will probably see these a lot in the logs.
Comment 11 John Bryson 2010-01-26 16:40:02 EST
You asked:
Up until this point, everything had been going apparently fine for conn=1305921
- then for some reason, it fires off several start replication requests.  I'm
not sure who the supplier is - the first conn=1305921 in the access log is
this:
[21/Jan/2010:15:00:00 -0500] conn=1305921 op=71 EXT
oid="2.16.840.1.113730.3.5.3" name="Netscape Replication Start Session"

The peer IP address is only logged at the time when the connection is
established (see above e.g. conn=1314169 fd=269 slot=269 SSL connection from
130.207.183.19 to 130.207.183.26).

If you know which supplier this is, check the error log from that supplier at
around the time of the burst of start replication requests.

So here is the first part of the log where that connection comes in, an the 
 130.207.183.50 is one of our 2 hubs: bellar.iam.gatech.edu

zuber.iam.gatech.edu: grep 'conn=1305921' access.*
access.20100122-110112:[21/Jan/2010:14:54:20 -0500] conn=1305921 fd=154 slot=154 SSL connection from 130.207.183.50 to 130.207.183.26
access.20100122-110112:[21/Jan/2010:14:54:20 -0500] conn=1305921 op=0 BIND dn="cn=Replication Manager,cn=config" method=128 version=3
access.20100122-110112:[21/Jan/2010:14:54:20 -0500] conn=1305921 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=replication manager,cn=config"
access.20100122-110112:[21/Jan/2010:14:54:20 -0500] conn=1305921 op=1 SRCH base="" scope=0 filter="(objectClass=*)" attrs="supportedControl supportedExtension"
access.20100122-110112:[21/Jan/2010:14:54:20 -0500] conn=1305921 op=1 RESULT err=0 tag=101 nentries=1 etime=0
access.20100122-110112:[21/Jan/2010:14:54:20 -0500] conn=1305921 op=2 SRCH base="" scope=0 filter="(objectClass=*)" attrs="supportedControl supportedExtension"
access.20100122-110112:[21/Jan/2010:14:54:20 -0500] conn=1305921 op=2 RESULT err=0 tag=101 nentries=1 etime=0
access.20100122-110112:[21/Jan/2010:14:54:20 -0500] conn=1305921 op=3 EXT oid="2.16.840.1.113730.3.5.3" name="Netscape Replication Start Session"
access.20100122-110112:[21/Jan/2010:14:54:20 -0500] conn=1305921 op=3 RESULT err=0 tag=120 nentries=0 etime=0
access.20100122-110112:[21/Jan/2010:14:54:25 -0500] conn=1305921 op=4 EXT oid="2.16.840.1.113730.3.5.3" name="Netscape Replication Start Session"
Comment 12 John Bryson 2010-01-26 16:44:50 EST
attaching the error log from bellar.iam (the upstream replication provider) for the time around 16:00 on Jan 21
Comment 13 John Bryson 2010-01-26 16:46:22 EST
Created attachment 386926 [details]
/var/log/dirsrv/slapd-bellar/error except for jan 21 around 16:0*

/var/log/dirsrv/slapd-bellar/error except for jan 21 around 16:0*
Comment 14 Rich Megginson 2010-01-26 19:35:34 EST
So is 130.207.183.26 zuber?  From the new logs, there is this:

[21/Jan/2010:16:01:39 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar
zuber" (zuber:636): Warning: unable to receive endReplication extended
operation response (Bad parameter to an ldap routine)

This corresponds to the access log from zuber:

[21/Jan/2010:16:01:39 -0500] conn=1296532 op=1179 EXT
oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"

Looks like this causes things to go wrong, and then we have more:

[21/Jan/2010:16:02:17 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar
zuber" (zuber:636): Unable to receive the response for a startReplication
extended operation to consumer (Bad parameter to an ldap routine). Will retry
later.
[21/Jan/2010:16:04:59 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar
zuber" (zuber:636): Unable to receive the response for a startReplication
extended operation to consumer (Bad parameter to an ldap routine). Will retry
later.
[21/Jan/2010:16:05:16 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar
zuber" (zuber:636): Unable to receive the response for a startReplication
extended operation to consumer (Bad parameter to an ldap routine). Will retry
later.
[21/Jan/2010:16:05:31 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar
zuber" (zuber:636): Unable to receive the response for a startReplication
extended operation to consumer (Bad parameter to an ldap routine). Will retry
later.
[21/Jan/2010:16:05:43 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar
zuber" (zuber:636): Unable to receive the response for a startReplication
extended operation to consumer (Bad parameter to an ldap routine). Will retry
later.
[21/Jan/2010:16:05:43 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar
zuber" (zuber:636): Unable to receive the response for a startReplication
extended operation to consumer (Bad parameter to an ldap routine). Will retry
later.
[21/Jan/2010:16:06:02 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar
zuber" (zuber:636): Unable to receive the response for a startReplication
extended operation to consumer (Bad parameter to an ldap routine). Will retry
later.
[21/Jan/2010:16:06:50 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar
zuber" (zuber:636): Unable to receive the response for a startReplication
extended operation to consumer (Bad parameter to an ldap routine). Will retry
later.
[21/Jan/2010:16:08:26 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar
zuber" (zuber:636): Replication bind with SIMPLE auth resumed


Interspersed with this are many, many bad replication errors:
[21/Jan/2010:16:00:00 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar
geraldine" (geraldine:636): Consumer failed to replay change (uniqueid
e83f513a-1dd111b2-80b0be51-952a0000, CSN 4b57581a000100380000): Operations
error. Will retry later.
...
[21/Jan/2010:16:07:46 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar
geraldine" (geraldine:636): Consumer failed to replay change (uniqueid
5a204581-1dd211b2-8056be51-952a0000, CSN 4b575f18000100380000): Operations
error. Will retry later.

What's going on with geraldine?  Search the access log on geraldine for
4b57581a000100380000
Comment 15 John Bryson 2010-01-27 10:30:07 EST
Geraldine is a slave that is also replicates from hubs bellar and briggs, just as zuber does.

Geraldine also had issues and was rebuilt Jan 21 (rebuilt about 8pm, but was having issues earlier that day).
 
Im not seeing that 4b57581a000100380000 in the logs on geraldine.
Comment 16 John Bryson 2010-01-27 11:05:26 EST
Geraldine is a slave that is also replicates from hubs bellar and briggs, just as zuber does.

Geraldine also had issues and was rebuilt Jan 21 (rebuilt about 8pm, but was having issues earlier that day).
 
Im not seeing that 4b57581a000100380000 in the logs on geraldine.
Comment 17 Rich Megginson 2010-01-27 11:31:43 EST
Are you still seeing replication errors like Consumer failed to replay change (uniqueid XXX, CSN YYY) in any of the logs?
Comment 18 John Bryson 2010-01-27 13:22:42 EST
Not on geraldine or zuber... or any other slaves so far. 
And I dont see these on briggs.iam, one of the replication hubs.
BUT i see a lot of that in the logs for bellar.iam (one of the replication hubs) for Jan26. These all seem to be going to sneedle.iam, a replication target. And indeed, we did have problems on sneedle and had to rebuild the accounts database on sneedle because if had the same issue that we have been talking about. It was rebuilt last night to fix it. And today the logs are clear of this error msg.

briggs.iam.gatech.edu: grep 'Consumer failed to replay change' /var/log/dirsrv/slapd-*/errors.20100126-001010
(too many to count)

bellar.iam.gatech.edu: head -333 /var/log/dirsrv/slapd-bellar/errors.20100126-001010

bellar.iam.gatech.edu: head -50 /var/log/dirsrv/slapd-bellar/errors.20100126-001010|grep 'Consumer failed to replay change'
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c5d6d000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c5dab000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c5de4000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 51dccc0c-9efe11de-8efe8516-22c1043e, CSN 4b5c5e1a000000360000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 51dccc0b-9efe11de-8efe8516-22c1043e, CSN 4b5c5e1a000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 821f7486-c95b11de-8998d861-cd73a362, CSN 4b5c5e1a000c00330000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c5e25000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c5e5b000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c5e9a000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c5ed2000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c5f10000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 821f7485-c95b11de-8998d861-cd73a362, CSN 4b5c5f46000000330000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c5f4f000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 5ad56102-1dd211b2-80b9be51-952a0000, CSN 4b5c5f6a000000360000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c5f87000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 5ad56101-1dd211b2-80b9be51-952a0000, CSN 4b5c5f8a000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b1110b01-08f811df-9c29b0cc-9ca992fc, CSN 4b5c5fb8000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c5fc6000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c6001000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c603d000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 51dccc0b-9efe11de-8efe8516-22c1043e, CSN 4b5c6071000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 51dccc0c-9efe11de-8efe8516-22c1043e, CSN 4b5c6072000000360000): Operations error. Will retry later.
[26/Jan/2010:00:10:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 821f7486-c95b11de-8998d861-cd73a362, CSN 4b5c6072000800330000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c6083000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c60b5000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c60f0000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c612d000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c6168000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 821f7485-c95b11de-8998d861-cd73a362, CSN 4b5c619d000000330000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c61a2000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 5ad56102-1dd211b2-80b9be51-952a0000, CSN 4b5c61bd000000360000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid afa8acfd-1dd111b2-8001be51-952a0000, CSN 4b5c61bd000d00390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 5ad56101-1dd211b2-80b9be51-952a0000, CSN 4b5c61dd000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c61df000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c621b000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 179b0981-07d111df-9c29b0cc-9ca992fc, CSN 4b5c625a000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c6264000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c6292000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 821f7486-c95b11de-8998d861-cd73a362, CSN 4b5c62ca000000330000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 51dccc0c-9efe11de-8efe8516-22c1043e, CSN 4b5c62ca000000360000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid 51dccc0b-9efe11de-8efe8516-22c1043e, CSN 4b5c62ca000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c62d3000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c630e000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c634a000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c6389000000390000): Operations error. Will retry later.
[26/Jan/2010:00:10:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Consumer failed to replay change (uniqueid b0c85901-ea8a11de-9c29b0cc-9ca992fc, CSN 4b5c63c1000000390000): Operations error. Will retry later.
bellar.iam.gatech.edu:
Comment 19 John Bryson 2010-01-27 13:40:25 EST
So in looking at all of the machine logs today (recall there are 3 masters treilhard/rewbell/gertrude, 2hubs bellar/briggs, 12 slave FDS servers), and a backup machine called fox.  Master replicate to each other, a backup machine called fox, and hubs. Hubs replicate to each other and to 12 slaves each. Slave machines are are targets of replication but dont replicate to others.

I found this error today only on treilhard rewbell and gertrude (all are masters), and bellar (a hub)... None on the other hub briggs or other slaves.

Details:

treilhard.iam.gatech.edu: grep 'Consumer failed to replay change' /var/log/dirsrv/slapd-*/errors
[27/Jan/2010:11:00:08 -0500] NSMMReplicationPlugin - agmt="cn=people treilhard fox" (fox:636): Consumer failed to replay change (uniqueid f7fd2d0b-0abf11df-9c29b0cc-9ca992fc, CSN 4b6062f20001003a0000): Can't contact LDAP server. Will retry later.

rewbell.iam.gatech.edu: [27/Jan/2010:11:00:08 -0500] NSMMReplicationPlugin - agmt="cn=people treilhard fox" (fox:636): Consumer failed to replay change (uniqueid f7fd2d0b-0abf11df-9c29b0cc-9ca992fc, CSN 4b6062f20001003a0000): Can't contact LDAP server. Will retry later.
 
gertrude.iam.gatech.edu: grep 'Consumer failed to replay change' /var/log/dirsrv/slapd-*/errors
[27/Jan/2010:13:00:08 -0500] NSMMReplicationPlugin - agmt="cn=people gertrude fox" (fox:636): Consumer failed to replay change (uniqueid 51dccc08-9efe11de-8efe8516-22c1043e, CSN 4b607f21000000370000): Can't contact LDAP server. Will retry later.

bellar: see above - lots yesterday (26/Jan) but none today (27/Jan/2010)
Comment 20 Rich Megginson 2010-01-27 15:07:11 EST
The error

Can't contact LDAP server. Will retry later.

indicates either that the remote server is down, or that there is some sort of network problem preventing a connection.

The Operations error problems are indicative of something more serious.

Is bellar running the same 389-ds-base version as the other servers?
Comment 21 John Bryson 2010-01-27 15:48:54 EST
Well a quick look shows the all the same, is there some better way to see what version is on each? or are there particular components to  look at the versions on?

$ for i in rewbell gertrude treilhard bellar briggs fox snipps sneedle neufchateau zebs zall zans zuber geraldine barras aaron abigail alice
> do
> echo "$i"; /usr/local/bin/pmrun -h $i.iam.gatech.edu head -2 /var/log/dirsrv/slapd-$i/errors; done
rewbell
        389-Directory/1.2.2 B2009.237.2054
        rewbell.iam.gatech.edu:636 (/etc/dirsrv/slapd-rewbell)
gertrude
        389-Directory/1.2.2 B2009.237.2054
        gertrude.iam.gatech.edu:636 (/etc/dirsrv/slapd-gertrude)
treilhard
        389-Directory/1.2.2 B2009.237.2054
        treilhard.iam.gatech.edu:636 (/etc/dirsrv/slapd-treilhard)
bellar
        389-Directory/1.2.2 B2009.237.2054
        bellar.iam.gatech.edu:636 (/etc/dirsrv/slapd-bellar)
briggs
        389-Directory/1.2.2 B2009.237.2054
        briggs.iam.gatech.edu:636 (/etc/dirsrv/slapd-briggs)
fox
        389-Directory/1.2.2 B2009.237.2054
        fox.iam.gatech.edu:636 (/etc/dirsrv/slapd-fox)
snipps
        389-Directory/1.2.2 B2009.237.2054
        snipps.iam.gatech.edu:636 (/etc/dirsrv/slapd-snipps)
sneedle
        389-Directory/1.2.2 B2009.237.2054
        sneedle.iam.gatech.edu:636 (/etc/dirsrv/slapd-sneedle)
neufchateau
        389-Directory/1.2.2 B2009.237.2054
        neufchateau.iam.gatech.edu:636 (/etc/dirsrv/slapd-neufchateau)
zebs
        389-Directory/1.2.2 B2009.237.2054
        zebs.iam.gatech.edu:636 (/etc/dirsrv/slapd-zebs)
zall
        389-Directory/1.2.2 B2009.237.2054
        zall.iam.gatech.edu:636 (/etc/dirsrv/slapd-zall)
zans
        389-Directory/1.2.2 B2009.237.2054
        zans.iam.gatech.edu:636 (/etc/dirsrv/slapd-zans)
zuber
        389-Directory/1.2.2 B2009.237.2054
        zuber.iam.gatech.edu:636 (/etc/dirsrv/slapd-zuber)
geraldine
        389-Directory/1.2.2 B2009.237.2054
        geraldine.iam.gatech.edu:636 (/etc/dirsrv/slapd-geraldine)
barras
        389-Directory/1.2.2 B2009.237.2054
        barras.iam.gatech.edu:636 (/etc/dirsrv/slapd-barras)
aaron
        389-Directory/1.2.2 B2009.237.2054
        aaron.iam.gatech.edu:636 (/etc/dirsrv/slapd-aaron)
abigail
        389-Directory/1.2.2 B2009.237.2054
        abigail.iam.gatech.edu:636 (/etc/dirsrv/slapd-abigail)
alice
        389-Directory/1.2.2 B2009.237.2054
        alice.iam.gatech.edu:636 (/etc/dirsrv/slapd-alice)
bash-2.05$
Comment 22 Rich Megginson 2010-01-27 16:14:50 EST
The core directory server is the package 389-ds-base.  The best way to find out the version is with rpm:
rpm -qi 389-ds-base

But the error log should have the same version, so that should be fine.  I just wanted to check to see if bellar were perhaps running a very old version that had a similar bug in it, but it does not.
Comment 23 John Bryson 2010-01-27 16:35:56 EST
bash-2.05$ for i in rewbell gertrude treilhard bellar briggs fox snipps sneedle neufchateau zebs zall zans zuber geraldine barras aaron abigail alice ; do echo "$i"; /usr/local/bin/pmrun -h $i.iam.gatech.edu rpm -qi 389-ds-base|head -3; done
rewbell
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   gertrude
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   treilhard
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   bellar
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   briggs
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   fox
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   snipps
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   sneedle
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   neufchateau
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   zebs
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   zall
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   zans
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   zuber
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   geraldine
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   barras
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   aaron
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   abigail
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   alice
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   bash-2.05$
Comment 24 John Bryson 2010-01-28 13:47:45 EST
bash-2.05$ for i in rewbell gertrude treilhard bellar briggs fox snipps sneedle neufchateau zebs zall zans zuber geraldine barras aaron abigail alice ; do echo "$i"; /usr/local/bin/pmrun -h $i.iam.gatech.edu rpm -qi 389-ds-base|head -3; done
rewbell
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   gertrude
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   treilhard
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   bellar
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   briggs
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   fox
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   snipps
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   sneedle
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   neufchateau
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   zebs
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   zall
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   zans
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   zuber
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   geraldine
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   barras
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   aaron
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   abigail
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   alice
Name        : 389-ds-base                  Relocations: (not relocatable)
                                                                         Version     : 1.2.2                             Vendor: (none)
                   Release     : 1.el5                         Build Date: Tue Aug 25 16:55:38 2009
                                                                                                   bash-2.05$
Comment 25 Rich Megginson 2010-01-29 15:46:49 EST
I think this bug is related to https://bugzilla.redhat.com/show_bug.cgi?id=488866 but in a different place.

During replication, something happens to the connection, and the supplier gets this:
Warning: unable to receive endReplication extended
operation response (Bad parameter to an ldap routine)

I think this is yet another symptom of a race condition between the replication sending thread and the replication result reading thread, that somehow leaves the connection in a bad state.
Comment 26 John Bryson 2010-01-31 15:14:41 EST
I have another example, occuring on another machine, right now. If its helpful, we could look at those logs.
Comment 27 John Bryson 2010-02-01 10:17:09 EST
Thanks Rich!

This is an important issue for us since it takes someones time, and downtime to fix it each time. So, it looks like you all are targeting this for v 1.2.6 ?

Any estimate of when that will be out?

Thanks again,
John
Comment 28 Rich Megginson 2010-02-01 10:41:25 EST
1.2.6 should be out by the end of February.  I don't know if we will be able to fix this issue by then.  This problem seems to be very difficult to reproduce, which will make it very difficult to fix and to verify.
Comment 29 Rich Megginson 2010-02-01 10:56:01 EST
(In reply to comment #26)
> I have another example, occuring on another machine, right now. If its helpful,
> we could look at those logs.    

Yes.  What errors are you seeing?
Comment 30 John Bryson 2010-02-01 11:51:21 EST
The machine is neufchateau.
I was seeing lots of this error in the errors log file yesterday. And our load balancer had taken it out of the mix becuase it was failing to return queries in a decent period. That is what usually tips me off to a problem. So I started looking at logs, verified the usual errors were there, like 'ERROR bulk import abandoned' and the fixed it by reimporting the data with ldif2db, like we usually do.

Here are some excerpts from the logs. I can get you whole logs if you like:


        389-Directory/1.2.2 B2009.237.2054
        neufchateau.iam.gatech.edu:636 (/etc/dirsrv/slapd-neufchateau)

[31/Jan/2010:00:00:29 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
[31/Jan/2010:00:00:38 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
:
:
[31/Jan/2010:16:29:15 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication upd
ate vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
[31/Jan/2010:16:29:28 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication upd
ate vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
[31/Jan/2010:16:29:31 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication upd
ate vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
[31/Jan/2010:16:29:36 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication upd
ate vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
[31/Jan/2010:16:29:37 -0500] - slapd shutting down - signaling operation threads
[31/Jan/2010:16:29:38 -0500] - slapd shutting down - waiting for 17 threads to terminate
[31/Jan/2010:16:29:38 -0500] - slapd shutting down - closing down internal subsystems and plugins
[31/Jan/2010:16:29:39 -0500] - Waiting for 4 database threads to stop
[31/Jan/2010:16:29:40 -0500] - All database threads now stopped
[31/Jan/2010:16:30:06 -0500] - slapd stopped.

So the previous days log was also full of those errors... then the error log before that has this:

       389-Directory/1.2.2 B2009.237.2054
        neufchateau.iam.gatech.edu:636 (/etc/dirsrv/slapd-neufchateau)

[29/Jan/2010:01:01:48 -0500] - PR_Write(330) Netscape Portable Runtime error -5961 (TCP connection reset by peer.)
[29/Jan/2010:01:01:48 -0500] - PR_Write(333) Netscape Portable Runtime error -5961 (TCP connection reset by peer.)
[29/Jan/2010:02:04:01 -0500] - PR_Write(207) Netscape Portable Runtime error -5961 (TCP connection reset by peer.)
[29/Jan/2010:02:04:01 -0500] - PR_Write(337) Netscape Portable Runtime error -5961 (TCP connection reset by peer.)
:
[29/Jan/2010:13:10:12 -0500] NSMMReplicationPlugin - multimaster_be_state_change: replica ou=people,dc=gted,dc=gatech,dc=edu is going offline; disabling replication
[29/Jan/2010:13:11:10 -0500] - PR_Write(367) Netscape Portable Runtime error -5961 (TCP connection reset by peer.)
[29/Jan/2010:13:11:10 -0500] - PR_Write(347) Netscape Portable Runtime error -5961 (TCP connection reset by peer.)
[29/Jan/2010:13:12:01 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
[29/Jan/2010:13:12:05 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
[29/Jan/2010:13:12:10 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
[29/Jan/2010:13:13:06 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
@
[29/Jan/2010:13:18:15 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
[29/Jan/2010:13:18:32 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
[29/Jan/2010:13:18:39 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
[29/Jan/2010:13:18:44 -0500] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[29/Jan/2010:13:18:45 -0500] - ERROR bulk import abandoned
[29/Jan/2010:13:18:45 -0500] - import people: Aborting all import threads...
[29/Jan/2010:13:18:53 -0500] - import people: Import threads aborted.
[29/Jan/2010:13:18:53 -0500] - import people: Closing files...
[29/Jan/2010:13:18:53 -0500] - import people: Import failed.
[29/Jan/2010:13:18:54 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
[29/Jan/2010:13:18:58 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
[29/Jan/2010:13:18:58 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
Comment 31 Rich Megginson 2010-02-01 12:05:01 EST
(In reply to comment #30)
> The machine is neufchateau.
> I was seeing lots of this error in the errors log file yesterday. And our load
> balancer had taken it out of the mix becuase it was failing to return queries
> in a decent period. That is what usually tips me off to a problem. So I started
> looking at logs, verified the usual errors were there, like 'ERROR bulk import
> abandoned' and the fixed it by reimporting the data with ldif2db, like we
> usually do.
> 
> Here are some excerpts from the logs. I can get you whole logs if you like:

neufchateau is the replication consumer in this case?  What was the replication supplier?  I would expect to see errors like this in its error log before this problem begins:
Warning: unable to receive endReplication extended
operation response (Bad parameter to an ldap routine)
Or
Warning: unable to receive add/modify/delete operation response (Bad parameter to an ldap routine)

When you get the below error - is the database in read only mode?  You might be able to fix it by turning off read only.  I believe this error and others are all a result of the error cascade that results from the initial Warning above.

> 
> 
>         389-Directory/1.2.2 B2009.237.2054
>         neufchateau.iam.gatech.edu:636 (/etc/dirsrv/slapd-neufchateau)
> 
> [31/Jan/2010:00:00:29 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication update vector for
> replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [31/Jan/2010:00:00:38 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication update vector for
> replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> :
> :
> [31/Jan/2010:16:29:15 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication upd
> ate vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [31/Jan/2010:16:29:28 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication upd
> ate vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [31/Jan/2010:16:29:31 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication upd
> ate vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [31/Jan/2010:16:29:36 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication upd
> ate vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [31/Jan/2010:16:29:37 -0500] - slapd shutting down - signaling operation
> threads
> [31/Jan/2010:16:29:38 -0500] - slapd shutting down - waiting for 17 threads to
> terminate
> [31/Jan/2010:16:29:38 -0500] - slapd shutting down - closing down internal
> subsystems and plugins
> [31/Jan/2010:16:29:39 -0500] - Waiting for 4 database threads to stop
> [31/Jan/2010:16:29:40 -0500] - All database threads now stopped
> [31/Jan/2010:16:30:06 -0500] - slapd stopped.
> 
> So the previous days log was also full of those errors... then the error log
> before that has this:
> 
>        389-Directory/1.2.2 B2009.237.2054
>         neufchateau.iam.gatech.edu:636 (/etc/dirsrv/slapd-neufchateau)
> 
> [29/Jan/2010:01:01:48 -0500] - PR_Write(330) Netscape Portable Runtime error
> -5961 (TCP connection reset by peer.)
> [29/Jan/2010:01:01:48 -0500] - PR_Write(333) Netscape Portable Runtime error
> -5961 (TCP connection reset by peer.)
> [29/Jan/2010:02:04:01 -0500] - PR_Write(207) Netscape Portable Runtime error
> -5961 (TCP connection reset by peer.)
> [29/Jan/2010:02:04:01 -0500] - PR_Write(337) Netscape Portable Runtime error
> -5961 (TCP connection reset by peer.)
> :
> [29/Jan/2010:13:10:12 -0500] NSMMReplicationPlugin -
> multimaster_be_state_change: replica ou=people,dc=gted,dc=gatech,dc=edu is
> going offline; disabling replication
> [29/Jan/2010:13:11:10 -0500] - PR_Write(367) Netscape Portable Runtime error
> -5961 (TCP connection reset by peer.)
> [29/Jan/2010:13:11:10 -0500] - PR_Write(347) Netscape Portable Runtime error
> -5961 (TCP connection reset by peer.)
> [29/Jan/2010:13:12:01 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication update vector for
> replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [29/Jan/2010:13:12:05 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication update vector for
> replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [29/Jan/2010:13:12:10 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication update vector for
> replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [29/Jan/2010:13:13:06 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication update vector for
> replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> @
> [29/Jan/2010:13:18:15 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication update vector for
> replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [29/Jan/2010:13:18:32 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication update vector for
> replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [29/Jan/2010:13:18:39 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication update vector for
> replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [29/Jan/2010:13:18:44 -0500] - WARNING: Import is running with
> nsslapd-db-private-import-mem on; No other process is allowed to access the
> database
> [29/Jan/2010:13:18:45 -0500] - ERROR bulk import abandoned
> [29/Jan/2010:13:18:45 -0500] - import people: Aborting all import threads...
> [29/Jan/2010:13:18:53 -0500] - import people: Import threads aborted.
> [29/Jan/2010:13:18:53 -0500] - import people: Closing files...
> [29/Jan/2010:13:18:53 -0500] - import people: Import failed.
> [29/Jan/2010:13:18:54 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication update vector for
> replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [29/Jan/2010:13:18:58 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication update vector for
> replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [29/Jan/2010:13:18:58 -0500] NSMMReplicationPlugin -
> replica_replace_ruv_tombstone: failed to update replication update vector for
> replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
Comment 32 John Bryson 2010-02-01 13:55:16 EST
>neufchateau is the replication consumer in this case? 
Yes

> What was the replication supplier?  
It would be one of the hubs - bellar or briggs. I will go check.


>I would expect to see errors like this in its error log before this
>problem begins:
>Warning: unable to receive endReplication extended
>operation response (Bad parameter to an ldap routine)
>Or
>Warning: unable to receive add/modify/delete operation response (Bad parameter
>to an ldap routine)

I am not seeing any of those errors.

>When you get the below error - is the database in read only mode?  You might be
>able to fix it by turning off read only.  I believe this error and others are
>all a result of the error cascade that results from the initial Warning above.

Hmm. Im not sure if it is readonly or not. I'll have to check the next time that it happens.
Comment 33 Rich Megginson 2010-02-01 14:08:36 EST
(In reply to comment #32)
> >I would expect to see errors like this in its error log before this
> >problem begins:
> >Warning: unable to receive endReplication extended
> >operation response (Bad parameter to an ldap routine)
> >Or
> >Warning: unable to receive add/modify/delete operation response (Bad parameter
> >to an ldap routine)
> 
> I am not seeing any of those errors.

I would expect to see them in the supplier error log.
Comment 34 John Bryson 2010-02-01 14:59:29 EST
yes. I see thousands of these errors on bellar  from 13:08 to 13:59 :

[29/Jan/2010:13:08:46 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar neufchateau" (neufchateau:636): Warning: unable to receive endReplication extended operation response (Bad parameter to an ldap routine)
[29/Jan/2010:13:09:43 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.
[29/Jan/2010:13:09:57 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar neufchateau" (neufchateau:636): Incremental update failed and requires administrator action
[29/Jan/2010:13:10:15 -0500] NSMMReplicationPlugin - agmt="cn=people bellar neufchateau" (neufchateau:636): Consumer failed to replay change (uniqueid a229750d-1dd111b2-805cbe51-952a0000, CSN 4b63227b0002003a0000): Operations error. Will retry later.
[29/Jan/2010:13:10:15 -0500] NSMMReplicationPlugin - agmt="cn=people bellar neufchateau" (neufchateau:636): Consumer failed to replay change (uniqueid a229750d-1dd111b2-805cbe51-952a0000, CSN 4b63227b0003003a0000): Operations error. Will retry later.
[29/Jan/2010:13:10:17 -0500] NSMMReplicationPlugin - agmt="cn=people bellar neufchateau" (neufchateau:636): Consumer failed to replay change (uniqueid 0c973e01-c45b11de-8efe8516-22c1043e, CSN 4b63227c0000003a0000): Operations error. Will retry later.
[29/Jan/2010:13:10:17 -0500] NSMMReplicationPlugin - agmt="cn=people bellar neufchateau" (neufchateau:636): Consumer failed to replay change (uniqueid 0c973e01-c45b11de-8efe8516-22c1043e, CSN 4b63227c0001003a0000): Operations error. Will retry later.
[29/Jan/2010:13:10:21 -0500] NSMMReplicationPlugin - agmt="cn=people bellar neufchateau" (neufchateau:636): Consumer failed to replay change (uniqueid fb2fe485-1dd111b2-804cbe51-952a0000, CSN 4b63227c0002003a0000): Operations error. Will retry later.
[29/Jan/2010:13:10:22 -0500] NSMMReplicationPlugin - agmt="cn=people bellar neufchateau" (neufchateau:636): Consumer failed to replay change (uniqueid fb2fe485-1dd111b2-804cbe51-952a0000, CSN 4b63227c0003003a0000): Operations error. Will retry later.
[29/Jan/2010:13:11:11 -0500] NSMMReplicationPlugin - agmt="cn=people bellar neufchateau" (neufchateau:636): Consumer failed to replay change (uniqueid dd478d01-09c711df-9c29b0cc-9ca992fc, CSN 4b63227c0004003a0000): Operations error. Will retry later.
:
[29/Jan/2010:13:59:56 -0500] NSMMReplicationPlugin - agmt="cn=people bellar neufchateau" (neufchateau:636): Consumer failed to replay change (uniqueid ebd53b18-fc8a11de-9c29b0cc-9ca992fc, CSN 4b63262a0004003a0000): Operations error. Will retry later.
Comment 35 Rich Megginson 2010-02-01 15:22:28 EST
Do you see any errors before

[29/Jan/2010:13:08:46 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar
neufchateau" (neufchateau:636): Warning: unable to receive endReplication
extended operation response (Bad parameter to an ldap routine)

?
Comment 36 John Bryson 2010-02-02 13:16:42 EST
Sorry about the delay. On bellar I see this in one log:

bellar.iam.gatech.edu: head -40 errors.20100129-130652
        389-Directory/1.2.2 B2009.237.2054
        bellar.iam.gatech.edu:636 (/etc/dirsrv/slapd-bellar)

[29/Jan/2010:13:06:52 -0500] - repl5_inc_waitfor_async_results timed out waiting for responses: 365 367
[29/Jan/2010:13:08:46 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar neufchateau" (neufchateau:636): Warning: unable to receive endReplication extended operation response (Bad parameter to an ldap routine)
[29/Jan/2010:13:09:43 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.

This in the log before that:

bellar.iam.gatech.edu: cat errors.20100128-164147
        389-Directory/1.2.2 B2009.237.2054
        bellar.iam.gatech.edu:636 (/etc/dirsrv/slapd-bellar)

[28/Jan/2010:16:41:47 -0500] - PR_Write(136) Netscape Portable Runtime error -5961 (TCP connection reset by peer.)

This in the log before that:

bellar.iam.gatech.edu: cat errors.20100127-131309
        389-Directory/1.2.2 B2009.237.2054
        bellar.iam.gatech.edu:636 (/etc/dirsrv/slapd-bellar)

[27/Jan/2010:13:13:09 -0500] - repl5_inc_waitfor_async_results timed out waiting for responses: 437 466
[27/Jan/2010:13:13:46 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar neufchateau" (neufchateau:636): Warning: unable to receive endReplication extended operation response (Bad parameter to an ldap routine)
[27/Jan/2010:13:13:58 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.
[27/Jan/2010:13:14:14 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.
[27/Jan/2010:13:14:18 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.
[27/Jan/2010:13:14:41 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.
[27/Jan/2010:13:14:47 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.
[27/Jan/2010:13:15:32 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.
[27/Jan/2010:13:17:26 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar neufchateau" (neufchateau:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:13:27:04 -0500] - repl5_inc_waitfor_async_results timed out waiting for responses: 699 850
[27/Jan/2010:13:27:04 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar neufchateau" (neufchateau:636): Warning: unable to receive endReplication extended operation response (Bad parameter to an ldap routine)
[27/Jan/2010:13:27:14 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.
[27/Jan/2010:13:27:18 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.
[27/Jan/2010:13:27:24 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.
[27/Jan/2010:13:27:36 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.
[27/Jan/2010:13:28:00 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.
[27/Jan/2010:13:28:48 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar neufchateau" (neufchateau:636): Unable to receive the response for a startReplication extended operation to consumer (Bad parameter to an ldap routine). Will retry later.
[27/Jan/2010:13:30:24 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar neufchateau" (neufchateau:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:49:43 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar barras" (barras:636): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[27/Jan/2010:22:49:44 -0500] NSMMReplicationPlugin - agmt="cn=people bellar barras" (barras:636): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[27/Jan/2010:22:49:44 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:49:44 -0500] NSMMReplicationPlugin - agmt="cn=people bellar barras" (barras:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:49:44 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar barras" (barras:636): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[27/Jan/2010:22:49:47 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:49:47 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar barras" (barras:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:49:48 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:49:48 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar barras" (barras:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:49:49 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:49:53 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:49:54 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:49:54 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:49:59 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:04 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:05 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:06 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:09 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:14 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:19 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:24 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:29 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:29 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:30 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:34 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:39 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:44 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:49 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:50:54 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:51:00 -0500] NSMMReplicationPlugin - agmt="cn=people bellar barras" (barras:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:51:17 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar barras" (barras:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:51:18 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar barras" (barras:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:51:58 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar geraldine" (geraldine:636): Warning: unable to receive endReplication extended operation response (Can't contact LDAP server)
[27/Jan/2010:22:51:59 -0500] NSMMReplicationPlugin - agmt="cn=people bellar geraldine" (geraldine:636): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[27/Jan/2010:22:52:01 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar geraldine" (geraldine:636): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[27/Jan/2010:22:52:02 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:52:02 -0500] NSMMReplicationPlugin - agmt="cn=people bellar geraldine" (geraldine:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:52:05 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:52:05 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar geraldine" (geraldine:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:52:08 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:52:08 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:52:08 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar geraldine" (geraldine:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:52:11 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:52:12 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar geraldine" (geraldine:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:52:20 -0500] NSMMReplicationPlugin - agmt="cn=people bellar geraldine" (geraldine:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:52:23 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar geraldine" (geraldine:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:53:07 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[27/Jan/2010:22:53:07 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar sneedle" (sneedle:636): Warning: unable to receive endReplication extended operation response (Can't contact LDAP server)
[27/Jan/2010:22:53:11 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:53:11 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:53:11 -0500] NSMMReplicationPlugin - agmt="cn=people bellar sneedle" (sneedle:636): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[27/Jan/2010:22:53:15 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:53:15 -0500] NSMMReplicationPlugin - agmt="cn=people bellar sneedle" (sneedle:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:53:17 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:53:17 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar sneedle" (sneedle:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:53:17 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:53:21 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:53:21 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:53:27 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:53:29 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:53:33 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:53:39 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:53:53 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:53:57 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:54:03 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:54:41 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar sneedle" (sneedle:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:54:45 -0500] NSMMReplicationPlugin - agmt="cn=people bellar sneedle" (sneedle:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:54:51 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar sneedle" (sneedle:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:55:27 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar zuber" (zuber:636): Warning: unable to receive endReplication extended operation response (Can't contact LDAP server)
[27/Jan/2010:22:55:31 -0500] NSMMReplicationPlugin - agmt="cn=people bellar zuber" (zuber:636): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[27/Jan/2010:22:55:35 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:55:35 -0500] NSMMReplicationPlugin - agmt="cn=people bellar zuber" (zuber:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:55:37 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:55:37 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar zuber" (zuber:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:55:38 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar zuber" (zuber:636): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[27/Jan/2010:22:55:40 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:55:41 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:55:42 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:55:42 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar zuber" (zuber:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:55:46 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:55:48 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar zuber" (zuber:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:55:53 -0500] NSMMReplicationPlugin - agmt="cn=people bellar zuber" (zuber:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:55:58 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar zuber" (zuber:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:56:59 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar snipps" (snipps:636): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[27/Jan/2010:22:57:02 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:57:02 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar snipps" (snipps:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:57:03 -0500] NSMMReplicationPlugin - agmt="cn=people bellar snipps" (snipps:636): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[27/Jan/2010:22:57:06 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:57:06 -0500] NSMMReplicationPlugin - agmt="cn=people bellar snipps" (snipps:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:57:07 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar snipps" (snipps:636): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[27/Jan/2010:22:57:08 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:57:10 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:57:10 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar snipps" (snipps:636): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))
[27/Jan/2010:22:57:12 -0500] slapi_ldap_bind - Error: could not send bind request for id [cn=Replication Manager,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[27/Jan/2010:22:57:16 -0500] NSMMReplicationPlugin - agmt="cn=accounts bellar snipps" (snipps:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:57:20 -0500] NSMMReplicationPlugin - agmt="cn=gted bellar snipps" (snipps:636): Replication bind with SIMPLE auth resumed
[27/Jan/2010:22:57:24 -0500] NSMMReplicationPlugin - agmt="cn=people bellar snipps" (snipps:636): Replication bind with SIMPLE auth resumed
bellar.iam.gatech.edu:
Comment 37 John Bryson 2010-02-18 10:21:32 EST
In answer to a previous question, I -think- the database is read only, how can I confirm that? And how would I change that to see if that makes it usable after the error occurs?

One change we are talking about doing to simplify our topology is to have each hub only talk to its own group of replication consumers. So the hub bellar would provide replication data to one list of machines, and the hub briggs would talk to a separate list of machines. 

We also discussed putting these on some scheduled replication instead of constant, to see if it makes any difference.
Comment 38 Rich Megginson 2010-02-18 10:57:11 EST
(In reply to comment #37)
> In answer to a previous question, I -think- the database is read only, how can
> I confirm that?

http://www.redhat.com/docs/manuals/dir-server/8.1/admin/Configuring_Directory_Databases-Creating_and_Maintaining_Databases.html

> And how would I change that to see if that makes it usable
> after the error occurs?

See above, and see also:

http://www.redhat.com/docs/manuals/dir-server/8.1/admin/Configuring_Directory_Databases-Using_Referrals.html

You might need to set the suffix to "Use the Databases" for masters or "Return Referrals for Update Operations" if this is a hub or read-only consumer.

> 
> One change we are talking about doing to simplify our topology is to have each
> hub only talk to its own group of replication consumers. So the hub bellar
> would provide replication data to one list of machines, and the hub briggs
> would talk to a separate list of machines. 
> 
> We also discussed putting these on some scheduled replication instead of
> constant, to see if it makes any difference.
Comment 39 John Bryson 2010-02-19 11:05:46 EST
Not read only.

Ok, so we just had the failure on the people database on a machine called snipps.iam and the people database was not set to read only. 

I looked in the 389-console gui, under 

snipps.iam.gatech.edu, 
 Data
  Database Link Settings
  Database Settings
  Local Backups
  dc=gted,dc=gatech,dc=edu
   -gted 
      database tab, the box for 'Database is read-only' is not checked.
   ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu
   -accounts
      database tab, the box for 'Database is read-only' is not checked.
   ou=people,dc=gted,dc-gatech,dc=edu
    -people
       database tab, the box for 'Database is read-only' is not checked.

Nor was the entire server read only under the Configuration tab, snipps.iam.gatech.edu, Settings tab, the box for 'Make entire server read-only' was not checked.
Comment 40 Noriko Hosoi 2010-03-03 14:13:19 EST
Hello, John.  While running some other tests, I ran into a similar problem you are observing.  My test is pretty simple:
. Set up 2-way MMR
. Run total update against the 2 masters almost at the same time.
It causes the deadlock on the both masters.  They wait for the replica timeout (10min., by default).  Once timed out, the total update on the both servers resumes (as a consumer).  Since the supplier thread has already failed, no data is available and the both servers lose the data.

My error log shows pretty similar logs you pushed:
[..] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=example,dc=com is coming online; enabling replication
[..] - PR_Write(66) Netscape Portable Runtime error -5961 (TCP connection reset by peer.)
[..] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=example,dc=com is going offline; disabling replication
[..] NSMMReplicationPlugin - agmt="cn=m1-to-m0" (kiki:10390): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[..] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[..] - ERROR bulk import abandoned
[..] - import userRoot: Aborting all import threads...
[..] - import userRoot: Import threads aborted.
[..] - import userRoot: Closing files...
[..] - import userRoot: Import failed.
[..] - process_bulk_import_op: NULL backend

I wonder if your hub (e.g., zuber) is a replica of multiple masters?  You wrote "Our environment is 3 masters that replicate to each other, and to 2 hubs."  This means you have 6 hubs and each hub receives changes from all 3 masters or 2 or just one?

Interestingly, you are not initiating a total update although the hubs are receiving the multiple total update operations, I heard.  Do you happen to keep the access log of the masters when the total update/bulk import occurred on the hub?  I'm curious if your masters' log files could have stored this type of logs:
[..] conn=2 op=1 MOD dn="cn=<your_agreement_name>,cn=replica,\"<your_suffix>\",cn=mapping tree,cn=config"
Comment 41 John Bryson 2010-03-03 15:41:46 EST
So we have 3 masters rewbell, treilhard, and gertrude. These all replicate to each other. Changes usually come in to one of the masters because the load balancer favors that one.

We also have 2 hubs bellar and briggs, and all masters replicate to both of these hubs, and the hubs replicate to each other.

We have 12 slaves that each get replications from each hub.

   ==========
 //          \\
M <==> M <====> M    <3masters
|\     /\      /|
v v   v  v    v v 
 


bellar         briggs           <2hubs
||||||\\\\\\   //////||||||
VVVVVV VVVVVV  VVVVVV VVVVVV

zuber geraldine neufchateau snipps sneedle aaron   < 12slaves
abigail alice zans zall zebs barras    
           

The hubs, bellar and briggs, are getting replication from all masters. 
We use continuous updates instead of scheduled updates.   

I have observed that the issue almost always occurs on the slaves. In fact, Im not sure that I have ever seen it on the master to master replication, or master to hub.
Comment 42 Noriko Hosoi 2010-03-03 21:02:08 EST
Thank you for the clear picture.

Do you still keep the access logs on the hubs when such a problem occurs?  Could you run "grep -i mapping access*" in the log directory?  I cannot find any method to initiate the total update other than sending a modify request on the agreement...
Comment 43 Nathan Kinder 2010-03-04 16:22:56 EST
Is there any chance that the replication updates are being load balanced (intentionally or accidentally)?  I have seen some odd things like this before from replication sessions being load balanced, causing a master/hub to send updates to replicas other than those intended.
Comment 44 John Bryson 2010-03-04 17:18:15 EST
We load balance user access to the machines  but replication sessions are not. Replication agreements have specific hostnames (ex. neufchateau.iam.gatech.edu) in them, not the load balanced name (r.gted.gatech.edu)

Then last machine to have a problem was neufchateau.iam.gatech.edu (March 1 @15:14 is got these errors:
[01/Mar/2010:15:16:20 -0500] - ldbm: 'gted' is already in the middle of another task and cannot be disturbed.
[01/Mar/2010:15:16:21 -0500] - slapi_start_bulk_import: failed; error = -23
[01/Mar/2010:15:16:23 -0500] - ldbm: 'gted' is already in the middle of another task and cannot be disturbed.
[01/Mar/2010:15:16:24 -0500] - slapi_start_bulk_import: failed; error = -23
[01/Mar/2010:15:16:24 -0500] NSMMReplicationPlugin - conn=805050 op=3228 replica="dc=gted,dc=gatech,dc=edu": Unable to acquire replica: error: internal error
[01/Mar/2010:15:16:24 -0500] NSMMReplicationPlugin - replica_write_ruv: failed to update RUV tombstone for dc=gted,dc=gatech,dc=edu; LDAP error - 1
[01/Mar/2010:15:16:24 -0500] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=gted,dc=gatech,dc=edu is going offline; disabling replication
[01/Mar/2010:15:16:24 -0500] NSMMReplicationPlugin - conn=805050 op=3223 replica="dc=gted,dc=gatech,dc=edu": Unable to acquire replica: error: internal error
[01/Mar/2010:15:17:09 -0500] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica dc=gted,dc=gatech,dc=edu: LDAP error - 1



This is a slave server (gets replications from a hub -bellar or briggs, and not from anyone else).

grep -i mapping access* is pretty huge, so below is output from this (let me know if you need more)

access.20100301-000054 

neufchateau.iam.gatech.edu: grep -i mapping access.20100301-000054
[01/Mar/2010:00:10:03 -0500] conn=687068 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:00:20:02 -0500] conn=688278 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:00:30:04 -0500] conn=689558 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:00:40:01 -0500] conn=690961 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:00:50:01 -0500] conn=692096 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:01:00:01 -0500] conn=693105 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:01:10:01 -0500] conn=694594 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:01:20:01 -0500] conn=696321 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:01:30:02 -0500] conn=697781 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:01:40:01 -0500] conn=699518 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:01:50:02 -0500] conn=700729 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:02:00:00 -0500] conn=701921 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:02:10:00 -0500] conn=703106 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:02:20:01 -0500] conn=704116 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:02:30:00 -0500] conn=705083 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:02:40:02 -0500] conn=706102 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:02:50:00 -0500] conn=707340 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:03:00:02 -0500] conn=708651 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:03:10:02 -0500] conn=710034 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:03:26:03 -0500] conn=711266 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:03:30:29 -0500] conn=711632 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:03:40:01 -0500] conn=712910 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:03:50:01 -0500] conn=714109 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:04:00:01 -0500] conn=715071 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:04:10:00 -0500] conn=716456 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:04:20:01 -0500] conn=717608 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:04:30:00 -0500] conn=718774 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:04:40:00 -0500] conn=719905 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:04:50:00 -0500] conn=721120 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:05:00:01 -0500] conn=722158 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:05:10:01 -0500] conn=723318 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:05:20:01 -0500] conn=724639 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:05:30:01 -0500] conn=726334 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:05:40:01 -0500] conn=728297 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:05:50:01 -0500] conn=729678 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:06:00:01 -0500] conn=730834 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:06:10:01 -0500] conn=731938 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:06:20:01 -0500] conn=732747 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:06:30:01 -0500] conn=733872 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:06:40:01 -0500] conn=735063 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:06:50:01 -0500] conn=736338 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:07:00:00 -0500] conn=737257 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:07:10:01 -0500] conn=738408 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:07:20:00 -0500] conn=739638 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:07:30:00 -0500] conn=741056 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:07:40:00 -0500] conn=742392 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:07:50:00 -0500] conn=743978 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:08:00:01 -0500] conn=745611 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:08:10:00 -0500] conn=747321 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:08:20:01 -0500] conn=748812 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:08:30:01 -0500] conn=750977 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:08:40:04 -0500] conn=752940 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:08:50:01 -0500] conn=754755 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:09:00:02 -0500] conn=756505 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:09:10:01 -0500] conn=758873 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:09:20:01 -0500] conn=761486 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:09:30:01 -0500] conn=764156 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:09:40:01 -0500] conn=766744 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:09:50:02 -0500] conn=769310 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:10:00:01 -0500] conn=771717 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:10:10:01 -0500] conn=774060 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:10:20:00 -0500] conn=776365 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:10:30:01 -0500] conn=778630 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:10:40:00 -0500] conn=781014 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:10:50:01 -0500] conn=783076 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:11:00:01 -0500] conn=785541 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:11:10:01 -0500] conn=788360 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:11:20:01 -0500] conn=791004 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:11:30:01 -0500] conn=793107 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:11:40:01 -0500] conn=795270 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:11:50:01 -0500] conn=797784 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:12:00:01 -0500] conn=799580 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:12:10:01 -0500] conn=802051 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:12:20:01 -0500] conn=804250 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:12:30:01 -0500] conn=806103 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:12:40:01 -0500] conn=807729 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:12:50:01 -0500] conn=809573 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:13:00:01 -0500] conn=811233 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:13:10:00 -0500] conn=812767 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:13:20:01 -0500] conn=814186 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:13:30:01 -0500] conn=815966 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:13:40:01 -0500] conn=817770 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:13:50:00 -0500] conn=819971 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:14:00:00 -0500] conn=822228 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:14:10:00 -0500] conn=824344 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:14:20:00 -0500] conn=826592 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:14:30:01 -0500] conn=828714 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:14:40:01 -0500] conn=830837 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:14:50:01 -0500] conn=832331 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:15:00:01 -0500] conn=834058 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
[01/Mar/2010:15:16:01 -0500] conn=836206 op=1 SRCH base="cn=mapping tree,cn=config" scope=2 filter="(|(objectClass=nsDS5ReplicationAgreement)(objectClass=LDAPReplica)(objectClass=nsDSWindowsReplicationAgreement))" attrs="nsds5replicaLastUpdateStatus"
                          
(and there are no more of these, as replication was broken at that point)

I agree with you when you say "I cannot find
any method to initiate the total update other than sending a modify request on
the agreement..."  
I dont know of any way either. But it sure does look like something like that is occurring...
Comment 45 Nathan Kinder 2010-03-04 17:46:13 EST
(In reply to comment #44)
> We load balance user access to the machines  but replication sessions are not.
> Replication agreements have specific hostnames (ex. neufchateau.iam.gatech.edu)
> in them, not the load balanced name (r.gted.gatech.edu)

Great.  That setup should work fine.  I've seen a case where one of the replication specific hostnames was accidentally added to the load balancer which really confused replication.

> This is a slave server (gets replications from a hub -bellar or briggs, and not
> from anyone else).
> 
> grep -i mapping access* is pretty huge, so below is output from this (let me
> know if you need more)
> 
> access.20100301-000054 
> 
> neufchateau.iam.gatech.edu: grep -i mapping access.20100301-000054

This looks like you are checking the access log on the consumer (neufcahteau).  You should be looking for a MOD to the replication agreement that points to neufchateau on one of the hubs (bellar and briggs).
Comment 46 John Bryson 2010-03-12 11:15:59 EST
Can you all recommend any changes that might give more information on this? For example, does it make sense to change from constant replication to scheduled replications to see if that makes the error less frequent?

Is there anything that you can suggest?
Comment 47 Rich Megginson 2010-07-15 16:18:03 EDT
On the replication consumer - in the access log - when you get the error - do you see any lines like
[07/Jul/2010:23:11:52 +0000] conn=1 op=-1 fd=1024 closed error 11 (Resource temporarily unavailable) - T1
?

have you explicitly set the idle timeout to something other than the default value?

grep -i idle /etc/dirsrv/slapd-instance/dse.ldif
?
Comment 48 John Bryson 2010-07-16 16:20:55 EDT
We dont seem to be setting any 'idle' string in dse.ldif on the consumers:
ex
zuber.iam.gatech.edu: grep -i idle /etc/dirsrv/slapd-zuber/*ldif*
 

Hmm. I would have to search to find the error. I havent had one in some weeks now. (We turned off host based firewalls between these servers a few weeks ago, and things are much better. In fact the only time that the error has recurred, the host firewalls had been turned back on accidentally...)

However, I can look back through logs when I get a chance.
Comment 49 Rich Megginson 2010-07-16 16:40:42 EDT
Created attachment 432492 [details]
0001-Bug-547503-replication-broken-again-with-389-MMR.patch
Comment 50 Rich Megginson 2010-07-16 16:45:06 EDT
Well, there goes my theory :P

But at least it is a good clue that it is firewall related.
Comment 51 Rich Megginson 2010-07-16 17:28:17 EDT
To ssh://git.fedorahosted.org/git/389/ds.git
   7b23290..9d638b3  389-ds-base-1.2.6 -> 389-ds-base-1.2.6
commit 9d638b3fc25fbc57884a511744943499c7102f40
Author: Rich Megginson <rmeggins@redhat.com>
Date:   Fri Jul 16 14:26:19 2010 -0600
To ssh://git.fedorahosted.org/git/389/ds.git
   2bfbe59..2a03eed  master -> master
commit 2a03eed6f1bb8d70d529b9248603e0449f07ec30
Author: Rich Megginson <rmeggins@redhat.com>
Date:   Fri Jul 16 14:26:19 2010 -0600
Comment 52 John Bryson 2010-09-27 11:37:36 EDT
(In reply to comment #51)
> To ssh://git.fedorahosted.org/git/389/ds.git
>    7b23290..9d638b3  389-ds-base-1.2.6 -> 389-ds-base-1.2.6
> commit 9d638b3fc25fbc57884a511744943499c7102f40
> Author: Rich Megginson <rmeggins@redhat.com>
> Date:   Fri Jul 16 14:26:19 2010 -0600
> To ssh://git.fedorahosted.org/git/389/ds.git
>    2bfbe59..2a03eed  master -> master
> commit 2a03eed6f1bb8d70d529b9248603e0449f07ec30
> Author: Rich Megginson <rmeggins@redhat.com>
> Date:   Fri Jul 16 14:26:19 2010 -0600

So, I can see the latest version references this bug (or it certainly sounds like this bug). We are thinking of upgrading. Can you confirm it is fixed or improved? (Im curious what the problem was too)
Thanks
JohnB
Comment 53 Rich Megginson 2010-09-27 12:33:45 EDT
(In reply to comment #52)
> (In reply to comment #51)
> > To ssh://git.fedorahosted.org/git/389/ds.git
> >    7b23290..9d638b3  389-ds-base-1.2.6 -> 389-ds-base-1.2.6
> > commit 9d638b3fc25fbc57884a511744943499c7102f40
> > Author: Rich Megginson <rmeggins@redhat.com>
> > Date:   Fri Jul 16 14:26:19 2010 -0600
> > To ssh://git.fedorahosted.org/git/389/ds.git
> >    2bfbe59..2a03eed  master -> master
> > commit 2a03eed6f1bb8d70d529b9248603e0449f07ec30
> > Author: Rich Megginson <rmeggins@redhat.com>
> > Date:   Fri Jul 16 14:26:19 2010 -0600
> 
> So, I can see the latest version references this bug (or it certainly sounds
> like this bug). We are thinking of upgrading. Can you confirm it is fixed or
> improved? (Im curious what the problem was too)
> Thanks
> JohnB

I can confirm that it is improved, in certain cases.

I cannot confirm that it is fixed - based on your last comment (https://bugzilla.redhat.com/show_bug.cgi?id=547503#c48) it would appear that you may be seeing a slightly different problem than the one I fixed.  I would appreciate it if you could test 1.2.6 to see if it makes a difference in the problem you are seeing.
Comment 56 Amita Sharma 2011-07-28 02:58:01 EDT
Based on comment#55, marking bug as VerifiedUpstream.