Description of problem: We experienced a strange error today. It's a fractional replication test on 64-bit Linux (nightcrawler): mmrepl fractional run 80% (4/5) Error description The test env is 2-masters {fractional0, fractional1} and 2-replicas {fractional2, fractional3}. First, the test imports an ldif to fractional0, then run the consumer initialization one by one: fractional0_to_fractional1, fractional0_to_fractional2, fractional0_to_fractional3. The 3 ldapmodify operations were successful, but the last fractional0_to_fractional3 consumer initialization was not triggered on the master fractional0 as follows. fractional0: error log [17/Nov/2006:05:34:02 -0800] NSMMReplicationPlugin - agmt="cn=fractional0_to_fractional3" (nightcrawler:33632): Replica has a different generation ID than the local data. [17/Nov/2006:05:34:05 -0800] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=fractional0_to_fractional2" (nightcrawler:32257)". [17/Nov/2006:05:34:10 -0800] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=fractional0_to_fractional1" (nightcrawler:31284)". [17/Nov/2006:05:34:14 -0800] NSMMReplicationPlugin - Finished total update of replica "agmt="cn=fractional0_to_fractional2" (nightcrawler:32257)". Sent 161 entries. [17/Nov/2006:05:34:17 -0800] NSMMReplicationPlugin - Finished total update of replica "agmt="cn=fractional0_to_fractional1" (nightcrawler:31284)". Sent 161 entries. [17/Nov/2006:05:34:43 -0800] NSMMReplicationPlugin - agmt="cn=fractional0_to_fractional3" (nightcrawler:33632): Replica has a different generation ID than the local data. And this is the access log of the problematic replica fractional3. (note: fractional3 error log does not have any error messages, which is the biggest problem, I think...) fractional3: access log [17/Nov/2006:05:34:06 -0800] conn=5 fd=64 slot=64 connection from 10.14.0.77 to 10.14.0.77 [17/Nov/2006:05:34:06 -0800] conn=5 op=0 BIND dn="cn=replication manager,cn=config" method=128 version=3 [17/Nov/2006:05:34:06 -0800] conn=5 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=replication manager,cn=config" [17/Nov/2006:05:34:06 -0800] conn=5 op=1 SRCH base="" scope=0 filter="(objectClass=*)" attrs="supportedControl supportedExtension" [17/Nov/2006:05:34:06 -0800] conn=5 op=1 RESULT err=0 tag=101 nentries=1 etime=0 [17/Nov/2006:05:34:06 -0800] conn=5 op=2 SRCH base="" scope=0 filter="(objectClass=*)" attrs="supportedControl supportedExtension" [17/Nov/2006:05:34:06 -0800] conn=5 op=2 RESULT err=0 tag=101 nentries=1 etime=0 [17/Nov/2006:05:34:06 -0800] conn=4 op=6 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session" [17/Nov/2006:05:34:06 -0800] conn=5 op=3 EXT oid="2.16.840.1.113730.3.5.3" name="Netscape Replication Start Session" [17/Nov/2006:05:34:06 -0800] conn=5 op=3 RESULT err=0 tag=120 nentries=0 etime=0 [17/Nov/2006:05:34:07 -0800] conn=4 op=6 RESULT err=0 tag=120 nentries=0 etime=1 [17/Nov/2006:05:34:08 -0800] conn=4 op=7 UNBIND [...] [17/Nov/2006:05:35:06 -0800] conn=23 op=2 fd=67 closed - U1 [17/Nov/2006:05:35:07 -0800] conn=5 op=4 UNBIND [17/Nov/2006:05:35:07 -0800] conn=5 op=4 fd=64 closed - U1 Comparing the fractional3 access log to the healthy replica fractional2's log, we could see fractional3 did not receive "Netscape Replication Total Update Entry" extension at all. On fractional3, the connection 5 see no activities after returning the result of "Netscape Replication start Session", wait for about one min. and the connection gets closed. fractional2: access log [17/Nov/2006:05:34:04 -0800] conn=5 fd=64 slot=64 connection from 10.14.0.77 to10.14.0.77 [17/Nov/2006:05:34:04 -0800] conn=5 op=0 BIND dn="cn=replication manager,cn=config" method=128 version=3 [17/Nov/2006:05:34:04 -0800] conn=5 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=replication manager,cn=config" [17/Nov/2006:05:34:04 -0800] conn=5 op=1 SRCH base="" scope=0 filter="(objectClass=*)" attrs="supportedControl supportedExtension" [17/Nov/2006:05:34:04 -0800] conn=5 op=1 RESULT err=0 tag=101 nentries=1 etime=0 [17/Nov/2006:05:34:04 -0800] conn=5 op=2 SRCH base="" scope=0 filter="(objectClass=*)" attrs="supportedControl supportedExtension" [17/Nov/2006:05:34:04 -0800] conn=5 op=2 RESULT err=0 tag=101 nentries=1 etime=0 [17/Nov/2006:05:34:04 -0800] conn=5 op=3 EXT oid="2.16.840.1.113730.3.5.3" name="Netscape Replication Start Session" [17/Nov/2006:05:34:04 -0800] conn=4 op=4 UNBIND [17/Nov/2006:05:34:04 -0800] conn=4 op=4 fd=65 closed - U1 [17/Nov/2006:05:34:05 -0800] conn=5 op=3 RESULT err=0 tag=120 nentries=0 etime=1 [17/Nov/2006:05:34:06 -0800] conn=5 op=4 EXT oid="2.16.840.1.113730.3.5.6" name="Netscape Replication Total Update Entry" [17/Nov/2006:05:34:06 -0800] conn=5 op=4 RESULT err=0 tag=120 nentries=0 etime=0 [...] Following the code, most likely, master failed to acquire replica here. The comment on line 346-348 is very interesting. If the aquire_replica fails, it does not retry. And it jumps to "done" quietly. 320 static void 321 repl5_tot_run(Private_Repl_Protocol *prp) 322 { [...] 343 /* acquire remote replica */ 344 agmt_set_last_init_start(prp->agmt, current_time()); 345 rc = acquire_replica (prp, REPL_NSDS50_TOTAL_PROTOCOL_OID, NULL /* r uv */); 346 /* We never retry total protocol, even in case a transient error. 347 This is because if somebody already updated the replica we don't 348 want to do it again */ 349 if (rc != ACQUIRE_SUCCESS) 350 { 351 int optype, ldaprc; 352 conn_get_error(prp->conn, &optype, &ldaprc); 353 agmt_set_last_init_status(prp->agmt, ldaprc, 354 prp->last_acquire_response_code, NULL); 355 goto done; 356 } Looking into acquire_replica, since we saw the OID "2.16.840.1.113730.3.5.3" in the access log, this conn_send_extended_operation should have been successfully done. 114 int 115 acquire_replica(Private_Repl_Protocol *prp, char *prot_oid, RUV **ruv) 116 { [...] 220 crc = conn_send_extended_operation(conn, 221 REPL_START_NSDS50_REPLICATION_REQUEST_OID, payload, NULL /* update control */, NULL /* Message ID */); But somehow, this conn_read_result_ex might have failed. This part is just my guess. 235 /* Since the operation request is async, we need to wait for the response here */ 236 crc = conn_read_result_ex(conn,&retoid,&retdata,NULL,NULL,1); Another interesting observation is if acquire_replica fails, it calls conn_start_linger. Its default timeout length is 60 sec, which roughly matches the conn=5 closed time (the timestamp of the access log is not accurate at all, but at least we could use it as a rough measure. It's a min., not a sec. or 10 min...) #define DEFAULT_LINGER_TIME (60) Unfortunately, I could not narrow down the problem further, but thinking of the "no errors" in the errors log, probably, what happened in acquire_replica was this: 311 case NSDS50_REPL_REPLICA_BUSY: 312 /* Someone else is updating the replica. Try later. */ 313 slapi_log_error(SLAPI_LOG_REPL, repl_plugin_name, 314 "%s: Unable to acquire replica: " 315 "the replica is currently being updated" 316 "by another supplier. Will try later\n", 317 agmt_get_long_name(prp->agmt)); 318 return_value = ACQUIRE_REPLICA_BUSY; 319 break; When the consumer initialization was called, the second master fractional1 was also trying to do incremental-update on the replica fractional3, and there could be a contention. To make it clearer, it would be nice if we could change the above slapi_log_error level SLAPI_LOG_FATAL only when acquire_replica is called from repl5_tot_run (consumer initialization)... And I think we should report an error when acqure_replica fails in repl5_tot_run, too. As Nathan pointed out, the fractional test case does not check the initialization result. We could add the code to check the result and if it's found failed, we could retry. Response from Rich: So, the problem is that one supplier has acquired the replica for an incremental update, which blocks out the other supplier from doing the total update, and returns BUSY? Yes, we should log an error at level SLAPI_LOG_FATAL in this case. Or perhaps we should retry (2 times? More?) and log the retries at the REPL level, and if all of the tries fail, then log at the FATAL level. > > As Nathan pointed out, the fractional test case does not check the initialization result. We could add the code to check the result and if it's found failed, we could retry. Yes. This busy status should be in the repl monitoring attributes.
Since CVS server is down, this is not a real CVS diff, but this is the change I'd like to make to log "replica busy" when acquire_replica is called from consumer initialization. --- repl5_protocol_util.c 2006-11-10 13:22:34.000000000 -0800 +++ /export/src/ds72/ldapserver/ldap/servers/plugins/replication/repl5_protocol_util.c 2006-11-20 14:31:31.000000000 -0800 @@ -309,13 +309,27 @@ return_value = ACQUIRE_FATAL_ERROR; break; case NSDS50_REPL_REPLICA_BUSY: /* Someone else is updating the replica. Try later. */ + /* if acquire_replica is called for replica + initialization, log REPLICA_BUSY, too */ + if (strcmp(REPL_NSDS50_TOTAL_PROTOCOL_OID, + prot_oid) == 0) + { + slapi_log_error(SLAPI_LOG_FATAL, repl_plugin_name, + "%s: Unable to acquire replica: " + "the replica is currently being updated" + "by another supplier.\n", + agmt_get_long_name(prp->agmt)); + } + else /* REPL_NSDS50_INCREMENTAL_PROTOCOL_OID */ + { slapi_log_error(SLAPI_LOG_REPL, repl_plugin_name, "%s: Unable to acquire replica: " "the replica is currently being updated" "by another supplier. Will try later\n", agmt_get_long_name(prp->agmt)); + } return_value = ACQUIRE_REPLICA_BUSY; break; case NSDS50_REPL_LEGACY_CONSUMER: /* remote replica is a legacy consumer */
Created attachment 141812 [details] cvs commit message (repl5_protocol_util.c) Reviewed by Rich (Thank you!!) Checked in into HEAD.
I thought I fixed the problem in TET, but it looks it was incomplete.. Good news is now the cause of the problem is log'ed in the master's errors log: Fractional-Repl-slapd-fractional0-errors:[19/Jan/2007:05:35:20 -0800] NSMMReplicationPlugin - agmt="cn=fractional0_to_fractional3" (nightcrawler:33000): Unable to acquire replica: the replica is currently being updatedby another supplier. The fix I made in the bug 216512 was after calling consumer initializaion, check the value of nsds5replicalastinitstatus. If it is "replica busy", it retries. I guess we are getting some other value. To find it out, I'm adding the code to print it out... Thanks, --noriko Index: replib.ksh =================================================================== RCS file: /cvs/dirsec/tetframework/Shared/DS/6.0/ksh/replib.ksh,v retrieving revision 1.10 diff -t -w -U4 -r1.10 replib.ksh --- replib.ksh 21 Nov 2006 02:01:51 -0000 1.10 +++ replib.ksh 19 Jan 2007 19:29:40 -0000 @@ -763,23 +763,24 @@ fi # Check if could acquire replica or not. # If the result was "replica busy", retry. - $LDAPSEARCH \ + initstatus=`$LDAPSEARCH \ -h ${MYHOST} \ -p ${MYPORT} \ -D "${MYROOTDN}" \ -w "${MYROOTDNPW}" \ -s base \ -b "cn=${MYAGRNAME},cn=replica,cn=\"${MYSFX}\",cn=mapping tree,cn=config" \ - "objectclass=*" nsds5replicalastinitstatus | \ - grep -i 'nsds5replicalastinitstatus: replica busy' >$DEVNULL 2>&1 + "objectclass=*" nsds5replicalastinitstatus` + echo $initstatus | grep -i 'nsds5replicalastinitstatus: replica busy' >$DEVNULL 2>&1 RC=$? if [ $RC -eq 0 ]; then i=`expr $i + 1` message "initialize_consumer: replica busy: retry $i time(s)" else # something else; no need to retry + message "initialize_consumer: status: $initstatus" i=5 fi done