Bug 216512 - replica initialization: random failure
Summary: replica initialization: random failure
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: 389
Classification: Retired
Component: Replication - General
Version: 1.0.4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Noriko Hosoi
QA Contact: Viktor Ashirov
URL:
Whiteboard:
Depends On:
Blocks: 152373 240316 FDS1.1.0
TreeView+ depends on / blocked
 
Reported: 2006-11-20 21:34 UTC by Noriko Hosoi
Modified: 2015-12-07 16:55 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-12-07 16:55:14 UTC
Embargoed:


Attachments (Terms of Use)
cvs commit message (repl5_protocol_util.c) (556 bytes, text/plain)
2006-11-21 19:15 UTC, Noriko Hosoi
no flags Details

Description Noriko Hosoi 2006-11-20 21:34:40 UTC
Description of problem:
We experienced a strange error today.  It's a fractional replication test on
64-bit Linux (nightcrawler):

mmrepl fractional run 	  	80% (4/5) 	 

Error description
The test env is 2-masters {fractional0, fractional1} and 2-replicas
{fractional2, fractional3}. 
First, the test imports an ldif to fractional0, then run the consumer
initialization one by one: fractional0_to_fractional1,
fractional0_to_fractional2, fractional0_to_fractional3.
The 3 ldapmodify operations were successful, but the last
fractional0_to_fractional3 consumer initialization was not triggered on the
master fractional0 as follows. 

    fractional0: error log
    [17/Nov/2006:05:34:02 -0800] NSMMReplicationPlugin -
agmt="cn=fractional0_to_fractional3" (nightcrawler:33632): Replica has a
different generation ID than the local data.
    [17/Nov/2006:05:34:05 -0800] NSMMReplicationPlugin - Beginning total update
of replica "agmt="cn=fractional0_to_fractional2" (nightcrawler:32257)".
    [17/Nov/2006:05:34:10 -0800] NSMMReplicationPlugin - Beginning total update
of replica "agmt="cn=fractional0_to_fractional1" (nightcrawler:31284)".
    [17/Nov/2006:05:34:14 -0800] NSMMReplicationPlugin - Finished total update
of replica "agmt="cn=fractional0_to_fractional2" (nightcrawler:32257)". Sent 161
entries.
    [17/Nov/2006:05:34:17 -0800] NSMMReplicationPlugin - Finished total update
of replica "agmt="cn=fractional0_to_fractional1" (nightcrawler:31284)". Sent 161
entries.
    [17/Nov/2006:05:34:43 -0800] NSMMReplicationPlugin -
agmt="cn=fractional0_to_fractional3" (nightcrawler:33632): Replica has a
different generation ID than the local data.

And this is the access log of the problematic replica fractional3.  (note:
fractional3 error log does not have any error messages, which is the biggest
problem, I think...)

    fractional3: access log
    [17/Nov/2006:05:34:06 -0800] conn=5 fd=64 slot=64 connection from 10.14.0.77
to 10.14.0.77
    [17/Nov/2006:05:34:06 -0800] conn=5 op=0 BIND dn="cn=replication
manager,cn=config" method=128 version=3
    [17/Nov/2006:05:34:06 -0800] conn=5 op=0 RESULT err=0 tag=97 nentries=0
etime=0 dn="cn=replication manager,cn=config"
    [17/Nov/2006:05:34:06 -0800] conn=5 op=1 SRCH base="" scope=0
filter="(objectClass=*)" attrs="supportedControl supportedExtension"
    [17/Nov/2006:05:34:06 -0800] conn=5 op=1 RESULT err=0 tag=101 nentries=1 etime=0
    [17/Nov/2006:05:34:06 -0800] conn=5 op=2 SRCH base="" scope=0
filter="(objectClass=*)" attrs="supportedControl supportedExtension"
    [17/Nov/2006:05:34:06 -0800] conn=5 op=2 RESULT err=0 tag=101 nentries=1 etime=0
    [17/Nov/2006:05:34:06 -0800] conn=4 op=6 EXT oid="2.16.840.1.113730.3.5.5"
name="Netscape Replication End Session"
    [17/Nov/2006:05:34:06 -0800] conn=5 op=3 EXT oid="2.16.840.1.113730.3.5.3"
name="Netscape Replication Start Session"
    [17/Nov/2006:05:34:06 -0800] conn=5 op=3 RESULT err=0 tag=120 nentries=0 etime=0
    [17/Nov/2006:05:34:07 -0800] conn=4 op=6 RESULT err=0 tag=120 nentries=0 etime=1
    [17/Nov/2006:05:34:08 -0800] conn=4 op=7 UNBIND
        [...]
    [17/Nov/2006:05:35:06 -0800] conn=23 op=2 fd=67 closed - U1
    [17/Nov/2006:05:35:07 -0800] conn=5 op=4 UNBIND
    [17/Nov/2006:05:35:07 -0800] conn=5 op=4 fd=64 closed - U1

Comparing the fractional3 access log to the healthy replica fractional2's log,
we could see fractional3 did not receive "Netscape Replication Total Update
Entry" extension at all.   On fractional3, the connection 5 see no activities
after returning the result of "Netscape Replication start Session", wait for
about one min. and the connection gets closed.

    fractional2: access log
    [17/Nov/2006:05:34:04 -0800] conn=5 fd=64 slot=64 connection from 10.14.0.77
to10.14.0.77
    [17/Nov/2006:05:34:04 -0800] conn=5 op=0 BIND dn="cn=replication
manager,cn=config" method=128 version=3
    [17/Nov/2006:05:34:04 -0800] conn=5 op=0 RESULT err=0 tag=97 nentries=0
etime=0 dn="cn=replication manager,cn=config"
    [17/Nov/2006:05:34:04 -0800] conn=5 op=1 SRCH base="" scope=0
filter="(objectClass=*)" attrs="supportedControl supportedExtension"
    [17/Nov/2006:05:34:04 -0800] conn=5 op=1 RESULT err=0 tag=101 nentries=1 etime=0
    [17/Nov/2006:05:34:04 -0800] conn=5 op=2 SRCH base="" scope=0
filter="(objectClass=*)" attrs="supportedControl supportedExtension"
    [17/Nov/2006:05:34:04 -0800] conn=5 op=2 RESULT err=0 tag=101 nentries=1 etime=0
    [17/Nov/2006:05:34:04 -0800] conn=5 op=3 EXT oid="2.16.840.1.113730.3.5.3"
name="Netscape Replication Start Session"
    [17/Nov/2006:05:34:04 -0800] conn=4 op=4 UNBIND
    [17/Nov/2006:05:34:04 -0800] conn=4 op=4 fd=65 closed - U1
    [17/Nov/2006:05:34:05 -0800] conn=5 op=3 RESULT err=0 tag=120 nentries=0 etime=1
    [17/Nov/2006:05:34:06 -0800] conn=5 op=4 EXT oid="2.16.840.1.113730.3.5.6"
name="Netscape Replication Total Update Entry"
    [17/Nov/2006:05:34:06 -0800] conn=5 op=4 RESULT err=0 tag=120 nentries=0 etime=0
        [...]

Following the code, most likely, master failed to acquire replica here.  The
comment on line 346-348 is very interesting.  If the aquire_replica fails, it
does not retry.  And it jumps to "done" quietly.
    320 static void
    321 repl5_tot_run(Private_Repl_Protocol *prp)
    322 {
    [...]
    343     /* acquire remote replica */
    344     agmt_set_last_init_start(prp->agmt, current_time());
    345     rc = acquire_replica (prp, REPL_NSDS50_TOTAL_PROTOCOL_OID, NULL /* r
       uv */);
    346     /* We never retry total protocol, even in case a transient error.
    347        This is because if somebody already updated the replica we don't
    348        want to do it again */
    349     if (rc != ACQUIRE_SUCCESS)
    350     {
    351         int optype, ldaprc;
    352         conn_get_error(prp->conn, &optype, &ldaprc);
    353         agmt_set_last_init_status(prp->agmt, ldaprc,
    354                   prp->last_acquire_response_code, NULL);
    355         goto done;
    356     }

Looking into acquire_replica, since we saw the OID "2.16.840.1.113730.3.5.3" in
the access log, this conn_send_extended_operation should have been successfully
done.
    114 int
    115 acquire_replica(Private_Repl_Protocol *prp, char *prot_oid, RUV **ruv)
    116 {
    [...]
    220                     crc = conn_send_extended_operation(conn,
    221                         REPL_START_NSDS50_REPLICATION_REQUEST_OID,
payload,  NULL /* update control */, NULL /* Message ID */);

But somehow, this conn_read_result_ex might have failed.  This part is just my
guess.
    235                     /* Since the operation request is async, we need to
wait for the response here */
    236                     crc =
conn_read_result_ex(conn,&retoid,&retdata,NULL,NULL,1);

Another interesting observation is if acquire_replica fails, it calls
conn_start_linger.  Its default timeout length is 60 sec, which roughly matches
the conn=5 closed time (the timestamp of the access log is not accurate at all,
but at least we could use it as a rough measure.  It's a min., not a sec. or 10
min...)

    #define DEFAULT_LINGER_TIME (60)

Unfortunately, I could not narrow down the problem further, but thinking of the
"no errors" in the errors log, probably, what happened in acquire_replica was this:
    311                             case NSDS50_REPL_REPLICA_BUSY:
    312                                 /* Someone else is updating the replica.
Try later. */
    313                                 slapi_log_error(SLAPI_LOG_REPL,
repl_plugin_name,
    314                                     "%s: Unable to acquire replica: "
    315                                     "the replica is currently being updated"
    316                                     "by another supplier. Will try later\n",
    317                                     agmt_get_long_name(prp->agmt)); 
    318                                 return_value = ACQUIRE_REPLICA_BUSY;
    319                                 break;

When the consumer initialization was called, the second master fractional1 was
also trying to do incremental-update on the replica fractional3, and there could
be a contention.  To make it clearer, it would be nice if we could change the
above slapi_log_error level SLAPI_LOG_FATAL only when acquire_replica is called
from repl5_tot_run (consumer initialization)...  And I think we should report an
error when acqure_replica fails in repl5_tot_run, too.

As Nathan pointed out, the fractional test case does not check the
initialization result.  We could add the code to check the result and if it's
found failed, we could retry.

Response from Rich:
So, the problem is that one supplier has acquired the replica for an incremental
update, which blocks out the other supplier from doing the total update, and
returns BUSY?
Yes, we should log an error at level SLAPI_LOG_FATAL in this case.  Or perhaps
we should retry (2 times?  More?) and log the retries at the REPL level, and if
all of the tries fail, then log at the FATAL level.
>
> As Nathan pointed out, the fractional test case does not check the
initialization result.  We could add the code to check the result and if it's
found failed, we could retry.
Yes.  This busy status should be in the repl monitoring attributes.

Comment 1 Noriko Hosoi 2006-11-21 00:25:58 UTC
Since CVS server is down, this is not a real CVS diff, but this is the change
I'd like to make to log "replica busy" when acquire_replica is called from
consumer initialization.

--- repl5_protocol_util.c   2006-11-10 13:22:34.000000000 -0800
+++
/export/src/ds72/ldapserver/ldap/servers/plugins/replication/repl5_protocol_util.c
 2006-11-20 14:31:31.000000000 -0800
@@ -309,13 +309,27 @@
                 return_value = ACQUIRE_FATAL_ERROR; 
                 break;
               case NSDS50_REPL_REPLICA_BUSY:
                 /* Someone else is updating the replica. Try later. */ 
+                /* if acquire_replica is called for replica
+                   initialization, log REPLICA_BUSY, too */ 
+                if (strcmp(REPL_NSDS50_TOTAL_PROTOCOL_OID,
+                       prot_oid) == 0) +                {
+                  slapi_log_error(SLAPI_LOG_FATAL, repl_plugin_name, 
+                    "%s: Unable to acquire replica: "
+                    "the replica is currently being updated" 
+                    "by another supplier.\n",
+                    agmt_get_long_name(prp->agmt)); 
+                }
+                else /* REPL_NSDS50_INCREMENTAL_PROTOCOL_OID */
+                {
                   slapi_log_error(SLAPI_LOG_REPL, repl_plugin_name,
                   "%s: Unable to acquire replica: "
                   "the replica is currently being updated"
                   "by another supplier. Will try later\n",
                   agmt_get_long_name(prp->agmt));
+                }
                 return_value = ACQUIRE_REPLICA_BUSY;
                 break;
               case NSDS50_REPL_LEGACY_CONSUMER:
                 /* remote replica is a legacy consumer */


Comment 7 Noriko Hosoi 2006-11-21 19:15:39 UTC
Created attachment 141812 [details]
cvs commit message (repl5_protocol_util.c)

Reviewed by Rich (Thank you!!)

Checked in into HEAD.

Comment 8 Noriko Hosoi 2007-01-19 19:39:06 UTC
I thought I fixed the problem in TET, but it looks it was incomplete..

Good news is now the cause of the problem is log'ed in the master's errors log:

Fractional-Repl-slapd-fractional0-errors:[19/Jan/2007:05:35:20 -0800]
NSMMReplicationPlugin - agmt="cn=fractional0_to_fractional3"
(nightcrawler:33000): Unable to acquire replica: the replica is currently being
updatedby another supplier.

The fix I made in the bug 216512 was after calling consumer initializaion, check
the value of nsds5replicalastinitstatus.  If it is "replica busy", it retries. 
I guess we are getting some other value.  To find it out, I'm adding the code to
print it out...
Thanks,
--noriko

Index: replib.ksh
===================================================================
RCS file: /cvs/dirsec/tetframework/Shared/DS/6.0/ksh/replib.ksh,v
retrieving revision 1.10
diff -t -w -U4 -r1.10 replib.ksh
--- replib.ksh  21 Nov 2006 02:01:51 -0000      1.10
+++ replib.ksh  19 Jan 2007 19:29:40 -0000
@@ -763,23 +763,24 @@
   fi

   # Check if could acquire replica or not.
   # If the result was "replica busy", retry.
-  $LDAPSEARCH \
+  initstatus=`$LDAPSEARCH \
         -h ${MYHOST} \
         -p ${MYPORT} \
         -D "${MYROOTDN}" \
         -w "${MYROOTDNPW}" \
         -s base \
         -b "cn=${MYAGRNAME},cn=replica,cn=\"${MYSFX}\",cn=mapping tree,cn=config" \
-        "objectclass=*" nsds5replicalastinitstatus | \
-    grep -i 'nsds5replicalastinitstatus: replica busy' >$DEVNULL 2>&1
+        "objectclass=*" nsds5replicalastinitstatus`
+    echo $initstatus | grep -i 'nsds5replicalastinitstatus: replica busy'
>$DEVNULL 2>&1
   RC=$?
   if [ $RC -eq 0 ]; then
     i=`expr $i + 1`
     message "initialize_consumer: replica busy: retry $i time(s)"
   else
     # something else; no need to retry
+    message "initialize_consumer: status: $initstatus"
     i=5
   fi
 done



Note You need to log in before you can comment on or make changes to this bug.