Bug 283041 - MMR: Directory updates on same object
MMR: Directory updates on same object
Status: CLOSED CURRENTRELEASE
Product: 389
Classification: Community
Component: Replication - General (Show other bugs)
1.0.4
All All
medium Severity high
: ---
: ---
Assigned To: Rich Megginson
Viktor Ashirov
:
Depends On:
Blocks: 240316 FDS1.1.0
  Show dependency treegraph
 
Reported: 2007-09-07 15:44 EDT by reinhard nappert
Modified: 2015-12-07 11:38 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-12-07 11:38:16 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
stack trace (4.06 KB, text/plain)
2007-09-11 15:39 EDT, Rich Megginson
no flags Details
diffs (2.05 KB, patch)
2007-09-11 16:05 EDT, Rich Megginson
no flags Details | Diff
cvs commit log (216 bytes, text/plain)
2007-09-11 21:00 EDT, Rich Megginson
no flags Details

  None (edit)
Description reinhard nappert 2007-09-07 15:44:10 EDT
I have a working Multi-Master Replication setup with two masters (Fedora
Directory Server 1.0.4). The setup works fine as long as I do not update the
same object via both Masters. When the later happens (application driven), one
of the Master crashes. This server does not generate a core dump.

I included the logs in the user mailing list and Richard had the following comment:

"Thanks.  This is a very interesting test.  You are generating replication
conflicts:
[05/Sep/2007:13:15:40 -0400] conn=51 op=29 csn=46dee55f000200030000 - Naming
conflict ADD. Renamed existing entry to
nsuniqueid=99277847-1dd111b2-80dfcd7f-b7bc0000+ou=repltest

It looks as though you are repeatedly adding and deleting the same entry from
both servers at the same time, which should be fine.  Could you post your script
that you use to generate these entries? "

I send the important pieces of my java class:

"Richard, this is a java class, using jndi.

The relevant methods are:
1. 
public InitialDirContext connect(String host, int port) throws NamingException
   {
       InitialDirContext context = null;
       Hashtable environment = new Hashtable();
       environment.put( Context.INITIAL_CONTEXT_FACTORY,
"com.sun.jndi.ldap.LdapCtxFactory" ); 
       environment.put( "java.naming.ldap.version", "3" ); 
       environment.put(Context.SECURITY_PRINCIPAL, "cn=Directory Manager");
       environment.put(Context.SECURITY_CREDENTIALS, "xxxxxx");
       environment.put(Context.SECURITY_AUTHENTICATION, "simple");
       
       // timeouts
       environment.put( "com.sun.jndi.dns.timeout.initial", "2000" ); 
       environment.put( "com.sun.jndi.dns.timeout.retries", "3" ); 

       environment.put( Context.PROVIDER_URL, "ldap://" + host + ":" +
port+"/o=test" );

       context = new InitialDirContext( environment);
       System.out.println("Connected to " + host);
                
       return context;
       
   }

2.
public void addEntry(InitialDirContext ctx) {
      
      // Create attributes to be associated with the new context
         Attributes attrs = new BasicAttributes(true); // case-ignore
         Attribute objclass = new BasicAttribute("objectclass");
         objclass.add("top");
         objclass.add("organizationalUnit");
         attrs.put(objclass);
      
         // Create the context
         Context result;
         try {
            result = ctx.createSubcontext("ou=test", attrs);
         
            result.close();
         } catch (NameAlreadyBoundException e) {
            // ignore
            // just logg it .......
         } catch (NamingException e) {
            e.printStackTrace();
            this.destroy();
         }
      }

3.
public void deleteEntry(InitialDirContext ctx) { 

         try {
            ctx.destroySubcontext("ou=test");
            //ctx.close();
         } catch (NameNotFoundException e) {
            // ignore
           // just logg it .......
            }
         } catch (NamingException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
            this.destroy();
         }
      }

4. Start of the thread:
public void start() {
          int counter = 0;
          
          for (int i = start; i < stop; i++) {
              try {
                 addEntry(ctx);
                 //....some kind of logging
                 this.sleep(100);
                 deleteEntry(ctx);
                 //....some kind of logging
                 this.sleep(50);
              } catch (Exception e) {
                 e.printStackTrace();
              }
              
          }
          //close context;
          try {
            ctx.close();
         } catch (NamingException e) {
            e.printStackTrace();
         }
      }     

Then, I just call this thread for my two masters (MasterOne and MasterTwo).

Of course, when I pause for a longer time between the add and delete, it takes
longer that it happens."
Comment 1 Rich Megginson 2007-09-11 15:39:29 EDT
Created attachment 192921 [details]
stack trace

I was able to reproduce this with a python-ldap script.  Attached is the stack
trace.	The problem appears to be that in del_replconflict_attr(), the test
	if (slapi_entry_attr_find (entry, ATTR_NSDS5_REPLCONFLICT, &attr) == 0)

succeeds, so the attribute nsds5ReplConflict is present.  However, by the time
the internal modify operation goes to actually remove this attribute, it has
been removed.  This suggests a race condition.
Comment 2 Rich Megginson 2007-09-11 16:05:04 EDT
Created attachment 192941 [details]
diffs

The problem does appear to be concurrency.  I think the original intention of
the urp fixup code was that it should only be run inside the database lock, so
that the database could be restored to a consistent state before the next
operation was processed.  However, this requires the database code to know when
the database is already locked, so that if e.g. a modrdn operation needs to
call an internal delete, the database should not be locked again.  The flag
OP_FLAG_REPL_FIXUP is used to denote both that the operation is such an
internal operation, and that the database should not be locked again.

There are a couple of cases where these operations can be called from outside
of the database lock:
urp_fixup_rename_entry is called from multimaster_postop_modrdn and
multimaster_postop_delete, both of which are front end post op plugins, not
called from within the database lock.  Same with urp_fixup_delete_entry and
urp_fixup_modify_entry.  In other cases, such as urp_fixup_add_entry, and other
places where urp_fixup_rename_entry and urp_fixup_modify_entry are called, they
are called from a bepostop plugin function, which is called after the original
database operation has been processed, within the database lock.  So the
solution appears to be to move the urp_* functions to the bepostop plugin
functions.  One of these functions does an internal search -
urp_get_min_naming_conflict_entry - but it does not appear that search locks
the database, so there was nothing to be done to make it "reentrant".

Without this patch, I can crash the server in a matter of minutes (x86_64
rhel5) using the latest Fedora DS 1.1 code.  With the patch, the server runs
for several hours (maybe longer, I had to stop the test).

Also, to really exercise the urp code, I added a rename operation between the
add and delete e.g.
add("ou=test");
rename("ou=test", "ou=test2");
delete("ou=test2");
The server still runs for several hours with no problems.
Comment 3 Noriko Hosoi 2007-09-11 20:03:15 EDT
Looks good to me.

Comment 4 Rich Megginson 2007-09-11 21:00:35 EDT
Created attachment 193151 [details]
cvs commit log

Reviewed by: nhosoi (Thanks!)
Files: see diff
Branch: HEAD
Fix Description: https://bugzilla.redhat.com/show_bug.cgi?id=283041#c2
Platforms tested: RHEL5 x86_64
Flag Day: no
Doc impact: no
Comment 5 reinhard nappert 2007-09-21 12:43:44 EDT
I built Linux and Solaris packages. I can not reproduce the error anymore (mixed
system - Solaris/Linux - MMR setup).

Thanks,
-Reinhard
Comment 6 Rich Megginson 2007-09-21 12:49:45 EDT
(In reply to comment #5)
> I built Linux and Solaris packages. I can not reproduce the error anymore (mixed
> system - Solaris/Linux - MMR setup).
> 
> Thanks,
> -Reinhard

Great!  Thank you for confirming.  Please let us know if you see any other problems.
Comment 7 reinhard nappert 2007-09-21 12:53:50 EDT
You can bet on it :)

Note You need to log in before you can comment on or make changes to this bug.