I have a working Multi-Master Replication setup with two masters (Fedora Directory Server 1.0.4). The setup works fine as long as I do not update the same object via both Masters. When the later happens (application driven), one of the Master crashes. This server does not generate a core dump. I included the logs in the user mailing list and Richard had the following comment: "Thanks. This is a very interesting test. You are generating replication conflicts: [05/Sep/2007:13:15:40 -0400] conn=51 op=29 csn=46dee55f000200030000 - Naming conflict ADD. Renamed existing entry to nsuniqueid=99277847-1dd111b2-80dfcd7f-b7bc0000+ou=repltest It looks as though you are repeatedly adding and deleting the same entry from both servers at the same time, which should be fine. Could you post your script that you use to generate these entries? " I send the important pieces of my java class: "Richard, this is a java class, using jndi. The relevant methods are: 1. public InitialDirContext connect(String host, int port) throws NamingException { InitialDirContext context = null; Hashtable environment = new Hashtable(); environment.put( Context.INITIAL_CONTEXT_FACTORY, "com.sun.jndi.ldap.LdapCtxFactory" ); environment.put( "java.naming.ldap.version", "3" ); environment.put(Context.SECURITY_PRINCIPAL, "cn=Directory Manager"); environment.put(Context.SECURITY_CREDENTIALS, "xxxxxx"); environment.put(Context.SECURITY_AUTHENTICATION, "simple"); // timeouts environment.put( "com.sun.jndi.dns.timeout.initial", "2000" ); environment.put( "com.sun.jndi.dns.timeout.retries", "3" ); environment.put( Context.PROVIDER_URL, "ldap://" + host + ":" + port+"/o=test" ); context = new InitialDirContext( environment); System.out.println("Connected to " + host); return context; } 2. public void addEntry(InitialDirContext ctx) { // Create attributes to be associated with the new context Attributes attrs = new BasicAttributes(true); // case-ignore Attribute objclass = new BasicAttribute("objectclass"); objclass.add("top"); objclass.add("organizationalUnit"); attrs.put(objclass); // Create the context Context result; try { result = ctx.createSubcontext("ou=test", attrs); result.close(); } catch (NameAlreadyBoundException e) { // ignore // just logg it ....... } catch (NamingException e) { e.printStackTrace(); this.destroy(); } } 3. public void deleteEntry(InitialDirContext ctx) { try { ctx.destroySubcontext("ou=test"); //ctx.close(); } catch (NameNotFoundException e) { // ignore // just logg it ....... } } catch (NamingException e) { // TODO Auto-generated catch block e.printStackTrace(); this.destroy(); } } 4. Start of the thread: public void start() { int counter = 0; for (int i = start; i < stop; i++) { try { addEntry(ctx); //....some kind of logging this.sleep(100); deleteEntry(ctx); //....some kind of logging this.sleep(50); } catch (Exception e) { e.printStackTrace(); } } //close context; try { ctx.close(); } catch (NamingException e) { e.printStackTrace(); } } Then, I just call this thread for my two masters (MasterOne and MasterTwo). Of course, when I pause for a longer time between the add and delete, it takes longer that it happens."
Created attachment 192921 [details] stack trace I was able to reproduce this with a python-ldap script. Attached is the stack trace. The problem appears to be that in del_replconflict_attr(), the test if (slapi_entry_attr_find (entry, ATTR_NSDS5_REPLCONFLICT, &attr) == 0) succeeds, so the attribute nsds5ReplConflict is present. However, by the time the internal modify operation goes to actually remove this attribute, it has been removed. This suggests a race condition.
Created attachment 192941 [details] diffs The problem does appear to be concurrency. I think the original intention of the urp fixup code was that it should only be run inside the database lock, so that the database could be restored to a consistent state before the next operation was processed. However, this requires the database code to know when the database is already locked, so that if e.g. a modrdn operation needs to call an internal delete, the database should not be locked again. The flag OP_FLAG_REPL_FIXUP is used to denote both that the operation is such an internal operation, and that the database should not be locked again. There are a couple of cases where these operations can be called from outside of the database lock: urp_fixup_rename_entry is called from multimaster_postop_modrdn and multimaster_postop_delete, both of which are front end post op plugins, not called from within the database lock. Same with urp_fixup_delete_entry and urp_fixup_modify_entry. In other cases, such as urp_fixup_add_entry, and other places where urp_fixup_rename_entry and urp_fixup_modify_entry are called, they are called from a bepostop plugin function, which is called after the original database operation has been processed, within the database lock. So the solution appears to be to move the urp_* functions to the bepostop plugin functions. One of these functions does an internal search - urp_get_min_naming_conflict_entry - but it does not appear that search locks the database, so there was nothing to be done to make it "reentrant". Without this patch, I can crash the server in a matter of minutes (x86_64 rhel5) using the latest Fedora DS 1.1 code. With the patch, the server runs for several hours (maybe longer, I had to stop the test). Also, to really exercise the urp code, I added a rename operation between the add and delete e.g. add("ou=test"); rename("ou=test", "ou=test2"); delete("ou=test2"); The server still runs for several hours with no problems.
Looks good to me.
Created attachment 193151 [details] cvs commit log Reviewed by: nhosoi (Thanks!) Files: see diff Branch: HEAD Fix Description: https://bugzilla.redhat.com/show_bug.cgi?id=283041#c2 Platforms tested: RHEL5 x86_64 Flag Day: no Doc impact: no
I built Linux and Solaris packages. I can not reproduce the error anymore (mixed system - Solaris/Linux - MMR setup). Thanks, -Reinhard
(In reply to comment #5) > I built Linux and Solaris packages. I can not reproduce the error anymore (mixed > system - Solaris/Linux - MMR setup). > > Thanks, > -Reinhard Great! Thank you for confirming. Please let us know if you see any other problems.
You can bet on it :)