Bug 283041
| Summary: | MMR: Directory updates on same object | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] 389 | Reporter: | reinhard nappert <rnappert> | ||||||||
| Component: | Replication - General | Assignee: | Rich Megginson <rmeggins> | ||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Viktor Ashirov <vashirov> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | medium | ||||||||||
| Version: | 1.0.4 | ||||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | All | ||||||||||
| OS: | All | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2015-12-07 16:38:16 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 240316, 427409 | ||||||||||
| Attachments: |
|
||||||||||
Created attachment 192921 [details]
stack trace
I was able to reproduce this with a python-ldap script. Attached is the stack
trace. The problem appears to be that in del_replconflict_attr(), the test
if (slapi_entry_attr_find (entry, ATTR_NSDS5_REPLCONFLICT, &attr) == 0)
succeeds, so the attribute nsds5ReplConflict is present. However, by the time
the internal modify operation goes to actually remove this attribute, it has
been removed. This suggests a race condition.
Created attachment 192941 [details]
diffs
The problem does appear to be concurrency. I think the original intention of
the urp fixup code was that it should only be run inside the database lock, so
that the database could be restored to a consistent state before the next
operation was processed. However, this requires the database code to know when
the database is already locked, so that if e.g. a modrdn operation needs to
call an internal delete, the database should not be locked again. The flag
OP_FLAG_REPL_FIXUP is used to denote both that the operation is such an
internal operation, and that the database should not be locked again.
There are a couple of cases where these operations can be called from outside
of the database lock:
urp_fixup_rename_entry is called from multimaster_postop_modrdn and
multimaster_postop_delete, both of which are front end post op plugins, not
called from within the database lock. Same with urp_fixup_delete_entry and
urp_fixup_modify_entry. In other cases, such as urp_fixup_add_entry, and other
places where urp_fixup_rename_entry and urp_fixup_modify_entry are called, they
are called from a bepostop plugin function, which is called after the original
database operation has been processed, within the database lock. So the
solution appears to be to move the urp_* functions to the bepostop plugin
functions. One of these functions does an internal search -
urp_get_min_naming_conflict_entry - but it does not appear that search locks
the database, so there was nothing to be done to make it "reentrant".
Without this patch, I can crash the server in a matter of minutes (x86_64
rhel5) using the latest Fedora DS 1.1 code. With the patch, the server runs
for several hours (maybe longer, I had to stop the test).
Also, to really exercise the urp code, I added a rename operation between the
add and delete e.g.
add("ou=test");
rename("ou=test", "ou=test2");
delete("ou=test2");
The server still runs for several hours with no problems.
Looks good to me. Created attachment 193151 [details] cvs commit log Reviewed by: nhosoi (Thanks!) Files: see diff Branch: HEAD Fix Description: https://bugzilla.redhat.com/show_bug.cgi?id=283041#c2 Platforms tested: RHEL5 x86_64 Flag Day: no Doc impact: no I built Linux and Solaris packages. I can not reproduce the error anymore (mixed system - Solaris/Linux - MMR setup). Thanks, -Reinhard (In reply to comment #5) > I built Linux and Solaris packages. I can not reproduce the error anymore (mixed > system - Solaris/Linux - MMR setup). > > Thanks, > -Reinhard Great! Thank you for confirming. Please let us know if you see any other problems. You can bet on it :) |
I have a working Multi-Master Replication setup with two masters (Fedora Directory Server 1.0.4). The setup works fine as long as I do not update the same object via both Masters. When the later happens (application driven), one of the Master crashes. This server does not generate a core dump. I included the logs in the user mailing list and Richard had the following comment: "Thanks. This is a very interesting test. You are generating replication conflicts: [05/Sep/2007:13:15:40 -0400] conn=51 op=29 csn=46dee55f000200030000 - Naming conflict ADD. Renamed existing entry to nsuniqueid=99277847-1dd111b2-80dfcd7f-b7bc0000+ou=repltest It looks as though you are repeatedly adding and deleting the same entry from both servers at the same time, which should be fine. Could you post your script that you use to generate these entries? " I send the important pieces of my java class: "Richard, this is a java class, using jndi. The relevant methods are: 1. public InitialDirContext connect(String host, int port) throws NamingException { InitialDirContext context = null; Hashtable environment = new Hashtable(); environment.put( Context.INITIAL_CONTEXT_FACTORY, "com.sun.jndi.ldap.LdapCtxFactory" ); environment.put( "java.naming.ldap.version", "3" ); environment.put(Context.SECURITY_PRINCIPAL, "cn=Directory Manager"); environment.put(Context.SECURITY_CREDENTIALS, "xxxxxx"); environment.put(Context.SECURITY_AUTHENTICATION, "simple"); // timeouts environment.put( "com.sun.jndi.dns.timeout.initial", "2000" ); environment.put( "com.sun.jndi.dns.timeout.retries", "3" ); environment.put( Context.PROVIDER_URL, "ldap://" + host + ":" + port+"/o=test" ); context = new InitialDirContext( environment); System.out.println("Connected to " + host); return context; } 2. public void addEntry(InitialDirContext ctx) { // Create attributes to be associated with the new context Attributes attrs = new BasicAttributes(true); // case-ignore Attribute objclass = new BasicAttribute("objectclass"); objclass.add("top"); objclass.add("organizationalUnit"); attrs.put(objclass); // Create the context Context result; try { result = ctx.createSubcontext("ou=test", attrs); result.close(); } catch (NameAlreadyBoundException e) { // ignore // just logg it ....... } catch (NamingException e) { e.printStackTrace(); this.destroy(); } } 3. public void deleteEntry(InitialDirContext ctx) { try { ctx.destroySubcontext("ou=test"); //ctx.close(); } catch (NameNotFoundException e) { // ignore // just logg it ....... } } catch (NamingException e) { // TODO Auto-generated catch block e.printStackTrace(); this.destroy(); } } 4. Start of the thread: public void start() { int counter = 0; for (int i = start; i < stop; i++) { try { addEntry(ctx); //....some kind of logging this.sleep(100); deleteEntry(ctx); //....some kind of logging this.sleep(50); } catch (Exception e) { e.printStackTrace(); } } //close context; try { ctx.close(); } catch (NamingException e) { e.printStackTrace(); } } Then, I just call this thread for my two masters (MasterOne and MasterTwo). Of course, when I pause for a longer time between the add and delete, it takes longer that it happens."