Discussed on this topic in the mailing lists: http://lists.fedoraproject.org/pipermail/389-users/2010-March/011192.html We have found some replication conflicts (nsds5replicaconflict=*). We have deleted them manually, and now the databases in the consumers are busy all the time, no matter if i restart the service in the supplier or consumer servers (i must kill the supplier server process as they never stop, because the DEL operation never gets returned), when the replication agreement is launched again from the supplier, the replica is busy all the time. The last operation in the access log of the consumers is the deletion of the object in conflict, which never gets a result: [08/Mar/2010:16:02:51 +0100] NSMMReplicationPlugin - conn=221525 op=11 repl="o=XXXXX,dc=XXXXX,dc=XXXXX": Replica in use locking_purl=conn=207283 id=3 [08/Mar/2010:16:02:51 +0100] NSMMReplicationPlugin - conn=221525 op=11 replica="o=XXXXX,dc=XXXXX,dc=XXXXX": Unable to acquire replica: error: replica busy locked by conn=207283 id=3 for incremental update [08/Mar/2010:16:02:51 +0100] NSMMReplicationPlugin - conn=221525 op=11 repl="o=XXXXX,dc=XXXXX,dc=XXXXX": StartNSDS50ReplicationRequest: response=1 rc=0 [08/Mar/2010:16:02:54 +0100] NSMMReplicationPlugin - conn=221525 op=13 repl="o=XXXXX,dc=XXXXX,dc=XXXXX": Begin incremental protocol These are all the messages in the access.log referring the connection conn=207283 [08/Mar/2010:08:16:01 +0100] conn=207283 fd=607 slot=607 SSL connection from XXXXXX to XXXXXX [08/Mar/2010:08:16:01 +0100] conn=207283 SSL 256-bit AES [08/Mar/2010:08:16:01 +0100] conn=207283 op=0 BIND dn="cn=Replication Manager,cn=config" method=128 version=3 [08/Mar/2010:08:16:01 +0100] conn=207283 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=replication manager,cn=config" [08/Mar/2010:08:16:01 +0100] conn=207283 op=1 SRCH base="" scope=0 filter="(objectClass=*)" attrs="supportedControl supportedExtension" [08/Mar/2010:08:16:01 +0100] conn=207283 op=1 RESULT err=0 tag=101 nentries=1 etime=0 [08/Mar/2010:08:16:01 +0100] conn=207283 op=2 SRCH base="" scope=0 filter="(objectClass=*)" attrs="supportedControl supportedExtension" [08/Mar/2010:08:16:01 +0100] conn=207283 op=2 RESULT err=0 tag=101 nentries=1 etime=0 [08/Mar/2010:08:16:01 +0100] conn=207283 op=3 EXT oid="2.16.840.1.113730.3.5.3" name="Netscape Replication Start Session" [08/Mar/2010:08:16:01 +0100] conn=207283 op=3 RESULT err=0 tag=120 nentries=0 etime=0 [08/Mar/2010:08:16:01 +0100] conn=207283 op=4 DEL dn="nsuniqueid=f851c101-1dd111b2-a64db547-e4060000+uid=cabudenhos029p$,ou=computers,o=XXXXXX,dc=XXXXXX,dc=XXXXXX" We had the same problem some time ago when deleting replication conflicts, and now the problem happens again (that time, we solved the problem is a development environment deleting all replication agreements, recreating them, and initializing the consumers, but now this has happened in a production environment with 60 servers, very difficult to recreate al the replication agreements and reinitialize the databases). "Normal" objects are deleted fine, the problem is when deleting replication conflicts, indeed, we did remove replication conflicts of two different databases (in two different servers, each master of its database), and now the two databases in the rest of the servers are all the time busy. Installation is the default installation, with these changes; in /etc/sysconfig/dirsrv: sysctl -w "net.ipv4.ip_local_port_range=1024 65000" > /dev/null sysctl -w "fs.file-max=64000" > /dev/null ulimit -n 65535 In dse.ldif: nsslapd-sizelimit: 50000 nsslapd-timelimit: 60 nsslapd-maxdescriptors: 65535 nsslapd-idlistscanlimit: 50000 nsslapd-lookthroughlimit: 50000 nsslapd-dbcachesize: 838860800 nsslapd-allidsthreshold: 10000 nsslapd-cachememsize: 125829120 (in each database) In fact, these only happens in servers upgraded to 1.2.5 servers, not those with version 1.1.3 (1.1.3 don't have applied the previous settings, only 1.2.5). Using: Centos 5.4 rpm -qa | grep 389 389-admin-1.1.10-1.el5 389-ds-1.1.3-6.el5 389-admin-console-doc-1.1.4-3.el5 389-console-1.1.3-6.el5 389-dsgw-1.1.4-1.el5 389-ds-console-1.2.0-5.el5 389-ds-console-doc-1.2.0-5.el5 389-admin-console-1.1.4-3.el5 389-ds-base-devel-1.2.5-1.el5 389-ds-base-1.2.5-1.el5 Regards.
This is definitely a deadlock. The problem is that when you delete the conflict entry, it calls this function internally to first remove the conflict attribute: del_replconflict_attr(). This is called by urp_delete_operation() which is called as as bepreop plugin from ldbm_back_delete. ldbm_back_delete has already acquired the cache lock on the backentry. del_replconflict_attr() does an internal modify on this same entry, which also tries to acquire the cache lock on the same backentry, which deadlocks. The workaround appears to be - delete the nsds5ReplConflict attribute from the entry first, then delete it.
Is that specific to 1.2.5 version and not 1.1.3?
The workaround above should work on all servers all versions. I'm not sure why the bug only shows up on certain servers - I'm not aware of anything that would have changed this behavior.
OK, I will try to delete some replication conflicts that already exist with that workround, and i will post the results.
Created attachment 402171 [details] patch
To ssh://git.fedorahosted.org/git/389/ds.git 2e7f973..03c2dcc ds82 -> fedora/Directory_Server_8_2_Branch commit 03c2dcc26a50f58348d1cb201f912fdb5839b79f Author: Rich Megginson <rmeggins> Date: Tue Mar 23 19:08:13 2010 -0600 commit to master commit eac3f15f2209719e05640e1576b4273d03bef079 Author: Rich Megginson <rmeggins> Date: Tue Mar 23 19:08:13 2010 -0600
To ssh://git.fedorahosted.org/git/389/ds.git 2e7f973..5db9031 Directory_Server_8_2_Branch -> Directory_Server_8_2_Branch commit 5db90314f1d0239b928a35e325b4810d14677c6b Author: Rich Megginson <rmeggins> Date: Thu Mar 25 12:10:46 2010 -0600