571677 – Busy replica on consumers when directly deleting a replication conflict

Bug 571677 - Busy replica on consumers when directly deleting a replication conflict

Summary: Busy replica on consumers when directly deleting a replication conflict

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	389
Classification:	Retired
Component:	Replication - General
Sub Component:
Version:	1.2.6
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Rich Megginson
QA Contact:	Viktor Ashirov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	434914 389_1.2.6
TreeView+	depends on / blocked

Reported:	2010-03-09 07:56 UTC by Juan
Modified:	2015-12-07 16:47 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-07 16:47:48 UTC
Embargoed:

Attachments	(Terms of Use)
patch (4.96 KB, patch) 2010-03-24 01:09 UTC, Rich Megginson	nhosoi: review+	Details \| Diff
View All

Description Juan 2010-03-09 07:56:18 UTC

Discussed on this topic in the mailing lists:
http://lists.fedoraproject.org/pipermail/389-users/2010-March/011192.html

We have found some replication conflicts (nsds5replicaconflict=*). We have deleted them manually, and now the databases in the consumers are busy all the time, no matter if i restart the service in the supplier or consumer servers (i must kill the supplier server process as they never stop, because the DEL operation never gets returned), when the replication agreement is launched again from the supplier, the replica is busy all the time. The last operation in the access log of the consumers is the deletion of the object in conflict, which never gets a result:

[08/Mar/2010:16:02:51 +0100] NSMMReplicationPlugin - conn=221525 op=11 repl="o=XXXXX,dc=XXXXX,dc=XXXXX": Replica in use locking_purl=conn=207283 id=3
[08/Mar/2010:16:02:51 +0100] NSMMReplicationPlugin - conn=221525 op=11 replica="o=XXXXX,dc=XXXXX,dc=XXXXX": Unable to acquire replica: error: replica busy locked by conn=207283 id=3 for incremental update
[08/Mar/2010:16:02:51 +0100] NSMMReplicationPlugin - conn=221525 op=11 repl="o=XXXXX,dc=XXXXX,dc=XXXXX": StartNSDS50ReplicationRequest: response=1 rc=0
[08/Mar/2010:16:02:54 +0100] NSMMReplicationPlugin - conn=221525 op=13 repl="o=XXXXX,dc=XXXXX,dc=XXXXX": Begin incremental protocol

These are all the messages in the access.log referring the connection conn=207283

[08/Mar/2010:08:16:01 +0100] conn=207283 fd=607 slot=607 SSL connection from XXXXXX to XXXXXX
[08/Mar/2010:08:16:01 +0100] conn=207283 SSL 256-bit AES
[08/Mar/2010:08:16:01 +0100] conn=207283 op=0 BIND dn="cn=Replication Manager,cn=config" method=128 version=3
[08/Mar/2010:08:16:01 +0100] conn=207283 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=replication manager,cn=config"
[08/Mar/2010:08:16:01 +0100] conn=207283 op=1 SRCH base="" scope=0 filter="(objectClass=*)" attrs="supportedControl supportedExtension"
[08/Mar/2010:08:16:01 +0100] conn=207283 op=1 RESULT err=0 tag=101 nentries=1 etime=0
[08/Mar/2010:08:16:01 +0100] conn=207283 op=2 SRCH base="" scope=0 filter="(objectClass=*)" attrs="supportedControl supportedExtension"
[08/Mar/2010:08:16:01 +0100] conn=207283 op=2 RESULT err=0 tag=101 nentries=1 etime=0
[08/Mar/2010:08:16:01 +0100] conn=207283 op=3 EXT oid="2.16.840.1.113730.3.5.3" name="Netscape Replication Start Session"
[08/Mar/2010:08:16:01 +0100] conn=207283 op=3 RESULT err=0 tag=120 nentries=0 etime=0
[08/Mar/2010:08:16:01 +0100] conn=207283 op=4 DEL dn="nsuniqueid=f851c101-1dd111b2-a64db547-e4060000+uid=cabudenhos029p$,ou=computers,o=XXXXXX,dc=XXXXXX,dc=XXXXXX"


We had the same problem some time ago when deleting replication conflicts, and now the problem happens again (that time, we solved the problem is a development environment deleting all replication agreements, recreating them, and initializing the consumers, but now this has happened in a production environment with 60 servers, very difficult to recreate al the replication agreements and reinitialize the databases). "Normal" objects are deleted fine, the problem is when deleting replication conflicts, indeed, we did remove replication conflicts of two different databases (in two different servers, each master of its database), and now the two databases in the rest of the servers are all the time busy. Installation is the default installation, with these changes; in /etc/sysconfig/dirsrv:

sysctl -w "net.ipv4.ip_local_port_range=1024 65000" > /dev/null
sysctl -w "fs.file-max=64000" > /dev/null
ulimit -n 65535

In dse.ldif:

nsslapd-sizelimit: 50000
nsslapd-timelimit: 60
nsslapd-maxdescriptors: 65535
nsslapd-idlistscanlimit: 50000
nsslapd-lookthroughlimit: 50000
nsslapd-dbcachesize: 838860800
nsslapd-allidsthreshold: 10000
nsslapd-cachememsize: 125829120 (in each database)

In fact, these only happens in servers upgraded to 1.2.5 servers, not those with version 1.1.3 (1.1.3 don't have applied the previous settings, only 1.2.5).


Using:
Centos 5.4

rpm -qa | grep 389
389-admin-1.1.10-1.el5
389-ds-1.1.3-6.el5
389-admin-console-doc-1.1.4-3.el5
389-console-1.1.3-6.el5
389-dsgw-1.1.4-1.el5
389-ds-console-1.2.0-5.el5
389-ds-console-doc-1.2.0-5.el5
389-admin-console-1.1.4-3.el5
389-ds-base-devel-1.2.5-1.el5
389-ds-base-1.2.5-1.el5

Regards.

Comment 1 Rich Megginson 2010-03-15 20:38:16 UTC

This is definitely a deadlock.  The problem is that when you delete the conflict entry, it calls this function internally to first remove the conflict attribute: del_replconflict_attr().  This is called by urp_delete_operation() which is called as as bepreop plugin from ldbm_back_delete.  ldbm_back_delete has already acquired the cache lock on the backentry.  del_replconflict_attr() does an internal modify on this same entry, which also tries to acquire the cache lock on the same backentry, which deadlocks.

The workaround appears to be - delete the nsds5ReplConflict attribute from the entry first, then delete it.

Comment 2 Juan 2010-03-15 23:02:42 UTC

Is that specific to 1.2.5 version and not 1.1.3?

Comment 3 Rich Megginson 2010-03-15 23:07:34 UTC

The workaround above should work on all servers all versions.  I'm not sure why the bug only shows up on certain servers - I'm not aware of anything that would have changed this behavior.

Comment 4 Juan 2010-03-15 23:14:49 UTC

OK, I will try to delete some replication conflicts that already exist with that workround, and i will post the results.

Comment 5 Rich Megginson 2010-03-24 01:09:44 UTC

Created attachment 402171 [details]
patch

Comment 6 Rich Megginson 2010-03-24 20:43:24 UTC

To ssh://git.fedorahosted.org/git/389/ds.git
   2e7f973..03c2dcc  ds82 -> fedora/Directory_Server_8_2_Branch

commit 03c2dcc26a50f58348d1cb201f912fdb5839b79f
Author: Rich Megginson <rmeggins>
Date:   Tue Mar 23 19:08:13 2010 -0600

commit to master
commit eac3f15f2209719e05640e1576b4273d03bef079
Author: Rich Megginson <rmeggins>
Date:   Tue Mar 23 19:08:13 2010 -0600

Comment 7 Rich Megginson 2010-03-25 18:15:15 UTC

To ssh://git.fedorahosted.org/git/389/ds.git
   2e7f973..5db9031  Directory_Server_8_2_Branch -> Directory_Server_8_2_Branch

commit 5db90314f1d0239b928a35e325b4810d14677c6b
Author: Rich Megginson <rmeggins>
Date:   Thu Mar 25 12:10:46 2010 -0600

Note You need to log in before you can comment on or make changes to this bug.