Red Hat Bugzilla – Bug 329951
MMR: Supplier does not respond anymore after many operations (deletes)
Last modified: 2015-12-07 12:03:29 EST
MMR setup with two Masters M1 and M2. M1 was the only supplier (only operations
against M1). The replication is setup with a low purgeDelay (900 sec). I
performed many ldap operations (80.000 adds / 80.000 mods / 80.000 deletes).
Eventually, M1 will not respond anymore. You can still access is with other
clients, but only base searches are possible. M1 does not come back with
one-level or subtree searches.
I noticed that id2entry.db4 will not get rid of the tumbstones after 900 secs. I
assume, it has to do with that.
Steps to Reproduce:
1. MMR setup with 900 sec as purgeDelay
2. access M1 with a script or any ldap client (I used JDNI client).
3. This client iterates in an endless loop through:
- adds 500 entries
- mods these 500 entries
- deletes those 500 entries
M1 does not respond anymore
Thank you for reporting the problem.
For debugging the bug, can we have your errors/access logs? Also, output from
> M1 will not respond anymore.
This means, add/mod/del operations and one-level or subtree search hang? Do
these clients get any error returns? I'm interested in the stack trace when
this is happening.
Is the problem reproducible every time you run the test? The "not respond"
problem occurs after 900 second running or random timing?
Created attachment 227761 [details]
error and access logs
To answer your questions:
1. I attached the access and error file. Have a look at connection 27 and you
see down at the bottom:
[12/Oct/2007:15:13:38 -0400] conn=27 op=126788 DEL dn="o=testOrg40,o=test"
without any response. Now we are at the state, where the clients hang (waiting
for a response). You will see that a read (base-level search still works).
How do you create the output of pstack?
2. The problem is reproducible, but it may take longer (or shorter), until it
happens (random time). I actually believe that the purgedalay does not work,
since id2entry.db4 does not decrease at ant time.
Created attachment 227851 [details]
Hi, I reproduced the error and captured the pstack trace of ns-slapd.
Thank you for the stack-trace. It looks there's a deadlock between thread #34
and #35. And VLV is involved again...
I assume you have some vlv indexes in your system. Is it correct?
And could you check this configuration parameter in your config file: dse.ldif?
Could it be "off"?
nsslapd-serial-lock is set to "on".
Yes, I do have vlv indexes in the system. However, I do not perform any
vlv-searches during the test. I reported a similar bug, where the server hangs
during updates and vlv searches. It is possible that both a related.
Created attachment 231611 [details]
Description: introduce OP_FLAG_REPL_RUV. It's set if the entry is RUV in
repl5_replica.c. The operation should not be blocked at the backend SERIAL
lock. But it has nothing to do with VLV, thus if the flag is set, it skips the
Test case I'm running:
Set up 2-way MMR (Master 1 and 2)
Set purge delay: nsds5ReplicaPurgeDelay: 600
Import an LDIF file on Master 1
Initialize the replica (Master 2) on Master 1
Create browsing indexes on both Master 1 and 2
Then, I started running following test tools:
ldclt -D "cn=Directory manager" -w <passwd> -p <port> -e add -e random -b
ou=Payroll,dc=example,dc=com -f uid=test_XXXX -v -q -n 2 -N 3600 -r 1000 -R
2999 -I 68 -e inetOrgPerson -e imagesdir=<images_dir>
ldclt -D "cn=Directory manager" -w <passwd> -p <port> -e delete -e random -b
ou=payroll,dc=example,dc=com -f uid=test_XXXX -v -q -n 2 -N 3600 -r 1000 -R
2999 -I 32 -e inetOrgPerson -e imagesdir=<images_dir>
ldclt -D "cn=Directory manager" -w <passwd> -p <port> -e
attreplace=sn:a_random_sn_XXXX -e random -b ou=payroll,dc=example,dc=com -f
uid=test_XXXX -v -q -n 2 -N 3600 -r 1000 -R 2999 -I 32
So far, it's been running for 2 hours with occasional checks on the browser
using VLV/brousing index.
Hello reinhard, I attached the fix proposal in comment #9.
I'm testing the changes as described in the same comment. So far, I don't see
any problem including the deadlock. May I ask you to try the patch to verify it
fixes your problem or not? Thank you so much for your help, in advance...
Created attachment 231711 [details]
cvs commit message
Reviewed by Rich (Thank you!!!)
Checked in into CVS HEAD.
I guess I can not patch up my existing 1.0.4 release. Those changes depend on
too many other fixes.
What CVS tag should I use to download the source for verification purposes.
(In reply to comment #12)
> I guess I can not patch up my existing 1.0.4 release. Those changes depend on
> too many other fixes.
> What CVS tag should I use to download the source for verification purposes.
We're using HEAD (just cvs checkout with no tag). However, good news and bad
news - the bad news is that the build process for Fedora DS 1.1 is completely
different - the good news is that is based on standard autotools.
Note: the test case in the comment #9 ran 12 hours with no problem. Did you
have to wait longer till the deadlock occurred?
the longest the test ran, before the server was deadlocked, was about 4 hours.
12 hours sound good to me.
I am going to build the latest and let it run as well.
Thank you so much for reporting this tricky bug and verifying the fix! It's
helping us a lot.
I still have trouble building 1.1.1. Hopefully, I can solve this soon. I
actually thought that the 1.0.4 built process was not bad.
I have a question regarding this bug. What does it really cause it? I guess it
is a race condition. When I reproduce it, it takes quite some operations (more
than 200 000 operations). However, when our testing guys test our application,
it (sometimes) happens much earlier, which is a concern to me.
Yes, it was a race condition. The cause was the conflict between a delete
thread and an MMR thread. The latter thread was going to update a special entry
called RUV (replica update vector). When updating the entry, it has nothing to
do with VLV. But since the backend update code is shared with the ordinary
operation and RUV update operation, even if not necessary, it was checking the
vlv search list with the lock.
To eliminate the deadlock opportunity, I introduced a new flag for RUV. If it's
set, we skip the VLV update for the special entry.
Regarding the deadlock timing, could the test machine be more powerful than your
machine? For example, the CPU numbers? Some race problem is seldom observed on
a single CPU machine, but quite easily on a multiple CPU machine... Could there
be such difference?
I guess the SQA box is more powerful (Linux, dual CPU, 32 GB RAM), where I used
a pretty old SunFire (dual CPU (much slower) and 8 GB of RAM). Linux's IO is
also much faster than Sun's.
Any hints to get 1.1.1 built. Is there somewhere a document?
(In reply to comment #19)
> I guess the SQA box is more powerful (Linux, dual CPU, 32 GB RAM), where I used
> a pretty old SunFire (dual CPU (much slower) and 8 GB of RAM). Linux's IO is
> also much faster than Sun's.
> Any hints to get 1.1.1 built. Is there somewhere a document?
Not yet, we are still working on it. Firstly, what platform is this on? Have
you been able to get the code from CVS?
Yes, I got the code from CVS. Right now, I am stuck with mod_admserv. pkg-config
can not find the ldapcsdk, although it is specified with --with-ldapsdk-lib and
I am thinking about skipping those parts, since I really do not need those extra
No, in order to test this patch, you only need the ldapserver component. You do
not need mod_*, adminserver, adminutil, etc.
Depending on what platform you are using, you may already have all of the
dependent components (the components that ldapserver uses) on your system.
I forgot to mention that I am building on Linux. This is where I am:
I got all components (net-snmp; mozilla; db ...) built. I downloaded the cvs
code (HEAD). When I run configure from the ldapserver, with all the
it finds the db.h, but complains about the version:
configure: error: ../../../db/work/db-4.2.52.NC/built/include/db.h is version
4.2 but libdb-4.2 not found
If you are on a Red Hat, Centos, or Fedora system, you can just use the
db4-devel package for db. configure will automatically find and use it. Same
with snmp and sasl. Most other linux distros will likely include db4 too.
configure is looking for a file called libdb-4.2.so (it tries to link a test
program using gcc ... -L/path/to/db/lib -ldb-4.2). The makefiles we use (in
Fedora DS 1.0.4 and earlier) for db create db42.so I think. So you might be
able to just rename the library. I suggest with Fedora DS 1.1 to use system
components if possible, to simplify building and running the server. If you are
using gentoo or debian or some other distro, we would be very interested in any
information about building on that platform.
Eventually the make of ldapserver finished successfully. Afterwards, I executed
make install, which installed it in /opt/dirsrv.
Now, how to install the directory instance? I executed sbin/setup-ds.pl, which
failed due to "Can't locate Mozilla/LDAP/Conn.pm ...."
What is the right procedure?
I am sure you had enough reasons to change the entire directory
built/installation procedure and structure as well.
What I have seen so far, I liked the old stuff much better. This worked nicely
and I was able to build it within a reasonable time for Linux and Solaris, which
I can not say anymore.
More importantly, I am quite concerned about the migration.
I do consider attempting to apply the fixes for that bug on the 104 code base.
It does not look like an easy task, though....
I apologize for that. There are many reasons we are switching:
1) The open source community really, really wanted the package to have better OS
integration and use the FHS standard -
2) Most open source developers prefer autotools - Fedora DS 1.0 Makefiles are
too daunting for most - although the One Step Build is nice, it's not very portable
3) Splitting the giant package into modular pieces makes building only one of
them much easier and makes maintenance much simpler -
http://directory.fedoraproject.org/wiki/Discrete_Packaging - this means rpms and
solaris pkgs for the components.
I realize that you may not care about any of these, and that the current system
works fine for you, but it is imperative that we grow the developer and user bas
of Fedora DS, and we saw these steps as necessary. So, please, bitte, bear with
us while we go through this process.
Rick, there is no need for apologizing :)
I can see all of those points, but I still have to say that the old way just
worked fine. I actually made the effort to just apply the relevant fix into the
104 source base. Guess what: I had the Linux build within minutes. I just build
it for Solaris and this looks good so far as well. I guess the point is that the
build procedure for 1.1.1 still has to be fine tuned.
Anyway, to my problem:
The test is still running without any issue. I try to get a much faster box to
back to 104 vs 111:
How does the migration work? Background: We ship FDS in one of our products and
at one point of time we have to switch to 1.1.1 or later.
Any feedback is appreciated.
(In reply to comment #27)
> I guess the point is that the
> build procedure for 1.1.1 still has to be fine tuned.
Yes. At some point in the near future, you will be able to download or use yum
to install the devel packages to facilitate building. And we will have proper docs.
> Anyway, to my problem:
> The test is still running without any issue. I try to get a much faster box to
> test it.
> back to 104 vs 111:
> How does the migration work? Background: We ship FDS in one of our products and
> at one point of time we have to switch to 1.1.1 or later.
> Any feedback is appreciated.
We have migration scripts that will automate everything. See