MMR setup with two Masters M1 and M2. M1 was the only supplier (only operations against M1). The replication is setup with a low purgeDelay (900 sec). I performed many ldap operations (80.000 adds / 80.000 mods / 80.000 deletes). Eventually, M1 will not respond anymore. You can still access is with other clients, but only base searches are possible. M1 does not come back with one-level or subtree searches. I noticed that id2entry.db4 will not get rid of the tumbstones after 900 secs. I assume, it has to do with that. How reproducible: Steps to Reproduce: 1. MMR setup with 900 sec as purgeDelay 2. access M1 with a script or any ldap client (I used JDNI client). 3. This client iterates in an endless loop through: - adds 500 entries - mods these 500 entries - deletes those 500 entries Actual results: M1 does not respond anymore
Thank you for reporting the problem. For debugging the bug, can we have your errors/access logs? Also, output from pstack command? > M1 will not respond anymore. This means, add/mod/del operations and one-level or subtree search hang? Do these clients get any error returns? I'm interested in the stack trace when this is happening. Is the problem reproducible every time you run the test? The "not respond" problem occurs after 900 second running or random timing?
Created attachment 227761 [details] error and access logs
To answer your questions: 1. I attached the access and error file. Have a look at connection 27 and you see down at the bottom: [12/Oct/2007:15:13:38 -0400] conn=27 op=126788 DEL dn="o=testOrg40,o=test" without any response. Now we are at the state, where the clients hang (waiting for a response). You will see that a read (base-level search still works). How do you create the output of pstack? 2. The problem is reproducible, but it may take longer (or shorter), until it happens (random time). I actually believe that the purgedalay does not work, since id2entry.db4 does not decrease at ant time.
Created attachment 227851 [details] pstack trace Hi, I reproduced the error and captured the pstack trace of ns-slapd.
Thank you for the stack-trace. It looks there's a deadlock between thread #34 and #35. And VLV is involved again... I assume you have some vlv indexes in your system. Is it correct? And could you check this configuration parameter in your config file: dse.ldif? nsslapd-serial-lock Could it be "off"?
nsslapd-serial-lock is set to "on". Yes, I do have vlv indexes in the system. However, I do not perform any vlv-searches during the test. I reported a similar bug, where the server hangs during updates and vlv searches. It is possible that both a related.
Created attachment 231611 [details] cvs diffs Files: plugins/replication/repl5_plugins.c plugins/replication/repl5_replica.c slapd/slapi-private.h slapd/back-ldbm/ldbm_add.c slapd/back-ldbm/ldbm_delete.c slapd/back-ldbm/ldbm_modify.c slapd/back-ldbm/ldbm_modrdn.c Description: introduce OP_FLAG_REPL_RUV. It's set if the entry is RUV in repl5_replica.c. The operation should not be blocked at the backend SERIAL lock. But it has nothing to do with VLV, thus if the flag is set, it skips the VLV indexing. Test case I'm running: Set up 2-way MMR (Master 1 and 2) Set purge delay: nsds5ReplicaPurgeDelay: 600 Import an LDIF file on Master 1 Initialize the replica (Master 2) on Master 1 Create browsing indexes on both Master 1 and 2 Then, I started running following test tools: ldclt -D "cn=Directory manager" -w <passwd> -p <port> -e add -e random -b ou=Payroll,dc=example,dc=com -f uid=test_XXXX -v -q -n 2 -N 3600 -r 1000 -R 2999 -I 68 -e inetOrgPerson -e imagesdir=<images_dir> ldclt -D "cn=Directory manager" -w <passwd> -p <port> -e delete -e random -b ou=payroll,dc=example,dc=com -f uid=test_XXXX -v -q -n 2 -N 3600 -r 1000 -R 2999 -I 32 -e inetOrgPerson -e imagesdir=<images_dir> ldclt -D "cn=Directory manager" -w <passwd> -p <port> -e attreplace=sn:a_random_sn_XXXX -e random -b ou=payroll,dc=example,dc=com -f uid=test_XXXX -v -q -n 2 -N 3600 -r 1000 -R 2999 -I 32 So far, it's been running for 2 hours with occasional checks on the browser using VLV/brousing index.
Hello reinhard, I attached the fix proposal in comment #9. I'm testing the changes as described in the same comment. So far, I don't see any problem including the deadlock. May I ask you to try the patch to verify it fixes your problem or not? Thank you so much for your help, in advance... --noriko
Created attachment 231711 [details] cvs commit message Reviewed by Rich (Thank you!!!) Checked in into CVS HEAD.
I guess I can not patch up my existing 1.0.4 release. Those changes depend on too many other fixes. What CVS tag should I use to download the source for verification purposes.
(In reply to comment #12) > I guess I can not patch up my existing 1.0.4 release. Those changes depend on > too many other fixes. > What CVS tag should I use to download the source for verification purposes. > We're using HEAD (just cvs checkout with no tag). However, good news and bad news - the bad news is that the build process for Fedora DS 1.1 is completely different - the good news is that is based on standard autotools.
Note: the test case in the comment #9 ran 12 hours with no problem. Did you have to wait longer till the deadlock occurred?
Noriko, the longest the test ran, before the server was deadlocked, was about 4 hours. 12 hours sound good to me. I am going to build the latest and let it run as well.
Reinhard, Thank you so much for reporting this tricky bug and verifying the fix! It's helping us a lot. --noriko
Hi Noriko, I still have trouble building 1.1.1. Hopefully, I can solve this soon. I actually thought that the 1.0.4 built process was not bad. I have a question regarding this bug. What does it really cause it? I guess it is a race condition. When I reproduce it, it takes quite some operations (more than 200 000 operations). However, when our testing guys test our application, it (sometimes) happens much earlier, which is a concern to me. Thanks -Reinhard
Hi Reinhard, Yes, it was a race condition. The cause was the conflict between a delete thread and an MMR thread. The latter thread was going to update a special entry called RUV (replica update vector). When updating the entry, it has nothing to do with VLV. But since the backend update code is shared with the ordinary operation and RUV update operation, even if not necessary, it was checking the vlv search list with the lock. To eliminate the deadlock opportunity, I introduced a new flag for RUV. If it's set, we skip the VLV update for the special entry. Regarding the deadlock timing, could the test machine be more powerful than your machine? For example, the CPU numbers? Some race problem is seldom observed on a single CPU machine, but quite easily on a multiple CPU machine... Could there be such difference?
I guess the SQA box is more powerful (Linux, dual CPU, 32 GB RAM), where I used a pretty old SunFire (dual CPU (much slower) and 8 GB of RAM). Linux's IO is also much faster than Sun's. Any hints to get 1.1.1 built. Is there somewhere a document?
(In reply to comment #19) > I guess the SQA box is more powerful (Linux, dual CPU, 32 GB RAM), where I used > a pretty old SunFire (dual CPU (much slower) and 8 GB of RAM). Linux's IO is > also much faster than Sun's. > > Any hints to get 1.1.1 built. Is there somewhere a document? Not yet, we are still working on it. Firstly, what platform is this on? Have you been able to get the code from CVS?
Yes, I got the code from CVS. Right now, I am stuck with mod_admserv. pkg-config can not find the ldapcsdk, although it is specified with --with-ldapsdk-lib and --with-ldapsdk-inc. I am thinking about skipping those parts, since I really do not need those extra packages.
No, in order to test this patch, you only need the ldapserver component. You do not need mod_*, adminserver, adminutil, etc. Depending on what platform you are using, you may already have all of the dependent components (the components that ldapserver uses) on your system.
I forgot to mention that I am building on Linux. This is where I am: I got all components (net-snmp; mozilla; db ...) built. I downloaded the cvs code (HEAD). When I run configure from the ldapserver, with all the components-paths (./configure --with-nspr=../../../mozilla/work/mozilla/dist/OPT.OBJ --with-nss-inc=../../../mozilla/work/mozilla/dist/public/nss --with-nss-lib=../../../mozilla/work/mozilla/dist/OPT.OBJ/lib --with-ldapsdk-inc=../../../mozilla/work/mozilla/dist/public/ldap --with-ldapsdk-lib=../../../mozilla/work/mozilla/dist/lib --with-ldapsdk-bin=../../../mozilla/work/mozilla/dist/bin --with-svrcore-inc=../../../mozilla/work/mozilla/dist/public/svrcore --with-svrcore-lib=../../../mozilla/work/mozilla/dist/lib --with-db=../../../db/work/db-4.2.52.NC/built) it finds the db.h, but complains about the version: configure: error: ../../../db/work/db-4.2.52.NC/built/include/db.h is version 4.2 but libdb-4.2 not found
If you are on a Red Hat, Centos, or Fedora system, you can just use the db4-devel package for db. configure will automatically find and use it. Same with snmp and sasl. Most other linux distros will likely include db4 too. configure is looking for a file called libdb-4.2.so (it tries to link a test program using gcc ... -L/path/to/db/lib -ldb-4.2). The makefiles we use (in Fedora DS 1.0.4 and earlier) for db create db42.so I think. So you might be able to just rename the library. I suggest with Fedora DS 1.1 to use system components if possible, to simplify building and running the server. If you are using gentoo or debian or some other distro, we would be very interested in any information about building on that platform.
Eventually the make of ldapserver finished successfully. Afterwards, I executed make install, which installed it in /opt/dirsrv. Now, how to install the directory instance? I executed sbin/setup-ds.pl, which failed due to "Can't locate Mozilla/LDAP/Conn.pm ...." What is the right procedure? I am sure you had enough reasons to change the entire directory built/installation procedure and structure as well. What I have seen so far, I liked the old stuff much better. This worked nicely and I was able to build it within a reasonable time for Linux and Solaris, which I can not say anymore. More importantly, I am quite concerned about the migration. I do consider attempting to apply the fixes for that bug on the 104 code base. It does not look like an easy task, though....
I apologize for that. There are many reasons we are switching: 1) The open source community really, really wanted the package to have better OS integration and use the FHS standard - http://directory.fedoraproject.org/wiki/FHS_Packaging 2) Most open source developers prefer autotools - Fedora DS 1.0 Makefiles are too daunting for most - although the One Step Build is nice, it's not very portable 3) Splitting the giant package into modular pieces makes building only one of them much easier and makes maintenance much simpler - http://directory.fedoraproject.org/wiki/Discrete_Packaging - this means rpms and solaris pkgs for the components. I realize that you may not care about any of these, and that the current system works fine for you, but it is imperative that we grow the developer and user bas of Fedora DS, and we saw these steps as necessary. So, please, bitte, bear with us while we go through this process.
Rick, there is no need for apologizing :) I can see all of those points, but I still have to say that the old way just worked fine. I actually made the effort to just apply the relevant fix into the 104 source base. Guess what: I had the Linux build within minutes. I just build it for Solaris and this looks good so far as well. I guess the point is that the build procedure for 1.1.1 still has to be fine tuned. Anyway, to my problem: The test is still running without any issue. I try to get a much faster box to test it. back to 104 vs 111: How does the migration work? Background: We ship FDS in one of our products and at one point of time we have to switch to 1.1.1 or later. Any feedback is appreciated.
(In reply to comment #27) > I guess the point is that the > build procedure for 1.1.1 still has to be fine tuned. Yes. At some point in the near future, you will be able to download or use yum to install the devel packages to facilitate building. And we will have proper docs. > Anyway, to my problem: > The test is still running without any issue. I try to get a much faster box to > test it. Great! > back to 104 vs 111: > How does the migration work? Background: We ship FDS in one of our products and > at one point of time we have to switch to 1.1.1 or later. > > Any feedback is appreciated. We have migration scripts that will automate everything. See http://directory.fedoraproject.org/wiki/DS_Admin_Migration and http://directory.fedoraproject.org/wiki/FDS_Setup#migrate-ds-admin.pl