329951 – MMR: Supplier does not respond anymore after many operations (deletes)

Bug 329951 - MMR: Supplier does not respond anymore after many operations (deletes)

Summary: MMR: Supplier does not respond anymore after many operations (deletes)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	389
Classification:	Retired
Component:	Database - General
Sub Component:
Version:	1.0.4
Hardware:	All
OS:	All
Priority:	high
Severity:	medium
Target Milestone:	---
Assignee:	Noriko Hosoi
QA Contact:	Viktor Ashirov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	240316 FDS1.1.0
TreeView+	depends on / blocked

Reported:	2007-10-12 19:46 UTC by reinhard nappert
Modified:	2015-12-07 17:03 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-07 17:03:29 UTC
Embargoed:

Attachments	(Terms of Use)
error and access logs (2.97 MB, application/octet-stream) 2007-10-15 15:54 UTC, reinhard nappert	no flags	Details
pstack trace (32.30 KB, text/plain) 2007-10-15 18:17 UTC, reinhard nappert	no flags	Details
cvs diffs (18.90 KB, patch) 2007-10-18 21:29 UTC, Noriko Hosoi	no flags	Details \| Diff
cvs commit message (2.14 KB, text/plain) 2007-10-18 22:41 UTC, Noriko Hosoi	no flags	Details
View All

Description reinhard nappert 2007-10-12 19:46:49 UTC

MMR setup with two Masters M1 and M2. M1 was the only supplier (only operations
against M1). The replication is setup with a low purgeDelay (900 sec). I
performed many ldap operations (80.000 adds / 80.000 mods / 80.000 deletes).
Eventually, M1 will not respond anymore. You can still access is with other
clients, but only base searches are possible. M1 does not come back with
one-level or subtree searches.
I noticed that id2entry.db4 will not get rid of the tumbstones after 900 secs. I
assume, it has to do with that.

How reproducible:

Steps to Reproduce:
1. MMR setup with 900 sec as purgeDelay
2. access M1 with a script or any ldap client (I used JDNI client).
3. This client iterates in an endless loop through:
- adds 500 entries
- mods these 500 entries
- deletes those 500 entries
  
Actual results:
M1 does not respond anymore

Comment 1 Noriko Hosoi 2007-10-12 21:29:46 UTC

Thank you for reporting the problem.

For debugging the bug, can we have your errors/access logs?  Also, output from
pstack command?

> M1 will not respond anymore.
This means, add/mod/del operations and one-level or subtree search hang?  Do
these clients get any error returns?  I'm interested in the stack trace when
this is happening.

Is the problem reproducible every time you run the test?  The "not respond"
problem occurs after 900 second running or random timing?

Comment 2 reinhard nappert 2007-10-15 15:54:46 UTC

Created attachment 227761 [details]
error and access logs

Comment 3 reinhard nappert 2007-10-15 16:00:05 UTC

To answer your questions:
1. I attached the access and error file. Have a look at connection 27 and you
see down at the bottom:
[12/Oct/2007:15:13:38 -0400] conn=27 op=126788 DEL dn="o=testOrg40,o=test"
without any response. Now we are at the state, where the clients hang (waiting
for a response). You will see that a read (base-level search still works).
How do you create the output of pstack?
2. The problem is reproducible, but it may take longer (or shorter), until it
happens (random time). I actually believe that the purgedalay does not work,
since  id2entry.db4 does not decrease at ant time.

Comment 4 reinhard nappert 2007-10-15 18:17:45 UTC

Created attachment 227851 [details]
pstack trace

Hi, I reproduced the error and captured the pstack trace of ns-slapd.

Comment 5 Noriko Hosoi 2007-10-15 19:11:24 UTC

Thank you for the stack-trace.  It looks there's a deadlock between thread #34
and #35.  And VLV is involved again... 

I assume you have some vlv indexes in your system.  Is it correct?

And could you check this configuration parameter in your config file: dse.ldif?
  nsslapd-serial-lock

Could it be "off"?

Comment 6 reinhard nappert 2007-10-15 19:21:03 UTC

nsslapd-serial-lock is set to "on".

Yes, I do have vlv indexes in the system. However, I do not perform any
vlv-searches during the test. I reported a similar bug, where the server hangs
during updates and vlv searches. It is possible that both a related.

Comment 9 Noriko Hosoi 2007-10-18 21:29:17 UTC

Created attachment 231611 [details]
cvs diffs

Files:
 plugins/replication/repl5_plugins.c
 plugins/replication/repl5_replica.c
 slapd/slapi-private.h
 slapd/back-ldbm/ldbm_add.c
 slapd/back-ldbm/ldbm_delete.c
 slapd/back-ldbm/ldbm_modify.c
 slapd/back-ldbm/ldbm_modrdn.c

Description: introduce OP_FLAG_REPL_RUV.  It's set if the entry is RUV in
repl5_replica.c.  The operation should not be blocked at the backend SERIAL
lock. But it has nothing to do with VLV, thus if the flag is set, it skips the
VLV indexing.

Test case I'm running:
Set up 2-way MMR (Master 1 and 2)
Set purge delay: nsds5ReplicaPurgeDelay: 600
Import an LDIF file on Master 1
Initialize the replica (Master 2) on Master 1
Create browsing indexes on both Master 1 and 2
Then, I started running following test tools:
ldclt -D "cn=Directory manager" -w <passwd> -p <port> -e add -e random -b
ou=Payroll,dc=example,dc=com -f uid=test_XXXX -v -q -n 2 -N 3600 -r 1000 -R
2999 -I 68 -e inetOrgPerson -e imagesdir=<images_dir>
ldclt -D "cn=Directory manager" -w <passwd> -p <port> -e delete -e random -b
ou=payroll,dc=example,dc=com -f uid=test_XXXX -v -q -n 2 -N 3600 -r 1000 -R
2999 -I 32 -e inetOrgPerson -e imagesdir=<images_dir>
ldclt -D "cn=Directory manager" -w <passwd> -p <port> -e
attreplace=sn:a_random_sn_XXXX -e random -b ou=payroll,dc=example,dc=com -f
uid=test_XXXX -v -q -n 2 -N 3600 -r 1000 -R 2999 -I 32

So far, it's been running for 2 hours with occasional checks on the browser
using VLV/brousing index.

Comment 10 Noriko Hosoi 2007-10-18 21:37:25 UTC

Hello reinhard, I attached the fix proposal in comment #9.

I'm testing the changes as described in the same comment.  So far, I don't see
any problem including the deadlock.  May I ask you to try the patch to verify it
fixes your problem or not?  Thank you so much for your help, in advance...
--noriko

Comment 11 Noriko Hosoi 2007-10-18 22:41:33 UTC

Created attachment 231711 [details]
cvs commit message

Reviewed by Rich (Thank you!!!)

Checked in into CVS HEAD.

Comment 12 reinhard nappert 2007-10-19 16:41:10 UTC

I guess I can not patch up my existing 1.0.4 release. Those changes depend on
too many other fixes.
What CVS tag should I use to download the source for verification purposes.

Comment 13 Rich Megginson 2007-10-19 16:49:55 UTC

(In reply to comment #12)
> I guess I can not patch up my existing 1.0.4 release. Those changes depend on
> too many other fixes.
> What CVS tag should I use to download the source for verification purposes.
> 

We're using HEAD (just cvs checkout with no tag).  However, good news and bad
news - the bad news is that the build process for Fedora DS 1.1 is completely
different - the good news is that is based on standard autotools.

Comment 14 Noriko Hosoi 2007-10-19 18:15:39 UTC

Note: the test case in the comment #9 ran 12 hours with no problem.  Did you
have to wait longer till the deadlock occurred?

Comment 15 reinhard nappert 2007-10-19 18:24:14 UTC

Noriko,

the longest the test ran, before the server was deadlocked, was about 4 hours.
12 hours sound good to me.

I am going to build the latest and let it run as well.

Comment 16 Noriko Hosoi 2007-10-19 18:38:21 UTC

Reinhard, 

Thank you so much for reporting this tricky bug and verifying the fix!  It's
helping us a lot.
--noriko

Comment 17 reinhard nappert 2007-10-22 20:36:05 UTC

Hi Noriko,

I still have trouble building 1.1.1. Hopefully, I can solve this soon. I
actually thought that the 1.0.4 built process was not bad.
I have a question regarding this bug. What does it really cause it? I guess it
is a race condition. When I reproduce it, it takes quite some operations (more
than 200 000 operations). However, when our testing guys test our application,
it (sometimes) happens much earlier, which is a concern to me.

Thanks
-Reinhard

Comment 18 Noriko Hosoi 2007-10-23 00:57:51 UTC

Hi Reinhard,

Yes, it was a race condition.  The cause was the conflict between a delete
thread and an MMR thread.  The latter thread was going to update a special entry
called RUV (replica update vector).  When updating the entry, it has nothing to
do with VLV.  But since the backend update code is shared with the ordinary
operation and RUV update operation, even if not necessary, it was checking the
vlv search list with the lock.

To eliminate the deadlock opportunity, I introduced a new flag for RUV.  If it's
set, we skip the VLV update for the special entry.

Regarding the deadlock timing, could the test machine be more powerful than your
machine?  For example, the CPU numbers?  Some race problem is seldom observed on
a single CPU machine, but quite easily on a multiple CPU machine...  Could there
be such difference?

Comment 19 reinhard nappert 2007-10-23 12:41:48 UTC

I guess the SQA box is more powerful (Linux, dual CPU, 32 GB RAM), where I used
a pretty old SunFire (dual CPU (much slower) and 8 GB of RAM). Linux's IO is
also much faster than Sun's.

Any hints to get 1.1.1 built. Is there somewhere a document?

Comment 20 Rich Megginson 2007-10-23 15:10:17 UTC

(In reply to comment #19)
> I guess the SQA box is more powerful (Linux, dual CPU, 32 GB RAM), where I used
> a pretty old SunFire (dual CPU (much slower) and 8 GB of RAM). Linux's IO is
> also much faster than Sun's.
> 
> Any hints to get 1.1.1 built. Is there somewhere a document?

Not yet, we are still working on it.  Firstly, what platform is this on?  Have
you been able to get the code from CVS?

Comment 21 reinhard nappert 2007-10-23 15:21:42 UTC

Yes, I got the code from CVS. Right now, I am stuck with mod_admserv. pkg-config
can not find the ldapcsdk, although it is specified with --with-ldapsdk-lib and
--with-ldapsdk-inc.

I am thinking about skipping those parts, since I really do not need those extra
packages.

Comment 22 Rich Megginson 2007-10-23 15:35:03 UTC

No, in order to test this patch, you only need the ldapserver component.  You do
not need mod_*, adminserver, adminutil, etc.

Depending on what platform you are using, you may already have all of the
dependent components (the components that ldapserver uses) on your system.

Comment 23 reinhard nappert 2007-10-23 17:08:35 UTC

I forgot to mention that I am building on Linux. This is where I am:
I got all components (net-snmp; mozilla; db ...) built. I downloaded the cvs
code (HEAD). When I run configure from the ldapserver, with all the
components-paths 

(./configure --with-nspr=../../../mozilla/work/mozilla/dist/OPT.OBJ
--with-nss-inc=../../../mozilla/work/mozilla/dist/public/nss
--with-nss-lib=../../../mozilla/work/mozilla/dist/OPT.OBJ/lib
--with-ldapsdk-inc=../../../mozilla/work/mozilla/dist/public/ldap
--with-ldapsdk-lib=../../../mozilla/work/mozilla/dist/lib
--with-ldapsdk-bin=../../../mozilla/work/mozilla/dist/bin
--with-svrcore-inc=../../../mozilla/work/mozilla/dist/public/svrcore
--with-svrcore-lib=../../../mozilla/work/mozilla/dist/lib
--with-db=../../../db/work/db-4.2.52.NC/built)
it finds the db.h, but complains about the version:
configure: error: ../../../db/work/db-4.2.52.NC/built/include/db.h is version
4.2 but libdb-4.2 not found

Comment 24 Rich Megginson 2007-10-23 17:43:46 UTC

If you are on a Red Hat, Centos, or Fedora system, you can just use the
db4-devel package for db.  configure will automatically find and use it.  Same
with snmp and sasl.  Most other linux distros will likely include db4 too. 
configure is looking for a file called libdb-4.2.so (it tries to link a test
program using gcc ... -L/path/to/db/lib -ldb-4.2).  The makefiles we use (in
Fedora DS 1.0.4 and earlier) for db create db42.so I think.  So you might be
able to just rename the library.  I suggest with Fedora DS 1.1 to use system
components if possible, to simplify building and running the server.  If you are
using gentoo or debian or some other distro, we would be very interested in any
information about building on that platform.

Comment 25 reinhard nappert 2007-10-24 14:31:52 UTC

Eventually the make of ldapserver finished successfully. Afterwards, I executed
make install, which installed it in /opt/dirsrv.
Now, how to install the directory instance? I executed sbin/setup-ds.pl, which
failed due to "Can't locate Mozilla/LDAP/Conn.pm ...."

What is the right procedure?

I am sure you had enough reasons to change the entire directory
built/installation   procedure and structure as well. 
What I have seen so far, I liked the old stuff much better. This worked nicely
and I was able to build it within a reasonable time for Linux and Solaris, which
I can not say anymore.
More importantly, I am quite concerned about the migration. 
I do consider attempting to apply the fixes for that bug on the 104 code base.
It does not look like an easy task, though....

Comment 26 Rich Megginson 2007-10-24 15:13:58 UTC

I apologize for that.  There are many reasons we are switching:
1) The open source community really, really wanted the package to have better OS
integration and use the FHS standard -
http://directory.fedoraproject.org/wiki/FHS_Packaging
2) Most open source developers prefer autotools - Fedora DS 1.0 Makefiles are
too daunting for most - although the One Step Build is nice, it's not very portable
3) Splitting the giant package into modular pieces makes building only one of
them much easier and makes maintenance much simpler -
http://directory.fedoraproject.org/wiki/Discrete_Packaging - this means rpms and
solaris pkgs for the components.

I realize that you may not care about any of these, and that the current system
works fine for you, but it is imperative that we grow the developer and user bas
of Fedora DS, and we saw these steps as necessary.  So, please, bitte, bear with
us while we go through this process.

Comment 27 reinhard nappert 2007-10-24 18:59:01 UTC

Rick, there is no need for apologizing :)
I can see all of those points, but I still have to say that the old way just
worked fine. I actually made the effort to just apply the relevant fix into the
104 source base. Guess what: I had the Linux build within minutes. I just build
it for Solaris and this looks good so far as well. I guess the point is that the
build procedure for 1.1.1 still has to be fine tuned.
Anyway, to my problem: 
The test is still running without any issue. I try to get a much faster box to
test it.

back to 104 vs 111: 
How does the migration work? Background: We ship FDS in one of our products and
at one point of time we have to switch to 1.1.1 or later. 

Any feedback is appreciated.

Comment 28 Rich Megginson 2007-10-24 19:14:24 UTC

(In reply to comment #27)
> I guess the point is that the
> build procedure for 1.1.1 still has to be fine tuned.

Yes.  At some point in the near future, you will be able to download or use yum
to install the devel packages to facilitate building.  And we will have proper docs.

> Anyway, to my problem: 
> The test is still running without any issue. I try to get a much faster box to
> test it.

Great!

> back to 104 vs 111: 
> How does the migration work? Background: We ship FDS in one of our products and
> at one point of time we have to switch to 1.1.1 or later. 
> 
> Any feedback is appreciated.

We have migration scripts that will automate everything.  See
http://directory.fedoraproject.org/wiki/DS_Admin_Migration and
http://directory.fedoraproject.org/wiki/FDS_Setup#migrate-ds-admin.pl

Note You need to log in before you can comment on or make changes to this bug.