Bug 1029870 - DS stuck after adding new backend
DS stuck after adding new backend
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: 389-ds-base (Show other bugs)
7.0
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Rich Megginson
Sankar Ramalingam
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-13 07:07 EST by Ján Rusnačko
Modified: 2014-01-07 01:22 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-01-07 01:03:25 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Stacktraces of all threads (41.10 KB, text/plain)
2013-11-13 07:07 EST, Ján Rusnačko
no flags Details
stacktraces -- db_open and checkpoint threads are hanging (9.19 KB, text/plain)
2013-12-16 20:44 EST, Noriko Hosoi
no flags Details

  None (edit)
Description Ján Rusnačko 2013-11-13 07:07:53 EST
Created attachment 823383 [details]
Stacktraces of all threads

Description of problem:
During execution of basic testsuite, a new backend is being added:

$LDAPMODIFY -v -h $LDAPhost -p $LDAPport -D "$ROOTDN" -w "$ROOTDNPW" -c <<-EOF 2>&1
dn: cn=$bname,cn=ldbm database,cn=plugins,cn=config
changetype: add
objectclass: top
objectclass: extensibleObject
objectclass: nsBackendInstance
cn: $bname
nsslapd-suffix: $sf
nsslapd-cachesize: $cachesize
nsslapd-cachememsize: $cachememsize
EOF

This ldapmodify hangs indefinitely. 

These two threads are active:

(gdb) thread apply 39 bt

Thread 39 (Thread 0x7fcd1a838700 (LWP 23599)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fcd21a63f63 in __db_pthread_mutex_condwait (env=0x7fcd29a8a3e0, mutex=0, timespec=0x0, mutexp=<optimized out>) at ../../src/mutex/mut_pthread.c:321
#2  __db_hybrid_mutex_suspend (env=env@entry=0x7fcd29a8a3e0, mutex=mutex@entry=1082, timespec=timespec@entry=0x0, exclusive=exclusive@entry=1) at ../../src/mutex/mut_pthread.c:577
#3  0x00007fcd21a63440 in __db_tas_mutex_lock_int (nowait=0, timeout=<optimized out>, mutex=<optimized out>, env=0x7fcd29a8a3e0, env@entry=0x7fcd1b9ec448) at ../../src/mutex/mut_tas.c:255
#4  __db_tas_mutex_lock (env=env@entry=0x7fcd29a8a3e0, mutex=1082, timeout=timeout@entry=0) at ../../src/mutex/mut_tas.c:286
#5  0x00007fcd21b93b78 in __txn_checkpoint (env=env@entry=0x7fcd29a8a3e0, kbytes=kbytes@entry=0, minutes=minutes@entry=0, flags=flags@entry=1) at ../../src/txn/txn_chkpt.c:140
#6  0x00007fcd21b940dc in __txn_checkpoint_pp (dbenv=<optimized out>, kbytes=0, minutes=0, flags=1) at ../../src/txn/txn_chkpt.c:81
#7  0x00007fcd1ee98887 in dblayer_force_checkpoint (li=li@entry=0x7fcd29a45050) at ldap/servers/slapd/back-ldbm/dblayer.c:5061
#8  0x00007fcd1ee98a28 in checkpoint_threadmain (param=0x7fcd29a45050) at ldap/servers/slapd/back-ldbm/dblayer.c:4464
#9  0x00007fcd27538eb1 in _pt_root (arg=0x7fcd29b0c0f0) at ../../../nspr/pr/src/pthreads/ptthread.c:204
#10 0x00007fcd26edade3 in start_thread (arg=0x7fcd1a838700) at pthread_create.c:308
#11 0x00007fcd26c081ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) thread apply 26 bt

Thread 26 (Thread 0x7fcd0b7fe700 (LWP 23627)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fcd21a63f63 in __db_pthread_mutex_condwait (env=0x7fcd29a8a3e0, mutex=0, timespec=0x0, mutexp=<optimized out>) at ../../src/mutex/mut_pthread.c:321
#2  __db_hybrid_mutex_suspend (env=env@entry=0x7fcd29a8a3e0, mutex=mutex@entry=1322, timespec=timespec@entry=0x0, exclusive=exclusive@entry=1) at ../../src/mutex/mut_pthread.c:577
#3  0x00007fcd21a63440 in __db_tas_mutex_lock_int (nowait=0, timeout=<optimized out>, mutex=<optimized out>, env=0x7fcd29a8a3e0, env@entry=0x7fcd1b9f5324) at ../../src/mutex/mut_tas.c:255
#4  __db_tas_mutex_lock (env=env@entry=0x7fcd29a8a3e0, mutex=1322, timeout=timeout@entry=0) at ../../src/mutex/mut_tas.c:286
#5  0x00007fcd21b6e19f in __log_flush_int (dblp=dblp@entry=0x7fcd29a8dc30, lsnp=lsnp@entry=0x7fcd0b7f7db0, release=release@entry=1) at ../../src/log/log_put.c:1046
#6  0x00007fcd21b6ef75 in __log_flush_commit (flags=10, lsnp=0x7fcd0b7f7d90, env=0x7fcd29a8a3e0) at ../../src/log/log_put.c:538
#7  __log_put (env=env@entry=0x7fcd29a8a3e0, lsnp=lsnp@entry=0x7fcd1bcf1118, udbt=udbt@entry=0x7fcd0b7f7fa0, flags=flags@entry=10) at ../../src/log/log_put.c:312
#8  0x00007fcd21b6ffde in __log_put_record_int (env=0x7fcd29a8a3e0, dbp=<optimized out>, txnp=0x7fccc809e2a0, ret_lsnp=0x7fcd0b7f8150, flags=2, rectype=143, has_data=has_data@entry=0, 
    size=size@entry=93, spec=spec@entry=0x7fcd21de3fc0 <__fop_create_desc>, argp=argp@entry=0x7fcd0b7f8038) at ../../src/log/log_put.c:2002
#9  0x00007fcd21b70db0 in __log_put_record (env=env@entry=0x7fcd29a8a3e0, dbp=dbp@entry=0x0, txnp=txnp@entry=0x7fccc809e2a0, ret_lsnp=ret_lsnp@entry=0x7fcd0b7f8150, flags=flags@entry=2, 
    rectype=rectype@entry=143, has_data=has_data@entry=0, size=93, spec=0x7fcd21de3fc0 <__fop_create_desc>) at ../../src/log/log_put.c:1743
#10 0x00007fcd21b5e600 in __fop_create_log (mode=384, appname=1, dirname=0x7fcd0b7f81a0, name=0x7fcd0b7f8170, flags=2, ret_lsnp=0x7fcd0b7f8150, txnp=0x7fccc809e2a0, env=0x7fcd29a8a3e0)
    at ../../src/dbinc_auto/fileops_auto.h:40
#11 __fop_create (env=env@entry=0x7fcd29a8a3e0, txn=0x7fccc809e2a0, fhpp=fhpp@entry=0x7fcd0b7f82a0, name=0x7fccc809e110 "bug766322db/8000007a.5e64af8b", dirp=dirp@entry=0x7fccc809b270, 
    appname=appname@entry=DB_APP_DATA, mode=mode@entry=384, flags=flags@entry=0) at ../../src/fileops/fop_basic.c:92
#12 0x00007fcd21b61cbb in __fop_file_setup (dbp=dbp@entry=0x7fccc809b210, ip=ip@entry=0x0, txn=<optimized out>, txn@entry=0x7fccc808f0c0, 
    name=name@entry=0x7fccc80a6c50 "bug766322db/id2entry.db", mode=mode@entry=384, flags=flags@entry=33, retidp=retidp@entry=0x7fcd0b7f85b4) at ../../src/fileops/fop_util.c:612
#13 0x00007fcd21b37346 in __db_open (dbp=dbp@entry=0x7fccc809b210, ip=0x0, txn=0x7fccc808f0c0, fname=fname@entry=0x7fccc80a6c50 "bug766322db/id2entry.db", dname=dname@entry=0x0, 
    type=type@entry=DB_BTREE, flags=flags@entry=33, mode=mode@entry=384, meta_pgno=meta_pgno@entry=0) at ../../src/db/db_open.c:187
#14 0x00007fcd21b31466 in __db_open_pp (dbp=0x7fccc809b210, txn=0x7fccc808f0c0, fname=0x7fccc80a6c50 "bug766322db/id2entry.db", dname=0x0, type=DB_BTREE, flags=33, mode=384)
    at ../../src/db/db_iface.c:1191
#15 0x00007fcd1ee9a3e5 in dblayer_instance_start (be=be@entry=0x7fccc807eb60, mode=mode@entry=2) at ldap/servers/slapd/back-ldbm/dblayer.c:2271
#16 0x00007fcd1eeb8bed in ldbm_instance_start (be=0x7fccc807eb60) at ldap/servers/slapd/back-ldbm/instance.c:293
#17 0x00007fcd1eed041c in ldbm_instance_postadd_instance_entry_callback (pb=<optimized out>, entryBefore=<optimized out>, entryAfter=<optimized out>, returncode=<optimized out>, 
    returntext=<optimized out>, arg=<optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_instance_config.c:906
#18 0x00007fcd28f07f1b in dse_call_callback (pb=pb@entry=0x7fcd29d1dfe0, operation=operation@entry=16, flags=flags@entry=2, entryBefore=0x7fccc8001140, entryAfter=entryAfter@entry=0x0, 
    returncode=returncode@entry=0x7fcd0b7f9814, returntext=returntext@entry=0x7fcd0b7f9870 "", pdse=<optimized out>) at ldap/servers/slapd/dse.c:2421
#19 0x00007fcd28f0aaa6 in dse_add (pb=0x7fcd29d1dfe0) at ldap/servers/slapd/dse.c:2216
#20 0x00007fcd28ef5f9a in op_shared_add (pb=pb@entry=0x7fcd29d1dfe0) at ldap/servers/slapd/add.c:735
#21 0x00007fcd28ef7258 in do_add (pb=0x7fcd29d1dfe0) at ldap/servers/slapd/add.c:258
#22 0x00007fcd29404d98 in connection_dispatch_operation (pb=<optimized out>, op=0x7fcd29cd89c0, conn=0x7fcd292fbf48) at ldap/servers/slapd/connection.c:582
#23 connection_threadmain () at ldap/servers/slapd/connection.c:2348
#24 0x00007fcd27538eb1 in _pt_root (arg=0x7fcd29d007d0) at ../../../nspr/pr/src/pthreads/ptthread.c:204
#25 0x00007fcd26edade3 in start_thread (arg=0x7fcd0b7fe700) at pthread_create.c:308
#26 0x00007fcd26c081ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Attaching full stacktrace.

Version-Release number of selected component (if applicable):
389-ds-base-1.3.1.6-8.el7.x86_64
Comment 2 Noriko Hosoi 2013-11-13 13:37:47 EST
Interesting.  Adding a backend at the random timing does not cause this hang;  but to reproduce it, the add must be done at the same timing as the checkpoint timing, isn't it?  

It looks we may need to protect DB_OPEN with ENV->dblayer_env_lock.
1955 int dblayer_instance_start(backend *be, int mode)
2271         DB_OPEN(mypEnv->dblayer_openflags,
2272                 dbp, NULL/* txnid */, id2entry_file, subname, DB_BTREE,
2273                 open_flags, priv->dblayer_file_mode, return_value);

Investigating...
Comment 3 Rich Megginson 2013-11-13 14:01:59 EST
file:///usr/share/doc/db4-devel-4.7.25/api_c/db_open.html

 flags

    The flags parameter must be set to 0 or by bitwise inclusively OR'ing together one or more of the following values: 
 DB_AUTO_COMMIT

    Enclose the DB->open call within a transaction. If the call succeeds, the open operation will be recoverable and all subsequent database modification operations based on this handle will be transactionally protected. If the call fails, no database will have been created.
Comment 4 Noriko Hosoi 2013-11-13 14:11:52 EST
(In reply to Rich Megginson from comment #3)
> file:///usr/share/doc/db4-devel-4.7.25/api_c/db_open.html
> 
>  flags
> 
>     The flags parameter must be set to 0 or by bitwise inclusively OR'ing
> together one or more of the following values: 
>  DB_AUTO_COMMIT
> 
>     Enclose the DB->open call within a transaction. If the call succeeds,
> the open operation will be recoverable and all subsequent database
> modification operations based on this handle will be transactionally
> protected. If the call fails, no database will have been created.

It makes sense.  I could try that.

The problem is it's quite hard for me to reproduce.  I kept adding user entries and backend at the same time for longer than 20 minutes, but still I cannot hit the timing...  So far, I added 330 backends...
Comment 5 Rich Megginson 2013-11-13 14:14:17 EST
We are using DB_AUTO_COMMIT for db->open - but it doesn't appear to be working?
Comment 6 Noriko Hosoi 2013-11-13 14:17:40 EST
(In reply to Rich Megginson from comment #5)
> We are using DB_AUTO_COMMIT for db->open - but it doesn't appear to be
> working?

Oh, you meant if DB_AUTO_COMMIT was set, we should not have to put DB_OPEN in the transaction?
Comment 7 Noriko Hosoi 2013-11-13 14:23:47 EST
It looks db transaction and checkpointing are not excluded each other.  Env->dblayer_env_lock protecting them from each other.  So, I guess DB_OPEN also needs it even though it sets DB_AUTO_COMMIT...
Comment 8 Noriko Hosoi 2013-11-13 14:59:08 EST
Hi Ján,

> During execution of basic testsuite, a new backend is being added:

Here's what I did:
Terminal 1: Ran basic test suite.
Terminal 2: Ran this script:
cnt=0
while [ $cnt -lt 10000 ]
do
echo "======================="
bname="BE_$cnt"
sf="o=BE_$cnt"
echo $bname
echo $sf
ldapmodify -x -v -h $LDAPhost -p $LDAPport -D "$ROOTDN" -w "$ROOTDNPW" -c <<-EOF 2>&1
dn: cn=$bname,cn=ldbm database,cn=plugins,cn=config
changetype: add
objectclass: top
objectclass: extensibleObject
objectclass: nsBackendInstance
cn: $bname
nsslapd-suffix: $sf
nsslapd-cachesize: $cachesize
nsslapd-cachememsize: $cachememsize
EOF
cnt=`expr $cnt + 1`
done

I could successfully finish the basic test and add 298 backends.  In addition to this, do you have any idea what I need to do to hang the server?  I'm wondering how easy for you to make it happen?
Comment 9 Rich Megginson 2013-11-13 15:04:51 EST
(In reply to Noriko Hosoi from comment #6)
> (In reply to Rich Megginson from comment #5)
> > We are using DB_AUTO_COMMIT for db->open - but it doesn't appear to be
> > working?
> 
> Oh, you meant if DB_AUTO_COMMIT was set, we should not have to put DB_OPEN
> in the transaction?

Yes.  If DB_AUTO_COMMIT is set, then DB_OPEN should automatically use a transaction during the db open.

I notice in dblayer.c that when using older versions of bdb < 4.1 we had to do our own locking for db open and txn_checkpoint.
Comment 10 Ján Rusnačko 2013-11-13 16:03:45 EST
(In reply to Noriko Hosoi from comment #8)
> Hi Ján,
> 
> > During execution of basic testsuite, a new backend is being added:
> 
> Here's what I did:
> Terminal 1: Ran basic test suite.
> Terminal 2: Ran this script:
> cnt=0
> while [ $cnt -lt 10000 ]
> do
> echo "======================="
> bname="BE_$cnt"
> sf="o=BE_$cnt"
> echo $bname
> echo $sf
> ldapmodify -x -v -h $LDAPhost -p $LDAPport -D "$ROOTDN" -w "$ROOTDNPW" -c
> <<-EOF 2>&1
> dn: cn=$bname,cn=ldbm database,cn=plugins,cn=config
> changetype: add
> objectclass: top
> objectclass: extensibleObject
> objectclass: nsBackendInstance
> cn: $bname
> nsslapd-suffix: $sf
> nsslapd-cachesize: $cachesize
> nsslapd-cachememsize: $cachememsize
> EOF
> cnt=`expr $cnt + 1`
> done
> 
> I could successfully finish the basic test and add 298 backends.  In
> addition to this, do you have any idea what I need to do to hang the server?
> I'm wondering how easy for you to make it happen?
Hello Noriko,

I literally just ran basic testsuite and when I saw log output stuck on adding new backend, I just attached gdb.

Now I re-ran basic testsuite, and it hanged again - but this time on simple ldapsearch like this:

ldapsearch -D ... -b "cn=config" objectclass=*

and gdb stacktraces look the same as above.

I will share that VM with you tomorrow.
Comment 11 Noriko Hosoi 2013-11-13 17:05:55 EST
(In reply to Rich Megginson from comment #9)
> I notice in dblayer.c that when using older versions of bdb < 4.1 we had to
> do our own locking for db open and txn_checkpoint.

I see.  So, it could be a regression of libdb...?  The latest libdb on rhel-7.0 is libdb-5.3.21-14.el7.  Mine is a bit older than that: libdb-5.3.21-3.fc18.x86_64.

I scanned the changelogs of 5.3.28 and 6.0.20 at 
http://www.oracle.com/technetwork/products/berkeleydb/downloads/index-082944.html
but I could not find any related issues...
Comment 12 Noriko Hosoi 2013-11-13 17:06:55 EST
(In reply to Ján Rusnačko from comment #10)
> I will share that VM with you tomorrow.

Thanks a lot, Ján!!
Comment 13 Ján Rusnačko 2013-11-14 09:09:51 EST
(In reply to Noriko Hosoi from comment #12)
> (In reply to Ján Rusnačko from comment #10)
> > I will share that VM with you tomorrow.
> 
> Thanks a lot, Ján!!
Bad news - I have copied that VM to another server for you to reach, but I fail to reproduce there :( However, when I re-ran tests on my machine, I could reproduce (this time hanged on backed creation again).

I notice one more detail - when I start the VM on my machine, gnome shell immediately starts consuming 80-90% of CPU time and slows down right after hang occurs. However, I don`t see this on the server where I fail to reproduce.

I guess we can assume this hang is caused by excessive CPU usage of gnome-shell. How do I further debug what is going on there ?
Comment 14 Noriko Hosoi 2013-11-14 12:34:14 EST
(In reply to Ján Rusnačko from comment #13)
> I notice one more detail - when I start the VM on my machine, gnome shell
> immediately starts consuming 80-90% of CPU time and slows down right after
> hang occurs. 

CPU usage on the host goes up to 80-90% or the VM's?  "When you start the VM", the basic tests are also launched as well?  (I'm curious exactly when the CPU usage jumps up -- it's a VM issue or the basic test issue...?)  And "slows down" means the VM slows down?  I.e., the response time for any operation gets longer?

Could there be any other sign?  Since the hang is in libdb, something could be wrong in the disk IO of the VM.  Do you see any IO difficulties?  Output of iostat (in the package sysstat)?

> However, I don`t see this on the server where I fail to reproduce.
> 
> I guess we can assume this hang is caused by excessive CPU usage of
> gnome-shell. How do I further debug what is going on there ?

Well, hard to say...  I think there is a potential risk which causes this hang, which might be magnified because of some slow IO?  It'd be really hard to make it happen in the normal condition, but it could happen...?
Comment 15 Ján Rusnačko 2013-11-15 04:47:59 EST
(In reply to Noriko Hosoi from comment #14)
> (In reply to Ján Rusnačko from comment #13)
> > I notice one more detail - when I start the VM on my machine, gnome shell
> > immediately starts consuming 80-90% of CPU time and slows down right after
> > hang occurs. 
> 
> CPU usage on the host goes up to 80-90% or the VM's?  "When you start the
> VM", the basic tests are also launched as well?  (I'm curious exactly when
> the CPU usage jumps up -- it's a VM issue or the basic test issue...?)  And
> "slows down" means the VM slows down?  I.e., the response time for any
> operation gets longer?
> 
> Could there be any other sign?  Since the hang is in libdb, something could
> be wrong in the disk IO of the VM.  Do you see any IO difficulties?  Output
> of iostat (in the package sysstat)?
Ok, seems like the load is triggered when I open network settings (upper right corner -> Network icon -> Network settings). Gnome shell starts eating CPU, and there is also some IO:

[jrusnack@localhost ~]$ sudo iostat
Linux 3.10.0-42.el7.x86_64 (localhost.localdomain) 	11/15/2013 	_x86_64_	(2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          24.63    0.03    4.30   13.18    0.00   57.87

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              55.80      1606.23        40.19     292799       7326
dm-0              1.95         7.81         0.00       1424          0
dm-1             53.61      1491.34        28.96     271855       5278

Also, it stops when screensaver is activated. Iotop shows 0 bytes I/O.
> 
> > However, I don`t see this on the server where I fail to reproduce.
> > 
> > I guess we can assume this hang is caused by excessive CPU usage of
> > gnome-shell. How do I further debug what is going on there ?
> 
> Well, hard to say...  I think there is a potential risk which causes this
> hang, which might be magnified because of some slow IO?  It'd be really hard
> to make it happen in the normal condition, but it could happen...?
I take back what I said, that I cannot reproduce on another server. If I open network settings, load is triggered - I will share server details with you.
Comment 16 Noriko Hosoi 2013-11-15 20:09:17 EST
Upstream ticket:
https://fedorahosted.org/389/ticket/47600
Comment 17 Noriko Hosoi 2013-11-15 20:45:58 EST
Note:
1) Checkpoint is no longer protected by the ENV->dblayer_env_lock.
 The dblayer comment says:
 /* 608145: db4.1 and newer does not require exclusive lock for checkpointing 
  * and transactions */
 #define DB_CHECKPOINT_LOCK(use_lock, lock) ;
 #define DB_CHECKPOINT_UNLOCK(use_lock, lock) ;

2) Both threads are waiting at the same pthread_cond_wait.  It does not look like a deadlock between thread 26 and 39, but rather some lock leak?
Thread 39 (Thread 0x7fcd1a838700 (LWP 23599)):
Thread 26 (Thread 0x7fcd0b7fe700 (LWP 23627)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fcd21a63f63 in __db_pthread_mutex_condwait (env=0x7fcd29a8a3e0, mutex=0, timespec=0x0, mutexp=<optimized out>) at ../../src/mutex/mut_pthread.c:321
#2  __db_hybrid_mutex_suspend (env=env@entry=0x7fcd29a8a3e0, mutex=mutex@entry=1082, timespec=timespec@entry=0x0, exclusive=exclusive@entry=1) at ../../src/mutex/mut_pthread.c:577
#3  0x00007fcd21a63440 in __db_tas_mutex_lock_int (nowait=0, timeout=<optimized out>, mutex=<optimized out>, env=0x7fcd29a8a3e0, env@entry=0x7fcd1b9ec448) at ../../src/mutex/mut_tas.c:255
#4  __db_tas_mutex_lock (env=env@entry=0x7fcd29a8a3e0, mutex=1082, timeout=timeout@entry=0) at ../../src/mutex/mut_tas.c:286

Ján, when the hang occurs, could you please run this command line and attach the output to this bug?
# db_stat -C A -h /var/lib/dirsrv/slapd-<ID>/db/

Thanks!
--noriko
Comment 18 Noriko Hosoi 2013-11-15 21:21:26 EST
(In reply to Noriko Hosoi from comment #17)
> 
> 2) Both threads are waiting at the same pthread_cond_wait.  It does not look
> like a deadlock between thread 26 and 39, but rather some lock leak?
> Thread 39 (Thread 0x7fcd1a838700 (LWP 23599)):
> Thread 26 (Thread 0x7fcd0b7fe700 (LWP 23627)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1  0x00007fcd21a63f63 in __db_pthread_mutex_condwait (env=0x7fcd29a8a3e0,
> mutex=0, timespec=0x0, mutexp=<optimized out>) at
> ../../src/mutex/mut_pthread.c:321
> #2  __db_hybrid_mutex_suspend (env=env@entry=0x7fcd29a8a3e0,
> mutex=mutex@entry=1082, timespec=timespec@entry=0x0,
> exclusive=exclusive@entry=1) at ../../src/mutex/mut_pthread.c:577
> #3  0x00007fcd21a63440 in __db_tas_mutex_lock_int (nowait=0,
> timeout=<optimized out>, mutex=<optimized out>, env=0x7fcd29a8a3e0,
> env@entry=0x7fcd1b9ec448) at ../../src/mutex/mut_tas.c:255
> #4  __db_tas_mutex_lock (env=env@entry=0x7fcd29a8a3e0, mutex=1082,
> timeout=timeout@entry=0) at ../../src/mutex/mut_tas.c:286

Never mind.  Not on the same mutex.
Thread 39: mutex=mutex@entry=1082
Thread 26: mutex=mutex@entry=1322
Need more investigation... :(
Comment 19 Noriko Hosoi 2013-12-04 19:38:56 EST
This is a tough one... Still I cannot reproduce the hang on the RHEL7 VM from RHEVM...

This is the OS info:
# uname -a
Linux nhrhel7.idm.lab.bos.redhat.com 3.10.0-57.el7.x86_64 #1 SMP Tue Nov 26 17:10:51 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

I'm using this pair:
libdb-5.3.21-14.el7.x86_64
389-ds-base-1.3.1.6-11.el7.x86_64

While running the acceptance/basic test, I executed this script.  It added 1863 backends without the hang.
# cat addbe.sh
LDAPhost="localhost"
LDAPport=389
ROOTDN="cn=directory manager"
ROOTDNPW="password"
cachesize="-1"
cachememsize=8388608

cnt=0
while [ $cnt -lt 10000 ]
do
echo "======================="
bname="BE_$cnt"
sf="o=BE_$cnt"
echo $bname
echo $sf
ldapmodify -x -v -h $LDAPhost -p $LDAPport -D "$ROOTDN" -w "$ROOTDNPW" -c <<-EOF 2>&1
dn: cn=$bname,cn=ldbm database,cn=plugins,cn=config
changetype: add
objectclass: top
objectclass: extensibleObject
objectclass: nsBackendInstance
cn: $bname
nsslapd-suffix: $sf
nsslapd-cachesize: $cachesize
nsslapd-cachememsize: $cachememsize
EOF
cnt=`expr $cnt + 1`
done
----------------------------------------------------------------------------
Please note that the basic test passed 100%:
Subject: 389-ds-on_rhel7 Linux_RHEL_x86_64-64 Acceptance test report - 100% passed on nhrhel7.idm.lab.bos.redhat.com using

I also tried to checkpoint more often:
nsslapd-db-checkpoint-interval: 6

Since Ján moved to the other team, can someone from the QE team provide the reproducer?

Thanks!
Comment 20 Ján Rusnačko 2013-12-16 07:01:21 EST
Steps to reproduce:
1) Start RHEL 7 VM, image I am using is RHEL-7.0-20131111.1-Server-x86_64-dvd1
2) Monitor CPU and IO usage
3) Click on network icon, then open Network Settings
4) CPU usage goes up, gnome-shell process will start eating CPU
5) Try to reproduce the problem - run basic testsuite, or just try to add a backend.

Please, let me know if this does not work.
Comment 21 Noriko Hosoi 2013-12-16 20:44:26 EST
Created attachment 837512 [details]
stacktraces -- db_open and checkpoint threads are hanging

RHEL-7.0
Linux vm-098.idm.lab.bos.redhat.com 3.10.0-57.el7.x86_64 #1 SMP Tue Nov 26 17:10:51 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

389-ds-base-1.3.1.6-11.el7.x86_64
libdb-5.3.21-14.el7.x86_64
Comment 22 Noriko Hosoi 2013-12-16 20:57:48 EST
(In reply to Ján Rusnačko from comment #20)
> Steps to reproduce:
> 1) Start RHEL 7 VM, image I am using is
> RHEL-7.0-20131111.1-Server-x86_64-dvd1
> 2) Monitor CPU and IO usage
> 3) Click on network icon, then open Network Settings
> 4) CPU usage goes up, gnome-shell process will start eating CPU
> 5) Try to reproduce the problem - run basic testsuite, or just try to add a
                                                         ^^      
> backend.

Ján, thank you for the steps!  Let me confirm with you one thing in the step (5).  Is it "or" not "and"?  Isn't it while running basic testsuite, you have to add backends?  I assume the first answer is "and" based upon the original description "During execution of basic testsuite, a new backend is being added".

I did run basic testsuite AND add backends.  I could reproduce the hang.  I attached the stacktraces in comment 21.

I also run another round and I'd like you or someone from the QE team try this...

In basic.sh, eliminate bug856657 related test cases as follows and run base testsute AND add backends?  I think the server hang problem does not occur any more (hopefully :).

    #ic2="tp3 tp4 tp5 tp6 tp7 tp8 tp9 tp10 tp11 tp12 tp13 tp14 tp15 bug192901 trac365 bug856657_setup bug856657_01 bug856657_02 bug856657_03 bug856657_cleanup bug834057 bug910581 bug969210_01 bug969210_02 bug969210_03 bug979169_01 bug979169_02 bug979169_03"
    ic2="tp3 tp4 tp5 tp6 tp7 tp8 tp9 tp10 tp11 tp12 tp13 tp14 tp15 bug192901 trac365 bug834057 bug910581 bug969210_01 bug969210_02 bug969210_03 bug979169_01 bug979169_02 bug979169_03"

These 856657s are destructive cases that make the server not start.  It even affects all backends including the ones we additionally add.  See this line:

    sed -i "s/dn: cn=monitor/\n\ndn: cn=monitor/" $CDIR/dse.ldif

After the 856657 tests are done, it overwrites the config file with the backed up dse.ldif initially ignoring the newly generated backends in the clean up code.  Although the DB's transaction log file still has the backend info to be checkpointed, the server itself no longer has the knowledge since they are wiped out from the config file...

I have to say this scenario is too brutal.  The other right steps would be before overwriting the config file by "CopyDSE Restore basic" in bug856657_cleanup, deleting all the backends.  This way, the transaction log and the server's configuration would be in sync...

May we change the status to CLOSED/WONTFIX?
Comment 23 Ján Rusnačko 2013-12-17 03:56:30 EST
(In reply to Noriko Hosoi from comment #22)
> (In reply to Ján Rusnačko from comment #20)
> > Steps to reproduce:
> > 1) Start RHEL 7 VM, image I am using is
> > RHEL-7.0-20131111.1-Server-x86_64-dvd1
> > 2) Monitor CPU and IO usage
> > 3) Click on network icon, then open Network Settings
> > 4) CPU usage goes up, gnome-shell process will start eating CPU
> > 5) Try to reproduce the problem - run basic testsuite, or just try to add a
>                                                          ^^      
> > backend.
> 
> Ján, thank you for the steps!  Let me confirm with you one thing in the step
> (5).  Is it "or" not "and"?  Isn't it while running basic testsuite, you
> have to add backends?  I assume the first answer is "and" based upon the
Yes, you are right, simply adding backend is not sufficient.
> original description "During execution of basic testsuite, a new backend is
> being added".
> 
> I did run basic testsuite AND add backends.  I could reproduce the hang.  I
> attached the stacktraces in comment 21.
> 
> I also run another round and I'd like you or someone from the QE team try
> this...
> 
> In basic.sh, eliminate bug856657 related test cases as follows and run base
> testsute AND add backends?  I think the server hang problem does not occur
> any more (hopefully :).
I did not try this scenario.
> 
>     #ic2="tp3 tp4 tp5 tp6 tp7 tp8 tp9 tp10 tp11 tp12 tp13 tp14 tp15
> bug192901 trac365 bug856657_setup bug856657_01 bug856657_02 bug856657_03
> bug856657_cleanup bug834057 bug910581 bug969210_01 bug969210_02 bug969210_03
> bug979169_01 bug979169_02 bug979169_03"
>     ic2="tp3 tp4 tp5 tp6 tp7 tp8 tp9 tp10 tp11 tp12 tp13 tp14 tp15 bug192901
> trac365 bug834057 bug910581 bug969210_01 bug969210_02 bug969210_03
> bug979169_01 bug979169_02 bug979169_03"
> 
> These 856657s are destructive cases that make the server not start.  It even
> affects all backends including the ones we additionally add.  See this line:
> 
>     sed -i "s/dn: cn=monitor/\n\ndn: cn=monitor/" $CDIR/dse.ldif
> 
> After the 856657 tests are done, it overwrites the config file with the
> backed up dse.ldif initially ignoring the newly generated backends in the
> clean up code.  Although the DB's transaction log file still has the backend
> info to be checkpointed, the server itself no longer has the knowledge since
> they are wiped out from the config file...
> 
> I have to say this scenario is too brutal.  The other right steps would be
> before overwriting the config file by "CopyDSE Restore basic" in
> bug856657_cleanup, deleting all the backends.  This way, the transaction log
> and the server's configuration would be in sync...
> 
> May we change the status to CLOSED/WONTFIX?
Comment 24 Nathan Kinder 2013-12-18 13:34:16 EST
I agree with Noriko's summary from comment#22 that the tests are essentially doing creative destruction type tasks.  This is a scenario that we would not support, so WONTFIX seems appropriate.

I would like confirmation /agreement on this from QE.
Comment 25 Sankar Ramalingam 2014-01-07 01:03:25 EST
(In reply to Nathan Kinder from comment #24)
> I agree with Noriko's summary from comment#22 that the tests are essentially
> doing creative destruction type tasks.  This is a scenario that we would not
> support, so WONTFIX seems appropriate.
> 
I agree with Nathan and Noriko. This bug can be closed as WONTFIX.
> I would like confirmation /agreement on this from QE.

Note You need to log in before you can comment on or make changes to this bug.