| Summary: | DS stuck after adding new backend | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Ján Rusnačko <jrusnack> | ||||||
| Component: | 389-ds-base | Assignee: | Rich Megginson <rmeggins> | ||||||
| Status: | CLOSED WONTFIX | QA Contact: | Sankar Ramalingam <sramling> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 7.0 | CC: | nhosoi, nkinder, sramling | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2014-01-07 06:03:25 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
Ján Rusnačko
2013-11-13 12:07:53 UTC
Interesting. Adding a backend at the random timing does not cause this hang; but to reproduce it, the add must be done at the same timing as the checkpoint timing, isn't it? It looks we may need to protect DB_OPEN with ENV->dblayer_env_lock. 1955 int dblayer_instance_start(backend *be, int mode) 2271 DB_OPEN(mypEnv->dblayer_openflags, 2272 dbp, NULL/* txnid */, id2entry_file, subname, DB_BTREE, 2273 open_flags, priv->dblayer_file_mode, return_value); Investigating... file:///usr/share/doc/db4-devel-4.7.25/api_c/db_open.html
flags
The flags parameter must be set to 0 or by bitwise inclusively OR'ing together one or more of the following values:
DB_AUTO_COMMIT
Enclose the DB->open call within a transaction. If the call succeeds, the open operation will be recoverable and all subsequent database modification operations based on this handle will be transactionally protected. If the call fails, no database will have been created.
(In reply to Rich Megginson from comment #3) > file:///usr/share/doc/db4-devel-4.7.25/api_c/db_open.html > > flags > > The flags parameter must be set to 0 or by bitwise inclusively OR'ing > together one or more of the following values: > DB_AUTO_COMMIT > > Enclose the DB->open call within a transaction. If the call succeeds, > the open operation will be recoverable and all subsequent database > modification operations based on this handle will be transactionally > protected. If the call fails, no database will have been created. It makes sense. I could try that. The problem is it's quite hard for me to reproduce. I kept adding user entries and backend at the same time for longer than 20 minutes, but still I cannot hit the timing... So far, I added 330 backends... We are using DB_AUTO_COMMIT for db->open - but it doesn't appear to be working? (In reply to Rich Megginson from comment #5) > We are using DB_AUTO_COMMIT for db->open - but it doesn't appear to be > working? Oh, you meant if DB_AUTO_COMMIT was set, we should not have to put DB_OPEN in the transaction? It looks db transaction and checkpointing are not excluded each other. Env->dblayer_env_lock protecting them from each other. So, I guess DB_OPEN also needs it even though it sets DB_AUTO_COMMIT... Hi Ján,
> During execution of basic testsuite, a new backend is being added:
Here's what I did:
Terminal 1: Ran basic test suite.
Terminal 2: Ran this script:
cnt=0
while [ $cnt -lt 10000 ]
do
echo "======================="
bname="BE_$cnt"
sf="o=BE_$cnt"
echo $bname
echo $sf
ldapmodify -x -v -h $LDAPhost -p $LDAPport -D "$ROOTDN" -w "$ROOTDNPW" -c <<-EOF 2>&1
dn: cn=$bname,cn=ldbm database,cn=plugins,cn=config
changetype: add
objectclass: top
objectclass: extensibleObject
objectclass: nsBackendInstance
cn: $bname
nsslapd-suffix: $sf
nsslapd-cachesize: $cachesize
nsslapd-cachememsize: $cachememsize
EOF
cnt=`expr $cnt + 1`
done
I could successfully finish the basic test and add 298 backends. In addition to this, do you have any idea what I need to do to hang the server? I'm wondering how easy for you to make it happen?
(In reply to Noriko Hosoi from comment #6) > (In reply to Rich Megginson from comment #5) > > We are using DB_AUTO_COMMIT for db->open - but it doesn't appear to be > > working? > > Oh, you meant if DB_AUTO_COMMIT was set, we should not have to put DB_OPEN > in the transaction? Yes. If DB_AUTO_COMMIT is set, then DB_OPEN should automatically use a transaction during the db open. I notice in dblayer.c that when using older versions of bdb < 4.1 we had to do our own locking for db open and txn_checkpoint. (In reply to Noriko Hosoi from comment #8) > Hi Ján, > > > During execution of basic testsuite, a new backend is being added: > > Here's what I did: > Terminal 1: Ran basic test suite. > Terminal 2: Ran this script: > cnt=0 > while [ $cnt -lt 10000 ] > do > echo "=======================" > bname="BE_$cnt" > sf="o=BE_$cnt" > echo $bname > echo $sf > ldapmodify -x -v -h $LDAPhost -p $LDAPport -D "$ROOTDN" -w "$ROOTDNPW" -c > <<-EOF 2>&1 > dn: cn=$bname,cn=ldbm database,cn=plugins,cn=config > changetype: add > objectclass: top > objectclass: extensibleObject > objectclass: nsBackendInstance > cn: $bname > nsslapd-suffix: $sf > nsslapd-cachesize: $cachesize > nsslapd-cachememsize: $cachememsize > EOF > cnt=`expr $cnt + 1` > done > > I could successfully finish the basic test and add 298 backends. In > addition to this, do you have any idea what I need to do to hang the server? > I'm wondering how easy for you to make it happen? Hello Noriko, I literally just ran basic testsuite and when I saw log output stuck on adding new backend, I just attached gdb. Now I re-ran basic testsuite, and it hanged again - but this time on simple ldapsearch like this: ldapsearch -D ... -b "cn=config" objectclass=* and gdb stacktraces look the same as above. I will share that VM with you tomorrow. (In reply to Rich Megginson from comment #9) > I notice in dblayer.c that when using older versions of bdb < 4.1 we had to > do our own locking for db open and txn_checkpoint. I see. So, it could be a regression of libdb...? The latest libdb on rhel-7.0 is libdb-5.3.21-14.el7. Mine is a bit older than that: libdb-5.3.21-3.fc18.x86_64. I scanned the changelogs of 5.3.28 and 6.0.20 at http://www.oracle.com/technetwork/products/berkeleydb/downloads/index-082944.html but I could not find any related issues... (In reply to Ján Rusnačko from comment #10) > I will share that VM with you tomorrow. Thanks a lot, Ján!! (In reply to Noriko Hosoi from comment #12) > (In reply to Ján Rusnačko from comment #10) > > I will share that VM with you tomorrow. > > Thanks a lot, Ján!! Bad news - I have copied that VM to another server for you to reach, but I fail to reproduce there :( However, when I re-ran tests on my machine, I could reproduce (this time hanged on backed creation again). I notice one more detail - when I start the VM on my machine, gnome shell immediately starts consuming 80-90% of CPU time and slows down right after hang occurs. However, I don`t see this on the server where I fail to reproduce. I guess we can assume this hang is caused by excessive CPU usage of gnome-shell. How do I further debug what is going on there ? (In reply to Ján Rusnačko from comment #13) > I notice one more detail - when I start the VM on my machine, gnome shell > immediately starts consuming 80-90% of CPU time and slows down right after > hang occurs. CPU usage on the host goes up to 80-90% or the VM's? "When you start the VM", the basic tests are also launched as well? (I'm curious exactly when the CPU usage jumps up -- it's a VM issue or the basic test issue...?) And "slows down" means the VM slows down? I.e., the response time for any operation gets longer? Could there be any other sign? Since the hang is in libdb, something could be wrong in the disk IO of the VM. Do you see any IO difficulties? Output of iostat (in the package sysstat)? > However, I don`t see this on the server where I fail to reproduce. > > I guess we can assume this hang is caused by excessive CPU usage of > gnome-shell. How do I further debug what is going on there ? Well, hard to say... I think there is a potential risk which causes this hang, which might be magnified because of some slow IO? It'd be really hard to make it happen in the normal condition, but it could happen...? (In reply to Noriko Hosoi from comment #14) > (In reply to Ján Rusnačko from comment #13) > > I notice one more detail - when I start the VM on my machine, gnome shell > > immediately starts consuming 80-90% of CPU time and slows down right after > > hang occurs. > > CPU usage on the host goes up to 80-90% or the VM's? "When you start the > VM", the basic tests are also launched as well? (I'm curious exactly when > the CPU usage jumps up -- it's a VM issue or the basic test issue...?) And > "slows down" means the VM slows down? I.e., the response time for any > operation gets longer? > > Could there be any other sign? Since the hang is in libdb, something could > be wrong in the disk IO of the VM. Do you see any IO difficulties? Output > of iostat (in the package sysstat)? Ok, seems like the load is triggered when I open network settings (upper right corner -> Network icon -> Network settings). Gnome shell starts eating CPU, and there is also some IO: [jrusnack@localhost ~]$ sudo iostat Linux 3.10.0-42.el7.x86_64 (localhost.localdomain) 11/15/2013 _x86_64_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 24.63 0.03 4.30 13.18 0.00 57.87 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn vda 55.80 1606.23 40.19 292799 7326 dm-0 1.95 7.81 0.00 1424 0 dm-1 53.61 1491.34 28.96 271855 5278 Also, it stops when screensaver is activated. Iotop shows 0 bytes I/O. > > > However, I don`t see this on the server where I fail to reproduce. > > > > I guess we can assume this hang is caused by excessive CPU usage of > > gnome-shell. How do I further debug what is going on there ? > > Well, hard to say... I think there is a potential risk which causes this > hang, which might be magnified because of some slow IO? It'd be really hard > to make it happen in the normal condition, but it could happen...? I take back what I said, that I cannot reproduce on another server. If I open network settings, load is triggered - I will share server details with you. Upstream ticket: https://fedorahosted.org/389/ticket/47600 Note: 1) Checkpoint is no longer protected by the ENV->dblayer_env_lock. The dblayer comment says: /* 608145: db4.1 and newer does not require exclusive lock for checkpointing * and transactions */ #define DB_CHECKPOINT_LOCK(use_lock, lock) ; #define DB_CHECKPOINT_UNLOCK(use_lock, lock) ; 2) Both threads are waiting at the same pthread_cond_wait. It does not look like a deadlock between thread 26 and 39, but rather some lock leak? Thread 39 (Thread 0x7fcd1a838700 (LWP 23599)): Thread 26 (Thread 0x7fcd0b7fe700 (LWP 23627)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007fcd21a63f63 in __db_pthread_mutex_condwait (env=0x7fcd29a8a3e0, mutex=0, timespec=0x0, mutexp=<optimized out>) at ../../src/mutex/mut_pthread.c:321 #2 __db_hybrid_mutex_suspend (env=env@entry=0x7fcd29a8a3e0, mutex=mutex@entry=1082, timespec=timespec@entry=0x0, exclusive=exclusive@entry=1) at ../../src/mutex/mut_pthread.c:577 #3 0x00007fcd21a63440 in __db_tas_mutex_lock_int (nowait=0, timeout=<optimized out>, mutex=<optimized out>, env=0x7fcd29a8a3e0, env@entry=0x7fcd1b9ec448) at ../../src/mutex/mut_tas.c:255 #4 __db_tas_mutex_lock (env=env@entry=0x7fcd29a8a3e0, mutex=1082, timeout=timeout@entry=0) at ../../src/mutex/mut_tas.c:286 Ján, when the hang occurs, could you please run this command line and attach the output to this bug? # db_stat -C A -h /var/lib/dirsrv/slapd-<ID>/db/ Thanks! --noriko (In reply to Noriko Hosoi from comment #17) > > 2) Both threads are waiting at the same pthread_cond_wait. It does not look > like a deadlock between thread 26 and 39, but rather some lock leak? > Thread 39 (Thread 0x7fcd1a838700 (LWP 23599)): > Thread 26 (Thread 0x7fcd0b7fe700 (LWP 23627)): > #0 pthread_cond_wait@@GLIBC_2.3.2 () at > ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > #1 0x00007fcd21a63f63 in __db_pthread_mutex_condwait (env=0x7fcd29a8a3e0, > mutex=0, timespec=0x0, mutexp=<optimized out>) at > ../../src/mutex/mut_pthread.c:321 > #2 __db_hybrid_mutex_suspend (env=env@entry=0x7fcd29a8a3e0, > mutex=mutex@entry=1082, timespec=timespec@entry=0x0, > exclusive=exclusive@entry=1) at ../../src/mutex/mut_pthread.c:577 > #3 0x00007fcd21a63440 in __db_tas_mutex_lock_int (nowait=0, > timeout=<optimized out>, mutex=<optimized out>, env=0x7fcd29a8a3e0, > env@entry=0x7fcd1b9ec448) at ../../src/mutex/mut_tas.c:255 > #4 __db_tas_mutex_lock (env=env@entry=0x7fcd29a8a3e0, mutex=1082, > timeout=timeout@entry=0) at ../../src/mutex/mut_tas.c:286 Never mind. Not on the same mutex. Thread 39: mutex=mutex@entry=1082 Thread 26: mutex=mutex@entry=1322 Need more investigation... :( This is a tough one... Still I cannot reproduce the hang on the RHEL7 VM from RHEVM... This is the OS info: # uname -a Linux nhrhel7.idm.lab.bos.redhat.com 3.10.0-57.el7.x86_64 #1 SMP Tue Nov 26 17:10:51 EST 2013 x86_64 x86_64 x86_64 GNU/Linux I'm using this pair: libdb-5.3.21-14.el7.x86_64 389-ds-base-1.3.1.6-11.el7.x86_64 While running the acceptance/basic test, I executed this script. It added 1863 backends without the hang. # cat addbe.sh LDAPhost="localhost" LDAPport=389 ROOTDN="cn=directory manager" ROOTDNPW="password" cachesize="-1" cachememsize=8388608 cnt=0 while [ $cnt -lt 10000 ] do echo "=======================" bname="BE_$cnt" sf="o=BE_$cnt" echo $bname echo $sf ldapmodify -x -v -h $LDAPhost -p $LDAPport -D "$ROOTDN" -w "$ROOTDNPW" -c <<-EOF 2>&1 dn: cn=$bname,cn=ldbm database,cn=plugins,cn=config changetype: add objectclass: top objectclass: extensibleObject objectclass: nsBackendInstance cn: $bname nsslapd-suffix: $sf nsslapd-cachesize: $cachesize nsslapd-cachememsize: $cachememsize EOF cnt=`expr $cnt + 1` done ---------------------------------------------------------------------------- Please note that the basic test passed 100%: Subject: 389-ds-on_rhel7 Linux_RHEL_x86_64-64 Acceptance test report - 100% passed on nhrhel7.idm.lab.bos.redhat.com using I also tried to checkpoint more often: nsslapd-db-checkpoint-interval: 6 Since Ján moved to the other team, can someone from the QE team provide the reproducer? Thanks! Steps to reproduce: 1) Start RHEL 7 VM, image I am using is RHEL-7.0-20131111.1-Server-x86_64-dvd1 2) Monitor CPU and IO usage 3) Click on network icon, then open Network Settings 4) CPU usage goes up, gnome-shell process will start eating CPU 5) Try to reproduce the problem - run basic testsuite, or just try to add a backend. Please, let me know if this does not work. Created attachment 837512 [details]
stacktraces -- db_open and checkpoint threads are hanging
RHEL-7.0
Linux vm-098.idm.lab.bos.redhat.com 3.10.0-57.el7.x86_64 #1 SMP Tue Nov 26 17:10:51 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
389-ds-base-1.3.1.6-11.el7.x86_64
libdb-5.3.21-14.el7.x86_64
(In reply to Ján Rusnačko from comment #20) > Steps to reproduce: > 1) Start RHEL 7 VM, image I am using is > RHEL-7.0-20131111.1-Server-x86_64-dvd1 > 2) Monitor CPU and IO usage > 3) Click on network icon, then open Network Settings > 4) CPU usage goes up, gnome-shell process will start eating CPU > 5) Try to reproduce the problem - run basic testsuite, or just try to add a ^^ > backend. Ján, thank you for the steps! Let me confirm with you one thing in the step (5). Is it "or" not "and"? Isn't it while running basic testsuite, you have to add backends? I assume the first answer is "and" based upon the original description "During execution of basic testsuite, a new backend is being added". I did run basic testsuite AND add backends. I could reproduce the hang. I attached the stacktraces in comment 21. I also run another round and I'd like you or someone from the QE team try this... In basic.sh, eliminate bug856657 related test cases as follows and run base testsute AND add backends? I think the server hang problem does not occur any more (hopefully :). #ic2="tp3 tp4 tp5 tp6 tp7 tp8 tp9 tp10 tp11 tp12 tp13 tp14 tp15 bug192901 trac365 bug856657_setup bug856657_01 bug856657_02 bug856657_03 bug856657_cleanup bug834057 bug910581 bug969210_01 bug969210_02 bug969210_03 bug979169_01 bug979169_02 bug979169_03" ic2="tp3 tp4 tp5 tp6 tp7 tp8 tp9 tp10 tp11 tp12 tp13 tp14 tp15 bug192901 trac365 bug834057 bug910581 bug969210_01 bug969210_02 bug969210_03 bug979169_01 bug979169_02 bug979169_03" These 856657s are destructive cases that make the server not start. It even affects all backends including the ones we additionally add. See this line: sed -i "s/dn: cn=monitor/\n\ndn: cn=monitor/" $CDIR/dse.ldif After the 856657 tests are done, it overwrites the config file with the backed up dse.ldif initially ignoring the newly generated backends in the clean up code. Although the DB's transaction log file still has the backend info to be checkpointed, the server itself no longer has the knowledge since they are wiped out from the config file... I have to say this scenario is too brutal. The other right steps would be before overwriting the config file by "CopyDSE Restore basic" in bug856657_cleanup, deleting all the backends. This way, the transaction log and the server's configuration would be in sync... May we change the status to CLOSED/WONTFIX? (In reply to Noriko Hosoi from comment #22) > (In reply to Ján Rusnačko from comment #20) > > Steps to reproduce: > > 1) Start RHEL 7 VM, image I am using is > > RHEL-7.0-20131111.1-Server-x86_64-dvd1 > > 2) Monitor CPU and IO usage > > 3) Click on network icon, then open Network Settings > > 4) CPU usage goes up, gnome-shell process will start eating CPU > > 5) Try to reproduce the problem - run basic testsuite, or just try to add a > ^^ > > backend. > > Ján, thank you for the steps! Let me confirm with you one thing in the step > (5). Is it "or" not "and"? Isn't it while running basic testsuite, you > have to add backends? I assume the first answer is "and" based upon the Yes, you are right, simply adding backend is not sufficient. > original description "During execution of basic testsuite, a new backend is > being added". > > I did run basic testsuite AND add backends. I could reproduce the hang. I > attached the stacktraces in comment 21. > > I also run another round and I'd like you or someone from the QE team try > this... > > In basic.sh, eliminate bug856657 related test cases as follows and run base > testsute AND add backends? I think the server hang problem does not occur > any more (hopefully :). I did not try this scenario. > > #ic2="tp3 tp4 tp5 tp6 tp7 tp8 tp9 tp10 tp11 tp12 tp13 tp14 tp15 > bug192901 trac365 bug856657_setup bug856657_01 bug856657_02 bug856657_03 > bug856657_cleanup bug834057 bug910581 bug969210_01 bug969210_02 bug969210_03 > bug979169_01 bug979169_02 bug979169_03" > ic2="tp3 tp4 tp5 tp6 tp7 tp8 tp9 tp10 tp11 tp12 tp13 tp14 tp15 bug192901 > trac365 bug834057 bug910581 bug969210_01 bug969210_02 bug969210_03 > bug979169_01 bug979169_02 bug979169_03" > > These 856657s are destructive cases that make the server not start. It even > affects all backends including the ones we additionally add. See this line: > > sed -i "s/dn: cn=monitor/\n\ndn: cn=monitor/" $CDIR/dse.ldif > > After the 856657 tests are done, it overwrites the config file with the > backed up dse.ldif initially ignoring the newly generated backends in the > clean up code. Although the DB's transaction log file still has the backend > info to be checkpointed, the server itself no longer has the knowledge since > they are wiped out from the config file... > > I have to say this scenario is too brutal. The other right steps would be > before overwriting the config file by "CopyDSE Restore basic" in > bug856657_cleanup, deleting all the backends. This way, the transaction log > and the server's configuration would be in sync... > > May we change the status to CLOSED/WONTFIX? I agree with Noriko's summary from comment#22 that the tests are essentially doing creative destruction type tasks. This is a scenario that we would not support, so WONTFIX seems appropriate. I would like confirmation /agreement on this from QE. (In reply to Nathan Kinder from comment #24) > I agree with Noriko's summary from comment#22 that the tests are essentially > doing creative destruction type tasks. This is a scenario that we would not > support, so WONTFIX seems appropriate. > I agree with Nathan and Noriko. This bug can be closed as WONTFIX. > I would like confirmation /agreement on this from QE. |