Description of problem: clvmd hangs and stop operations when it's under stress load for couple of mins Version-Release number of selected component (if applicable): cman-1.0.27-1.el4 cman-kernel-2.6.9-56.7.el4_8.9 How reproducible: stress load script: #!/bin/bash while true; do lvs; done > /dev/null & while true; do echo -n '.'; vgscan > /dev/null; sleep $(($RANDOM%7)) done Steps to Reproduce: 1. form a cluster with 3 nodes 2. start 1-6 instances of the script on each node (sometimes more is better) 3. wait for dots to stop Actual results: At some time message about broken pipe is received and some time later all the clvmd processes stop and are probably waiting for something. This can result in segmentation fault or infinite waits. The result varies. Expected results: clvmd processes all the requests coming and no hang occurs Additional info:
More info: The main thread is waiting in pthread_join() for this thread: (gdb) bt #0 0x0000003e6bc08d1a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x000000000040e96a in pre_and_post_thread (arg=Variable "arg" is not available. ) at clvmd.c:1453 #2 0x0000003e6bc06317 in start_thread () from /lib64/tls/libpthread.so.0 #3 0x0000003e6b3c9f03 in clone () from /lib64/tls/libc.so.6 (gdb) l 1450 1445 } 1446 1447 /* We may need to wait for the condition variable before running the post command */ 1448 pthread_mutex_lock(&client->bits.localsock.mutex); 1449 DEBUGLOG("Waiting to do post command - state = %d\n", 1450 client->bits.localsock.state); 1451 1452 if (client->bits.localsock.state != POST_COMMAND) { 1453 pthread_cond_wait(&client->bits.localsock.cond, 1454 &client->bits.localsock.mutex); (gdb) p client->bits.localsock.state $7 = PRE_COMMAND (gdb) So it looks like it might be some sort of thread race. lvm2-cluster-2.02.42-5.el4
Created attachment 388236 [details] Patch for testing This is the patch I have given to Jaroslav for testing, as I can't reproduce this myself.
It's also worth mentioning that this bug will exist in RHEL5 and RHEL6 too as the code is generic.
the patched binary makes things worse. Now only local runs are successfull. Running the hangscript (3 times) on one node and running lvs on the other makes it hang.
Is there a hung clvmd I can have a look at please ? It might have some more clues as to the cause
The symptoms of that hang are identical to the DLM bug which has, supposedly already been fixed. All clvmd worker threads are waiting for a VG lock, but the DLM shows no holder for that lock, only waiters.
The referenced DLM bug is Bug 546022 - https://bugzilla.redhat.com/show_bug.cgi?id=546022
and now we get this crash! (gdb) bt #0 persistent_filter_wipe (f=0x0) at filters/filter-persistent.c:54 #1 0x000000000041f48a in dev_iter_create (f=0x0, dev_scan=1) at device/dev-cache.c:738 #2 0x00000000004172f0 in lvmcache_label_scan (cmd=0x578610, full_scan=2) at cache/lvmcache.c:448 #3 0x0000000000410f60 in do_refresh_cache () at lvm-functions.c:540 #4 0x000000000040d267 in do_command (client=Variable "client" is not available. ) at clvmd-command.c:123 #5 0x000000000040eee5 in lvm_thread_fn (arg=Variable "arg" is not available. ) at clvmd.c:1283 #6 0x00000039ffa06317 in start_thread () from /lib64/tls/libpthread.so.0 #7 0x00000039ff1c9f03 in clone () from /lib64/tls/libc.so.6
(In reply to comment #8) > and now we get this crash! > #0 persistent_filter_wipe (f=0x0) at filters/filter-persistent.c:54 This is another bug, I think we fixed it upstream but need to verify, please clone it as new bug (for lvm2-cluster, assign it to me) and add lvm2 rpm versions (and if you have reproducer attach it too).
*** Bug 483685 has been marked as a duplicate of this bug. ***
Changing severity of ticket to match severity of 483685
I have had a patch for this for ages. But I'm still waiting for it to be tested with the right combination of other patches(!) to verify that it does fix the problem.
I've added the proposed fix to CVS, as it looks right and tests out harmlessly on my systems (even though I can't reproduce the problem in this BZ): Checking in WHATS_NEW; /cvs/lvm2/LVM2/WHATS_NEW,v <-- WHATS_NEW new revision: 1.1503; previous revision: 1.1502 done Checking in daemons/clvmd/clvmd.c; /cvs/lvm2/LVM2/daemons/clvmd/clvmd.c,v <-- clvmd.c new revision: 1.65; previous revision: 1.64 done
Fix in lvm2-cluster-2.02.42-6.el4.
I ran many instances of the test described in comment #0 on the grant cluster (grant-0[123]), and a segfault still occurred on grant-03. Marking FailsQA. Jan 14 11:19:28 grant-03 kernel: clvmd[8076]: segfault at 0000000000000010 rip 0000000000423a91 rsp 0000000041400cb0 error 4
I was able to get clvmd hung this time (though no segfault), and saw this lvcreate strace. Not sure how much help it is though. I'll attach the entire kern dump from this machine as well. grant-02 kernel: lvcreate S 000001021445268c 0 16991 17233 (NOTLB) grant-02 kernel: 0000010117293bb8 0000000000000002 0000010117293b68 ffffffff803e3b80 grant-02 kernel: 0000007500000055 0000000000000000 0000000000000130 0000000000100000 grant-02 kernel: 000001000103fac0 00000000801329c1 grant-02 kernel: Call Trace:<ffffffff803161e8>{schedule_timeout+257} <ffffffff80136080>{prepare_to_wait+21} grant-02 kernel: <ffffffff80310f8a>{unix_stream_recvmsg+622} <ffffffff8013618e>{autoremove_wake_function+0} grant-02 kernel: <ffffffff8013618e>{autoremove_wake_function+0} <ffffffff802b0e06>{sock_aio_read+297} grant-02 kernel: <ffffffff8017c32d>{do_sync_read+178} <ffffffff8013618e>{autoremove_wake_function+0} grant-02 kernel: <ffffffff8017c43b>{vfs_read+226} <ffffffff8017c684>{sys_read+69} grant-02 kernel: <ffffffff80110442>{tracesys+209}
Created attachment 473594 [details] log from hung grant-02
I can reproduce this segfault fairly easily, but I'm not able to get a core file. I've got ulimit -c unlimited, and sysctl kernel.core_pattern=/core_files/core set, and still no core appears. Time to test the latest build now...
Corey, if you can still reproduce the hanging clvmd (not the crashing one), it would be extremely useful to get the following from the machine (PID here is the PID of the hanging clvmd process): gdb . PID gdb should attach to the (hanging) process, and land you in a prompt; then grab the output of this: thread apply all bt full I am not sure if the hang and the crash are actually related, so the core for the crash is still relevant even if we have the above. Either will help to move the bug forward, though. Thanks, Petr
Additional patch (double close fd fix) add in lvm2-cluster-2.02.42-10.el4.
Marking this verified due to the fact that the stress load listed above now runs quite a bit longer. However there are still deadlock/segfault issues in rhel4 when running this stress load. I believe the cmirror server issue in 444187 may be one of these issues.
*** Bug 579832 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0274.html