559999 – clvmd hangs during stress load

Bug 559999 - clvmd hangs during stress load

Summary: clvmd hangs during stress load

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	lvm2-cluster
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	483685 579832 (view as bug list)
Depends On:
Blocks:	561226 561227 579832
TreeView+	depends on / blocked

Reported:	2010-01-29 15:44 UTC by Jaroslav Kortus
Modified:	2018-11-14 18:28 UTC (History)
CC List:	17 users (show)
Fixed In Version:	lvm2-cluster-2.02.42-6.el4
Clone Of:
Clones:	561226 561227 (view as bug list)
Environment:
Last Closed:	2011-02-16 16:35:11 UTC
Embargoed:

Attachments	(Terms of Use)
Patch for testing (714 bytes, patch) 2010-02-02 10:00 UTC, Christine Caulfield	no flags	Details \| Diff
log from hung grant-02 (143.04 KB, text/plain) 2011-01-14 21:35 UTC, Corey Marthaler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0274	0	normal	SHIPPED_LIVE	lvm2-cluster bug fix and enhancement update	2011-02-16 16:35:03 UTC

Description Jaroslav Kortus 2010-01-29 15:44:30 UTC

Description of problem:
clvmd hangs and stop operations when it's under stress load for couple of mins

Version-Release number of selected component (if applicable):
cman-1.0.27-1.el4
cman-kernel-2.6.9-56.7.el4_8.9


How reproducible:
stress load script:
#!/bin/bash

while  true; do 
        lvs; 
done > /dev/null &

while true; do
        echo -n '.';
        vgscan > /dev/null;
        sleep $(($RANDOM%7))
done



Steps to Reproduce:
1. form a cluster with 3 nodes
2. start 1-6 instances of the script on each node (sometimes more is better)
3. wait for dots to stop
  
Actual results:
At some time message about broken pipe is received and some time later all the clvmd processes stop and are probably waiting for something.

This can result in segmentation fault or infinite waits. The result varies.

Expected results:
clvmd processes all the requests coming and no hang occurs

Additional info:

Comment 1 Christine Caulfield 2010-01-29 15:47:43 UTC

More info:

The main thread is waiting in pthread_join() for this thread:

(gdb) bt
#0  0x0000003e6bc08d1a in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x000000000040e96a in pre_and_post_thread (arg=Variable "arg" is not available.
) at clvmd.c:1453
#2  0x0000003e6bc06317 in start_thread () from /lib64/tls/libpthread.so.0
#3  0x0000003e6b3c9f03 in clone () from /lib64/tls/libc.so.6
(gdb) l 1450
1445			}
1446	
1447			/* We may need to wait for the condition variable before running the post command */
1448			pthread_mutex_lock(&client->bits.localsock.mutex);
1449			DEBUGLOG("Waiting to do post command - state = %d\n",
1450				 client->bits.localsock.state);
1451	
1452			if (client->bits.localsock.state != POST_COMMAND) {
1453				pthread_cond_wait(&client->bits.localsock.cond,
1454						  &client->bits.localsock.mutex);
(gdb) p client->bits.localsock.state
$7 = PRE_COMMAND
(gdb) 


So it looks like it might be some sort of thread race.

lvm2-cluster-2.02.42-5.el4

Comment 2 Christine Caulfield 2010-02-02 10:00:00 UTC

Created attachment 388236 [details]
Patch for testing

This is the patch I have given to Jaroslav for testing, as I can't reproduce this myself.

Comment 3 Christine Caulfield 2010-02-02 10:40:30 UTC

It's also worth mentioning that this bug will exist in RHEL5 and RHEL6 too as the code is generic.

Comment 4 Jaroslav Kortus 2010-02-08 18:37:25 UTC

the patched binary makes things worse. Now only local runs are successfull. Running the hangscript (3 times) on one node and running lvs on the other makes it hang.

Comment 5 Christine Caulfield 2010-02-09 08:15:42 UTC

Is there a hung clvmd I can have a look at please ? It might have some more clues as to the cause

Comment 6 Christine Caulfield 2010-02-10 08:18:22 UTC

The symptoms of that hang are identical to the DLM bug which has, supposedly already been fixed. All clvmd worker threads are waiting for a VG lock, but the DLM shows no holder for that lock, only waiters.

Comment 7 Jaroslav Kortus 2010-02-10 10:42:09 UTC

The referenced DLM bug is Bug 546022 - https://bugzilla.redhat.com/show_bug.cgi?id=546022

Comment 8 Christine Caulfield 2010-02-19 15:50:43 UTC

and now we get this crash!

(gdb) bt
#0  persistent_filter_wipe (f=0x0) at filters/filter-persistent.c:54
#1  0x000000000041f48a in dev_iter_create (f=0x0, dev_scan=1)
    at device/dev-cache.c:738
#2  0x00000000004172f0 in lvmcache_label_scan (cmd=0x578610, full_scan=2)
    at cache/lvmcache.c:448
#3  0x0000000000410f60 in do_refresh_cache () at lvm-functions.c:540
#4  0x000000000040d267 in do_command (client=Variable "client" is not available.
) at clvmd-command.c:123
#5  0x000000000040eee5 in lvm_thread_fn (arg=Variable "arg" is not available.
) at clvmd.c:1283
#6  0x00000039ffa06317 in start_thread () from /lib64/tls/libpthread.so.0
#7  0x00000039ff1c9f03 in clone () from /lib64/tls/libc.so.6

Comment 9 Milan Broz 2010-02-19 18:53:59 UTC

(In reply to comment #8)
> and now we get this crash!
> #0  persistent_filter_wipe (f=0x0) at filters/filter-persistent.c:54

This is another bug, I think we fixed it upstream but need to verify, please clone it as new bug (for lvm2-cluster, assign it to me) and add lvm2 rpm versions (and if you have reproducer attach it too).

Comment 10 Calvin Smith 2010-03-17 02:59:26 UTC

*** Bug 483685 has been marked as a duplicate of this bug. ***

Comment 11 Calvin Smith 2010-03-17 03:04:44 UTC

Changing severity of ticket to match severity of 483685

Comment 13 Christine Caulfield 2010-03-17 08:22:55 UTC

I have had a patch for this for ages. But I'm still waiting for it to be tested with the right combination of other patches(!) to verify that it does fix the problem.

Comment 19 Christine Caulfield 2010-04-06 15:30:59 UTC

I've added the proposed fix to CVS, as it looks right and tests out harmlessly on my systems (even though I can't reproduce the problem in this BZ):

Checking in WHATS_NEW;
/cvs/lvm2/LVM2/WHATS_NEW,v  <--  WHATS_NEW
new revision: 1.1503; previous revision: 1.1502
done
Checking in daemons/clvmd/clvmd.c;
/cvs/lvm2/LVM2/daemons/clvmd/clvmd.c,v  <--  clvmd.c
new revision: 1.65; previous revision: 1.64
done

Comment 24 Milan Broz 2010-05-18 11:19:27 UTC

Fix in lvm2-cluster-2.02.42-6.el4.

Comment 28 Corey Marthaler 2011-01-14 17:23:56 UTC

I ran many instances of the test described in comment #0 on the grant cluster (grant-0[123]), and a segfault still occurred on grant-03. Marking FailsQA.

Jan 14 11:19:28 grant-03 kernel: clvmd[8076]: segfault at 0000000000000010 rip 0000000000423a91 rsp 0000000041400cb0 error 4

Comment 29 Corey Marthaler 2011-01-14 21:08:21 UTC

I was able to get clvmd hung this time (though no segfault), and saw this lvcreate strace. Not sure how much help it is though. I'll attach the entire kern dump from this machine as well.

grant-02 kernel: lvcreate      S 000001021445268c     0 16991  17233                     (NOTLB)
grant-02 kernel: 0000010117293bb8 0000000000000002 0000010117293b68 ffffffff803e3b80
grant-02 kernel:        0000007500000055 0000000000000000 0000000000000130 0000000000100000
grant-02 kernel:        000001000103fac0 00000000801329c1
grant-02 kernel: Call Trace:<ffffffff803161e8>{schedule_timeout+257} 
                            <ffffffff80136080>{prepare_to_wait+21}
grant-02 kernel:            <ffffffff80310f8a>{unix_stream_recvmsg+622} 
                            <ffffffff8013618e>{autoremove_wake_function+0}
grant-02 kernel:            <ffffffff8013618e>{autoremove_wake_function+0} 
                            <ffffffff802b0e06>{sock_aio_read+297}
grant-02 kernel:            <ffffffff8017c32d>{do_sync_read+178} 
                            <ffffffff8013618e>{autoremove_wake_function+0}
grant-02 kernel:            <ffffffff8017c43b>{vfs_read+226} 
                            <ffffffff8017c684>{sys_read+69}
grant-02 kernel:            <ffffffff80110442>{tracesys+209}

Comment 30 Corey Marthaler 2011-01-14 21:35:23 UTC

Created attachment 473594 [details]
log from hung grant-02

Comment 31 Corey Marthaler 2011-01-15 00:04:36 UTC

I can reproduce this segfault fairly easily, but I'm not able to get a core file. I've got ulimit -c unlimited, and sysctl kernel.core_pattern=/core_files/core
set, and still no core appears.

Time to test the latest build now...

Comment 32 Petr Rockai 2011-01-16 02:23:39 UTC

Corey,

if you can still reproduce the hanging clvmd (not the crashing one), it would be extremely useful to get the following from the machine (PID here is the PID of the hanging clvmd process):

gdb . PID

gdb should attach to the (hanging) process, and land you in a prompt; then grab the output of this:

thread apply all bt full

I am not sure if the hang and the crash are actually related, so the core for the crash is still relevant even if we have the above. Either will help to move the bug forward, though.

Thanks,
   Petr

Comment 35 Milan Broz 2011-01-18 12:16:01 UTC

Additional patch (double close fd fix) add in lvm2-cluster-2.02.42-10.el4.

Comment 37 Corey Marthaler 2011-01-18 22:17:51 UTC

Marking this verified due to the fact that the stress load listed above now runs quite a bit longer.

However there are still deadlock/segfault issues in rhel4 when running this stress load. I believe the cmirror server issue in 444187 may be one of these issues.

Comment 38 Milan Broz 2011-01-21 11:45:31 UTC

*** Bug 579832 has been marked as a duplicate of this bug. ***

Comment 39 errata-xmlrpc 2011-02-16 16:35:11 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0274.html

Note You need to log in before you can comment on or make changes to this bug.