From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20080208 Fedora/2.0.0.12-1.fc8 Firefox/2.0.0.12 Description of problem: clvmd will fail to start when a node was fenced off from cluster. node started back up, but clvmd failed to start. The status says it is running, but none of the lvm commands will work. segfaults are created when running pvs and none of the cluster volumes are activated. Feb 27 15:55:34 blade112 kernel: pvs[12756]: segfault at 0000000000000000 rip 00000000004443e4 rsp 0000007fbfffd778 error 4 If the "vgscan" line is commented out from /etc/init.d/clvmd, then everything is activated and mounted correctly for gfs. --sbradley Version-Release number of selected component (if applicable): cman-1.0.17-0-x86_64 How reproducible: Always Steps to Reproduce: 1. start ccsd, cman, fenced 2. start clvmd Actual Results: node should join cluster. clvmd should start, vgscan, and then activate all clustered volumes found. Expected Results: clvmd will hang on starting and lvm commands will fail. clustered volume groups will not be activated and cannot be mounted. Additional info: Removing the vgscan from the /etc/init.d/clvmd script will allow the node to run clvmd correctly. however, subsequents vgscan I believe will fail.
Can we have debugging information from clvmd (running clvmd -d and capture stderr) when the vgscan and failing commands are run please ? Ideally annotated so we can see which logs parts related to which commands), and also the result of running the commands with -vvvvv (eg pvs -vvvvvvv). If the commands are segfaulting then it would also be useful to install the debug packages and get a gdb traceback of the failure too.
Created attachment 296619 [details] Information requested from eng.
It's waiting for the DLM. Looking at the clvmd log I would guess that it's stuck in a dlm_write so that the dlm hasn't responded to the locking request at all, and the kill hasn't caused the write to return. That, in turn causes the main loop to hang waiting for the child thread to complete and prevents further activity in clvmd. There are no obvious anomalies in the DLM log, it seems to be running and finished recovery. There are some odd "clvmd lockspace already in use" messages then I can't account for; I doubt they are a cause but they shouldn't be there, it might be an indication of mixed-up lockspace device nodes. The next thing to get, if possible is a sysrq-T - that should tell us where the dlm is waiting and which kernel locks are being held.
Created attachment 296752 [details] sysrq output
Created attachment 296753 [details] sysrq output 2
There are a lot of processes in find_extend_vma (including clvmd) which might point to an out-of-memory problem. Looking in /proc/slabinfo, nothing leaps out as particularly bad though.
I hit something like this on our ppc cluster, but while running GULM. All four nodes were coming up at the same time. The three GULM servers hung for a while in clvmd startup. Two of the three nodes eventually became unstuck. I don't think the last node will become unstuck because it doesn't have any outstanding locks in gulm, nor is waiting for any locks. It does have one vgscan process still running.
Created attachment 297921 [details] Patch for testing This bug looks very similar to bz#435491 (the clvmd bits, not the networking issues). Looking at the reproduction we got (comment 47) I think I spotted a race in clvmd. If you're up to making another test RPM this might be worth a go. It includes the non-blocking patch too as I still think that's a good idea.
I ran the test on corey's cluster with this patch in place and it still hung. However, looking at clvmd under gdb shows different symptoms. All of the clvmd process are in a normal quiescent state and all LVM commands can connect. The hang is purely due to the other processes being held behind the VG lock. I'm going to restart the tests with clvmd logging enabled to gather more information. I'm actually away today but I might be able to check on the status later on.
One of the mirrors (on the QA cluster) is stuck because it is trying to write to the log device while it is suspended. There are two ways that this can happen: 1) [Most likely] The mirror log server can move around as machines come and go. If a machine incorrectly assumes control as the server and issues a disk operation while the log its suspended, it will freeze the whole cluster. This is because the log server is stuck disk waiting - no-one can contact it, and a new server can not be elected without the stuck server's vote (or fencing). 2) The top level mirror is being resumed before the lower level devices are resumed. This would cause a result similar to #1 where the server would be stuck; but it would not be the fault of the mirror log server. We have seen this issue before, but it is not the basis for this bug. You may have gotten past the original issue and stumbled back upon an old issue.
> We have seen this issue before, but it is not the basis for this bug. You may > have gotten past the original issue and stumbled back upon an old issue. I agree. I suspect this means that my clvmd fix has done its job. I'll check it in to CVS. Checking in daemons/clvmd/clvmd.c; /cvs/lvm2/LVM2/daemons/clvmd/clvmd.c,v <-- clvmd.c new revision: 1.44; previous revision: 1.43 done
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
*** Bug 435491 has been marked as a duplicate of this bug. ***
The issues mentioned in comment #21 should be cleared now. You can always tell if the issue in comment #21 has arisen, because 'dmsetup info' will show a log device as SUSPENDED and cluster_log_servd will be in the 'D' state.
corey was still experiencing some hangs even after using clvmd with the above patch. The commit below fixes an uninitialised variable that caused this. Checking in daemons/clvmd/clvmd.c; /cvs/lvm2/LVM2/daemons/clvmd/clvmd.c,v <-- clvmd.c new revision: 1.45; previous revision: 1.44 done I've set the state back to POST because we almost certainly need both of these patches.
Put back to modified, I'll use bz# 435491 for this checkin.
I observed a similar problem: When clvmd is started on only a subset of all joined cluster nodes (as it always is the case at some time during a cluster startup) a vgscan causes clvmd to deadlock. Please note, that I already applied Christines patches for clvmd.c 1.43-1.45 This behavior can be reproduced with the following process: 1. on all cluster nodes do start cluster services: # ccsd # cman_tool join -w # fence_tool join -c -w 2. on only one cluster node do start clvmd # clvmd # vgscan
While doing some debugging, I noticed, that the deadlock comes from here: clvmd wants to reply that not all cluster nodes are active (clvmd.c: add_reply_to_list) but the mutex is not initialized yet. clvmd.c: 1284 static void add_reply_to_list(struct local_client *client, int status, 1285 const char *csid, const char *buf, int len) 1286 { 1287 struct node_reply *reply; 1288 1289 pthread_mutex_lock(&client->bits.localsock.reply_mutex); The mutex initialization is done in clvmd.c:read_from_local_sock after the check if all clvmds are running ;-) check for all clvmds: clvmd.c: 988 /* Only run the command if all the cluster nodes are running CLVMD */ 989 if (((inheader->flags & CLVMD_FLAG_LOCAL) == 0) && 990 (check_all_clvmds_running(thisfd) == -1)) { initialization of the mutex: clvmd.c: 1063 pthread_mutex_init(&thisfd->bits.localsock.reply_mutex, NULL); In my opinion the mutex initialization should be done before the clmvd verification.
Please note, that removing the deadlock situation leads to another problem: if the check_all_clvmds_running verification in clvmd.c fails as expected, the vgscan and cgchange commands also fail. This leads to undefined situations during a multi node cluster startup. I.e. during a cluster startup when the required numer of nodes, fulfilling the quorum requirement are coming up, the clvmd cluster is in an inconsistent state for some time. Some nodes would have already started the clvmd some wouldn't. This would cause several check_all_clvmds_running calls to fail and therefore vgscan and vgchange commands would also fail.
Comment #34 you're right - thanks for spotting this. I've checked the fix in as revision 1.46 of clvmd.c. Comment #35 needs a little more thought and communication with the init scripts I suspect. I'm not sure what they do.
The clvmd initscript basically does the following steps: clvmd -T20 -t 90 vgscan vgchange -ayl There are no clmvd cluster consistency checks included. How could this be done ?
The commands in comment #37 should work in theory, because vgchange -aly only activates the volumes on the local node. In practice however, it doesn't, because vgscan issues a command around the whole cluster to backup metadata and re-populate the cache.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0806.html