From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.8) Gecko/20071008 Firefox/2.0.0.8 Description of problem: We have a cluster based on two members with 5 services. In normal time, all services start and stop without problem. When a process is not found, the service is restarted. But, after 4 or 5 days, we can see in /var/log/messages much errors like this : ... Oct 21 04:07:12 OsaveAppli1 clusvcmgrd[15760]: <warning> Starvation on Lock #1! Oct 21 04:07:57 OsaveAppli1 clusvcmgrd[15760]: <warning> Starvation on Lock #1! ... And the clusvcadm commands are not responding and if a process is not found, the service is not restarted. I made a shutil -p /lock/1/status and the result is : [root@OsaveAppli1 root]# shutil -p /lock/1/status /lock/1/status is 168 bytes long SharedLockBlock { lb_magic = 0xfeedface lb_state = 0 lb_incarnation = 3 lb_holder { h_node = -1 h_pid = 21699 h_pc = 0x0000000008060570 h_crc = 0xc1da0e1f h_exe = /usr/sbin/clusvcmgrd } } It is very hard to stop the service clumanager on the server and I must reboot the system to have clumanager service stopped. After stop service or reboot, I made a reinit of the quorum with this process : 1. Stop cluster daemons on all members. 2. Reinitialize the shared state from one member in the cluster: shutil -i 3. Make sure that /etc/cluster.xml is same on all members. 4. Initialize the configuration on the shared quorum partition from one member in the cluster: shutil -s /etc/cluster.xml 5. Verify that the configuration has been initialized correctly from one member in the cluster: shutil -p /cluster/config.xml And the cluster is working fine for 4 or 5 days. Version-Release number of selected component (if applicable): clumanager-1.2.35.2-1 How reproducible: Always Steps to Reproduce: No particular actions to reproduce. The problem is coming itself. In description, I have described how to restart the cluster and the solution for 4 or 5 days. Actual Results: Expected Results: Additional info:
I configured a box with 8 services; I'm awaiting to see if it will reproduce for me. My lock status block looks the same: /lock/1/status is 168 bytes long SharedLockBlock { lb_magic = 0xfeedface lb_state = 0 lb_incarnation = 13 lb_holder { h_node = -1 h_pid = 30823 h_pc = 0x0000000008060570 h_crc = 0xc1da0e1f h_exe = /usr/sbin/clusvcmgrd } } Note that state = 0 or h_node = -1 means the lock is not held on-disk.
You wouldn't happen to be running on a PERC3-attached PowerVault 220S would you?
(In reply to comment #2) No, we do not use power controller on our cluster.
Created attachment 236041 [details] Messages log files for start of starvation on lock errors
You can find here the first log before starvation on lock : Oct 23 23:40:59 OsaveAppli1 su(pam_unix)[29166]: session closed for user adm3ggc1 Oct 23 23:42:36 OsaveAppli1 clusvcmgrd[9726]: <err> Unable to obtain cluster lock: No locks available Oct 23 23:43:18 OsaveAppli1 last message repeated 2 times Oct 23 23:44:21 OsaveAppli1 last message repeated 3 times Oct 23 23:45:24 OsaveAppli1 last message repeated 3 times Oct 23 23:45:27 OsaveAppli1 kernel: 493 [RAIDarray.mpp]M3G_P_DS4700:1:1:2 Cmnd failed. Retry same path. vcmnd SN 662622 pdev H3:C0:T1:L2 0x00/0x00/0x00 0x00070000 Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <err> __msg_send: Incomplete write to 10. Error: Broken pipe Oct 23 23:45:27 OsaveAppli1 last message repeated 10 times Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Broken pipe Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <err> select error: Broken pipe Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Broken pipe Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <err> select error: Broken pipe Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Broken pipe Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <err> select error: Broken pipe Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Broken pipe Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <err> select error: Broken pipe Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Broken pipe ... Oct 23 23:45:30 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Connection reset by peer Oct 23 23:45:30 OsaveAppli1 clulockd[9720]: <err> select error: Connection reset by peer Oct 23 23:45:30 OsaveAppli1 clulockd[9720]: <warning> Potential recursive lock #0 grant to member #0, PID9726 Oct 23 23:45:31 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Broken pipe Oct 23 23:45:31 OsaveAppli1 clulockd[9720]: <err> select error: Broken pipe Oct 23 23:45:31 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Connection reset by peer We have made the reinit of the quorum yesterday at 5:00 PM and the starvation on lock came back this night. You can see as attached file the messages log from yesterday night to this morning.
So, you've got SAN / device-level errors causing timeouts, which produces clulockd & rgmanager errors (that's *ok* - however, the persistent starvation is not okay). I was able to reproduce most of the warnings and errors by forcing a hang in clulockd. The problem is, I can't get persistent "Starvation" errors at any point. I can get them, but they stop after I stop clulockd from hanging. This leads me to think that apart from the SAN timeouts in the logs, something else is happening as a side effect which I haven't been able to determine. So... * add DAEMON_COREFILE_LIMIT="unlimited" somewhere near the top of /etc/init.d/clumanager * edit /etc/sysctl.conf and change kernel.sysrq to 1 * echo 1 > /proc/sys/kernel/sysrq * install the matching clumanager-debuginfo RPM for the version you have (1.2.35.2 I think). Next time this happens, you'll need to do the following on the low node ID (node 0): * script debug-info.txt (records a type session) * echo t > /proc/sysrq-trigger * dmesg * shutil -p /lock/LOCK_ID_THAT_IS_STARVED/status * gdb /usr/sbin/clulockd `pidof clulockd` [... in gdb; note this won't work if the clulockd process is stuck in 'D' state...] (gdb) bt (gdb) p locks (gdb) quit * exit (ends script output) On *both* nodes you'll need to do: * script svcmgr-nodeX.txt * lsof | grep clumanager.lock\\. * ps auwwx | grep clusvcmgrd * exit While you're waiting for it to happen again, you should look into figuring out what's wrong with your hardware / firmware that's causing this error: Oct 23 23:45:27 OsaveAppli1 kernel: 493 [RAIDarray.mpp]M3G_P_DS4700:1:1:2 Cmnd failed. Retry same path. vcmnd SN 662622 pdev H3:C0:T1:L2 0x00/0x00/0x00 0x00070000 I'm confident that if you are able to eliminate this error, clumanager will behave correctly again.
Basically, once the SAN returns to normal state and I/Os can get through again, clumanager should return to normal (though it will produce lots of errors and warnings in the mean time). The fact that it does not return to normal (and one or more process starve for locks indefinitely) is a bug.
(In reply to comment #7) Thanks for your help :) In first time, we are going to verify our SAN and re create new raw devices on another disks. Because the servers are in an another site, theses operations will be made on next week. If the problems persist, I will install debug package. I will inform you of our results.
Thanks, Eric.
Lon, We have checked the logs of our SAN and there is no error. We have created two new logicals drives for the quorum and restart the solution. We are waiting to see if the starvation are come back. One question : I don't find the clumanager-debuginfo RPM for my clumanager version, on the download page is wrote that debug package info not available. Can you say me if I can use and older package ? Which version ? Thanks, Eric.
Hmm, if the SAN isn't returning an error, the kernel or block device driver might be. Strange that there's no debuginfo rpm. I'll have to chase that down later - here it is: http://people.redhat.com/lhh/clumanager-debuginfo-1.2.35.2-1.i386.rpm
Basically, something is triggering a path failover. Between the initial error time and the time the I/O is resubmitted on the new path, clulockd is starved waiting for the I/O to complete (and thus applications waiting for clulockd starve for awhile). I should be able to create a workaround if I can figure out what is happening (i.e., we get all that information I asked for). The information is effectively all the state information regarding locks.
Eventually - even with this strange path failover problem - all the I/Os complete *successfully*.
Thanks Lon. I will install the debug package et I will inform you as soon the problem come back.
Lon, After 3 weeks, we have no more starvation on lock on our clusters nodes. I think we can say that new quorums creation has solved the problem. Thanks for your help. Regards, Eric.
Lon, The starvation lock is come back in our logs with the same issue : clusvcadm is not responding to our commands. Do you think the latest version of clumanager 1.2.37-1.1 is solving the problem because I see this sentence in the corrections : "short timeouts when trying to determine cluster and quorum state in high-latency clusters would lead to locking issues". Thanks for your answer. Eric.
I would definitely try it; it's not going to hurt anything. The timeouts we addressed, however, were on clusters with high network latency - such as geographically separated nodes. There were other indicators that you'd see if you hit the problem we fixed, including: * Log message: <warning> Potential recursive lock #x grant to member #y, PIDz * Disparate cluster membership. * 'clustat' on node 0 reports online nodes = {0} * 'clustat' on node 1 reports online nodes = {0, 1} If the 1.2.37 package doesn't solve your problem, I'll need the info per comment #6. As I recall, however, the problem you were seeing was caused by MPP doing path failovers for no good reason (which was causing a timeout 45+ seconds) on locks. I guessed it was the array, which you checked on and disproved: the array had no logs indicating any failures. More likely is that there is a bug in the MPP driver (or elsewhere in the kernel). So, if you haven't already, latest RHEL3-U9+errata kernel + most-recent MPP drivers should also be on your short list of things to install/upgrade.
Someone else hit what seems to be the same problem here: https://bugzilla.redhat.com/show_bug.cgi?id=427577#c30 [note - bugzilla may be private] Now, the original cause reported in 427577 is fixed; I talked about the symptoms in comment #17. I've started to try to reproduce this again as per a test case reported from another customer. Here's my script: relos=0 while : ; do for a in foo5 foo6 foo7; do clusvcadm -r $a if [ $? -ne 0 ]; then echo "Ruh roh..." read n fi sleep 10 ((++relos)) echo "$relos relocations complete" done done Note that I'm not testing on a 4.4 kernel (reported in 427577), but I'm not convinced that the kernel version should matter. Review: The information I'd need to really understand what's happening is: [cut] Next time this happens, you'll need to do the following on the low node ID (node 0): * script debug-info.txt (records a type session) * echo t > /proc/sysrq-trigger * dmesg * shutil -p /lock/LOCK_ID_THAT_IS_STARVED/status * gdb /usr/sbin/clulockd `pidof clulockd` [... in gdb; note this won't work if the clulockd process is stuck in 'D' state...] (gdb) bt (gdb) p locks (gdb) quit * exit (ends script output) On *both* nodes you'll need to do: * script svcmgr-nodeX.txt * lsof | grep clumanager.lock\\. * ps auwwx | grep clusvcmgrd * exit [/cut]
So far, relocation has occurred 1,638 times without a problem (546 cycles of all 3 services) with no ill effects. I'll let it continue running a few more days.
I stopped @ 16,059 relocations / 5,353 cycles of all 3 services Restarting with no sleep delay between relocations (e.g. deleting sleep 10)
Hi Lon, We are going to install the new version of clumanager next week on our production platform (it has been installed on our pre-prod platform this week but the starvation on lock problem is present only on production platform). I will inform you of result of this installation as soon as possible. Eric.
Thanks, Eric. FWIW, the loop without sleep delays ran for a total of 505,197 relocations (168,399 cycles) without incident over the weekend based on another customer's reported reproduction test case. So, I don't think the test case works in my environment. Tests were run on 2.4.21-47.EL; I'm retrying the no-sleep test with the latest 54.EL kernel.
Everything seems stable, however, there's one thing I found while reading the code with the help of Eduardo Damato - the cluster unlock code, pointedly, sucks. The locking code retries until a node transition occurs or the lock is granted. Meanwhile, if the _unlock_ code fails during slow I/O, there's a high chance that the lock will be granted twice to the same process (e.g. you will see "Recursive lock grant" and so forth in the log messages. This is because while clu_lock() retries and is very robust, the clu_unlock() code gives up after retrying only a few times. There's a reclaim algorithm which asks the "holder" node if the process holding the lock is still alive. So, about half the time, the "holder" PID is not the main clusvcmgrd process - it's a process handling a request for the service. In this case, the process will exit even if the unlock fails and the lock will be automatically freed. If, however, the main clusvcmgrd process is the one which takes the lock, and an unlock fails due to slow I/O on the clulockd server-side (because of a timeout), then there could be starvation on that particular lock from a child process - and a "Potential recursive lock grant ..." message in the logs. We have a patch to make the unlock more robust and are testing it.
Created attachment 305170 [details] Makes unlock path in clumanager more robust
I changed the name to more accurately reflect the bug that's fixed here. Basically, "Starvation" warnings are the result of poor performance or other problems while accessing shared storage. This slow access during the _unlock_ path can cause unlock to fail, resulting in a lock that's held forever - even after performance when talking to shared storage has returned to normal! The patch in comment #24 fixes the unlock path so that slow I/O to shared storage does not cause the unlock path to fail.
*** Bug 427577 has been marked as a duplicate of this bug. ***
Between the patch for this bugzilla and the patch for bug 427577, starvation warnings should no longer run forever. We have at least one customer testing this patch (who was previously getting duplicate lock grants due to an unlock failure), with no issues so far.
Note that the patch in 427577 was included in clumanager-1.2.37
Hi Lon, My customer has contacted me after a while and reported that they're having "Starvation on lock" error again even after they've updated to clumanager-1.2.37-1.1. Thus, clumanager-1.2.37-1.1 doesn't fix the problem. The customer is now running 4 nodes(SUC21/SUC22/SUC23/SUC24) cluster and have a problem with stopping clumanager on SUC21 and SUC23 with the following error. *SUC21 Jul 25 09:05:31 SUC21 clulockd[3950]: <err> __msg_send: Incomplete write to 10. Error: Broken pipe *SUC23 Jul 25 09:10:09 SUC23 clusvcmgrd[31869]: <warning> Starvation on Lock #2! I'll upload the messages log for both nodes(SUC21/SUC23). Please let me know if you need more information. Attachment: gw-suc21-messages.dat - /var/log/messages from SUC21 gw-suc23-messages.dat - /var/log/messages from SUC23 Thanks, Masahiro This event sent from IssueTracker by mokubo issue 135651
This event sent from IssueTracker by mokubo issue 135651 it_file 146976
Hi, My customer is using clumanager-1.2.37-1.1 and got the problem. After we've looked through the bugzilla, we found that the patch in Comment #24 in BZ:348671 is not yet applied to clumanager-1.2.37-1.1. We think that the patch might solve our customer's problem. Questions: 1. When are you planning to release the clumanager package which includes the patch(Comment #24 in BZ:348671) approximately? 2. We've found that clumanager-1.2.37.2-0.2test.i386.rpm include the patch. Is it a problem if I provide the package to my customer for testing? Thanks in advance, Masahiro This event sent from IssueTracker by mokubo issue 135651
Yes, you may test with 1.2.37.2-0.2test.
http://people.redhat.com/lhh/clumanager-1.2.37.2-0.2test.i386.rpm http://people.redhat.com/lhh/clumanager-1.2.37.2-0.2test.x86_64.rpm http://people.redhat.com/lhh/clumanager-1.2.37.2-0.2test.src.rpm http://people.redhat.com/lhh/clumanager-debuginfo-1.2.37.2-0.2test.i386.rpm http://people.redhat.com/lhh/clumanager-debuginfo-1.2.37.2-0.2test.x86_64.rpm
Hi Lon, I've been doing testing in two nodes cluster(node1/node2) before providing the test package(1.2.37.2-0.2test) to my customer and unforutunately did reproduce the problem with it. [Testing environment] * Two nodes cluster(node1/node2) on vmware with a shared stroage on iscsi * RHEL3U8 + clumanager-1.2.37.2-0.2test [Test case] From /var/log/messages in node1, Sep 10 14:38:30 node1 clusvcmgrd: [9623]: <notice> service notice: Started service BATCH ... <- Made sure cluster is up and running with no problem .. Sep 10 14:39:11 node1 clulockd[6766]: <warning> Denied 192.168.50.12: Broken pipe Sep 10 14:39:13 node1 clulockd[6766]: <err> select error: Broken pipe <- Give the system some busy work("cat /dev/zero") and observed that it started producing the above errors. .. Sep 10 14:40:51 node1 clumanager: [12521]: <notice> Shutting down Red Hat Cluster Manager... <- executed "service clumanager stop" .. Sep 10 15:00:14 node1 clulockd[6766]: <warning> Denied 192.168.50.12: Broken pipe <- Canceled the busy work("cat /dev/zero") since clumanager didn't stop for nearly 20 minutes after "service clumanager stop". .. รข From /var/log/messages in node2, Sep 10 15:02:25 node2 clusvcmgrd[5528]: <warning> Starvation on Lock #1! Sep 10 15:03:27 node2 clusvcmgrd[5528]: <warning> Starvation on Lock #1! <- It started producing "Stravation on Lock" error Finally I've rebooted both nodes at Sep 10 15:16. [Backtrace of clusvcmgrd when the problem is occuring] * node1 (gdb) bt #0 0x0025c038 in ___newselect_nocancel () from /lib/tls/libc.so.6 #1 0x08055c87 in __select_retry (fdmax=15, rfds=0xbffff390, wfds=0x0, xfds=0xbffff310, timeout=0xbffff480) at fdops.c:43 #2 0x08055f44 in __read_retry (sockfd=14, buf=0xbffff450, count=4, timeout=0xbffff480) at fdops.c:149 #3 0x0805374d in __msg_receive (handle=14, buf=0xbffff4c0, count=20, tv=0xbffff480) at msgsimple.c:889 #4 0x080538fc in msg_receive_timeout (handle=14, buf=0xbffff4c0, count=20, timeout=15) at msgsimple.c:958 #5 0x08050244 in lock_keeper () at clulock.c:149 #6 0x0805092d in lock_send (pc=0x804e050, lockid=1, cmd=36865) at clulock.c:291 #7 0x080509a0 in lock_lock (pc=0x804e050, lock=1) at clulock.c:307 #8 0x0805105f in _clu_try_lock_cluster (pc=0x804e050, lockid=1) at clulock.c:544 #9 0x08050baa in _clu_try_lock (pc=0x804e050, lockid=1) at clulock.c:409 #10 0x08050a2d in clu_lock (lockid=1) at clulock.c:353 #11 0x0804e050 in check_services (elapsed=6) at clusvcmgrd.c:2522 #12 0x0804f07a in main (argc=1, argv=0xbffff904) at clusvcmgrd.c:3033 #13 0x0019b79a in __libc_start_main () from /lib/tls/libc.so.6 #14 0x0804a631 in _start () * node2 (gdb) bt #0 0x00f74038 in ___newselect_nocancel () from /lib/tls/libc.so.6 #1 0x0805aa80 in auth_md5 (fd=14) at auth_md5.c:379 #2 0x08052dfe in __msg_open (daemonid=3, nodeid=0, timeout=0xbfffbc18) at msgsimple.c:554 #3 0x08052e80 in msg_open (daemonid=3, nodeid=0) at msgsimple.c:579 #4 0x080503b5 in __lock_send (lreq=0xbfffbcb0) at clulock.c:192 #5 0x0805082c in lock_send (pc=0x804e050, lockid=0, cmd=36865) at clulock.c:270 #6 0x080509a0 in lock_lock (pc=0x804e050, lock=0) at clulock.c:307 #7 0x0805105f in _clu_try_lock_cluster (pc=0x804e050, lockid=0) at clulock.c:544 #8 0x08050baa in _clu_try_lock (pc=0x804e050, lockid=0) at clulock.c:409 #9 0x08050a2d in clu_lock (lockid=0) at clulock.c:353 #10 0x0804e050 in check_services (elapsed=6) at clusvcmgrd.c:2522 #11 0x0804f07a in main (argc=1, argv=0xbfffc094) at clusvcmgrd.c:3033 #12 0x00eb379a in __libc_start_main () from /lib/tls/libc.so.6 #13 0x0804a631 in _start () [Attachment] * /var/log/messages * /var/log/cluster - DEBUG log * fuser.log - Output of "fuser /var/lock/clumanager/*" * lock.log - Output of "cat /proc/locks" * lsof.log - Output of "lsof /var/lock/clumanager/*" * ps.log - Output of "ps -ef |grep clu" * shutil.log - Output of "shutil -p /lock/0/status" * lsof_clulockd.log Output of "lsof -p <pid of clulockd>" * clusvcmgrd.bt - backtrace of clusvcmgrd * clulockd.bt - backtrace of clulockd The problem looks like very rare case. I've tried to reproduce the problem by doing the test for nearly 100 times and could see the problem only once. Could you have a look? Thanks, Masahiro This event sent from IssueTracker by mokubo issue 135651 it_file 155794
Hi Lon, I've finally reproduced the problem. Please login to the testing system please? "dhcp-1-138.bne.redhat.com" is trying to disable DB service by "clusvcadm -d", but it can't stop it for a reason and has produced "Starvation on Lock" error for a while. * Output of clustat [root@dhcp-1-138 root]# clustat Cluster Status - Red Hat Clumanager Cluster ASEAN 19:42:37 Cluster Quorum Incarnation #3 Shared State: Shared Raw Device Driver v1.2 Member Status ------------------ ---------- dhcp-1-138.bne.red Active <-- You are here dhcp-1-89.bne.redh Active Service Status Owner (Last) Last Transition Chk Restarts -------------- -------- ---------------- --------------- --- -------- DB stopping dhcp-1-138.bne.r 17:02:07 Jan 27 60 0 BATCH disabled (dhcp-1-138.bne. 17:02:06 Jan 27 60 0 REAL disabled (dhcp-1-89.bne.r 16:57:16 Jan 27 60 0 * /var/log/messages Jan 27 17:43:06 dhcp-1-138 clusvcmgrd[8065]: <warning> Starvation on Lock #0! Jan 27 17:43:10 dhcp-1-138 clusvcmgrd[6147]: <warning> Starvation on Lock #2! .. Jan 27 19:41:24 dhcp-1-138 clusvcmgrd[6147]: <warning> Starvation on Lock #2! Jan 27 19:41:25 dhcp-1-138 clusvcmgrd[8065]: <warning> Starvation on Lock #0! At the time of this writing(Jan 27 19:57), "clusvcadm -d" had failed to stop DB service and has generated "Starvation on Lock" error for nearly two hours. As an addtional information, I'm also running the following scripts on both nodes to log the relevant information in /tmp directory.(The scripts are residing in /root/cluster directory). * fuser.sh logs /tmp/fuser.log - Output of "fuser /var/lock/clumanager/*" * lock.sh logs /tmp/lock.log - Output of "cat /proc/locks" * lsof.sh logs /tmp/lsof.log - Output of "lsof /var/lock/clumanager/*" * ps.sh logs /tmp/ps.log - Output of "ps -ef |grep clu" * shutil.sh logs /tmp/shutil.log - Output of "shutil -p /lock/0/status" I'll leave the testing system up and running. So please do whatever you want to get the relevant information. Thanks, Masahiro This event sent from IssueTracker by mokubo issue 135651
So, I reproduced it as well on your cluster. As it turns out, the local (non-cluster) locking components of clulock.c are terribly unreliable when I/O access is slow. I reproduced this problem by injecting a 'sleep 60' randomly into clulockd. What eventually happened was one of two things: * fcntl() on the file-based node lock failed with no errno, and * process-lock level was 2 (which shouldn't happen) I evaluated the rationale behind triple-layer locking and it's unnecessary in clumanager 1.2. In 1.0, we only had one lock and all access to it required performing (slow) raw I/O. So, to eliminate contention, we had this notion of a "process" and a "node local" lock which were taken before the cluster lock. The cluster lock, however, is buffered via clulockd in clumanager 1.2, so the triple layer locking isn't needed any more. So, I'm testing a patch which removes all of that. Observing, it's very reliable so far - even when an individual lock request takes 5+ minutes to complete due to sleep 60 + contention. The previous patch in this bugzilla is also required for proper operation. Once I have a little more testing done, I will post the patch and new RPMs here in this bugzilla.
Obviously, as I've stated over and over, "Starvation warnings" are not to be confused with "Critical errors", and the new patch does not and will not eliminate them in environments where access to shared storage is interrupted or slowed for any reason. What the new patch does is fix infinite "starvation loops", where a node complains over and over that it is starving for a particular lock.
More to the point, the new patch should fix all the remaining cases where the locks are not correctly released.
Created attachment 330178 [details] Simplified locking layer
http://people.redhat.com/lhh/clumanager-1.2.38-0.1test.src.rpm ^^ Testing srpm. To make an RPM on your system: rpmbuild --rebuild clumanager-1.2.38-0.1test.src.rpm Sha1sum: [lhh@ayanami clumanager]$ sha1sum clumanager-1.2.38-0.1test.src.rpm 624973501e1ca7df3a60c44d7841cd55c80592e6 clumanager-1.2.38-0.1test.src.rpm
Need testing feedback from affected users.
Test cluster revealed that on a heavily loaded system, a quorum update isn't correctly retried after a new membership view has occurred. This is a one-line fix; I will produce a new test srpm.
http://people.redhat.com/lhh/clumanager-1.2.38-0.2test.src.rpm ^^ Testing srpm. To make an RPM on your system: rpmbuild --rebuild clumanager-1.2.38-0.2test.src.rpm Sha1sum: [lhh@ayanami clumanager]$ sha1sum clumanager-1.2.38-0.2test.src.rpm ef547dd377397155f1e5a8c61033ee085f19255d clumanager-1.2.38-0.2test.src.rpm
Created attachment 330281 [details] Fix quorum/VF retry
I have one more issue I am working on which is related to an ordering of clean exit vs. quorum convergence on heavily loaded systems. I think I know how to fix it.
I can't reproduce the problem noted in comment #59 on my old bare metal RHEL3 cluster.
Note I have clumembd%rtp set to 10 and cluquoruomd%rtp also set to 10 on my bare metal cluster. The previous cluster I produced the problem on was using virtualization.
Ok, my testing environment is up and running now as follows. * Two node cluster dhcp-1-138.bne.redhat.com(IP:10.64.1.138) - username:root/passwd:cluster dhcp-1-89.bne.redhat.com(IP:10.64.1.89) - username:root/passwd:cluster I'm currently testing "clumanager-1.2.38-0.2test" by running the script to reproduce. But feel free to login and do whatever you want on the testing system. I'll also wait for the newer patched package and do further testing when you've released it before submitting it to the customer. Thanks, Masahiro This event sent from IssueTracker by mokubo issue 135651
Hi Masahiro, I will continue to try to reproduce on my RHEL3 development cluster, but I might need your cluster again to fix the problem described in comment #59. If you would like my cluster to play around, I can get you that information.
Hi Lon, Thanks for your continuous effort. I have a question from the customer. When they had the problem and also when I reproduced the problem, I often see the following error message. Feb 5 01:11:38 SUC02 clulockd[6602]: <warning> Denied 192.168.3.1: Broken pipe Feb 5 01:11:38 SUC02 clulockd[6602]: <err> select error: Broken pipe What does the error message mean and under what situation it is logged? Is it harmless? Thanks, Masahiro This event sent from IssueTracker by mokubo issue 135651
The error occurs when shared storage is interrupted temporarily. It is harmless, as the cluster will retry in these cases where appropriate.
That is, it can be caused by I/O flow problems to disk or high CPU load preventing clulockd from getting scheduled.
Hi Lon, I've been testing the hotfix package(clumanager-1.2.38-1) in my testing system(VMware) for the last one week and can see that cluster service can't start or stop producing the "Starvation on lock" error from time to time. It usuallly take 12 hours or more to reproduce though. The problem occures only in the busy system.(I makes up the busy system by running "dd if=/dev/zero of=/dev/null"). In the busy system, the task(cluster service start/stop) seems to be never scheduled for processing. I believe this is happening because I'm testing on VMware as per your Comment #70 in BZ(hypervisor doesn't recognize RT queued task). But just in case, can you have a look at my testing system? The problem is being reproduced now. (I can't keep the system in this state for a long time because I need the system for the other task..so please do so today and give me some comment). * Testing system information(two node cluster) node1: dhcp-1-138.bne.redhat.com(IP:10.64.1.138) - username:root/passwd:cluster node2: dhcp-1-89.bne.redhat.com(IP:10.64.1.89) - username:root/passwd:cluster ->dhcp-1-89.bne.redhat.com is the busy system clusvcadm is trying to enable a cluster service(=DB) but it can't start by producing the "Starvation on Lock" error. The system has been in this state for nealy two days. If you "killall dd" in dhcp-1-89.bne.redhat.com, the node will get out of the busy state and clusvcadm will be able to enable a cluster service.(you can killall dd if you want). So in this case, should we consider the problem in such a way that it is caused by hypervisor? I can ask the customer to test the hotfix in bare metal environment at this stage because it might be hard to answer the above question. But can you tell me what information I should ask the customer to get if the problem has been reproduced? Thanks, Masahiro This event sent from IssueTracker by mokubo issue 135651
I'll look at it -- I might not be able to get to it today though. :/
Hi Lon, Sorry, I couldn't keep the testing machine with the problem being reproduced since I needed it for the other purpose. Anyway, we haven't reproduced the problem in bare metal. So I talked to my manager and would like to submit the hotfix package to the customer and ask them to conduct the testing with it within this week. I'm thinking about asking them to collect the following information when they conduct the testing. 1. DEBUG log 2. core of clusvcmgrd when the problem occures Please let me know if I should collect more information. Thanks, Masahiro This event sent from IssueTracker by mokubo issue 135651
Yeah. I'm sorry, I got pulled off this to work on other things. I'm fairly certain that the issue you saw when running in testing happened during 'stop' -- it's a synchronization issue with clumanager that I think only occurs if the entire VM doesn't get scheduled. I'll try to get back to this ASAP, but at least for now the issue is partially fixed.
I would certainly like to fix the second issue before releasing an erratum.
So what happens is: Node A shuts down. On node B (which is loaded), 'cluquorumd' doesn't process a membership change, or it processes it, but doesn't deliver it to clulockd, or clulockd doesn't ever wake up and process the message... I'm not sure which. This causes clulockd to see an old view of the cluster. This, in turn, causes clulockd to believe the "down" node A is still the lock master, which causes clusvcmgrd to complain. I could patch around this by having clulockd query cluquorumd occasionally, but I really would like to fix the delivery mechanism. Again, I was not able to produce this errant behavior on bare metal clusters.
Thanks for your work. I've already asked the customer to test the hotfix package(clumanager-1.2.38-1.i386.rpm) and they've decided to use the hotfix package in their production system. Since the customer must update the hotfix package within 30days after the Erratum has been released, I'd like to know approximately when the next erratum will be released because the customer might want to wait for the Erratum if it'll be released anytime soon. So the customer can avoid updating clumanager twice. Please let me know if you have any plan. Thanks, Masahiro This event sent from IssueTracker by mokubo issue 135651
I will check. The current plan is to release the above package as an erratum - I pushed it through our build system and that's the one I tested with.
This event sent from IssueTracker by mokubo issue 135651 it_file 225208
Hello, The customer did the further testing on this and provided the further information such as backtrace of clusvcmgrd deamon/clusvcadm command when the problem was occuring. They didn't get the output of "shutil -p /lock/0/status". So I'm not sure what process actually keep holding the lock. But I guess the most likely senario is the following. 1. Execute "clusvcadm -d BATCH" to stop BATCH service. 2. clusvcmgrd(PID:3679) generate the child process clusvcmgrd(PID:9019) to conduct the task to stop BATCH service. 3. The clusvcmgrd(PID:9019) try to get the lock to do the task. But the parent process clusvcmgrd (PID:3670) keep holding the lock for some reason and doesn't release it. So clusvcmgrd(PID:9019) hangs and can't stop BATCH service. Test case: -> Testing was conducted using the script - test.sh. The script does the following. 1. stop clumanager 2. stop three cluster registered services(REAL/DB/BATCH) 3. start clumanager 4. start three cluster registered services(REAL/DB/BATCH) Note: test.sh script repeat the above operation 500 times. Environment: - Two nodes cluster (SUC01/SUC02) - Run three cluster services (REAL/DB/BATCH) -> The result is that it failed to stop a cluster service at 234rd attempts and started logging "Starvation on Lock" error. -> Attachment: 07.500normal_debug3.tar.gz contains the following information. - sysreport - /var/log/messages - cluster.log (DEBUG log) - ps.txt - ps output when the problem occured - clustat.txt - clustat output when the problem occured - test.sh - script used to conduct the testing - backtrace.txt - backtrace of each cluster deamon process -> The supporting information: * The time of the issue: When the script tried to stop one of the cluster service(BATCH) at Sat Tue Jun 2 04:13:51 2009, it failed to stop and started logging the error. * 07.500normal_debug3/ps.txt - ps output of the node when the problem was occuring [root@SUC01 etc]# ps -ef | grep clu root 3390 1 0 04:10 ? 00:00:49 cluquorumd root 3393 1 0 04:10 ? 00:00:10 /usr/sbin/clumembd root 3423 1 0 04:10 ? 00:01:49 /usr/sbin/clulockd root 3679 1 0 04:10 ? 00:00:19 /usr/sbin/clusvcmgrd root 9008 9007 0 04:13 ? 00:00:00 /usr/sbin/clusvcadm -d BATCH root 9019 3679 0 04:13 ? 00:00:19 /usr/sbin/clusvcmgrd root 6599 4248 0 13:34 pts/0 00:00:00 grep clu * 07.500normal_debug3/SUC01/cluster.log Jun 2 04:16:31 SUC01 clusvcmgrd[9019]: <warning> Starvation on Lock #1! Jun 2 04:16:31 SUC01 clulockd[3423]: <debug> lockd_trylock: member #0 lock 1 Jun 2 04:16:31 SUC01 clulockd[3423]: <debug> lock: held Jun 2 04:16:31 SUC01 clulockd[3423]: <debug> accepted local connection Jun 2 04:16:31 SUC01 clulockd[3423]: <debug> Processing message on 11 Jun 2 04:16:31 SUC01 clulockd[3423]: <debug> Received 188 bytes from peer Jun 2 04:16:31 SUC01 clulockd[3423]: <debug> LOCK_LOCK | LOCK_TRYLOCK * 07.500normal_debug3/clustat.txt [root@SUC01 etc]# clustat Cluster Status - Red Hat Clumanager Cluster ASEAN 13:23:30 Cluster Quorum Incarnation #1 Shared State: Shared Raw Device Driver v1.2 Member Status ------------------ ---------- M_SUC01 Active <-- You are here M_SUC02 Active Service Status Owner (Last) Last Transition Chk Restarts -------------- -------- ---------------- --------------- --- -------- DB started M_SUC01 04:12:15 Jun 02 60 0 BATCH stopping M_SUC01 04:13:38 Jun 02 60 0 <- This is hanging REAL disabled (M_SUC02) 04:13:37 Jun 02 60 0 * 07.500normal_debug3/backtrace.txt -> backtrace of clusvcmgrd(PID:9019) (gdb) backtrace #0 0x004a8c58 in ___newselect_nocancel () from /lib/tls/libc.so.6 #1 0x080556bf in __select_retry (fdmax=15, rfds=0xbfff79e0, wfds=0x0, xfds=0xbfff7960, timeout=0xbfff7ad0) at fdops.c:43 #2 0x0805597c in __read_retry (sockfd=14, buf=0xbfff7aa0, count=4, timeout=0xbfff7ad0) at fdops.c:149 #3 0x08053185 in __msg_receive (handle=14, buf=0xbfff7b10, count=20, tv=0xbfff7ad0) at msgsimple.c:889 #4 0x08053334 in msg_receive_timeout (handle=14, buf=0xbfff7b10, count=20, timeout=15) at msgsimple.c:958 #5 0x0805042d in __lock_send (lreq=0xbfff7b70) at clulock.c:198 #6 0x080507d0 in lock_send (pc=0x804b7f0, lockid=1, cmd=36865) at clulock.c:252 #7 0x08050944 in lock_lock (pc=0x804b7f0, lock=1) at clulock.c:289 #8 0x08050b05 in _clu_try_lock_cluster (pc=0x804b7f0, lockid=1) at clulock.c:384 #9 0x080509d1 in clu_lock (lockid=1) at clulock.c:335 #10 0x0804b7f0 in flip_state (svcName=0x8b38bf0 "BATCH", svcID=1, state=3, last_owner_flip=1) at clusvcmgrd.c:918 #11 0x0804beae in svc_disable (svcID=1) at clusvcmgrd.c:1127 #12 0x0804d5e4 in handle_svc_request (svcID=1, action=6, target=-1, fd=12) at clusvcmgrd.c:2092 #13 0x0804e939 in dispatch_msg (fd=12) at clusvcmgrd.c:2804 #14 0x0804eec9 in main (argc=1, argv=0xbfff80a4) at clusvcmgrd.c:2996 #15 0x003e878a in __libc_start_main () from /lib/tls/libc.so.6 #16 0x0804a5a1 in _start () (gdb) -> backtrace of clusvcadm(PID: 9008) (gdb) backtrace #0 0x0072ec58 in ___newselect_nocancel () from /lib/tls/libc.so.6 #1 0x0804ec1f in __select_retry (fdmax=7, rfds=0xbfffb4c0, wfds=0x0, xfds=0xbfffb440, timeout=0x0) at fdops.c:43 #2 0x0804eedc in __read_retry (sockfd=6, buf=0xbfffb580, count=4, timeout=0x0) at fdops.c:149 #3 0x0804c2f1 in __msg_receive (handle=6, buf=0xbfffb600, count=40, tv=0x0) at msgsimple.c:889 #4 0x0804c3f1 in msg_receive (handle=6, buf=0xbfffb600, count=40) at msgsimple.c:926 #5 0x0804a990 in main (argc=3, argv=0xbfffb6e4) at clusvcadm.c:354 #6 0x0066e78a in __libc_start_main () from /lib/tls/libc.so.6 #7 0x08049ead in _start () (gdb) Please feel free to speak to me if you need further information. Regards, Masa This event sent from IssueTracker by mokubo issue 135651 it_file 225912
So, what is happening is: 1. clulockd is busy (sending messages to syslog), 2. causing local waiters to retry the locking request, 3. clulockd sees a local process is waiting for input, and because it processes local connections first, this 4. starves out remote connections. As system I/O load increases, the chances for this occuring increase - especially with loglevel set to 7 (debug). When set to 7, clulockd spends most of its time blocked waiting for syslog to finish the log recording (since it's synchronous I/O). cludb -r clulockd%loglevel ... should fix it. A solution within the code would require fixing (3).
I reproduced the issue, and solved it by removing the clulockd loglevel from cluster.conf - WHILE the locks were being starved, in fact. Note that apparently when clulockd is this busy, you may have to: killall -HUP clulockd ... on both nodes shortly after reloading the configuration. Due to a backlog of open file descriptors, it took a few seconds for clulockd to return to normal operation.
Created attachment 350182 [details] Patch: Always check both local and remote listeners for lock requests.
I have reset logging to log level 7 and restarted testing with the above patch.
Other problems occur with the above patch and using log level 7 (debug) for clulockd, but starvation is not one of them. Notably, service requests from clusvcadm time out routinely during testing under load. These problems as well as the starvation in general can be avoided by not using loglevel 7.
You get timeout errors from clusvcadm. Effectively, it times out trying to make a connection to clusvcmgrd, because: * clusvcmgrd(main) forks a status check for service X * clusvcmgrd(main) tries to talk to clulockd to get a lock for service Y, and blocks because locking is horrifically slow because clulockd is in debug mode and blocking on syslog * clusvcmgrd(service X) handling status check * tries to take a cluster lock, which takes a long time because locking is horrifically slow because clulockd is in debug mode and blocking on syslog * clusvcadm sends a message to clusvcmgrd(main) * clusvcadm times out on authentication (5 second timeout) because clusvcmgrd(main) is blocked on locking which is blocked on syslog... clusvcadm fails at this point.
I have a patch to fix the above behavior as well, but it requires disabling authentication in order to prevent the clusvcadm timeout, which is sub-optimal. So, now, we have two patches to fix one behavior which can be fixed by simply turning off debug mode. Turning off debug mode for clulockd requires no patches and can be done immediately. Alternatively, you can turn on asynchronous logging to the cluster log by prepending the line in /etc/syslog.conf with '-'. For example: local4.* -/var/log/cluster.log
Added issue 299318.
Note that, in addition to the fixes in this bugzilla, administrators should do the following if starvation warnings persist: * Set clulockd's loglevel to 4 ("WARNING"). Running with log level at 7 ("DEBUG") is a severe performance hit, as clulockd sends many many messages to syslog in this mode. This is the very first thing that administrators should try. * Set clumembd to realtime prio 20 (see 'man cludb') * Set cluquorumd to realtime prio 19 (see 'man cludb') * Decrease cluquorumd's log level to 6 ("INFO") or below to reduce the chances of it becoming starved on syslog (which is a big problem on a loaded system). * Decrease clumembd's log level 6 ("INFO") to reduce the chances of it becoming starved on syslog (which is a big problem on a loaded system). To further improve performance, one may also: * Turn asynchronous recording for clumanager's logging by prefixing the line in syslog.conf with a '-'; for example: local4.* -/var/log/cluster * Drop the log levels of clumembd and cluquorumd to 4 (default) or below.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1436.html