348671 – Locks are not released in some cases, causing erroneous starvation warnings

Bug 348671 - Locks are not released in some cases, causing erroneous starvation warnings

Summary: Locks are not released in some cases, causing erroneous starvation warnings

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	clumanager
Sub Component:
Version:	3
Hardware:	i686
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-10-23 12:01 UTC by Eric MAGALLON
Modified:	2018-11-14 18:11 UTC (History)
CC List:	9 users (show)
Fixed In Version:	clumanager-1.2.39-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-10 18:50:12 UTC
Embargoed:

Attachments	(Terms of Use)
Messages log files for start of starvation on lock errors (15.48 KB, application/gzip) 2007-10-24 08:48 UTC, Eric MAGALLON	no flags	Details
Makes unlock path in clumanager more robust (3.34 KB, patch) 2008-05-12 19:44 UTC, Lon Hohberger	no flags	Details \| Diff
binary package with latest patch (548.31 KB, patch) 2008-05-20 20:58 UTC, Eduardo Damato	no flags	Details \| Diff
Simplified locking layer (9.86 KB, patch) 2009-01-27 23:43 UTC, Lon Hohberger	no flags	Details \| Diff
Fix quorum/VF retry (566 bytes, patch) 2009-01-28 20:16 UTC, Lon Hohberger	no flags	Details \| Diff
Patch: Always check both local and remote listeners for lock requests. (1.34 KB, patch) 2009-07-01 19:55 UTC, Lon Hohberger	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2009:1436	0	normal	SHIPPED_LIVE	clumanager bug-fix update	2009-09-10 18:50:10 UTC

Description Eric MAGALLON 2007-10-23 12:01:29 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.8) Gecko/20071008 Firefox/2.0.0.8

Description of problem:

We have a cluster based on two members with 5 services.

In normal time, all services start and stop without problem. When a process is not found, the service is restarted.

But, after 4 or 5 days, we can see in /var/log/messages much errors like this :
...
Oct 21 04:07:12 OsaveAppli1 clusvcmgrd[15760]: <warning> Starvation on Lock #1!
Oct 21 04:07:57 OsaveAppli1 clusvcmgrd[15760]: <warning> Starvation on Lock #1!
...

And the clusvcadm commands are not responding and if a process is not found, the service is not restarted.

I made a shutil -p /lock/1/status and the result is :

[root@OsaveAppli1 root]# shutil -p /lock/1/status
/lock/1/status is 168 bytes long
SharedLockBlock {
        lb_magic = 0xfeedface
        lb_state = 0
        lb_incarnation = 3
        lb_holder {
                h_node = -1
                h_pid = 21699
                h_pc = 0x0000000008060570
                h_crc = 0xc1da0e1f
                h_exe = /usr/sbin/clusvcmgrd
        }
}

It is very hard to stop the service clumanager on the server and I must reboot the system to have clumanager service stopped.

After stop service or reboot, I made a reinit of the quorum with this process :

1.
      Stop cluster daemons on all members.
   2.
      Reinitialize the shared state from one member in the cluster:
      shutil -i
   3.
      Make sure that /etc/cluster.xml is same on all members.
   4.
      Initialize the configuration on the shared quorum partition from one member in the cluster:
      shutil -s /etc/cluster.xml
   5.
      Verify that the configuration has been initialized correctly from one member in the cluster:
      shutil -p /cluster/config.xml

And the cluster is working fine for 4 or 5 days.


Version-Release number of selected component (if applicable):
clumanager-1.2.35.2-1

How reproducible:
Always


Steps to Reproduce:
No particular actions to reproduce. The problem is coming itself.

In description, I have described how to restart the cluster and the solution for 4 or 5 days.

Actual Results:


Expected Results:


Additional info:

Comment 1 Lon Hohberger 2007-10-23 19:31:27 UTC

I configured a box with 8 services; I'm awaiting to see if it will reproduce for me.

My lock status block looks the same:

/lock/1/status is 168 bytes long
SharedLockBlock {
        lb_magic = 0xfeedface
        lb_state = 0
        lb_incarnation = 13
        lb_holder {
                h_node = -1
                h_pid = 30823
                h_pc = 0x0000000008060570
                h_crc = 0xc1da0e1f
                h_exe = /usr/sbin/clusvcmgrd
        }
}

Note that state = 0 or h_node = -1 means the lock is not held on-disk.

Comment 2 Lon Hohberger 2007-10-23 19:32:59 UTC

You wouldn't happen to be running on a PERC3-attached PowerVault 220S would you?

Comment 3 Eric MAGALLON 2007-10-24 07:33:29 UTC

(In reply to comment #2)

No, we do not use power controller on our cluster.

Comment 4 Eric MAGALLON 2007-10-24 08:48:57 UTC

Created attachment 236041 [details]
Messages log files for start of starvation on lock errors

Comment 5 Eric MAGALLON 2007-10-24 08:49:39 UTC

You can find here the first log before starvation on lock :

Oct 23 23:40:59 OsaveAppli1 su(pam_unix)[29166]: session closed for user adm3ggc1
Oct 23 23:42:36 OsaveAppli1 clusvcmgrd[9726]: <err> Unable to obtain cluster
lock: No locks available
Oct 23 23:43:18 OsaveAppli1 last message repeated 2 times
Oct 23 23:44:21 OsaveAppli1 last message repeated 3 times
Oct 23 23:45:24 OsaveAppli1 last message repeated 3 times
Oct 23 23:45:27 OsaveAppli1 kernel: 493 [RAIDarray.mpp]M3G_P_DS4700:1:1:2 Cmnd
failed. Retry same path. vcmnd SN 662622 pdev H3:C0:T1:L2 0x00/0x00/0x00 0x00070000
Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <err> __msg_send: Incomplete write
to 10. Error: Broken pipe
Oct 23 23:45:27 OsaveAppli1 last message repeated 10 times
Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Broken
pipe
Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <err> select error: Broken pipe
Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Broken
pipe
Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <err> select error: Broken pipe
Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Broken
pipe
Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <err> select error: Broken pipe
Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Broken
pipe
Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <err> select error: Broken pipe
Oct 23 23:45:27 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Broken
pipe
...
Oct 23 23:45:30 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4:
Connection reset by peer
Oct 23 23:45:30 OsaveAppli1 clulockd[9720]: <err> select error: Connection reset
by peer
Oct 23 23:45:30 OsaveAppli1 clulockd[9720]: <warning> Potential recursive lock
#0 grant to member #0, PID9726
Oct 23 23:45:31 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4: Broken
pipe
Oct 23 23:45:31 OsaveAppli1 clulockd[9720]: <err> select error: Broken pipe
Oct 23 23:45:31 OsaveAppli1 clulockd[9720]: <warning> Denied 192.168.0.4:
Connection reset by peer

We have made the reinit of the quorum yesterday at 5:00 PM and the starvation on
lock came back this night.

You can see as attached file the messages log from yesterday night to this morning.

Comment 6 Lon Hohberger 2007-10-24 23:30:32 UTC

So, you've got SAN / device-level errors causing timeouts, which produces
clulockd & rgmanager errors (that's *ok* - however, the persistent starvation is
not okay).

I was able to reproduce most of the warnings and errors by forcing a hang in
clulockd.  The problem is, I can't get persistent "Starvation" errors at any
point.  I can get them, but they stop after I stop clulockd from hanging.  

This leads me to think that apart from the SAN timeouts in the logs, something
else is happening as a side effect which I haven't been able to determine.

So...

   * add DAEMON_COREFILE_LIMIT="unlimited" somewhere near the top of
     /etc/init.d/clumanager
   * edit /etc/sysctl.conf and change kernel.sysrq to 1
   * echo 1 > /proc/sys/kernel/sysrq
   * install the matching clumanager-debuginfo RPM for the version you have 
     (1.2.35.2 I think).

Next time this happens, you'll need to do the following on the low node ID (node 0):

   * script debug-info.txt  (records a type session)
   * echo t > /proc/sysrq-trigger
   * dmesg
   * shutil -p /lock/LOCK_ID_THAT_IS_STARVED/status
   * gdb /usr/sbin/clulockd `pidof clulockd`
     [... in gdb; note this won't work if 
          the clulockd process is stuck in 'D' state...]
     (gdb) bt
     (gdb) p locks
     (gdb) quit
   * exit                    (ends script output)

On *both* nodes you'll need to do:

   * script svcmgr-nodeX.txt
   * lsof | grep clumanager.lock\\.
   * ps auwwx | grep clusvcmgrd
   * exit

While you're waiting for it to happen again, you should look into figuring out
what's wrong with your hardware / firmware that's causing this error:

Oct 23 23:45:27 OsaveAppli1 kernel: 493 [RAIDarray.mpp]M3G_P_DS4700:1:1:2 Cmnd
failed. Retry same path. vcmnd SN 662622 pdev H3:C0:T1:L2 0x00/0x00/0x00 0x00070000

I'm confident that if you are able to eliminate this error, clumanager will
behave correctly again.

Comment 7 Lon Hohberger 2007-10-24 23:34:59 UTC

Basically, once the SAN returns to normal state and I/Os can get through again,
clumanager should return to normal (though it will produce lots of errors and
warnings in the mean time).

The fact that it does not return to normal (and one or more process starve for
locks indefinitely) is a bug.

Comment 8 Eric MAGALLON 2007-10-26 09:42:11 UTC

(In reply to comment #7)

Thanks for your help :)

In first time, we are going to verify our SAN and re create new raw devices on
another disks.

Because the servers are in an another site, theses operations will be made on
next week.

If the problems persist, I will install debug package. 

I will inform you of our results.

Comment 9 Lon Hohberger 2007-10-26 15:10:40 UTC

Thanks, Eric.

Comment 10 Eric MAGALLON 2007-10-30 15:05:37 UTC

Lon,

We have checked the logs of our SAN and there is no error.

We have created two new logicals drives for the quorum and restart the solution.
We are waiting to see if the starvation are come back.

One question : I don't find the clumanager-debuginfo RPM for my clumanager
version, on the download page is wrote that debug package info not available.
Can you say me if I can use and older package ? Which version ?

Thanks, Eric.

Comment 11 Lon Hohberger 2007-10-30 19:23:01 UTC

Hmm, if the SAN isn't returning an error, the kernel or block device driver
might be.

Strange that there's no debuginfo rpm.  I'll have to chase that down later -
here it is:

http://people.redhat.com/lhh/clumanager-debuginfo-1.2.35.2-1.i386.rpm

Comment 12 Lon Hohberger 2007-10-30 19:35:10 UTC

Basically, something is triggering a path failover.

Between the initial error time and the time the I/O is resubmitted on the new
path, clulockd is starved waiting for the I/O to complete (and thus applications
waiting for clulockd starve for awhile).

I should be able to create a workaround if I can figure out what is happening
(i.e., we get all that information I asked for).  The information is effectively
all the state information regarding locks.

Comment 13 Lon Hohberger 2007-10-30 19:37:34 UTC

Eventually - even with this strange path failover problem - all the I/Os
complete *successfully*.

Comment 14 Eric MAGALLON 2007-10-31 08:22:53 UTC

Thanks Lon.

I will install the debug package et I will inform you as soon the problem come back.

Comment 15 Eric MAGALLON 2007-11-21 16:39:15 UTC

Lon,

After 3 weeks, we have no more starvation on lock on our clusters nodes.

I think we can say that new quorums creation has solved the problem.

Thanks for your help.

Regards, Eric.

Comment 16 Eric MAGALLON 2008-02-08 10:05:07 UTC

Lon,

The starvation lock is come back in our logs with the same issue : clusvcadm is
not responding to our commands.

Do you think the latest version of clumanager 1.2.37-1.1 is solving the problem
because I see this sentence in the corrections :

"short timeouts when trying to determine cluster and quorum state in 
high-latency clusters would lead to locking issues".

Thanks for your answer.

Eric.

Comment 17 Lon Hohberger 2008-02-08 16:57:33 UTC

I would definitely try it; it's not going to hurt anything.

The timeouts we addressed, however, were on clusters with high network latency -
such as geographically separated nodes.  There were other indicators that you'd
see if you hit the problem we fixed, including:

  * Log message:
    <warning> Potential recursive lock #x grant to member #y, PIDz
  * Disparate cluster membership. 
    * 'clustat' on node 0 reports online nodes = {0}
    * 'clustat' on node 1 reports online nodes = {0, 1}

If the 1.2.37 package doesn't solve your problem, I'll need the info per comment #6.

As I recall, however, the problem you were seeing was caused by MPP doing path
failovers for no good reason (which was causing a timeout 45+ seconds) on locks.  

I guessed it was the array, which you checked on and disproved: the array had no
logs indicating any failures.  More likely is that there is a bug in the MPP
driver (or elsewhere in the kernel).

So, if you haven't already, latest RHEL3-U9+errata kernel + most-recent MPP
drivers should also be on your short list of things to install/upgrade.

Comment 18 Lon Hohberger 2008-03-17 14:38:04 UTC

Someone else hit what seems to be the same problem here:

https://bugzilla.redhat.com/show_bug.cgi?id=427577#c30 [note - bugzilla may be
private]

Now, the original cause reported in 427577 is fixed; I talked about the symptoms
in comment #17.

I've started to try to reproduce this again as per a test case reported from
another customer.  Here's my script:

relos=0
while : ; do
    for a in foo5 foo6 foo7; do
        clusvcadm -r $a
        if [ $? -ne 0 ]; then
            echo "Ruh roh..."
            read n
        fi
        sleep 10
        ((++relos))
        echo "$relos relocations complete"
    done
done

Note that I'm not testing on a 4.4 kernel (reported in 427577), but I'm not
convinced that the kernel version should matter.

Review: The information I'd need to really understand what's happening is:

[cut]
Next time this happens, you'll need to do the following on the low node ID (node 0):

   * script debug-info.txt  (records a type session)
   * echo t > /proc/sysrq-trigger
   * dmesg
   * shutil -p /lock/LOCK_ID_THAT_IS_STARVED/status
   * gdb /usr/sbin/clulockd `pidof clulockd`
     [... in gdb; note this won't work if 
          the clulockd process is stuck in 'D' state...]
     (gdb) bt
     (gdb) p locks
     (gdb) quit
   * exit                    (ends script output)

On *both* nodes you'll need to do:

   * script svcmgr-nodeX.txt
   * lsof | grep clumanager.lock\\.
   * ps auwwx | grep clusvcmgrd
   * exit

[/cut]

Comment 19 Lon Hohberger 2008-03-17 19:27:07 UTC

So far, relocation has occurred 1,638 times without a problem (546 cycles of all
3 services) with no ill effects.

I'll let it continue running a few more days.

Comment 20 Lon Hohberger 2008-03-19 15:15:27 UTC

I stopped @ 16,059 relocations / 5,353 cycles of all 3 services

Restarting with no sleep delay between relocations (e.g. deleting sleep 10)

Comment 21 Eric MAGALLON 2008-03-21 13:32:06 UTC

Hi Lon,

We are going to install the new version of clumanager next week on our
production platform (it has been installed on our pre-prod platform this week
but the starvation on lock problem is present only on production platform).

I will inform you of result of this installation as soon as possible.

Eric.

Comment 22 Lon Hohberger 2008-03-24 15:18:16 UTC

Thanks, Eric.

FWIW, the loop without sleep delays ran for a total of 505,197 relocations
(168,399 cycles) without incident over the weekend based on another customer's
reported reproduction test case.  So, I don't think the test case works in my
environment.

Tests were run on 2.4.21-47.EL; I'm retrying the no-sleep test with the latest
54.EL kernel.

Comment 23 Lon Hohberger 2008-05-12 19:41:28 UTC

Everything seems stable, however, there's one thing I found while reading the
code with the help of Eduardo Damato - the cluster unlock code, pointedly, sucks.

The locking code retries until a node transition occurs or the lock is granted.

Meanwhile, if the _unlock_ code fails during slow I/O, there's a high chance
that the lock will be granted twice to the same process (e.g. you will see
"Recursive lock grant" and so forth in the log messages.  This is because while
clu_lock() retries and is very robust, the clu_unlock() code gives up after
retrying only a few times.

There's a reclaim algorithm which asks the "holder" node if the process holding
the lock is still alive.  So, about half the time, the "holder" PID is not the
main clusvcmgrd process - it's a process handling a request for the service.  In
this case, the process will exit even if the unlock fails and the lock will be
automatically freed.

If, however, the main clusvcmgrd process is the one which takes the lock, and an
unlock fails due to slow I/O on the clulockd server-side (because of a timeout),
then there could be starvation on that particular lock from a child process -
and a "Potential recursive lock grant ..." message in the logs.

We have a patch to make the unlock more robust and are testing it.

Comment 24 Lon Hohberger 2008-05-12 19:44:39 UTC

Created attachment 305170 [details]
Makes unlock path in clumanager more robust

Comment 26 Lon Hohberger 2008-05-16 17:40:35 UTC

I changed the name to more accurately reflect the bug that's fixed here.

Basically, "Starvation" warnings are the result of poor performance or other
problems while accessing shared storage.

This slow access during the _unlock_ path can cause unlock to fail, resulting in
a lock that's held forever - even after performance when talking to shared
storage has returned to normal!

The patch in comment #24 fixes the unlock path so that slow I/O to shared
storage does not cause the unlock path to fail.

Comment 27 Lon Hohberger 2008-05-16 17:51:48 UTC

*** Bug 427577 has been marked as a duplicate of this bug. ***

Comment 28 Lon Hohberger 2008-05-16 17:54:30 UTC

Between the patch for this bugzilla and the patch for bug 427577, starvation
warnings should no longer run forever.

We have at least one customer testing this patch (who was previously getting
duplicate lock grants due to an unlock failure), with no issues so far.

Comment 29 Lon Hohberger 2008-05-16 17:54:59 UTC

Note that the patch in 427577 was included in clumanager-1.2.37

Comment 42 Issue Tracker 2008-08-05 05:35:14 UTC

Hi Lon,

My customer has contacted me after a while and reported that they're
having "Starvation on lock" error again even after they've updated to
clumanager-1.2.37-1.1.

Thus, clumanager-1.2.37-1.1 doesn't fix the problem.

The customer is now running 4 nodes(SUC21/SUC22/SUC23/SUC24) cluster and
have a
problem with stopping clumanager on SUC21 and SUC23 with the following
error.

*SUC21
Jul 25 09:05:31 SUC21 clulockd[3950]:
<err> __msg_send: Incomplete write to 10. Error: Broken pipe

*SUC23
Jul 25 09:10:09 SUC23 clusvcmgrd[31869]: <warning> Starvation on Lock
#2!

I'll upload the messages log for both nodes(SUC21/SUC23).  Please let me
know if you need more information.


Attachment:
gw-suc21-messages.dat - /var/log/messages from SUC21
gw-suc23-messages.dat - /var/log/messages from SUC23


Thanks,

Masahiro


This event sent from IssueTracker by mokubo 
 issue 135651

Comment 43 Issue Tracker 2008-08-05 07:38:35 UTC



This event sent from IssueTracker by mokubo 
 issue 135651
it_file 146976

Comment 44 Issue Tracker 2008-09-05 06:03:28 UTC

Hi,

My customer is using clumanager-1.2.37-1.1 and got the problem. After
we've looked through the bugzilla, we found that the patch in Comment #24
in BZ:348671 is not yet applied to clumanager-1.2.37-1.1.

We think that the patch might solve our customer's problem.

Questions:

1. When are you planning to release the clumanager package which includes
the 
   patch(Comment #24 in BZ:348671) approximately?

2. We've found that clumanager-1.2.37.2-0.2test.i386.rpm include the
patch. Is 
   it a problem if I provide the package to my customer for testing?

Thanks in advance,


Masahiro





This event sent from IssueTracker by mokubo 
 issue 135651

Comment 45 Lon Hohberger 2008-09-05 16:47:05 UTC

Yes, you may test with 1.2.37.2-0.2test.

Comment 46 Lon Hohberger 2008-09-05 16:50:41 UTC

http://people.redhat.com/lhh/clumanager-1.2.37.2-0.2test.i386.rpm
http://people.redhat.com/lhh/clumanager-1.2.37.2-0.2test.x86_64.rpm
http://people.redhat.com/lhh/clumanager-1.2.37.2-0.2test.src.rpm
http://people.redhat.com/lhh/clumanager-debuginfo-1.2.37.2-0.2test.i386.rpm
http://people.redhat.com/lhh/clumanager-debuginfo-1.2.37.2-0.2test.x86_64.rpm

Comment 47 Issue Tracker 2008-09-12 09:28:15 UTC

Hi Lon,

I've been doing testing in two nodes cluster(node1/node2) before providing
the test package(1.2.37.2-0.2test) to my customer and unforutunately did
reproduce the problem with it.


[Testing environment]
* Two nodes cluster(node1/node2) on vmware with a shared stroage on iscsi
* RHEL3U8 + clumanager-1.2.37.2-0.2test

[Test case]
From /var/log/messages in node1,

Sep 10 14:38:30 node1 clusvcmgrd: [9623]: <notice> service notice: Started
service BATCH ... <- Made sure cluster is up and running with no problem
..
Sep 10 14:39:11 node1 clulockd[6766]: <warning> Denied 192.168.50.12:
Broken pipe
Sep 10 14:39:13 node1 clulockd[6766]: <err> select error: Broken pipe
<- Give the system some busy work("cat /dev/zero") and observed that it
started  producing the above errors.
..
Sep 10 14:40:51 node1 clumanager: [12521]: <notice> Shutting down Red Hat
Cluster Manager... <- executed "service clumanager stop"
..
Sep 10 15:00:14 node1 clulockd[6766]: <warning> Denied 192.168.50.12:
Broken pipe
<- Canceled the busy work("cat /dev/zero") since clumanager didn't stop
for nearly 20 minutes after "service clumanager stop".
..

â

From /var/log/messages in node2,

Sep 10 15:02:25 node2 clusvcmgrd[5528]: <warning> Starvation on Lock #1!
Sep 10 15:03:27 node2 clusvcmgrd[5528]: <warning> Starvation on Lock #1!
<- It started producing "Stravation on Lock" error

Finally I've rebooted both nodes at Sep 10 15:16.


[Backtrace of clusvcmgrd when the problem is occuring]

* node1
(gdb) bt
#0  0x0025c038 in ___newselect_nocancel () from /lib/tls/libc.so.6
#1  0x08055c87 in __select_retry (fdmax=15, rfds=0xbffff390, wfds=0x0,
xfds=0xbffff310, timeout=0xbffff480) at fdops.c:43
#2  0x08055f44 in __read_retry (sockfd=14, buf=0xbffff450, count=4,
timeout=0xbffff480) at fdops.c:149
#3  0x0805374d in __msg_receive (handle=14, buf=0xbffff4c0, count=20,
tv=0xbffff480) at msgsimple.c:889
#4  0x080538fc in msg_receive_timeout (handle=14, buf=0xbffff4c0,
count=20, timeout=15) at msgsimple.c:958
#5  0x08050244 in lock_keeper () at clulock.c:149
#6  0x0805092d in lock_send (pc=0x804e050, lockid=1, cmd=36865) at
clulock.c:291
#7  0x080509a0 in lock_lock (pc=0x804e050, lock=1) at clulock.c:307
#8  0x0805105f in _clu_try_lock_cluster (pc=0x804e050, lockid=1) at
clulock.c:544
#9  0x08050baa in _clu_try_lock (pc=0x804e050, lockid=1) at clulock.c:409
#10 0x08050a2d in clu_lock (lockid=1) at clulock.c:353
#11 0x0804e050 in check_services (elapsed=6) at clusvcmgrd.c:2522
#12 0x0804f07a in main (argc=1, argv=0xbffff904) at clusvcmgrd.c:3033
#13 0x0019b79a in __libc_start_main () from /lib/tls/libc.so.6
#14 0x0804a631 in _start ()

* node2
(gdb) bt
#0  0x00f74038 in ___newselect_nocancel () from /lib/tls/libc.so.6
#1  0x0805aa80 in auth_md5 (fd=14) at auth_md5.c:379
#2  0x08052dfe in __msg_open (daemonid=3, nodeid=0, timeout=0xbfffbc18) at
msgsimple.c:554
#3  0x08052e80 in msg_open (daemonid=3, nodeid=0) at msgsimple.c:579
#4  0x080503b5 in __lock_send (lreq=0xbfffbcb0) at clulock.c:192
#5  0x0805082c in lock_send (pc=0x804e050, lockid=0, cmd=36865) at
clulock.c:270
#6  0x080509a0 in lock_lock (pc=0x804e050, lock=0) at clulock.c:307
#7  0x0805105f in _clu_try_lock_cluster (pc=0x804e050, lockid=0) at
clulock.c:544
#8  0x08050baa in _clu_try_lock (pc=0x804e050, lockid=0) at clulock.c:409
#9  0x08050a2d in clu_lock (lockid=0) at clulock.c:353
#10 0x0804e050 in check_services (elapsed=6) at clusvcmgrd.c:2522
#11 0x0804f07a in main (argc=1, argv=0xbfffc094) at clusvcmgrd.c:3033
#12 0x00eb379a in __libc_start_main () from /lib/tls/libc.so.6
#13 0x0804a631 in _start ()


[Attachment]
* /var/log/messages
* /var/log/cluster - DEBUG log
* fuser.log - Output of "fuser /var/lock/clumanager/*"
* lock.log - Output of "cat /proc/locks"
* lsof.log - Output of "lsof /var/lock/clumanager/*"
* ps.log - Output of "ps -ef |grep clu"
* shutil.log - Output of "shutil -p /lock/0/status"
* lsof_clulockd.log Output of "lsof -p <pid of clulockd>"
* clusvcmgrd.bt - backtrace of clusvcmgrd
* clulockd.bt - backtrace of clulockd

The problem looks like very rare case. I've tried to reproduce the problem
by doing the test for nearly 100 times and could see the problem only
once.

Could you have a look?

Thanks,

Masahiro




This event sent from IssueTracker by mokubo 
 issue 135651
it_file 155794

Comment 49 Issue Tracker 2009-01-27 09:58:30 UTC

Hi Lon,

I've finally reproduced the problem. Please login to the testing system
please?

"dhcp-1-138.bne.redhat.com" is trying to disable DB service by
"clusvcadm -d", 
but it can't stop it for a reason and has produced "Starvation on Lock"
error 
for a while.

* Output of clustat
[root@dhcp-1-138 root]# clustat
Cluster Status - Red Hat Clumanager Cluster ASEAN                     
19:42:37
Cluster Quorum Incarnation #3
Shared State: Shared Raw Device Driver v1.2

  Member             Status
  ------------------ ----------
  dhcp-1-138.bne.red Active     <-- You are here
  dhcp-1-89.bne.redh Active

  Service        Status   Owner (Last)     Last Transition Chk Restarts
  -------------- -------- ---------------- --------------- --- --------
  DB             stopping dhcp-1-138.bne.r 17:02:07 Jan 27  60        0
  BATCH          disabled (dhcp-1-138.bne. 17:02:06 Jan 27  60        0
  REAL           disabled (dhcp-1-89.bne.r 16:57:16 Jan 27  60        0


* /var/log/messages
Jan 27 17:43:06 dhcp-1-138 clusvcmgrd[8065]: <warning> Starvation on Lock
#0!
Jan 27 17:43:10 dhcp-1-138 clusvcmgrd[6147]: <warning> Starvation on Lock
#2!
..
Jan 27 19:41:24 dhcp-1-138 clusvcmgrd[6147]: <warning> Starvation on Lock
#2!
Jan 27 19:41:25 dhcp-1-138 clusvcmgrd[8065]: <warning> Starvation on Lock
#0!

At the time of this writing(Jan 27 19:57), "clusvcadm -d" had failed to
stop
DB service and has generated "Starvation on Lock" error for nearly two
hours.

As an addtional information, I'm also running the following scripts on
both nodes
to log the relevant information in /tmp directory.(The scripts are
residing in
/root/cluster directory).

* fuser.sh logs /tmp/fuser.log - Output of "fuser
/var/lock/clumanager/*"
* lock.sh logs /tmp/lock.log - Output of "cat /proc/locks"
* lsof.sh logs /tmp/lsof.log - Output of "lsof /var/lock/clumanager/*"
* ps.sh logs /tmp/ps.log - Output of "ps -ef |grep clu"
* shutil.sh logs /tmp/shutil.log - Output of "shutil -p /lock/0/status"

I'll leave the testing system up and running. So please do whatever you
want to 
get the relevant information.

Thanks,

Masahiro




This event sent from IssueTracker by mokubo 
 issue 135651

Comment 50 Lon Hohberger 2009-01-27 21:13:00 UTC

So, I reproduced it as well on your cluster.  As it turns out, the local (non-cluster) locking components of clulock.c are terribly unreliable when I/O access is slow.

I reproduced this problem by injecting a 'sleep 60' randomly into clulockd.  What eventually happened was one of two things:

 * fcntl() on the file-based node lock failed with no errno, and
 * process-lock level was 2 (which shouldn't happen)

I evaluated the rationale behind triple-layer locking and it's unnecessary in clumanager 1.2.  In 1.0, we only had one lock and all access to it required performing (slow) raw I/O.  So, to eliminate contention, we had this notion of a "process" and a "node local" lock which were taken before the cluster lock.

The cluster lock, however, is buffered via clulockd in clumanager 1.2, so the triple layer locking isn't needed any more.

So, I'm testing a patch which removes all of that.  Observing, it's very reliable so far - even when an individual lock request takes 5+ minutes to complete due to sleep 60 + contention.  The previous patch in this bugzilla is also required for proper operation.  Once I have a little more testing done, I will post the patch and new RPMs here in this bugzilla.

Comment 51 Lon Hohberger 2009-01-27 21:15:13 UTC

Obviously, as I've stated over and over, "Starvation warnings" are not to be confused with "Critical errors", and the new patch does not and will not eliminate them in environments where access to shared storage is interrupted or slowed for any reason.

What the new patch does is fix infinite "starvation loops", where a node complains over and over that it is starving for a particular lock.

Comment 52 Lon Hohberger 2009-01-27 21:16:56 UTC

More to the point, the new patch should fix all the remaining cases where the locks are not correctly released.

Comment 53 Lon Hohberger 2009-01-27 23:43:11 UTC

Created attachment 330178 [details]
Simplified locking layer

Comment 54 Lon Hohberger 2009-01-27 23:45:25 UTC

http://people.redhat.com/lhh/clumanager-1.2.38-0.1test.src.rpm

^^ Testing srpm.

To make an RPM on your system:

rpmbuild --rebuild clumanager-1.2.38-0.1test.src.rpm

Sha1sum:

[lhh@ayanami clumanager]$ sha1sum clumanager-1.2.38-0.1test.src.rpm 
624973501e1ca7df3a60c44d7841cd55c80592e6  clumanager-1.2.38-0.1test.src.rpm

Comment 55 Lon Hohberger 2009-01-27 23:55:12 UTC

Need testing feedback from affected users.

Comment 56 Lon Hohberger 2009-01-28 20:06:17 UTC

Test cluster revealed that on a heavily loaded system, a quorum update isn't correctly retried after a new membership view has occurred.  This is a one-line fix; I will produce a new test srpm.

Comment 57 Lon Hohberger 2009-01-28 20:11:21 UTC

http://people.redhat.com/lhh/clumanager-1.2.38-0.2test.src.rpm

^^ Testing srpm.

To make an RPM on your system:

rpmbuild --rebuild clumanager-1.2.38-0.2test.src.rpm

Sha1sum:

[lhh@ayanami clumanager]$ sha1sum clumanager-1.2.38-0.2test.src.rpm 
ef547dd377397155f1e5a8c61033ee085f19255d  clumanager-1.2.38-0.2test.src.rpm

Comment 58 Lon Hohberger 2009-01-28 20:16:37 UTC

Created attachment 330281 [details]
Fix quorum/VF retry

Comment 59 Lon Hohberger 2009-01-28 23:59:09 UTC

I have one more issue I am working on which is related to an ordering of clean exit vs. quorum convergence on heavily loaded systems.  I think I know how to fix it.

Comment 62 Lon Hohberger 2009-01-29 21:17:11 UTC

I can't reproduce the problem noted in comment #59 on my old bare metal RHEL3 cluster.

Comment 63 Lon Hohberger 2009-01-29 21:18:23 UTC

Note I have clumembd%rtp set to 10 and cluquoruomd%rtp also set to 10 on my bare metal cluster.

The previous cluster I produced the problem on was using virtualization.

Comment 64 Issue Tracker 2009-01-30 09:17:54 UTC

Ok, my testing environment is up and running now as follows.

* Two node cluster
dhcp-1-138.bne.redhat.com(IP:10.64.1.138) - username:root/passwd:cluster
dhcp-1-89.bne.redhat.com(IP:10.64.1.89)  - username:root/passwd:cluster

I'm currently testing "clumanager-1.2.38-0.2test" by running the script
to
reproduce. But feel free to login and do whatever you want on the testing
system.

I'll also wait for the newer patched package and do further testing when
you've 
released it before submitting it to the customer. 


Thanks,

Masahiro



This event sent from IssueTracker by mokubo 
 issue 135651

Comment 65 Lon Hohberger 2009-01-30 14:46:05 UTC

Hi Masahiro,

I will continue to try to reproduce on my RHEL3 development cluster, but I might need your cluster again to fix the problem described in comment #59.

If you would like my cluster to play around, I can get you that information.

Comment 66 Issue Tracker 2009-02-05 12:32:13 UTC

Hi Lon,

Thanks for your continuous effort. I have a question from the customer.

When they had the problem and also when I reproduced the problem, I often
see the following
error message.

Feb  5 01:11:38 SUC02 clulockd[6602]: <warning> Denied 192.168.3.1: Broken
pipe
Feb  5 01:11:38 SUC02 clulockd[6602]: <err> select error: Broken pipe

What does the error message mean and under what situation it is logged? 
Is it harmless?

Thanks,

Masahiro


This event sent from IssueTracker by mokubo 
 issue 135651

Comment 67 Lon Hohberger 2009-02-09 19:01:30 UTC

The error occurs when shared storage is interrupted temporarily.  It is harmless, as the cluster will retry in these cases where appropriate.

Comment 68 Lon Hohberger 2009-02-09 19:02:30 UTC

That is, it can be caused by I/O flow problems to disk or high CPU load preventing clulockd from getting scheduled.

Comment 76 Issue Tracker 2009-03-02 06:53:42 UTC

Hi Lon,

I've been testing the hotfix package(clumanager-1.2.38-1) in my testing
system(VMware)
for the last one week and can see that cluster service can't start or
stop producing 
the "Starvation on lock" error from time to time. It usuallly take 12
hours or more to 
reproduce though.

The problem occures only in the busy system.(I makes up the busy system by
running "dd
if=/dev/zero of=/dev/null"). In the busy system, the task(cluster service
start/stop)
seems to be never scheduled for processing. I believe this is happening
because I'm 
testing on VMware as per your Comment #70 in BZ(hypervisor doesn't
recognize RT queued 
task).

But just in case, can you have a look at my testing system? The problem is
being reproduced
now. (I can't keep the system in this state for a long time because I
need the system for
the other task..so please do so today and give me some comment).

* Testing system information(two node cluster)
node1: dhcp-1-138.bne.redhat.com(IP:10.64.1.138) -
username:root/passwd:cluster
node2: dhcp-1-89.bne.redhat.com(IP:10.64.1.89)  -
username:root/passwd:cluster
     ->dhcp-1-89.bne.redhat.com is the busy system

clusvcadm is trying to enable a cluster service(=DB) but it can't start
by producing the
"Starvation on Lock" error. The system has been in this state for nealy
two days. 

If you "killall dd" in dhcp-1-89.bne.redhat.com, the node will get out
of the busy state 
and clusvcadm will be able to enable a cluster service.(you can killall dd
if you want).

So in this case, should we consider the problem in such a way that it is
caused by hypervisor?

I can ask the customer to test the hotfix in bare metal environment at
this stage because
it might be hard to answer the above question. But can you tell me what
information I should
ask the customer to get if the problem has been reproduced?

Thanks,

Masahiro

 


This event sent from IssueTracker by mokubo 
 issue 135651

Comment 77 Lon Hohberger 2009-03-02 14:34:54 UTC

I'll look at it -- I might not be able to get to it today though.  :/

Comment 78 Issue Tracker 2009-03-04 08:40:06 UTC

Hi Lon,

Sorry, I couldn't keep the testing machine with the problem being
reproduced since I needed it for
the other purpose.

Anyway, we haven't reproduced the problem in bare metal. So I talked to
my manager and would like to 
submit the hotfix package to the customer and ask them to conduct the
testing with it within this week.

I'm thinking about asking them to collect the following information when
they conduct the testing.

1. DEBUG log
2. core of clusvcmgrd when the problem occures

Please let me know if I should collect more information.

Thanks,

Masahiro
 


This event sent from IssueTracker by mokubo 
 issue 135651

Comment 79 Lon Hohberger 2009-03-09 14:49:23 UTC

Yeah.  I'm sorry, I got pulled off this to work on other things.

I'm fairly certain that the issue you saw when running in testing happened during 'stop' -- it's a synchronization issue with clumanager that I think only occurs if the entire VM doesn't get scheduled.  I'll try to get back to this ASAP, but at least for now the issue is partially fixed.

Comment 80 Lon Hohberger 2009-03-09 14:50:21 UTC

I would certainly like to fix the second issue before releasing an erratum.

Comment 81 Lon Hohberger 2009-03-09 14:57:43 UTC

So what happens is:

Node A shuts down.

On node B (which is loaded), 'cluquorumd' doesn't process a membership change, or it processes it, but doesn't deliver it to clulockd, or clulockd doesn't ever wake up and process the message... I'm not sure which.

This causes clulockd to see an old view of the cluster.

This, in turn, causes clulockd to believe the "down" node A is still the lock master, which causes clusvcmgrd to complain.

I could patch around this by having clulockd query cluquorumd occasionally, but I really would like to fix the delivery mechanism.

Again, I was not able to produce this errant behavior on bare metal clusters.

Comment 82 Issue Tracker 2009-03-10 04:41:02 UTC

Thanks for your work.

I've already asked the customer to test the hotfix
package(clumanager-1.2.38-1.i386.rpm) and they've
decided to use the hotfix package in their production system.

Since the customer must update the hotfix package within 30days after the
Erratum has been released,
I'd like to know approximately when the next erratum will be released
because the customer might want
to wait for the Erratum if it'll be released anytime soon. So the
customer can avoid updating clumanager
twice.

Please let me know if you have any plan.

Thanks,

Masahiro



This event sent from IssueTracker by mokubo 
 issue 135651

Comment 83 Lon Hohberger 2009-03-10 19:44:31 UTC

I will check.  The current plan is to release the above package as an erratum - I pushed it through our build system and that's the one I tested with.

Comment 86 Issue Tracker 2009-06-01 05:26:52 UTC



This event sent from IssueTracker by mokubo 
 issue 135651
it_file 225208

Comment 87 Issue Tracker 2009-06-03 07:48:23 UTC

Hello,

The customer did the further testing on this and provided the further
information such as backtrace
of clusvcmgrd deamon/clusvcadm command when the problem was occuring.

They didn't get the output of "shutil -p /lock/0/status". So I'm not
sure what process actually keep
holding the lock. But I guess the most likely senario is the following.

1. Execute "clusvcadm -d BATCH" to stop BATCH service.
2. clusvcmgrd(PID:3679) generate the child process clusvcmgrd(PID:9019) to
conduct the task to stop
   BATCH service.
3. The clusvcmgrd(PID:9019) try to get the lock to do the task. But the
parent process clusvcmgrd
   (PID:3670) keep holding the lock for some reason and doesn't release
it. So clusvcmgrd(PID:9019)
   hangs and can't stop BATCH service.


Test case:

-> Testing was conducted using the script - test.sh. The script does the
following.
  1. stop clumanager
  2. stop three cluster registered services(REAL/DB/BATCH)
  3. start clumanager
  4. start three cluster registered services(REAL/DB/BATCH)

  Note: test.sh script repeat the above operation 500 times.

  Environment:
  - Two nodes cluster (SUC01/SUC02)
  - Run three cluster services (REAL/DB/BATCH)

-> The result is that it failed to stop a cluster service at 234rd
attempts and started logging
   "Starvation on Lock" error.

-> Attachment:

  07.500normal_debug3.tar.gz contains the following information.
  - sysreport
  - /var/log/messages
  - cluster.log (DEBUG log)
  - ps.txt - ps output when the problem occured
  - clustat.txt - clustat output when the problem occured
  - test.sh - script used to conduct the testing
  - backtrace.txt - backtrace of each cluster deamon process

-> The supporting information:

  * The time of the issue: When the script tried to stop one of the
cluster service(BATCH) at Sat
                           Tue Jun 2 04:13:51 2009, it failed to stop and
started logging the error.

  
  * 07.500normal_debug3/ps.txt - ps output of the node when the problem
was occuring

    [root@SUC01 etc]# ps -ef | grep clu
    root      3390     1  0 04:10 ?        00:00:49 cluquorumd
    root      3393     1  0 04:10 ?        00:00:10 /usr/sbin/clumembd
    root      3423     1  0 04:10 ?        00:01:49 /usr/sbin/clulockd
    root      3679     1  0 04:10 ?        00:00:19 /usr/sbin/clusvcmgrd
    root      9008  9007  0 04:13 ?        00:00:00 /usr/sbin/clusvcadm -d
BATCH
    root      9019  3679  0 04:13 ?        00:00:19 /usr/sbin/clusvcmgrd
    root      6599  4248  0 13:34 pts/0    00:00:00 grep clu


  * 07.500normal_debug3/SUC01/cluster.log

    Jun  2 04:16:31 SUC01 clusvcmgrd[9019]: <warning> Starvation on Lock
#1!
    Jun  2 04:16:31 SUC01 clulockd[3423]: <debug> lockd_trylock: member #0
lock 1 
    Jun  2 04:16:31 SUC01 clulockd[3423]: <debug> lock: held 
    Jun  2 04:16:31 SUC01 clulockd[3423]: <debug> accepted local
connection 
    Jun  2 04:16:31 SUC01 clulockd[3423]: <debug> Processing message on 11

    Jun  2 04:16:31 SUC01 clulockd[3423]: <debug> Received 188 bytes from
peer
    Jun  2 04:16:31 SUC01 clulockd[3423]: <debug> LOCK_LOCK | LOCK_TRYLOCK



  * 07.500normal_debug3/clustat.txt

    [root@SUC01 etc]# clustat
    Cluster Status - Red Hat Clumanager Cluster ASEAN                     
13:23:30
    Cluster Quorum Incarnation #1
    Shared State: Shared Raw Device Driver v1.2

    Member             Status
    ------------------ ----------
    M_SUC01            Active     <-- You are here
    M_SUC02            Active

    Service        Status   Owner (Last)     Last Transition Chk Restarts
    -------------- -------- ---------------- --------------- --- --------
    DB             started  M_SUC01          04:12:15 Jun 02  60        0
    BATCH          stopping M_SUC01          04:13:38 Jun 02  60        0 
 <- This is hanging
    REAL           disabled (M_SUC02)        04:13:37 Jun 02  60        0


  * 07.500normal_debug3/backtrace.txt

  -> backtrace of clusvcmgrd(PID:9019)

     (gdb) backtrace
     #0  0x004a8c58 in ___newselect_nocancel () from /lib/tls/libc.so.6
     #1  0x080556bf in __select_retry (fdmax=15, rfds=0xbfff79e0,
wfds=0x0, xfds=0xbfff7960,    timeout=0xbfff7ad0)
    at fdops.c:43 
     #2  0x0805597c in __read_retry (sockfd=14, buf=0xbfff7aa0, count=4,
timeout=0xbfff7ad0) at fdops.c:149
     #3  0x08053185 in __msg_receive (handle=14, buf=0xbfff7b10, count=20,
tv=0xbfff7ad0) at msgsimple.c:889
     #4  0x08053334 in msg_receive_timeout (handle=14, buf=0xbfff7b10,
count=20, timeout=15) at msgsimple.c:958
     #5  0x0805042d in __lock_send (lreq=0xbfff7b70) at clulock.c:198
     #6  0x080507d0 in lock_send (pc=0x804b7f0, lockid=1, cmd=36865) at
clulock.c:252
     #7  0x08050944 in lock_lock (pc=0x804b7f0, lock=1) at clulock.c:289 
     #8  0x08050b05 in _clu_try_lock_cluster (pc=0x804b7f0, lockid=1) at
clulock.c:384
     #9  0x080509d1 in clu_lock (lockid=1) at clulock.c:335
     #10 0x0804b7f0 in flip_state (svcName=0x8b38bf0 "BATCH", svcID=1,
state=3, last_owner_flip=1) at clusvcmgrd.c:918
     #11 0x0804beae in svc_disable (svcID=1) at clusvcmgrd.c:1127
     #12 0x0804d5e4 in handle_svc_request (svcID=1, action=6, target=-1,
fd=12) at clusvcmgrd.c:2092
     #13 0x0804e939 in dispatch_msg (fd=12) at clusvcmgrd.c:2804
     #14 0x0804eec9 in main (argc=1, argv=0xbfff80a4) at
clusvcmgrd.c:2996
     #15 0x003e878a in __libc_start_main () from /lib/tls/libc.so.6
     #16 0x0804a5a1 in _start ()
     (gdb)

   -> backtrace of clusvcadm(PID: 9008)
     (gdb) backtrace
     #0  0x0072ec58 in ___newselect_nocancel () from /lib/tls/libc.so.6
     #1  0x0804ec1f in __select_retry (fdmax=7, rfds=0xbfffb4c0, wfds=0x0,
xfds=0xbfffb440, timeout=0x0) at fdops.c:43
     #2  0x0804eedc in __read_retry (sockfd=6, buf=0xbfffb580, count=4,
timeout=0x0) at fdops.c:149 
     #3  0x0804c2f1 in __msg_receive (handle=6, buf=0xbfffb600, count=40,
tv=0x0) at msgsimple.c:889
     #4  0x0804c3f1 in msg_receive (handle=6, buf=0xbfffb600, count=40) at
msgsimple.c:926
     #5  0x0804a990 in main (argc=3, argv=0xbfffb6e4) at clusvcadm.c:354
     #6  0x0066e78a in __libc_start_main () from /lib/tls/libc.so.6
     #7  0x08049ead in _start ()
     (gdb)


Please feel free to speak to me if you need further information.

Regards,

Masa



This event sent from IssueTracker by mokubo 
 issue 135651
it_file 225912

Comment 94 Lon Hohberger 2009-07-01 14:54:54 UTC

So, what is happening is:

1. clulockd is busy (sending messages to syslog),
2. causing local waiters to retry the locking request,
3. clulockd sees a local process is waiting for input, and because it processes local connections first, this 
4. starves out remote connections.

As system I/O load increases, the chances for this occuring increase - especially with loglevel set to 7 (debug).  When set to 7, clulockd spends most of its time blocked waiting for syslog to finish the log recording (since it's synchronous I/O).

   cludb -r clulockd%loglevel

... should fix it.

A solution within the code would require fixing (3).

Comment 95 Lon Hohberger 2009-07-01 16:47:07 UTC

I reproduced the issue, and solved it by removing the clulockd loglevel from cluster.conf - WHILE the locks were being starved, in fact.

Note that apparently when clulockd is this busy, you may have to:

  killall -HUP clulockd

... on both nodes shortly after reloading the configuration.

Due to a backlog of open file descriptors, it took a few seconds for clulockd to return to normal operation.

Comment 96 Lon Hohberger 2009-07-01 19:55:48 UTC

Created attachment 350182 [details]
Patch: Always check both local and remote listeners for lock requests.

Comment 97 Lon Hohberger 2009-07-01 20:05:19 UTC

I have reset logging to log level 7 and restarted testing with the above patch.

Comment 98 Lon Hohberger 2009-07-01 22:38:29 UTC

Other problems occur with the above patch and using log level 7 (debug) for clulockd, but starvation is not one of them.  Notably, service requests from clusvcadm time out routinely during testing under load.

These problems as well as the starvation in general can be avoided by not using loglevel 7.

Comment 102 Lon Hohberger 2009-07-07 16:08:52 UTC

You get timeout errors from clusvcadm.  Effectively, it times out trying to make a connection to clusvcmgrd, because:

* clusvcmgrd(main) forks a status check for service X
* clusvcmgrd(main) tries to talk to clulockd to get a lock for
  service Y, and blocks because locking is horrifically slow
  because clulockd is in debug mode and blocking on syslog
* clusvcmgrd(service X) handling status check
  * tries to take a cluster lock, which takes a long time 
    because locking is horrifically slow because clulockd is
    in debug mode and blocking on syslog
* clusvcadm sends a message to clusvcmgrd(main)
  * clusvcadm times out on authentication (5 second timeout) because
    clusvcmgrd(main) is blocked on locking which is blocked on syslog...

clusvcadm fails at this point.

Comment 103 Lon Hohberger 2009-07-07 16:48:25 UTC

I have a patch to fix the above behavior as well, but it requires disabling authentication in order to prevent the clusvcadm timeout, which is sub-optimal.  So, now, we have two patches to fix one behavior which can be fixed by simply turning off debug mode.

Turning off debug mode for clulockd requires no patches and can be done immediately.

Alternatively, you can turn on asynchronous logging to the cluster log by prepending the line in /etc/syslog.conf with '-'.  For example:

  local4.* -/var/log/cluster.log

Comment 117 Lachlan McIlroy 2009-07-29 03:06:01 UTC

Added issue 299318.

Comment 128 Lon Hohberger 2009-09-09 19:31:00 UTC

Note that, in addition to the fixes in this bugzilla, administrators should do the following if starvation warnings persist:

* Set clulockd's loglevel to 4 ("WARNING").  Running with log level at 7 ("DEBUG") is a severe performance hit, as clulockd sends many many messages to syslog in this mode.  This is the very first thing that administrators should try.

* Set clumembd to realtime prio 20 (see 'man cludb')

* Set cluquorumd to realtime prio 19 (see 'man cludb')

* Decrease cluquorumd's log level to 6 ("INFO") or below to reduce the chances of it becoming starved on syslog (which is a big problem on a loaded system).

* Decrease clumembd's log level 6 ("INFO") to reduce the chances of it becoming
starved on syslog (which is a big problem on a loaded system).


To further improve performance, one may also:
* Turn asynchronous recording for clumanager's logging by prefixing the line in syslog.conf with a '-'; for example:

   local4.* -/var/log/cluster

* Drop the log levels of clumembd and cluquorumd to 4 (default) or below.

Comment 131 errata-xmlrpc 2009-09-10 18:50:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1436.html

Note You need to log in before you can comment on or make changes to this bug.