Bug 247766 - dlm: Magma: reply from 5 no lock
Summary: dlm: Magma: reply from 5 no lock
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: dlm-kernel
Version: 4
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-07-11 11:24 UTC by Mark Hlawatschek
Modified: 2009-05-18 21:17 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-05-18 21:17:23 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:1051 0 normal SHIPPED_LIVE dlm-kernel bug-fix update 2009-05-18 21:17:04 UTC

Description Mark Hlawatschek 2007-07-11 11:24:56 UTC
Description of problem:

The problem has multiple symptoms:
- rgmanager cannot akquire a dlm lock causing rgmanager to hang. This can only 
be resolved by restarting the whole cluster.
- GFS access is very slow (gfs_tool counters shows outstanding LM calls)
- dlm_recvd is in state "D<" 
1950 ?        D<    46:00  \_ [dlm_recvd] 


Version-Release number of selected component (if applicable):
RHEL4.4, RHCS4.4, GFS


How reproducible:
No Idea, it just happens. I believe under heavy load.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
I already opened a service request (1669002) for this issue. There you can 
find the following additional information (from each node):
* kernel trace 
 
* Output from the script: 
mountpoint=/cluster/mount/vg_P06user/lv_P06user 
name=vg_P06user_lv_P06user 
cd /root 
if [ ! -d "analysis-20070710" ]; then 
  mkdir analysis-20070710 
fi 
cd analysis-20070710 
gfs_tool lockdump $mountpoint > $(hostname)_lockdump_vg_$name 
gfs_tool counters $mountpoint > $(hostname)_counters_$name 
tar cvzf ./$(hostname)-proc.tar.gz /proc/[0-9][0-9]* /proc/cluster 
find /proc -name [0-9][0-9]* -maxdepth 1 -exec ls -lhtr {}/fd \; > 
$(hostname)_fds.log 
ps axfwwwwww > $(hostname)_ps.log 
ps ax -o pid,stat,cmd,wchan > $(hostname)_ps2.log 
pstree > $(hostname)_pstree.log 
echo "Magma" > /proc/cluster/dlm_locks 
cat /proc/cluster/dlm_locks > $(hostname)_dlm_locks_Magma1 
for service in $(cat /proc/cluster/services | grep "DLM Lock Space" | 
awk -F '\"' ' {print $2}'); do 
        echo "Getting locks for $service" 
        echo $service > /proc/cluster/dlm_locks 
        cat /proc/cluster/dlm_locks > $(hostname)_dlm_locks_$service 
done

Comment 1 Mark Hlawatschek 2007-07-11 11:30:59 UTC
Another symptom of this problem is that there is a massive number of locks:
gfs_tool counters shows:
                                  locks 1345381
                             locks held 673475
                          incore inodes 672488
                       metadata buffers 144
                        unlinked inodes 5
                              quota IDs 0
                     incore log buffers 0
                         log space used 0.49%
              meta header cache entries 0
                     glock dependencies 0
                 glocks on reclaim list 0
                              log wraps 88
                   outstanding LM calls 0
                  outstanding BIO calls 1
                       fh2dentry misses 0
                       glocks reclaimed 24168949
                         glock nq calls 1724612418
                         glock dq calls 1710695537
                   glock prefetch calls 10443847
                          lm_lock calls 43425976
                        lm_unlock calls 29437647
                           lm callbacks 73113018
                     address operations 34289469
                      dentry operations 679689195
                      export operations 0
                        file operations 353819582
                       inode operations 767228946
                       super operations 134338234
                          vm operations 13349812
                        block I/O reads 31218719

Note, that /proc/cluster/dlm/drop_count is at 50000

Comment 3 David Teigland 2007-07-12 16:09:46 UTC
This many locks could definately be causing problems.  The massive number of
locks are almost entirely "iopen" locks which are never dropped.  GFS gets an
iopen lock for ever file it stats.  A node usually accumulates this huge
number of iopen locks by scanning the entire fs (via a backup, find, updatedb, etc).

A new gfs patch exists to purge iopen locks, it's gone into the latest 4.5
release.  The patch adds a new tunable "glock_purge" which is a percentage
of iopen locks that gfs should regularly try to drop, e.g. glock_purge=50
means that a gfs daemon will regularly attempt to drop half of the iopen
locks it holds.

The glock_purge tunable is clumsy to use because if it's turned on
unnecessarily, it will be of no help, and will actually harm performance.
So, it should only be enabled while the number of iopen glocks is excessively
high, otherwise it should be disabled (by setting to 0).

--

What was the very first problem to occur on the cluster?  Can we point to
a specific node at a specific time?  Things will appear in the collected
debugging info that are simply a result of the initial problem triggering
other problems.  We need to identify the first problem that occured, then
try to find a cause for that.


Comment 4 Marc Grimme 2007-07-16 06:35:07 UTC
It looks like backup (NetBackup is used) and in former times cronjobs like 
(prelink, slocate) caused these troubles. Although we disabled backup now also 
we still see for the relevant filesystems gfs_scand with a lot of cpu time 
(e.g. uptime of 40h gfs_scand gets 4h CPU-Time).

As we are still using Rhel4U4 and we are talking about productive systems we 
cannot easily update to new versions. A Rhel4U5 update is planed but we'll 
have to wait until RHEL4U5 can be used.

BTW: Do you think those many looks and the usage gfs_scand and the freeze of 
the cluster are correlated? 
Or are we talking about two different things?
And why do we have those orphant locks and the rgmanager not working properly? 
Do you have any explanations on this?

Comment 5 David Teigland 2007-07-16 13:33:14 UTC
The large gfs_scand time is caused by the large number of locks.
The freeze and rgmanager problems might result from this also, but I don't
know why.  I haven't found any explanation in the logs.

Comment 6 Mark Hlawatschek 2007-07-16 19:21:15 UTC
We normally see rgmanager lock requests to time out first. Then the system 
gets very slow until it gets inoperable. 

For me the "iopen" glock issue raises some questions: 
- What number of locks do you assume to be critical ? I.e. causing a system to 
hang or rgmanager lock requests to time out ?
- Is the number of locks per locking table important or the whole number of 
locks ?
- How big would the performance hit caused by the glock_purge parameter be ?

Comment 7 David Teigland 2007-07-16 19:58:27 UTC
I don't know how many iopen glocks create a critical problem, but normally the
amount of caching gfs does is moderated by the kernel's normal mechanisms for
purging its cache.  The backup scenario is special because the system is not
experiencing memory shortages, so there's nothing telling gfs to reduce its
glock caching, even though the excessive glock caching is causing gfs to be
*really* slow.

A "normal" number of iopen locks would equate to the number of files/locks the
system generally uses without doing a backup.

If you can't eliminate the backup, then you want to reverse the effects of the
backup (the excessive number of glocks) -- a simple way to do this is to just
unmount;mount the fs after the backup is done to clear the unwanted glock cache.
If you can't unmount/mount, then you could enable the new glock_purge tunable
before the backup, and disable it after backup is done (or once the number of
glocks falls back to the same level as before backup).  This way, the special
glock purging only effects the backup and not the normal system operation.

There's no way of saying what the performance hit of glock_purge would be during
normal operation.  I'd guess it would be minimal in most cases.

We're concerned with excessive number of locks in a single fs / locking table.


Comment 8 Mark Hlawatschek 2007-07-16 20:20:18 UTC
How does the rgmanager lock request timeout fit in here ? As I know Magma is 
using a locktable for its own. That shouldn't be affected by another locktable 
then.
The servers we are using own 64GB and 128GB of RAM. We normally don't run into 
memory problems. Does that mean, that also in "normal" operation (no backup) 
nothing is telling GFS to reduce its glock caching and we therefore see these 
huge numbers of locks and "slow" performance ?
Do you think that the creation of a massive number of files (>10.000.000) and 
a statfs call to these files could eventually reproduce this problem in our 
case ?

Comment 9 David Teigland 2007-07-16 20:44:38 UTC
I don't know anything about the rgmanager problem, is there a specific error
message in the logs that I should look for and pass on to the rgmanager person?

The large amount of memory means there won't be any memory pressure on the 
system which means gfs will never be told to reduce its cache, which means it 
caches all the glocks it took during the backup, which slows it down.  Caching 
helps if the locks/data will be reused soon, but in the case of backup we 
*don't* want to cache the locks/data because most of it won't be reused at all
and all the unused cached locks just slow down normal access using the normal 
locks.

Creating/stating 10M files might not cause the same problems as the backup
which just scans the files and doesn't create any.


Comment 10 Mark Hlawatschek 2007-07-16 21:02:29 UTC
Rgmanager complains about "Unable to obtain cluster lock: Connection timed 
out"

Note, that the problem occurs on two cluster members. Note also that the 
dlm_recvd is in state " D<" on these nodes.

I can provide the output of a sysrq-t for every node if you'd need those. Note 
that I triggered 3 in a row for lilr623c.

Heres the syslog:
Jul  8 08:20:24 lilr623b clurgmgrd[22893]: <err> #48: Unable to obtain cluster 
lock: Connection timed out
Jul  8 08:20:25 lilr623b kernel: dlm: Magma: reply from 5 no lock
Jul  8 08:20:25 lilr623b kernel: dlm: reply
Jul  8 08:20:25 lilr623b kernel: rh_cmd 5
Jul  8 08:20:25 lilr623b kernel: rh_lkid 830374
Jul  8 08:20:25 lilr623b kernel: lockstate 0
Jul  8 08:20:25 lilr623b kernel: nodeid 2
Jul  8 08:20:25 lilr623b kernel: status 0
Jul  8 08:20:25 lilr623b kernel: lkid 0

Jul  8 08:19:56 lilr623e clurgmgrd[20742]: <err> #48: Unable to obtain cluster 
lock: Connection timed out
Jul  8 08:20:25 lilr623e kernel: dlm: Magma: reply from 2 no lock
Jul  8 08:20:25 lilr623e kernel: dlm: reply
Jul  8 08:20:25 lilr623e kernel: rh_cmd 5
Jul  8 08:20:25 lilr623e kernel: rh_lkid 9601b6
Jul  8 08:20:25 lilr623e kernel: lockstate 0
Jul  8 08:20:25 lilr623e kernel: nodeid 5
Jul  8 08:20:25 lilr623e kernel: status 0
Jul  8 08:20:25 lilr623e kernel: lkid 673b0245


Comment 11 David Teigland 2007-07-16 21:38:01 UTC
Those messages don't seem to be in the logs I'm looking at.  I'd be interested
to see if there's any information before those dlm:Magma errors that could
indicate something going wrong (I'd look for something at that same time in
general on all the nodes.)



Comment 12 Mark Hlawatschek 2007-07-16 21:50:39 UTC
I attached a complete syslog of this day (08.Jul 07) from all nodes to the SR 
1669002. Please let me know if you have problems getting these.

Comment 15 Mark Hlawatschek 2007-07-17 13:31:51 UTC
I did some more analysis on the lock information we got from the filesystem.
As you can see, we have about 50% "iopen" glocks and 50% "inode" locks. 
Shouldn't the at least inode locks be removed from gfs_scand ?

Have a look at the following summary:
-nodeA
Glocks 958654
iopen Glocks 479125 (5)
inode Glocks 479125 (2)
-nodeB
Glocks 1345381
iopen Glocks 672488
inode Glocks 672489
-nodeC
Glocks 1344508
iopen Glocks 672052
inode Glocks 672052
-nodeD
Glocks 1345222
iopen Glocks 672409
inode Glocks 672409
-nodeE
Glocks 785
iopen Glocks 190
inode Glocks 190
-nodeF
Glocks 685
iopen Glocks 140
inode Glocks 140


Comment 16 David Teigland 2007-07-17 13:52:17 UTC
The messages from the node with the dlm:Magma error may not be sufficient to find
the problem.  It's just as likely that a problem on another node caused the
problem which propagated back and was reported.  We want to see /var/log/messages
from all the nodes around the time that the error occured.  Then we can try to
correlate the dlm:Magma error message to something happening on a different node
at the same time.  It could be that we'll find nothing on the other nodes to
help, but we should check.


Comment 17 Mark Hlawatschek 2007-07-17 14:35:04 UTC
The messages file from the SR contains all syslog information from all nodes.
I just checked the syslog. There are no suspicious messages at the time when 
the dlm:Magma error messages occur. 

Comment 18 David Teigland 2007-07-17 14:46:27 UTC
Iignore comment 16, I didn't realize that the one file was messages from
all the nodes.


Comment 19 Lon Hohberger 2007-07-17 16:09:39 UTC
Upgrade to rgmanager & ccs from RHCS 4.5.


Comment 20 David Teigland 2007-07-17 16:20:06 UTC
Lon thinks the "dlm:Magma no lock" error is likely to be fixed in 4.5.

re comment 15, you're looking at the gfs lockdump there.  The inode glocks
have been unlocked in the dlm and the glock structs still exist in gfs.
The iopen glocks have not been unlocked in the dlm, so nearly all of the dlm
locks from gfs are iopen locks.  The glock_purge tunable will force the
iopen glocks to be unlocked (in the dlm), which will then cause both iopen
and inode glocks in gfs to be dropped.


Comment 21 David Teigland 2007-07-17 16:23:02 UTC
See bz 208968 for another example of the "Unable to obtain cluster lock"
problems that have been fixed.

Comment 22 Mark Hlawatschek 2007-07-17 16:52:09 UTC
One more question.
When we hit the memory leak you mention in 208968 will this also affect GFS in 
any case ? We normally see the rgmanager and GFS problems correlated in time. 

Comment 23 Lon Hohberger 2007-07-17 16:56:08 UTC
In talking with Dave and Marco, you're using one of the rgmanager hotfix
releases which addresses the lock leak.

Ok - with some more digging, #208968 could have been only symptomatically fixed
due to fixing the rgmanager lock leak in RHCS 4 update 4.  That is, because the
original submitter was not using lots of locks elsewhere in the system, the
symptoms went away when the lock leak in rgmanager was fixed.  According to
Marco, you are running a patched rgmanager which eliminates the lock leak problem.

Now, if the actual cause was not due to leaked locks, but instead due to simply
a large number of locks in general, there's a chance that #208968 wasn't fixed
in 4.5.

However, I still strongly recommend upgrading at least rgmanager / magma /
magma-plugins / ccs to 4.5 in order for us to better assist you.

Comment 24 Lon Hohberger 2007-07-17 17:15:34 UTC
The lock leak would have caused problems with GFS.  However, due to the package
you're running, I really doubt you're hitting the lock leak.

Comment 28 Mark Hlawatschek 2007-07-17 17:39:37 UTC
I really understand, that you want to see the system updated to 4.5. But as 
this is a productive system the update to u5 has to pass all internal QA 
requirements. As I know, there are still some issues with u5 that have to be 
solved first. 
As I know, we cannot just update ccs to u5 without updateing the kernel as 
ccsd in u5 requires something like /dev/misc/dlm-control.
Also, the customer wants to see a detailed description what the problem is and 
how it is or will be solved.
Therefore I need to understand, how the rgmanger problem ("Unable to obtain 
cluster lock"), the GFS problem (huge number of locks, slow performance) and 
the dlm issue (dlm_recvd in state " D<") fit together. From previous 
description in this bz I understand, that huge lock numbers in a GFS locking 
table should have no effect on Magma locking tables.

Comment 30 David Teigland 2007-07-17 19:13:32 UTC
We've discovered that the timeout feature of the dlm is enabled for the
rgmanager lockspace, which we didn't realize, and which we don't want.
That explains the rgmanager/magma problems.  We'll be doing a patch for that.

The huge number of gfs locks and slow performance is due to the huge number
of unwanted iopen locks (which probably came from a backup that scanned the
fs).  The fix for this already exists via the glock_purge tunable in 4.5.

Looking through all the data, it's not clear what dlm_recvd is waiting for
in process_cluster_request().  I think it's unlikely to persist after the
other problems have been addressed.


Comment 31 Mark Hlawatschek 2007-07-17 21:22:40 UTC
Thanks for the explanations. 
I'd like to come back to my comment #15. We have 
set /proc/cluster/lock_dlm/drop_count to 50k. Why haven't the inode glocks not 
been dropped. Are they directly related to the iopen glocks ? I.e. do the 
iopen glocks depend on the inode glocks ?  

Comment 32 David Teigland 2007-07-17 21:39:34 UTC
The inode glocks are in fact mostly unlocked, and the iopen glocks are not
affected by the drop_locks callback.  The glock_purge tunable unlocks the
iopen locks which removes a series of reference counts between the different
glocks and vfs inodes that allows them to be removed.



Comment 33 Mark Hlawatschek 2007-07-17 21:50:01 UTC
Thank you so much for your answers. When can we expect the patch for the 
rgmanager/magma timeout issue and with it a hotfix or u5.x update?

Comment 34 David Teigland 2007-07-18 14:40:02 UTC
We've decided the best way to resolve the rgmanager/magma timeout problem
is to just increase /proc/cluster/config/dlm/lock_timeout to 300 (5 min)
instead of the default of 30 sec.  So, just add this to the init/startup
scripts right after loading the dlm kernel module:

  echo "300" >> /proc/cluster/config/dlm/lock_timeout

We think that this will solve the problem without any code changes.


Fixing the excessive number of iopen locks (and the slowdown this causes)
will require the glock_purge change.  Or, if possible, you can get rid of
these excess locks by just unmounting and remounting after the backup
process has run (assuming that it's a backup scanning the fs that creates
all the unwanted iopen locks in the first place).


Comment 35 Lon Hohberger 2007-12-19 16:16:18 UTC
Added to Wiki here:

http://sourceware.org/cluster/wiki/FAQ/LockManager#dlm_timeout

Comment 37 David Teigland 2009-01-20 20:35:30 UTC
Chrissie, do you have any opinion on changing the default timeouts?
Or should we just close this won't-fix given the work around?

Comment 38 Christine Caulfield 2009-01-21 08:18:43 UTC
As it seems to be causing a problem, then I'm happy to increase the default timeout. It'll reduce the load on GSS at least!

I think 5 minutes is probably too much, but 1 or 2 would be reasonable. The original value of 30 seconds was meant to be just a little over the usual transition time.

Comment 39 Christine Caulfield 2009-01-21 10:20:02 UTC
commit 0562c3607d70ee1e2a35c0e4b2c7c17bf26cface
Author: Christine Caulfield <ccaulfie>
Date:   Wed Jan 21 10:10:34 2009 +0000

    dlm: Increase the lock_timeout

Comment 42 errata-xmlrpc 2009-05-18 21:17:23 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1051.html


Note You need to log in before you can comment on or make changes to this bug.