Description of problem: The problem has multiple symptoms: - rgmanager cannot akquire a dlm lock causing rgmanager to hang. This can only be resolved by restarting the whole cluster. - GFS access is very slow (gfs_tool counters shows outstanding LM calls) - dlm_recvd is in state "D<" 1950 ? D< 46:00 \_ [dlm_recvd] Version-Release number of selected component (if applicable): RHEL4.4, RHCS4.4, GFS How reproducible: No Idea, it just happens. I believe under heavy load. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: I already opened a service request (1669002) for this issue. There you can find the following additional information (from each node): * kernel trace * Output from the script: mountpoint=/cluster/mount/vg_P06user/lv_P06user name=vg_P06user_lv_P06user cd /root if [ ! -d "analysis-20070710" ]; then mkdir analysis-20070710 fi cd analysis-20070710 gfs_tool lockdump $mountpoint > $(hostname)_lockdump_vg_$name gfs_tool counters $mountpoint > $(hostname)_counters_$name tar cvzf ./$(hostname)-proc.tar.gz /proc/[0-9][0-9]* /proc/cluster find /proc -name [0-9][0-9]* -maxdepth 1 -exec ls -lhtr {}/fd \; > $(hostname)_fds.log ps axfwwwwww > $(hostname)_ps.log ps ax -o pid,stat,cmd,wchan > $(hostname)_ps2.log pstree > $(hostname)_pstree.log echo "Magma" > /proc/cluster/dlm_locks cat /proc/cluster/dlm_locks > $(hostname)_dlm_locks_Magma1 for service in $(cat /proc/cluster/services | grep "DLM Lock Space" | awk -F '\"' ' {print $2}'); do echo "Getting locks for $service" echo $service > /proc/cluster/dlm_locks cat /proc/cluster/dlm_locks > $(hostname)_dlm_locks_$service done
Another symptom of this problem is that there is a massive number of locks: gfs_tool counters shows: locks 1345381 locks held 673475 incore inodes 672488 metadata buffers 144 unlinked inodes 5 quota IDs 0 incore log buffers 0 log space used 0.49% meta header cache entries 0 glock dependencies 0 glocks on reclaim list 0 log wraps 88 outstanding LM calls 0 outstanding BIO calls 1 fh2dentry misses 0 glocks reclaimed 24168949 glock nq calls 1724612418 glock dq calls 1710695537 glock prefetch calls 10443847 lm_lock calls 43425976 lm_unlock calls 29437647 lm callbacks 73113018 address operations 34289469 dentry operations 679689195 export operations 0 file operations 353819582 inode operations 767228946 super operations 134338234 vm operations 13349812 block I/O reads 31218719 Note, that /proc/cluster/dlm/drop_count is at 50000
This many locks could definately be causing problems. The massive number of locks are almost entirely "iopen" locks which are never dropped. GFS gets an iopen lock for ever file it stats. A node usually accumulates this huge number of iopen locks by scanning the entire fs (via a backup, find, updatedb, etc). A new gfs patch exists to purge iopen locks, it's gone into the latest 4.5 release. The patch adds a new tunable "glock_purge" which is a percentage of iopen locks that gfs should regularly try to drop, e.g. glock_purge=50 means that a gfs daemon will regularly attempt to drop half of the iopen locks it holds. The glock_purge tunable is clumsy to use because if it's turned on unnecessarily, it will be of no help, and will actually harm performance. So, it should only be enabled while the number of iopen glocks is excessively high, otherwise it should be disabled (by setting to 0). -- What was the very first problem to occur on the cluster? Can we point to a specific node at a specific time? Things will appear in the collected debugging info that are simply a result of the initial problem triggering other problems. We need to identify the first problem that occured, then try to find a cause for that.
It looks like backup (NetBackup is used) and in former times cronjobs like (prelink, slocate) caused these troubles. Although we disabled backup now also we still see for the relevant filesystems gfs_scand with a lot of cpu time (e.g. uptime of 40h gfs_scand gets 4h CPU-Time). As we are still using Rhel4U4 and we are talking about productive systems we cannot easily update to new versions. A Rhel4U5 update is planed but we'll have to wait until RHEL4U5 can be used. BTW: Do you think those many looks and the usage gfs_scand and the freeze of the cluster are correlated? Or are we talking about two different things? And why do we have those orphant locks and the rgmanager not working properly? Do you have any explanations on this?
The large gfs_scand time is caused by the large number of locks. The freeze and rgmanager problems might result from this also, but I don't know why. I haven't found any explanation in the logs.
We normally see rgmanager lock requests to time out first. Then the system gets very slow until it gets inoperable. For me the "iopen" glock issue raises some questions: - What number of locks do you assume to be critical ? I.e. causing a system to hang or rgmanager lock requests to time out ? - Is the number of locks per locking table important or the whole number of locks ? - How big would the performance hit caused by the glock_purge parameter be ?
I don't know how many iopen glocks create a critical problem, but normally the amount of caching gfs does is moderated by the kernel's normal mechanisms for purging its cache. The backup scenario is special because the system is not experiencing memory shortages, so there's nothing telling gfs to reduce its glock caching, even though the excessive glock caching is causing gfs to be *really* slow. A "normal" number of iopen locks would equate to the number of files/locks the system generally uses without doing a backup. If you can't eliminate the backup, then you want to reverse the effects of the backup (the excessive number of glocks) -- a simple way to do this is to just unmount;mount the fs after the backup is done to clear the unwanted glock cache. If you can't unmount/mount, then you could enable the new glock_purge tunable before the backup, and disable it after backup is done (or once the number of glocks falls back to the same level as before backup). This way, the special glock purging only effects the backup and not the normal system operation. There's no way of saying what the performance hit of glock_purge would be during normal operation. I'd guess it would be minimal in most cases. We're concerned with excessive number of locks in a single fs / locking table.
How does the rgmanager lock request timeout fit in here ? As I know Magma is using a locktable for its own. That shouldn't be affected by another locktable then. The servers we are using own 64GB and 128GB of RAM. We normally don't run into memory problems. Does that mean, that also in "normal" operation (no backup) nothing is telling GFS to reduce its glock caching and we therefore see these huge numbers of locks and "slow" performance ? Do you think that the creation of a massive number of files (>10.000.000) and a statfs call to these files could eventually reproduce this problem in our case ?
I don't know anything about the rgmanager problem, is there a specific error message in the logs that I should look for and pass on to the rgmanager person? The large amount of memory means there won't be any memory pressure on the system which means gfs will never be told to reduce its cache, which means it caches all the glocks it took during the backup, which slows it down. Caching helps if the locks/data will be reused soon, but in the case of backup we *don't* want to cache the locks/data because most of it won't be reused at all and all the unused cached locks just slow down normal access using the normal locks. Creating/stating 10M files might not cause the same problems as the backup which just scans the files and doesn't create any.
Rgmanager complains about "Unable to obtain cluster lock: Connection timed out" Note, that the problem occurs on two cluster members. Note also that the dlm_recvd is in state " D<" on these nodes. I can provide the output of a sysrq-t for every node if you'd need those. Note that I triggered 3 in a row for lilr623c. Heres the syslog: Jul 8 08:20:24 lilr623b clurgmgrd[22893]: <err> #48: Unable to obtain cluster lock: Connection timed out Jul 8 08:20:25 lilr623b kernel: dlm: Magma: reply from 5 no lock Jul 8 08:20:25 lilr623b kernel: dlm: reply Jul 8 08:20:25 lilr623b kernel: rh_cmd 5 Jul 8 08:20:25 lilr623b kernel: rh_lkid 830374 Jul 8 08:20:25 lilr623b kernel: lockstate 0 Jul 8 08:20:25 lilr623b kernel: nodeid 2 Jul 8 08:20:25 lilr623b kernel: status 0 Jul 8 08:20:25 lilr623b kernel: lkid 0 Jul 8 08:19:56 lilr623e clurgmgrd[20742]: <err> #48: Unable to obtain cluster lock: Connection timed out Jul 8 08:20:25 lilr623e kernel: dlm: Magma: reply from 2 no lock Jul 8 08:20:25 lilr623e kernel: dlm: reply Jul 8 08:20:25 lilr623e kernel: rh_cmd 5 Jul 8 08:20:25 lilr623e kernel: rh_lkid 9601b6 Jul 8 08:20:25 lilr623e kernel: lockstate 0 Jul 8 08:20:25 lilr623e kernel: nodeid 5 Jul 8 08:20:25 lilr623e kernel: status 0 Jul 8 08:20:25 lilr623e kernel: lkid 673b0245
Those messages don't seem to be in the logs I'm looking at. I'd be interested to see if there's any information before those dlm:Magma errors that could indicate something going wrong (I'd look for something at that same time in general on all the nodes.)
I attached a complete syslog of this day (08.Jul 07) from all nodes to the SR 1669002. Please let me know if you have problems getting these.
I did some more analysis on the lock information we got from the filesystem. As you can see, we have about 50% "iopen" glocks and 50% "inode" locks. Shouldn't the at least inode locks be removed from gfs_scand ? Have a look at the following summary: -nodeA Glocks 958654 iopen Glocks 479125 (5) inode Glocks 479125 (2) -nodeB Glocks 1345381 iopen Glocks 672488 inode Glocks 672489 -nodeC Glocks 1344508 iopen Glocks 672052 inode Glocks 672052 -nodeD Glocks 1345222 iopen Glocks 672409 inode Glocks 672409 -nodeE Glocks 785 iopen Glocks 190 inode Glocks 190 -nodeF Glocks 685 iopen Glocks 140 inode Glocks 140
The messages from the node with the dlm:Magma error may not be sufficient to find the problem. It's just as likely that a problem on another node caused the problem which propagated back and was reported. We want to see /var/log/messages from all the nodes around the time that the error occured. Then we can try to correlate the dlm:Magma error message to something happening on a different node at the same time. It could be that we'll find nothing on the other nodes to help, but we should check.
The messages file from the SR contains all syslog information from all nodes. I just checked the syslog. There are no suspicious messages at the time when the dlm:Magma error messages occur.
Iignore comment 16, I didn't realize that the one file was messages from all the nodes.
Upgrade to rgmanager & ccs from RHCS 4.5.
Lon thinks the "dlm:Magma no lock" error is likely to be fixed in 4.5. re comment 15, you're looking at the gfs lockdump there. The inode glocks have been unlocked in the dlm and the glock structs still exist in gfs. The iopen glocks have not been unlocked in the dlm, so nearly all of the dlm locks from gfs are iopen locks. The glock_purge tunable will force the iopen glocks to be unlocked (in the dlm), which will then cause both iopen and inode glocks in gfs to be dropped.
See bz 208968 for another example of the "Unable to obtain cluster lock" problems that have been fixed.
One more question. When we hit the memory leak you mention in 208968 will this also affect GFS in any case ? We normally see the rgmanager and GFS problems correlated in time.
In talking with Dave and Marco, you're using one of the rgmanager hotfix releases which addresses the lock leak. Ok - with some more digging, #208968 could have been only symptomatically fixed due to fixing the rgmanager lock leak in RHCS 4 update 4. That is, because the original submitter was not using lots of locks elsewhere in the system, the symptoms went away when the lock leak in rgmanager was fixed. According to Marco, you are running a patched rgmanager which eliminates the lock leak problem. Now, if the actual cause was not due to leaked locks, but instead due to simply a large number of locks in general, there's a chance that #208968 wasn't fixed in 4.5. However, I still strongly recommend upgrading at least rgmanager / magma / magma-plugins / ccs to 4.5 in order for us to better assist you.
The lock leak would have caused problems with GFS. However, due to the package you're running, I really doubt you're hitting the lock leak.
I really understand, that you want to see the system updated to 4.5. But as this is a productive system the update to u5 has to pass all internal QA requirements. As I know, there are still some issues with u5 that have to be solved first. As I know, we cannot just update ccs to u5 without updateing the kernel as ccsd in u5 requires something like /dev/misc/dlm-control. Also, the customer wants to see a detailed description what the problem is and how it is or will be solved. Therefore I need to understand, how the rgmanger problem ("Unable to obtain cluster lock"), the GFS problem (huge number of locks, slow performance) and the dlm issue (dlm_recvd in state " D<") fit together. From previous description in this bz I understand, that huge lock numbers in a GFS locking table should have no effect on Magma locking tables.
We've discovered that the timeout feature of the dlm is enabled for the rgmanager lockspace, which we didn't realize, and which we don't want. That explains the rgmanager/magma problems. We'll be doing a patch for that. The huge number of gfs locks and slow performance is due to the huge number of unwanted iopen locks (which probably came from a backup that scanned the fs). The fix for this already exists via the glock_purge tunable in 4.5. Looking through all the data, it's not clear what dlm_recvd is waiting for in process_cluster_request(). I think it's unlikely to persist after the other problems have been addressed.
Thanks for the explanations. I'd like to come back to my comment #15. We have set /proc/cluster/lock_dlm/drop_count to 50k. Why haven't the inode glocks not been dropped. Are they directly related to the iopen glocks ? I.e. do the iopen glocks depend on the inode glocks ?
The inode glocks are in fact mostly unlocked, and the iopen glocks are not affected by the drop_locks callback. The glock_purge tunable unlocks the iopen locks which removes a series of reference counts between the different glocks and vfs inodes that allows them to be removed.
Thank you so much for your answers. When can we expect the patch for the rgmanager/magma timeout issue and with it a hotfix or u5.x update?
We've decided the best way to resolve the rgmanager/magma timeout problem is to just increase /proc/cluster/config/dlm/lock_timeout to 300 (5 min) instead of the default of 30 sec. So, just add this to the init/startup scripts right after loading the dlm kernel module: echo "300" >> /proc/cluster/config/dlm/lock_timeout We think that this will solve the problem without any code changes. Fixing the excessive number of iopen locks (and the slowdown this causes) will require the glock_purge change. Or, if possible, you can get rid of these excess locks by just unmounting and remounting after the backup process has run (assuming that it's a backup scanning the fs that creates all the unwanted iopen locks in the first place).
Added to Wiki here: http://sourceware.org/cluster/wiki/FAQ/LockManager#dlm_timeout
Chrissie, do you have any opinion on changing the default timeouts? Or should we just close this won't-fix given the work around?
As it seems to be causing a problem, then I'm happy to increase the default timeout. It'll reduce the load on GSS at least! I think 5 minutes is probably too much, but 1 or 2 would be reasonable. The original value of 30 seconds was meant to be just a little over the usual transition time.
commit 0562c3607d70ee1e2a35c0e4b2c7c17bf26cface Author: Christine Caulfield <ccaulfie> Date: Wed Jan 21 10:10:34 2009 +0000 dlm: Increase the lock_timeout
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1051.html