Description of problem: gfs_statfs_sync not called with statfs_fast set to 1. The output from 'df' gets BADLY out of sync (Tb off sometimes). Version-Release number of selected component (if applicable): kernel-smp-2.6.9-78.0.13.EL GFS-kernel-smp-2.6.9-80.9.el4_7.10 How reproducible: 100% on an affected node. Steps to Reproduce: Run this SystemTap script and modify a file (ie dd if=/dev/zero of=/gfs/myfile bs=1M count=1) while it runs. probe module("gfs").function("gfs_statfs_modify") { printf("gfs_statfs_modified is called\n") } probe module("gfs").function("gfs_statfs_sync") { printf("gfs_statfs_sync is called\n") } Actual results: Only gfs_statfs_modifified runs. gfs_statfs_sync is never called and it should be. Expected results: gfs_statfs_sync and gfs_statfs_modified should both be called under normal filesystem activity. Additional info:
This would not have been possible without the guidance of Josef Bacik. Also, this symptom does not always occur on our GFS clusters. However, once it does appear it takes a reboot to fix it. Trying 'gfs_tool settune statfs_fast 0 ; gfs_tool settune statfs_fast 1' does not fix it.
More progress. Running 'top' on the gfs_quotad pid show that it is in the 'D' state (un-interruptible sleep). Apparently gfs_quotad cannot wake up to do call gfs_statfs_sync.
I used crash to get a backtrace. Here it is: [root@node92 ~]# pgrep -l gfs_quotad 8474 gfs_quotad crash> bt 8474 PID: 8474 TASK: 11806919030 CPU: 3 COMMAND: "gfs_quotad" #0 [117fb283c78] schedule at ffffffff8031784c #1 [117fb283d50] wait_for_completion at ffffffff80317a90 #2 [117fb283dd0] glock_wait_internal at ffffffffa02a2936 #3 [117fb283e10] gfs_glock_nq at ffffffffa02a3174 #4 [117fb283e50] gfs_glock_nq_init at ffffffffa02a339f #5 [117fb283e70] gfs_statfs_sync at ffffffffa02c4c7b #6 [117fb283f20] gfs_quotad at ffffffffa0296ac8 #7 [117fb283f50] kernel_thread at ffffffff80110fd3
Created attachment 333926 [details] gfs_tool lockdump /gfs output
Can I get glock dumps from the other nodes as well? It would seem that some node is holding the glock for the fast_statfs file (which used to be the license file back when gfs was proprietary software). The pertinent information in the glock dump would be preceded with "Glock (2, 25)" and I'd need to see everything that follows, up until the next "Glock" line. Once we've determined which node is holding that license file glock, we should get a sysrq-t output to determine what process on that node is stuck while holding the crucial glock, and what it's doing. But be aware that sysrq-t can cause the node to get fenced, so you may need to adjust your cluster.conf fail time values accordingly.
Created attachment 333947 [details] tarfile with requested 'gfs_tool lockdump /gfs' info from all cluster nodes
Interesting. All the lock dumps for "Glock (2, 25)" are identical except for node92, the node with the 'df' problem. Is that expected?
No, that's not expected. The glock from the previous glock dump was as follows: Glock (2, 25) gl_flags = 1 gl_count = 6 gl_state = 1 req_gh = yes req_bh = yes lvb_count = 0 object = yes new_le = no incore_le = no reclaim = no aspace = 0 ail_bufs = no Request owner = -1 gh_state = 0 gh_flags = 0 error = 0 gh_iflags = 2 4 5 Waiter2 owner = -1 gh_state = 0 gh_flags = 0 error = 0 gh_iflags = 2 4 5 Waiter3 owner = 8474 gh_state = 1 gh_flags = 5 10 error = 0 gh_iflags = 1 Inode: busy which is completely different from all of these. Are you sure the glock dumps are from the failing scenario, or was that from an eleventh node? I would have expected one of the nodes to have a "Holder" entry for (2, 25), holding the glock. For example: Holder owner = -1 gh_state = 3 gh_flags = 5 7 error = 0 gh_iflags = 1 6 7 None of the glock dumps show that, including the previous one. So it's unclear what's happening.
Something strange IS going on. I took these samples at exactly the same moment using the cssh tool (typing goes to all 10 nodes at the same time). I will re-run the test and upload it just to be sure. What will you do if the results are the same as before?
Created attachment 334006 [details] Second dump of 'gfs_tool lockdump /gfs' on all nodes in cluster Here are the unedited files. Alas, I looked at them and the results seem identical for Glock (2, 25) on all nodes except the offending node92.
Well, ideally, I would recreate the problem on my lab systems. In the problem description you said, "How reproducible: 100% on an affected node." Does that mean it doesn't normally reproduce, and something happens on the cluster to make some of the nodes start behaving this way? If so, what is the cause? How can I recreate the failure here? Unfortunately, my only 4.x cluster is completely tied up with crit-sit bug #455696. If I can get that resolved soon, perhaps I can use the cluster to recreate this one, but that won't be for a few days I'm afraid.
Perhaps you can capture the dlm locks as well? See: http://sources.redhat.com/cluster/wiki/FAQ/LockManager#lock_dump_dlm
Created attachment 334021 [details] /proc/cluster/dlm_locks for clvmd and gfs0 Here are the dlm locks for both the clvmd and gfs0 groups.
The customer will only use a patched dlm kernel module if it has been tested at Red Hat and expected to achieve 'hotfix' status. These are production machines and ANY reboot is a difficult approval process. They will not approve it unless confidence of a fix is extremely high.
After hours of testing, I've been unable to recreate this problem in our lab. The data from the customer suggests a missing callback from dlm to gfs, although that's not guaranteed either. The bottom line is this: In order to get their "statfs_fast" working again, they need to either reboot the cluster or unmount the affected gfs file system and then remount it. If they need to schedule an outage to do that, I recommend they start that process. In the mean time, they should be able to turn off statfs_fast in order to get their statfs at least reporting good numbers again, but of course, it will be back to the slow method. We cannot provide any kind of patched dlm or gfs kernel modules until we know what the problem is, and we can't figure that out until we can recreate the problem in our lab. All we can do is conjecture and speculate about it. I'm setting the NEEDINFO flag in order to get both (a) a method we can use to recreate the failure, and (b) their /var/log/messages from all nodes (from the time of the failure) as per comment #15.
There is one item that might be significant. These are 16 processor systems (quad-socket, quad-core). The problem may only surface with a race condition on multi-core machines.
I have additional but very small clues. Here are /var/log/messages entries that I bet relate to when the 'df' output went bad. Notice: the dates are in the right sequence that they went bad: messages.1:Feb 21 19:43:07 node92 kernel: dlm: gfs0: cancel reply ret 0 messages.1:Feb 21 19:43:07 node92 kernel: dlm: gfs0: process_lockqueue_reply id 1803fc state 0 messages.3:Mar 8 08:19:46 node89 kernel: dlm: gfs0: cancel reply ret 0 messages.3:Mar 8 08:19:46 node89 kernel: dlm: gfs0: process_lockqueue_reply id a5f01ea state 0 messages.1:Mar 23 15:49:36 node128 kernel: dlm: gfs0: cancel reply ret 0 messages.1:Mar 23 15:49:36 node128 kernel: dlm: gfs0: process_lockqueue_reply id 225700e2 state 0
Yes, that may be the key. Those cancel's are usually caused by a gfs_grow (see bug 487026 for a fix to grow to prevent the cancels). A gfs_grow I think is very likely to break fast statfs.
in lockqueue.c:process_lockqueue_reply() it does a "switch (state)" and checks these four #define cases (from dlm_internal.h): /* * Lockqueue wait lock states */ #define GDLM_LQSTATE_WAIT_RSB 1 #define GDLM_LQSTATE_WAIT_CONVERT 2 #define GDLM_LQSTATE_WAIT_CONDGRANT 3 #define GDLM_LQSTATE_WAIT_UNLOCK 4 However, we know from the /var/log/messages entry that 'state' is none of those: default: log_error(ls, "process_lockqueue_reply id %x state %d", lkb->lkb_id, state); node128 kernel: dlm: gfs0: process_lockqueue_reply id 225700e2 state 0 Why is it that 'state' is not in 1, 2, 3, or 4?
That code path may also be called by gfs_jadd and gfs_quota. Have they done any of those around the time of the problem?
Checked the root shell history log on all nodes just to be sure, but I knew the answer already. We have NOT run gfs_jadd nor gfs_quota. The only utility we've used is gfs_tool.
I believe there are only three cases where cancel can happen: 1. journal recovery. you can check the logs around the cancel messages to see if there was any journal recovery happening at that time. node failure is known to be a problem with fast statfs. 2. hfile_write - gfs_grow, gfs_jadd, gfs_quota patch in next rhel4 release will avoid cancels from hfile_write 3. gfs_lock_fs_check_clean - gfs_tool reclaim, gfs_tool freeze I can imagine these causing problems with fast statfs, especially reclaim.
I verified that the places in comment #24 are the only cases where locks can be cancelled by gfs. So either the customer did gfs_grow, gfs_jadd, gfs_quota, gfs_tool reclaim, gfs_tool freeze, or else the file system recovery occurred around the same time as those cancel messages. (As Dave said, look in /var/log/messages for that). Either that or there was memory corruption in dlm or gfs. Corruption in gfs is less likely because it formulates the lock message right before sending it off.
Whatever is causing the cancel messages is probably also causing the fast-statfs problems. Both are symptoms of something unknown. Turning off fast-statfs will definately prevent statfs from becoming out of sync.
Created attachment 339929 [details] Faster "regular" statfs Since we've been seeing so many problems lately with statfs_fast I've been investigating how to improve the statfs performance. This simple patch improves the normal "slow" statfs by 30% in my testing. I couldn't find a statfs benchmark test, so I compared the same hardware, same test, running d_io on three nodes with the same random seed, and doing a df every 30 seconds, for 20 iterations. The results are as follows: Average time per df using the described test: Old method: 3.404553 seconds New method: 2.403117 seconds (with this patch) The idea behind the patch is simple: With the current algorithm, if all the slots are filled with glocks that are waiting to be locked, it spins in a nearly-tight loop (with a yield) for millions of iterations while the locks complete. That robs cpu time from the processes who need to actually process the lock requests. With this new patch, if all the slots are full of glocks that are waiting, it will wait for the first glock in the list of outstanding glocks to complete before iterating again. I might be able to make it faster by making it actually wait for the next glock that comes in by using a centralized wait queue. This afternoon, a few of us got our heads together and brainstormed a bit about new ways we can implement a better statfs_fast as well, so I'm going to pursue some of those ideas too.
FYI, I tried using a centralized wait_queue and it only slowed things down, worse than the original algorithm.
Created attachment 341846 [details] Preliminary patch The previous patch suffered from hangs; I suspect because eventually the glock waited on was sometimes out of sorted order. This patch approaches "statfs_fast" from a different angle. First, I ripped out the old code that tried to keep a single statfs file up-to-date with periodic synchronization, due to all the hangs and problems we've seen. Second, I ripped out all the code associated with lvb data on a per-rgrp basis and used glock promote/demotes to keep the lvb up-to- date. This may have a significant performance benefit. Third, I replaced the temporary gfs_stat_gfs structures with a a centralized "gfs_stat_gfs" associated with the mounted file system. This is kept up to date whenever blocks are allocated, freed or changed from data to metadata and such. I introduced a centralized function, gfs_rgd_modify, that takes care of adjusting both the rgd values, the centralized statfs values, adding the rgd buffer to the current transaction, and writing the rgrp out to the buffer. This means rgrp updates are now done consistently. I also believe this fixed a bug with the previous statfs code whereby blocks that were allocated as metadata chunks from free space were not accounted for properly. Fourth, whenever a rgrp is read in, the centralized statfs info is adjusted accordingly by subtracting out the previous values and adding in the new values. Fifth, whenever glocks for rgrps are locked, if the glock version number has changed, the new rgrp info is now read in. It is no longer based on the GL_SKIP bit which can potentially get us into the trouble I described in this link: 1 https://bugzilla.redhat.com/show_bug.cgi?id=471141#c7 Sixth, when statfs_fast is set, it just triggers the statfs code to go through a different path whereby "try" locks are used for the rgrps. Each glock is tried a certain number of times as specified by the existing rgrp_try_threshold tunable. If the glock is acquired, any locks that are taken from other nodes cause the "go_lock" code to adjust the global statfs info. Any locks that can't be acquired after the given number of tries are ignored and it's assumed that their previous values, taken into account in the global statfs structure, are close enough. Seventh, the quota daemon no longer has to sync the statfs file. Instead, it merely runs the list of rgrps looking for glocks that were not acquired during the last statfs. If it finds any, it locks them (no longer a "try" lock) thus accomplishing a time-delayed update to the global statfs info. So the statfs should be adjusted to a more accurate value every time the daemon runs (default 5 seconds). This is a proof-of-concept patch. I'd like to get comments/concerns from other developers who know this code. I'd also like to know if the customer is willing to try this patch.
Additional testing (with gfs_grow) revealed a bug in the previous patch. My final patch should, in theory, solve bug #484760 as well.
This sounds like a major change to the normal statfs code, which IMO shouldn't even be considered. Any change needs to be isolated to a specially enabled mode like fast_statfs is. Even though I'd like to, I don't think we can remove the current fast_statfs code either, because customers now depend on it. The best I think we can do is add a new quick_statfs option along side fast_statfs, and encourage people to change from fast to quick. There will then be three modes: normal (slow), fast, and quick. The first thing I'd do is add some detailed diagnostics/accounting to the normal statfs, that can tell us the exact reason for the time normal statfs takes (how many locks it needs to take, how long each takes, which it's waiting for, ...). Then, I'd try to come up with a dead-simple quick mode that we know cannot cause new problems, and that is targeted narrowly at the problem we observe. Something simple would be using TRY locks, and assuming no-space available from rg's it cannot lock. It could be tuned to wait/retry for a max time, and/or tuned for a max number of rg's it would give up on, e.g. give up trying on up to 10 rg's, which could produce a result that was up to 10% above the true values on a fs with 100 rg's.
Created attachment 342003 [details] Second prototype patch This is a "fixed" version of the previous patch. I found bugs in testing, especially related to gfs_grow. This version also fixes bug #484760 as I predicted, and likely bug #471141 as well. However, this patch does not yet address the concerns raised by Dave Teigland in comment #38. I'll likely revise the patch again to resolve Dave's concerns.
Created attachment 342099 [details] Third patch This version (hopefully) addresses Dave's concerns. It restores the "normal" statfs to its original condition and puts the lvbs back in place. The statfs_fast has been replaced by my new version. As expected, performance is back to its previous level.
Requesting flags to get this into a RHEL4 release.
*** Bug 484760 has been marked as a duplicate of this bug. ***
*** Bug 455309 has been marked as a duplicate of this bug. ***
Event posted on 2009-07-09 09:39 EDT by jwilleford This event sent from IssueTracker by jwilleford issue 283633
Event posted on 07-14-2009 06:05pm EDT by MatthewWhitehead The customer is indecisive about their choice of kernel. They have been prompted to choose and still have not. Also, the customer is testing setting statfs_fast = 0 on their dev machines and may eventually roll it out in production. That would remove the need for this patch. I would not put any more time into this until the customer comes back with a clear strategy. This event sent from IssueTracker by MatthewWhitehead issue 269092
Created attachment 360083 [details] RHEL4 Rebase of the "third patch" This is the same as the previous patch for RHEL4, rebased to be applied to the current code base in git.
We need to update the rhel4 documentation to say that people should not use the fast statfs feature and if they do require this feature they should consider upgrading to rhel5 or above.
Closing this as WONTFIX as per comment #65. The code for statfs_fast was kludged into GFS based on the GFS2 implementation and is not 100%. The prototype patches I posted earlier may function better than the original, but frankly, it's a design change to GFS and we're too late in the cycle for design changes.