Bug 488318 - gfs_statfs_sync not called with statfs_fast set to 1.
gfs_statfs_sync not called with statfs_fast set to 1.
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: GFS-kernel (Show other bugs)
All Linux
high Severity high
: rc
: 4.9
Assigned To: Robert Peterson
GFS Bugs
: Documentation
: 455309 484760 (view as bug list)
Depends On:
Blocks: 498664 654711
  Show dependency treegraph
Reported: 2009-03-03 13:17 EST by Matthew Whitehead
Modified: 2010-11-18 13:20 EST (History)
15 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 498664 (view as bug list)
Last Closed: 2010-11-18 13:20:24 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
gfs_tool lockdump /gfs output (9.40 MB, text/plain)
2009-03-03 15:43 EST, Matthew Whitehead
no flags Details
tarfile with requested 'gfs_tool lockdump /gfs' info from all cluster nodes (20.00 KB, application/x-tar)
2009-03-03 17:54 EST, Matthew Whitehead
no flags Details
Second dump of 'gfs_tool lockdump /gfs' on all nodes in cluster (2.36 MB, application/x-gzip)
2009-03-04 10:35 EST, Matthew Whitehead
no flags Details
/proc/cluster/dlm_locks for clvmd and gfs0 (4.71 MB, application/x-gzip)
2009-03-04 12:37 EST, Matthew Whitehead
no flags Details
Faster "regular" statfs (1.31 KB, patch)
2009-04-16 18:19 EDT, Robert Peterson
no flags Details | Diff
Preliminary patch (39.45 KB, patch)
2009-04-29 18:17 EDT, Robert Peterson
no flags Details | Diff
Second prototype patch (39.91 KB, patch)
2009-04-30 17:00 EDT, Robert Peterson
no flags Details | Diff
Third patch (29.46 KB, patch)
2009-05-01 10:51 EDT, Robert Peterson
no flags Details | Diff
RHEL4 Rebase of the "third patch" (34.24 KB, patch)
2009-09-08 10:12 EDT, Robert Peterson
no flags Details | Diff

  None (edit)
Description Matthew Whitehead 2009-03-03 13:17:54 EST
Description of problem: 

gfs_statfs_sync not called with statfs_fast set to 1. The output from 'df' gets BADLY out of sync (Tb off sometimes).

Version-Release number of selected component (if applicable):


How reproducible: 100% on an affected node.

Steps to Reproduce:

Run this SystemTap script and modify a file (ie dd if=/dev/zero of=/gfs/myfile bs=1M count=1) while it runs.

probe module("gfs").function("gfs_statfs_modify") {  
  printf("gfs_statfs_modified is called\n")

probe module("gfs").function("gfs_statfs_sync") {  
  printf("gfs_statfs_sync is called\n")

Actual results:

Only gfs_statfs_modifified runs. gfs_statfs_sync is never called and it should be.

Expected results:

gfs_statfs_sync and gfs_statfs_modified should both be called under normal filesystem activity.

Additional info:
Comment 1 Matthew Whitehead 2009-03-03 13:21:14 EST
This would not have been possible without the guidance of Josef Bacik.

Also, this symptom does not always occur on our GFS clusters. However, once it does appear it takes a reboot to fix it. 

Trying 'gfs_tool settune statfs_fast 0 ; gfs_tool settune statfs_fast 1' does not fix it.
Comment 2 Matthew Whitehead 2009-03-03 14:15:57 EST
More progress. Running 'top' on the gfs_quotad pid show that it is in the 'D' state (un-interruptible sleep).

Apparently gfs_quotad cannot wake up to do call gfs_statfs_sync.
Comment 3 Matthew Whitehead 2009-03-03 15:22:32 EST
I used crash to get a backtrace. Here it is:

[root@node92 ~]# pgrep -l gfs_quotad
8474 gfs_quotad

crash> bt 8474
PID: 8474   TASK: 11806919030       CPU: 3   COMMAND: "gfs_quotad"
 #0 [117fb283c78] schedule at ffffffff8031784c
 #1 [117fb283d50] wait_for_completion at ffffffff80317a90
 #2 [117fb283dd0] glock_wait_internal at ffffffffa02a2936
 #3 [117fb283e10] gfs_glock_nq at ffffffffa02a3174
 #4 [117fb283e50] gfs_glock_nq_init at ffffffffa02a339f
 #5 [117fb283e70] gfs_statfs_sync at ffffffffa02c4c7b
 #6 [117fb283f20] gfs_quotad at ffffffffa0296ac8
 #7 [117fb283f50] kernel_thread at ffffffff80110fd3
Comment 4 Matthew Whitehead 2009-03-03 15:43:36 EST
Created attachment 333926 [details]
gfs_tool lockdump /gfs output
Comment 5 Robert Peterson 2009-03-03 17:03:41 EST
Can I get glock dumps from the other nodes as well?  It would seem
that some node is holding the glock for the fast_statfs file (which
used to be the license file back when gfs was proprietary software).
The pertinent information in the glock dump would be preceded with
"Glock (2, 25)" and I'd need to see everything that follows, up
until the next "Glock" line.

Once we've determined which node is holding that license file glock,
we should get a sysrq-t output to determine what process on that node
is stuck while holding the crucial glock, and what it's doing.
But be aware that sysrq-t can cause the node to get fenced, so you
may need to adjust your cluster.conf fail time values accordingly.
Comment 6 Matthew Whitehead 2009-03-03 17:54:09 EST
Created attachment 333947 [details]
tarfile with requested 'gfs_tool lockdump /gfs' info from all cluster nodes
Comment 7 Matthew Whitehead 2009-03-03 18:12:40 EST
Interesting. All the lock dumps for "Glock (2, 25)" are identical except for node92, the node with the 'df' problem. Is that expected?
Comment 8 Robert Peterson 2009-03-03 22:21:19 EST
No, that's not expected.  The glock from the previous glock dump
was as follows:

Glock (2, 25)
  gl_flags = 1 
  gl_count = 6
  gl_state = 1
  req_gh = yes
  req_bh = yes
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 0
  ail_bufs = no
    owner = -1
    gh_state = 0
    gh_flags = 0 
    error = 0
    gh_iflags = 2 4 5 
    owner = -1
    gh_state = 0
    gh_flags = 0 
    error = 0
    gh_iflags = 2 4 5 
    owner = 8474
    gh_state = 1
    gh_flags = 5 10 
    error = 0
    gh_iflags = 1 
  Inode: busy

which is completely different from all of these.  Are you sure the
glock dumps are from the failing scenario, or was that from an
eleventh node?

I would have expected one of the nodes to have a "Holder" entry
for (2, 25), holding the glock.  For example:

    owner = -1
    gh_state = 3
    gh_flags = 5 7 
    error = 0
    gh_iflags = 1 6 7 

None of the glock dumps show that, including the previous one.
So it's unclear what's happening.
Comment 9 Matthew Whitehead 2009-03-04 10:21:35 EST
Something strange IS going on. I took these samples at exactly the same moment using the cssh tool (typing goes to all 10 nodes at the same time).  I will re-run the test and upload it just to be sure.

What will you do if the results are the same as before?
Comment 10 Matthew Whitehead 2009-03-04 10:35:21 EST
Created attachment 334006 [details]
Second dump of 'gfs_tool lockdump /gfs' on all nodes in cluster

Here are the unedited files. Alas, I looked at them and the results seem identical for Glock (2, 25) on all nodes except the offending node92.
Comment 11 Robert Peterson 2009-03-04 10:41:49 EST
Well, ideally, I would recreate the problem on my lab systems.
In the problem description you said, "How reproducible: 100% on
an affected node."  Does that mean it doesn't normally reproduce,
and something happens on the cluster to make some of the nodes
start behaving this way?  If so, what is the cause?  How can I
recreate the failure here?  Unfortunately, my only 4.x cluster is
completely tied up with crit-sit bug #455696.  If I can get that
resolved soon, perhaps I can use the cluster to recreate this one,
but that won't be for a few days I'm afraid.
Comment 12 Robert Peterson 2009-03-04 11:20:07 EST
Perhaps you can capture the dlm locks as well?  See:
Comment 13 Matthew Whitehead 2009-03-04 12:37:01 EST
Created attachment 334021 [details]
/proc/cluster/dlm_locks for clvmd and gfs0

Here are the dlm locks for both the clvmd and gfs0 groups.
Comment 16 Matthew Whitehead 2009-03-30 14:52:31 EDT
The customer will only use a patched dlm kernel module if it has been tested at Red Hat and expected to achieve 'hotfix' status. 

These are production machines and ANY reboot is a difficult approval process. They will not approve it unless confidence of a fix is extremely high.
Comment 17 Robert Peterson 2009-03-30 18:13:50 EDT
After hours of testing, I've been unable to recreate this problem
in our lab.  The data from the customer suggests a missing callback
from dlm to gfs, although that's not guaranteed either.

The bottom line is this:

In order to get their "statfs_fast" working again, they need to
either reboot the cluster or unmount the affected gfs file system
and then remount it.  If they need to schedule an outage to do that,
I recommend they start that process.  In the mean time, they should
be able to turn off statfs_fast in order to get their statfs at
least reporting good numbers again, but of course, it will be back
to the slow method.

We cannot provide any kind of patched dlm or gfs kernel modules
until we know what the problem is, and we can't figure that out
until we can recreate the problem in our lab.  All we can do is
conjecture and speculate about it.

I'm setting the NEEDINFO flag in order to get both (a) a method we
can use to recreate the failure, and (b) their /var/log/messages
from all nodes (from the time of the failure) as per comment #15.
Comment 18 Matthew Whitehead 2009-03-31 12:50:13 EDT
There is one item that might be significant. These are 16 processor systems (quad-socket, quad-core).  The problem may only surface with a race condition on multi-core machines.
Comment 19 Matthew Whitehead 2009-03-31 14:16:31 EDT
I have additional but very small clues. Here are /var/log/messages entries that I bet relate to when the 'df' output went bad. Notice: the dates are in the right sequence that they went bad:

messages.1:Feb 21 19:43:07 node92 kernel: dlm: gfs0: cancel reply ret 0
messages.1:Feb 21 19:43:07 node92 kernel: dlm: gfs0: process_lockqueue_reply id 1803fc state 0

messages.3:Mar  8 08:19:46 node89 kernel: dlm: gfs0: cancel reply ret 0
messages.3:Mar  8 08:19:46 node89 kernel: dlm: gfs0: process_lockqueue_reply id a5f01ea state 0

messages.1:Mar 23 15:49:36 node128 kernel: dlm: gfs0: cancel reply ret 0
messages.1:Mar 23 15:49:36 node128 kernel: dlm: gfs0: process_lockqueue_reply id 225700e2 state 0
Comment 20 David Teigland 2009-03-31 14:37:49 EDT
Yes, that may be the key.  Those cancel's are usually caused by a gfs_grow
(see bug 487026 for a fix to grow to prevent the cancels).  A gfs_grow I think is very likely to break fast statfs.
Comment 21 Matthew Whitehead 2009-03-31 15:07:30 EDT
in lockqueue.c:process_lockqueue_reply() it does a "switch (state)" and checks these four #define cases (from dlm_internal.h):

 * Lockqueue wait lock states                                                                                                                                                                                  

#define GDLM_LQSTATE_WAIT_RSB           1

However, we know from the /var/log/messages entry that 'state' is none of those:

                log_error(ls, "process_lockqueue_reply id %x state %d",
                          lkb->lkb_id, state);

node128 kernel: dlm: gfs0: process_lockqueue_reply id 225700e2 state 0 

Why is it that 'state' is not in 1, 2, 3, or 4?
Comment 22 Robert Peterson 2009-03-31 15:09:16 EDT
That code path may also be called by gfs_jadd and gfs_quota.
Have they done any of those around the time of the problem?
Comment 23 Matthew Whitehead 2009-03-31 15:28:12 EDT
Checked the root shell history log on all nodes just to be sure, but I knew the answer already. We have NOT run gfs_jadd nor gfs_quota. The only utility we've used is gfs_tool.
Comment 24 David Teigland 2009-04-02 16:31:12 EDT
I believe there are only three cases where cancel can happen:

1. journal recovery. you can check the logs around the cancel messages to see if there was any journal recovery happening at that time.  node failure is known to be a problem with fast statfs.

2. hfile_write - gfs_grow, gfs_jadd, gfs_quota
patch in next rhel4 release will avoid cancels from hfile_write

3. gfs_lock_fs_check_clean - gfs_tool reclaim, gfs_tool freeze
I can imagine these causing problems with fast statfs, especially reclaim.
Comment 25 Robert Peterson 2009-04-06 13:07:02 EDT
I verified that the places in comment #24 are the only cases where
locks can be cancelled by gfs.  So either the customer did gfs_grow,
gfs_jadd, gfs_quota, gfs_tool reclaim, gfs_tool freeze, or else
the file system recovery occurred around the same time as those cancel
messages.  (As Dave said, look in /var/log/messages for that).

Either that or there was memory corruption in dlm or gfs.  Corruption
in gfs is less likely because it formulates the lock message right
before sending it off.
Comment 27 David Teigland 2009-04-14 17:47:04 EDT
Whatever is causing the cancel messages is probably also causing the fast-statfs problems.  Both are symptoms of something unknown.

Turning off fast-statfs will definately prevent statfs from becoming out of sync.
Comment 31 Robert Peterson 2009-04-16 18:19:10 EDT
Created attachment 339929 [details]
Faster "regular" statfs

Since we've been seeing so many problems lately with statfs_fast
I've been investigating how to improve the statfs performance.
This simple patch improves the normal "slow" statfs by 30% in my
testing.  I couldn't find a statfs benchmark test, so I compared
the same hardware, same test, running d_io on three nodes with the
same random seed, and doing a df every 30 seconds, for 20 iterations.
The results are as follows:

Average time per df using the described test:

Old method: 3.404553 seconds
New method: 2.403117 seconds (with this patch)

The idea behind the patch is simple: With the current algorithm, if
all the slots are filled with glocks that are waiting to be locked,
it spins in a nearly-tight loop (with a yield) for millions of
iterations while the locks complete.  That robs cpu time from the
processes who need to actually process the lock requests.  With this
new patch, if all the slots are full of glocks that are waiting, it
will wait for the first glock in the list of outstanding glocks to
complete before iterating again.

I might be able to make it faster by making it actually wait for
the next glock that comes in by using a centralized wait queue.

This afternoon, a few of us got our heads together and brainstormed
a bit about new ways we can implement a better statfs_fast as well,
so I'm going to pursue some of those ideas too.
Comment 32 Robert Peterson 2009-04-17 12:28:53 EDT
FYI, I tried using a centralized wait_queue and it only slowed
things down, worse than the original algorithm.
Comment 36 Robert Peterson 2009-04-29 18:17:39 EDT
Created attachment 341846 [details]
Preliminary patch

The previous patch suffered from hangs; I suspect because eventually
the glock waited on was sometimes out of sorted order.

This patch approaches "statfs_fast" from a different angle.
First, I ripped out the old code that tried to keep a single statfs
file up-to-date with periodic synchronization, due to all the hangs
and problems we've seen.

Second, I ripped out all the code associated with lvb data on a
per-rgrp basis and used glock promote/demotes to keep the lvb up-to-
date.  This may have a significant performance benefit.

Third, I replaced the temporary gfs_stat_gfs structures with a
a centralized "gfs_stat_gfs" associated with the mounted file system.
This is kept up to date whenever blocks are allocated, freed or
changed from data to metadata and such.  I introduced a centralized
function, gfs_rgd_modify, that takes care of adjusting both the
rgd values, the centralized statfs values, adding the rgd buffer to
the current transaction, and writing the rgrp out to the buffer.
This means rgrp updates are now done consistently.  I also believe
this fixed a bug with the previous statfs code whereby blocks that
were allocated as metadata chunks from free space were not accounted
for properly.

Fourth, whenever a rgrp is read in, the centralized statfs info is
adjusted accordingly by subtracting out the previous values and
adding in the new values.

Fifth, whenever glocks for rgrps are locked, if the glock version
number has changed, the new rgrp info is now read in.  It is no
longer based on the GL_SKIP bit which can potentially get us into
the trouble I described in this link:

Sixth, when statfs_fast is set, it just triggers the statfs code to
go through a different path whereby "try" locks are used for the
rgrps.  Each glock is tried a certain number of times as specified
by the existing rgrp_try_threshold tunable.  If the glock is acquired,
any locks that are taken from other nodes cause the "go_lock" code to
adjust the global statfs info.  Any locks that can't be acquired after
the given number of tries are ignored and it's assumed that their
previous values, taken into account in the global statfs structure,
are close enough.

Seventh, the quota daemon no longer has to sync the statfs file.
Instead, it merely runs the list of rgrps looking for glocks that
were not acquired during the last statfs.  If it finds any, it locks
them (no longer a "try" lock) thus accomplishing a time-delayed update
to the global statfs info.  So the statfs should be adjusted to a more
accurate value every time the daemon runs (default 5 seconds).

This is a proof-of-concept patch.  I'd like to get comments/concerns
from other developers who know this code.  I'd also like to know if
the customer is willing to try this patch.
Comment 37 Robert Peterson 2009-04-30 10:53:33 EDT
Additional testing (with gfs_grow) revealed a bug in the previous
patch.  My final patch should, in theory, solve bug #484760 as well.
Comment 38 David Teigland 2009-04-30 14:01:50 EDT
This sounds like a major change to the normal statfs code, which IMO shouldn't even be considered.  Any change needs to be isolated to a specially enabled mode like fast_statfs is.  Even though I'd like to, I don't think we can remove the current fast_statfs code either, because customers now depend on it.  The best I think we can do is add a new quick_statfs option along side fast_statfs, and encourage people to change from fast to quick.  There will then be three modes: normal (slow), fast, and quick.

The first thing I'd do is add some detailed diagnostics/accounting to the normal statfs, that can tell us the exact reason for the time normal statfs takes (how many locks it needs to take, how long each takes, which it's waiting for, ...).  Then, I'd try to come up with a dead-simple quick mode that we know cannot cause new problems, and that is targeted narrowly at the problem we observe.

Something simple would be using TRY locks, and assuming no-space available from rg's it cannot lock.  It could be tuned to wait/retry for a max time, and/or tuned for a max number of rg's it would give up on, e.g. give up trying on up to 10 rg's, which could produce a result that was up to 10% above the true values on a fs with 100 rg's.
Comment 39 Robert Peterson 2009-04-30 17:00:01 EDT
Created attachment 342003 [details]
Second prototype patch

This is a "fixed" version of the previous patch.  I found bugs in
testing, especially related to gfs_grow.  This version also fixes
bug #484760 as I predicted, and likely bug #471141 as well.
However, this patch does not yet address the concerns raised by Dave
Teigland in comment #38.  I'll likely revise the patch again to
resolve Dave's concerns.
Comment 42 Robert Peterson 2009-05-01 10:51:48 EDT
Created attachment 342099 [details]
Third patch

This version (hopefully) addresses Dave's concerns.  It restores
the "normal" statfs to its original condition and puts the lvbs back
in place.  The statfs_fast has been replaced by my new version.
As expected, performance is back to its previous level.
Comment 43 Robert Peterson 2009-05-01 11:46:34 EDT
Requesting flags to get this into a RHEL4 release.
Comment 44 Robert Peterson 2009-05-01 14:47:12 EDT
*** Bug 484760 has been marked as a duplicate of this bug. ***
Comment 45 Robert Peterson 2009-05-07 17:28:31 EDT
*** Bug 455309 has been marked as a duplicate of this bug. ***
Comment 48 Issue Tracker 2009-07-09 09:39:02 EDT
Event posted on 2009-07-09 09:39 EDT by jwilleford

This event sent from IssueTracker by jwilleford 
 issue 283633
Comment 50 Issue Tracker 2009-07-14 18:05:31 EDT
Event posted on 07-14-2009 06:05pm EDT by MatthewWhitehead

The customer is indecisive about their choice of kernel. They have been
prompted to choose and still have not.

Also, the customer is testing setting statfs_fast = 0 on their dev
machines and may eventually roll it out in production. That would remove
the need for this patch.

I would not put any more time into this until the customer comes back with
a clear strategy.

This event sent from IssueTracker by MatthewWhitehead 
 issue 269092
Comment 53 Robert Peterson 2009-09-08 10:12:53 EDT
Created attachment 360083 [details]
RHEL4 Rebase of the "third patch"

This is the same as the previous patch for RHEL4, rebased to be
applied to the current code base in git.
Comment 62 Steve Whitehouse 2010-11-18 05:03:24 EST
We need to update the rhel4 documentation to say that people should not use the fast statfs feature and if they do require this feature they should consider upgrading to rhel5 or above.
Comment 66 Robert Peterson 2010-11-18 13:20:24 EST
Closing this as WONTFIX as per comment #65.  The code for
statfs_fast was kludged into GFS based on the GFS2 implementation
and is not 100%.  The prototype patches I posted earlier may
function better than the original, but frankly, it's a design
change to GFS and we're too late in the cycle for design changes.

Note You need to log in before you can comment on or make changes to this bug.