Red Hat Bugzilla – Bug 181779
Occasional process hang accessing at sync_page on nfs mount
Last modified: 2008-01-19 23:40:24 EST
Description of problem:
We are seeing the occasional process hanging accessing files over a heavily used
NFS mount (rw,nosuid,noac,actimeo=0,nfsvers=3,tcp,timeo=600,rsize=32768,wsize=32768,
where the process is stuck in the sync_page syscall according to ps -l, and it
becomes unkillable. Subsequent attempts to access the file hang in the same way.
Version-Release number of selected component (if applicable):
This has happened 2 or 3 times in the past week.
I am not sure if it is relevant but we are also seeing the error
kernel: do_vfs_lock: VFS is out of sync with lock manager!
occasionally in the messages file.
We tried an earlier kernel, and have seen the same occasional process hang on
Created attachment 125042 [details]
sysrq t output for stuck processes
The stuck processes appear to come in pairs stuck in sync_page. I am attaching
the sysrq t output from a couple of thses stuck processes.
Is there any network traffic when this hang happen?
The machine was in general under very heavy NFS load, but as we only observed
the stuck processes after the event, we couldn't tell if there was anything
unusual about the traffic at teh point the event was triggered. Incidentally,
the NFS server is a NetApps file server.
Would it be possible to post the complete sysrq t trace as
well as the sysrq-m output?
Created attachment 132998 [details]
Full sysrq-t output
Here is the full sysrq-t output from which the above extract was taken. I
didn't record the sysrq-m output.
Could you please post the sysrq-m output as well? Because I'm
thinking this could be a memory exhaustion problem... tia...
I didn't save the sysrq-m output when we saw the bug, and unfortunately the
software on that machine is in the process of being changed at the moment to try
to avoid the problem by lessening the nfs load, so I can't generate any useful
sysrq-m output at the present time.
We do still have some RedHat 9 machines running the same software and we could
get sysrq-m output from them but they didn't exhibit the bug, and may be too
different to be useful anyway.
Ok... if the issue pops back update the bug...
[This comment added as part of a mass-update to all open FC4 kernel bugs]
FC4 has now transitioned to the Fedora legacy project, which will continue to
release security related updates for the kernel. As this bug is not security
related, it is unlikely to be fixed in an update for FC4, and has been migrated
Please retest with Fedora Core 5.
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.
Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.
This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.
Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.
In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed. See bug 207474 for further details.
If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.
If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.
Created attachment 148568 [details]
sysrq-m output from a similar situation
We are seeing similar stuck processes for FC6 2.6.19-1.2911 on an x86_64 box,
doing the same task but with different software. This could be related to a
different problem we are seeing on the same box (which I have reported as bug
229469 ) where there are locks on the Netapp NFS server that the client seems
to have forgotten to remove.
I am attaching the sysrq-m output from this new occurrence, though it was taken
sometime after the sticking processes first occurred.
This does now seem to be separate from the locking issue I was seeing. I have
had processes sticking on 2.6.19-1.2911.6.5 with my locking patch applied,
including a single process rather than a pair. In this case the sysrq-t output
doesn't list the stuck process for some reason, so I can't see how it is sticking.
(this is a mass-close to kernel bugs in NEEDINFO state)
As indicated previously there has been no update on the progress of this bug
therefore I am closing it as INSUFFICIENT_DATA. Please re-open if the issue
still occurs for you and I will try to assist in its resolution. Thank you for
taking the time to report the initial bug.
If you believe that this bug was closed in error, please feel free to reopen