Bug 181779

Summary: Occasional process hang accessing at sync_page on nfs mount
Product: [Fedora] Fedora Reporter: Michael Young <m.a.young>
Component: kernelAssignee: Steve Dickson <steved>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 5CC: a.d.stribblehill, davej, jonstanley, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: MassClosed
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-01-20 04:40:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sysrq t output for stuck processes
none
Full sysrq-t output
none
sysrq-m output from a similar situation none

Description Michael Young 2006-02-16 15:20:44 UTC
Description of problem:
We are seeing the occasional process hanging accessing files over a heavily used
NFS mount (rw,nosuid,noac,actimeo=0,nfsvers=3,tcp,timeo=600,rsize=32768,wsize=32768,
hard,intr,fg)
where the process is stuck in the sync_page syscall according to ps -l, and it
becomes unkillable. Subsequent attempts to access the file hang in the same way.
Version-Release number of selected component (if applicable):
2.6.15-1.1831_FC4smp

How reproducible:
This has happened 2 or 3 times in the past week.

Comment 1 Michael Young 2006-02-17 13:51:34 UTC
I am not sure if it is relevant but we are also seeing the error
kernel: do_vfs_lock: VFS is out of sync with lock manager!
occasionally in the messages file.

Comment 2 Michael Young 2006-02-21 14:22:50 UTC
We tried an earlier kernel, and have seen the same occasional process hang on
2.6.11-1.1369_FC4smp 

Comment 3 Michael Young 2006-02-22 16:58:08 UTC
Created attachment 125042 [details]
sysrq t output for stuck processes

The stuck processes appear to come in pairs stuck in sync_page. I am attaching
the sysrq t output from a couple of thses stuck processes.

Comment 4 Steve Dickson 2006-07-25 04:16:04 UTC
Is there any network traffic when this hang happen?

Comment 5 Michael Young 2006-07-25 08:33:06 UTC
The machine was in general under very heavy NFS load, but as we only observed
the stuck processes after the event, we couldn't tell if there was anything
unusual about the traffic at teh point the event was triggered. Incidentally,
the NFS server is a NetApps file server.

Comment 6 Steve Dickson 2006-07-25 15:29:07 UTC
Would it be possible to post the complete sysrq t trace as
well as the sysrq-m output?

Comment 7 Michael Young 2006-07-25 15:40:58 UTC
Created attachment 132998 [details]
Full sysrq-t output

Here is the full sysrq-t output from which the above extract was taken. I
didn't record the sysrq-m output.

Comment 8 Steve Dickson 2006-07-25 20:03:24 UTC
Could you please post the sysrq-m output as well? Because I'm
thinking this could be a memory exhaustion problem... tia... 

Comment 9 Michael Young 2006-07-26 15:53:05 UTC
I didn't save the sysrq-m output when we saw the bug, and unfortunately the
software on that machine is in the process of being changed at the moment to try
to avoid the problem by lessening the nfs load, so I can't generate any useful
sysrq-m output at the present time.

We do still have some RedHat 9 machines running the same software and we could
get sysrq-m output from them but they didn't exhibit the bug, and may be too
different to be useful anyway.

Comment 10 Steve Dickson 2006-07-27 15:56:30 UTC
Ok... if the issue pops back update the bug... 

Comment 11 Dave Jones 2006-09-17 02:38:19 UTC
[This comment added as part of a mass-update to all open FC4 kernel bugs]

FC4 has now transitioned to the Fedora legacy project, which will continue to
release security related updates for the kernel.  As this bug is not security
related, it is unlikely to be fixed in an update for FC4, and has been migrated
to FC5.

Please retest with Fedora Core 5.

Thank you.


Comment 12 Dave Jones 2006-10-16 18:44:33 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 13 Michael Young 2007-02-22 14:36:41 UTC
Created attachment 148568 [details]
sysrq-m output from a similar situation

We are seeing similar stuck processes for FC6 2.6.19-1.2911 on an x86_64 box,
doing the same task but with different software. This could be related to a
different problem we are seeing on the same box (which I have reported as bug
229469 ) where there are locks on the Netapp NFS server that the client seems
to have forgotten to remove.
I am attaching the sysrq-m output from this new occurrence, though it was taken
sometime after the sticking processes first occurred.

Comment 14 Michael Young 2007-03-13 12:57:40 UTC
This does now seem to be separate from the locking issue I was seeing. I have
had processes sticking on 2.6.19-1.2911.6.5 with my locking patch applied,
including a single process rather than a pair. In this case the sysrq-t output
doesn't list the stuck process for some reason, so I can't see how it is sticking.

Comment 15 Jon Stanley 2008-01-20 04:40:24 UTC
(this is a mass-close to kernel bugs in NEEDINFO state)

As indicated previously there has been no update on the progress of this bug
therefore I am closing it as INSUFFICIENT_DATA. Please re-open if the issue
still occurs for you and I will try to assist in its resolution. Thank you for
taking the time to report the initial bug.

If you believe that this bug was closed in error, please feel free to reopen
this bug.