Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 476726 - [nfs] actimeo=0 not enforced during ftruncate operations, resulting in database crashes
[nfs] actimeo=0 not enforced during ftruncate operations, resulting in databa...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.7
All Linux
high Severity high
: beta
: 4.8
Assigned To: Peter Staubach
Martin Jenner
:
: 476728 (view as bug list)
Depends On:
Blocks: 450897
  Show dependency treegraph
 
Reported: 2008-12-16 14:19 EST by John Sobecki
Modified: 2018-10-19 22:07 EDT (History)
18 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-05-18 15:24:25 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Testcase (4.06 KB, application/x-gzip)
2008-12-16 14:19 EST, John Sobecki
no flags Details
patch 1 of 2 (3.17 KB, patch)
2008-12-16 14:21 EST, John Sobecki
no flags Details | Diff
patch 2 of 2 (1.41 KB, patch)
2008-12-16 14:21 EST, John Sobecki
no flags Details | Diff
Proposed patch (3.07 KB, patch)
2008-12-17 10:12 EST, Peter Staubach
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1024 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update 2009-05-18 10:57:26 EDT

  None (edit)
Description John Sobecki 2008-12-16 14:19:58 EST
Created attachment 327151 [details]
Testcase

Description of problem:

During heavy file extend operations of an Oracle RAC database, the instance can crash with an ORA-600 [KCFQUEWR_2].  

The root cause of the probem is inconsistent filesize as reported across clustered machines using nfs v3 tcp storage.   The problem can be reproduced using NetApp or nfsd served storage.  

Version-Release number of selected component (if applicable):

2.6.9-78

How reproducible:

Run the testcase, attached.   No database required. 

Steps to Reproduce:
1. Setup two nfs clients, mounted nointr,actimeo=0
2. Run testcase per the README
3. Get the error
  
Actual results:

Same as above. 

Expected results:

Filesize should be consistent across the cluster after ftruncate() operations. 

Additional info:

Backported patches from Chuck Lever were tested and attached here.
Comment 1 John Sobecki 2008-12-16 14:21:10 EST
Created attachment 327152 [details]
patch 1 of 2
Comment 2 John Sobecki 2008-12-16 14:21:50 EST
Created attachment 327153 [details]
patch 2 of 2
Comment 3 Chuck Lever 2008-12-16 14:38:09 EST
Patch 1 is a backport of Peter Staubach's upstream patch to close a window in the attribute timeout code.  This is needed to make sure that nfs_revalidate_inode() always forces a call to __nfs_revalidate_inode() for "actimeo=0" and "noac" mounts.

Patch 2 is a patch that makes __nfs_revalidate_inode() behave the same way for "actimeo=0" and "noac" mounts.  Our database customers use "actimeo=0" with direct NFS I/O as a faster, more reliable replacement of "noac".  "noac" should always be a synonym for "actimeo=0,sync".
Comment 4 Peter Staubach 2008-12-16 14:53:09 EST
*** Bug 476728 has been marked as a duplicate of this bug. ***
Comment 5 Chuck Lever 2008-12-16 15:04:24 EST
John did a test matrix yesterday with just the first patch applied.

Both clients running EL4.5 patched kernel & 5 copies of reads.

linux3 running truncates	linux4 running stats	result
------------------------        --------------------	------
noac				noac			pass
actimeo=0			actimeo=0		fail
noac				actimeo=0		fail
actimeo=0			noac			pass

So we think the second is a requirement to completely fix this problem.
Comment 6 Chuck Lever 2008-12-16 15:07:11 EST
Due to the use of proportional fonts in the "Additional Comments:" box and tabs in the original table, the above matrix is kind of a mess.  Sorry.
Comment 7 Peter Staubach 2008-12-17 10:12:20 EST
Created attachment 327257 [details]
Proposed patch

Here is the (roughly) equivalent patch as ported to 2.6.9-78.22.

The previous patches were good as a model, but were applicable to a
relatively old version of the RHEL-4 kernel.  I believe that this
new patch should work to address the problems.
Comment 8 RHEL Product and Program Management 2008-12-18 14:28:33 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 9 Vivek Goyal 2009-01-05 09:20:29 EST
Committed in 78.23.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Comment 10 John Sobecki 2009-01-05 17:12:17 EST
Confirmed fixed via the testcase I posted to this BZ.  Ran the load
test for over 6 hours so far using actimeo=0 on both systems and 
pre-patch it would fail within the first 5 minutes.  Thanks, John
Comment 17 Chris Ward 2009-03-27 10:20:31 EDT
~~ Attention Partners! Snap 1 Released ~~
RHEL 4.8 Snapshot 1 has been released on partners.redhat.com. There should
be a fix present, which addresses this bug. NOTE: there is only a short time
left to test, please test and report back results on this bug
at your earliest convenience.

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. If you have found a NEW bug, clone this
bug and describe the issues you encountered. Further questions can be
directed to your Red Hat Partner Manager.

If you have VERIFIED the bug fix. Please select your PartnerID from the
Verified field above. Please leave a comment with your test results details.
Include which arches tested, package version and any applicable logs.

 - Red Hat QE Partner Management
Comment 18 Chris Ward 2009-04-16 10:13:20 EDT
Confirmed that the patch in comment #7, which was customer verified, is included in the latest -88 build.
Comment 19 John Sobecki 2009-04-16 10:54:03 EDT
I have this kernel downloaded & the nfs testcase is running.  Thanks, John
Comment 20 Chris Ward 2009-04-16 12:08:13 EDT
~~ Attention! Snap 4 Released ~~
RHEL 4.8 Snapshot 4 has been released on partners.redhat.com. There should
be a fix present that addresses this bug. NOTE: there is only a short time
left to test, please test and report back results on this bug ASAP.

The latest kernel build can be obtained here:
http://people.redhat.com/vgoyal/rhel4/

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. If you have found a NEW bug, clone this
bug and describe the issues you encountered. Further questions can be
directed to your Red Hat Partner Manager.

If you have VERIFIED the bug fix. Please select your PartnerID from the
Verified field above. Please leave a comment with your test results details.
Include which arches tested, package version and any applicable logs.
Comment 21 Chris Ward 2009-04-20 05:50:47 EDT
John, and update?
Comment 22 Chris Ward 2009-04-20 05:51:16 EDT
s/and/any/
Comment 23 John Sobecki 2009-04-20 14:14:19 EDT
Status:

Tests are still running, but it appears that the -88 kernel has fixed the
actimeo=0 bugs, based on the testcase we have here at Oracle.
Comment 24 John Sobecki 2009-04-20 17:16:23 EDT
Fixed confirmed with 2.6.9-88.ELsmp kernel on x86_64 in a cluster stress test.
Comment 26 errata-xmlrpc 2009-05-18 15:24:25 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html

Note You need to log in before you can comment on or make changes to this bug.