Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 4 product line. The current stable release is 4.9. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 476726

Summary:

[nfs] actimeo=0 not enforced during ftruncate operations, resulting in database crashes

Product:

Red Hat Enterprise Linux 4

Reporter:

John Sobecki <john.sobecki>

Component:

kernel

Assignee:

Peter Staubach <staubach>

Status:

CLOSED ERRATA

QA Contact:

Martin Jenner <mjenner>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.7

CC:

andriusb, bikash, cevich, chris.mason, chuck.lever, cward, dejohnso, greg.marsden, jlayton, jneedham, john.sobecki, jtluka, juanino, lwang, rwheeler, spurrier, steved, tao

Target Milestone:

beta

Target Release:

4.8

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-05-18 19:24:25 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

450897

Attachments:

Description	Flags
Testcase	none
patch 1 of 2	none
patch 2 of 2	none
Proposed patch	none

Description John Sobecki 2008-12-16 19:19:58 UTC

Created attachment 327151 [details]
Testcase

Description of problem:

During heavy file extend operations of an Oracle RAC database, the instance can crash with an ORA-600 [KCFQUEWR_2].  

The root cause of the probem is inconsistent filesize as reported across clustered machines using nfs v3 tcp storage.   The problem can be reproduced using NetApp or nfsd served storage.  

Version-Release number of selected component (if applicable):

2.6.9-78

How reproducible:

Run the testcase, attached.   No database required. 

Steps to Reproduce:
1. Setup two nfs clients, mounted nointr,actimeo=0
2. Run testcase per the README
3. Get the error
  
Actual results:

Same as above. 

Expected results:

Filesize should be consistent across the cluster after ftruncate() operations. 

Additional info:

Backported patches from Chuck Lever were tested and attached here.

Comment 1 John Sobecki 2008-12-16 19:21:10 UTC

Created attachment 327152 [details]
patch 1 of 2

Comment 2 John Sobecki 2008-12-16 19:21:50 UTC

Created attachment 327153 [details]
patch 2 of 2

Comment 3 Chuck Lever 2008-12-16 19:38:09 UTC

Patch 1 is a backport of Peter Staubach's upstream patch to close a window in the attribute timeout code.  This is needed to make sure that nfs_revalidate_inode() always forces a call to __nfs_revalidate_inode() for "actimeo=0" and "noac" mounts.

Patch 2 is a patch that makes __nfs_revalidate_inode() behave the same way for "actimeo=0" and "noac" mounts.  Our database customers use "actimeo=0" with direct NFS I/O as a faster, more reliable replacement of "noac".  "noac" should always be a synonym for "actimeo=0,sync".

Comment 4 Peter Staubach 2008-12-16 19:53:09 UTC

*** Bug 476728 has been marked as a duplicate of this bug. ***

Comment 5 Chuck Lever 2008-12-16 20:04:24 UTC

John did a test matrix yesterday with just the first patch applied.

Both clients running EL4.5 patched kernel & 5 copies of reads.

linux3 running truncates	linux4 running stats	result
------------------------        --------------------	------
noac				noac			pass
actimeo=0			actimeo=0		fail
noac				actimeo=0		fail
actimeo=0			noac			pass

So we think the second is a requirement to completely fix this problem.

Comment 6 Chuck Lever 2008-12-16 20:07:11 UTC

Due to the use of proportional fonts in the "Additional Comments:" box and tabs in the original table, the above matrix is kind of a mess.  Sorry.

Comment 7 Peter Staubach 2008-12-17 15:12:20 UTC

Created attachment 327257 [details]
Proposed patch

Here is the (roughly) equivalent patch as ported to 2.6.9-78.22.

The previous patches were good as a model, but were applicable to a
relatively old version of the RHEL-4 kernel.  I believe that this
new patch should work to address the problems.

Comment 8 RHEL Program Management 2008-12-18 19:28:33 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Vivek Goyal 2009-01-05 14:20:29 UTC

Committed in 78.23.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 10 John Sobecki 2009-01-05 22:12:17 UTC

Confirmed fixed via the testcase I posted to this BZ.  Ran the load
test for over 6 hours so far using actimeo=0 on both systems and 
pre-patch it would fail within the first 5 minutes.  Thanks, John

Comment 17 Chris Ward 2009-03-27 14:20:31 UTC

~~ Attention Partners! Snap 1 Released ~~
RHEL 4.8 Snapshot 1 has been released on partners.redhat.com. There should
be a fix present, which addresses this bug. NOTE: there is only a short time
left to test, please test and report back results on this bug
at your earliest convenience.

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. If you have found a NEW bug, clone this
bug and describe the issues you encountered. Further questions can be
directed to your Red Hat Partner Manager.

If you have VERIFIED the bug fix. Please select your PartnerID from the
Verified field above. Please leave a comment with your test results details.
Include which arches tested, package version and any applicable logs.

 - Red Hat QE Partner Management

Comment 18 Chris Ward 2009-04-16 14:13:20 UTC

Confirmed that the patch in comment #7, which was customer verified, is included in the latest -88 build.

Comment 19 John Sobecki 2009-04-16 14:54:03 UTC

I have this kernel downloaded & the nfs testcase is running.  Thanks, John

Comment 20 Chris Ward 2009-04-16 16:08:13 UTC

~~ Attention! Snap 4 Released ~~
RHEL 4.8 Snapshot 4 has been released on partners.redhat.com. There should
be a fix present that addresses this bug. NOTE: there is only a short time
left to test, please test and report back results on this bug ASAP.

The latest kernel build can be obtained here:
http://people.redhat.com/vgoyal/rhel4/

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. If you have found a NEW bug, clone this
bug and describe the issues you encountered. Further questions can be
directed to your Red Hat Partner Manager.

If you have VERIFIED the bug fix. Please select your PartnerID from the
Verified field above. Please leave a comment with your test results details.
Include which arches tested, package version and any applicable logs.

Comment 21 Chris Ward 2009-04-20 09:50:47 UTC

John, and update?

Comment 22 Chris Ward 2009-04-20 09:51:16 UTC

s/and/any/

Comment 23 John Sobecki 2009-04-20 18:14:19 UTC

Status:

Tests are still running, but it appears that the -88 kernel has fixed the
actimeo=0 bugs, based on the testcase we have here at Oracle.

Comment 24 John Sobecki 2009-04-20 21:16:23 UTC

Fixed confirmed with 2.6.9-88.ELsmp kernel on x86_64 in a cluster stress test.

Comment 26 errata-xmlrpc 2009-05-18 19:24:25 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html