Bug 476726
| Summary: | [nfs] actimeo=0 not enforced during ftruncate operations, resulting in database crashes | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | John Sobecki <john.sobecki> | ||||||||||
| Component: | kernel | Assignee: | Peter Staubach <staubach> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | high | ||||||||||||
| Version: | 4.7 | CC: | andriusb, bikash, cevich, chris.mason, chuck.lever, cward, dejohnso, greg.marsden, jlayton, jneedham, john.sobecki, jtluka, juanino, lwang, rwheeler, spurrier, steved, tao | ||||||||||
| Target Milestone: | beta | ||||||||||||
| Target Release: | 4.8 | ||||||||||||
| Hardware: | All | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2009-05-18 19:24:25 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 450897 | ||||||||||||
| Attachments: |
|
||||||||||||
Created attachment 327152 [details]
patch 1 of 2
Created attachment 327153 [details]
patch 2 of 2
Patch 1 is a backport of Peter Staubach's upstream patch to close a window in the attribute timeout code. This is needed to make sure that nfs_revalidate_inode() always forces a call to __nfs_revalidate_inode() for "actimeo=0" and "noac" mounts. Patch 2 is a patch that makes __nfs_revalidate_inode() behave the same way for "actimeo=0" and "noac" mounts. Our database customers use "actimeo=0" with direct NFS I/O as a faster, more reliable replacement of "noac". "noac" should always be a synonym for "actimeo=0,sync". *** Bug 476728 has been marked as a duplicate of this bug. *** John did a test matrix yesterday with just the first patch applied. Both clients running EL4.5 patched kernel & 5 copies of reads. linux3 running truncates linux4 running stats result ------------------------ -------------------- ------ noac noac pass actimeo=0 actimeo=0 fail noac actimeo=0 fail actimeo=0 noac pass So we think the second is a requirement to completely fix this problem. Due to the use of proportional fonts in the "Additional Comments:" box and tabs in the original table, the above matrix is kind of a mess. Sorry. Created attachment 327257 [details]
Proposed patch
Here is the (roughly) equivalent patch as ported to 2.6.9-78.22.
The previous patches were good as a model, but were applicable to a
relatively old version of the RHEL-4 kernel. I believe that this
new patch should work to address the problems.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Committed in 78.23.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ Confirmed fixed via the testcase I posted to this BZ. Ran the load test for over 6 hours so far using actimeo=0 on both systems and pre-patch it would fail within the first 5 minutes. Thanks, John ~~ Attention Partners! Snap 1 Released ~~ RHEL 4.8 Snapshot 1 has been released on partners.redhat.com. There should be a fix present, which addresses this bug. NOTE: there is only a short time left to test, please test and report back results on this bug at your earliest convenience. If you encounter any issues, please set the bug back to the ASSIGNED state and describe the issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you encountered. Further questions can be directed to your Red Hat Partner Manager. If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs. - Red Hat QE Partner Management Confirmed that the patch in comment #7, which was customer verified, is included in the latest -88 build. I have this kernel downloaded & the nfs testcase is running. Thanks, John ~~ Attention! Snap 4 Released ~~ RHEL 4.8 Snapshot 4 has been released on partners.redhat.com. There should be a fix present that addresses this bug. NOTE: there is only a short time left to test, please test and report back results on this bug ASAP. The latest kernel build can be obtained here: http://people.redhat.com/vgoyal/rhel4/ If you encounter any issues, please set the bug back to the ASSIGNED state and describe the issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you encountered. Further questions can be directed to your Red Hat Partner Manager. If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs. John, and update? s/and/any/ Status: Tests are still running, but it appears that the -88 kernel has fixed the actimeo=0 bugs, based on the testcase we have here at Oracle. Fixed confirmed with 2.6.9-88.ELsmp kernel on x86_64 in a cluster stress test. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html |
Created attachment 327151 [details] Testcase Description of problem: During heavy file extend operations of an Oracle RAC database, the instance can crash with an ORA-600 [KCFQUEWR_2]. The root cause of the probem is inconsistent filesize as reported across clustered machines using nfs v3 tcp storage. The problem can be reproduced using NetApp or nfsd served storage. Version-Release number of selected component (if applicable): 2.6.9-78 How reproducible: Run the testcase, attached. No database required. Steps to Reproduce: 1. Setup two nfs clients, mounted nointr,actimeo=0 2. Run testcase per the README 3. Get the error Actual results: Same as above. Expected results: Filesize should be consistent across the cluster after ftruncate() operations. Additional info: Backported patches from Chuck Lever were tested and attached here.