Bug 510310 - GFS: Commited data evaporates
GFS: Commited data evaporates
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: gfs-kmod (Show other bugs)
5.2
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Robert Peterson
Cluster QE
: ZStream
Depends On:
Blocks: 513445 513446
  Show dependency treegraph
 
Reported: 2009-07-08 12:30 EDT by Issue Tracker
Modified: 2010-10-23 06:38 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 07:03:00 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
sync_and_fail program that uses jdata (3.95 KB, text/plain)
2009-07-15 14:28 EDT, Robert Peterson
no flags Details
Patch Try #1 (504 bytes, patch)
2009-07-16 16:31 EDT, Robert Peterson
no flags Details | Diff
x86_64 rpm for 5.2.z (142.39 KB, application/octet-stream)
2009-07-20 13:49 EDT, Robert Peterson
no flags Details

  None (edit)
Description Issue Tracker 2009-07-08 12:30:45 EDT
Escalated to Bugzilla from IssueTracker
Comment 1 Issue Tracker 2009-07-08 12:30:48 EDT
Event posted on 07-07-2009 02:37am EDT by asafeks

Description of problem:
We are using GFS cluster of 2 nodes, with RH 5 update 2.
The incident:
1) A file named "kuku" was written and committed on "node A"
2) "node A" was unexpectedly powered off.
3) "node B" could not read the data in "kuku".

notes:
1) the details of the above incident are captured at the attached example.
2) This is a crucial problem to the application that relies on the GFS and RH cluster. 







How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:
This event sent from IssueTracker by dejohnso  [Support Engineering Group]
 issue 314926
Comment 2 Issue Tracker 2009-07-08 12:30:49 EDT
Event posted on 07-07-2009 07:58am EDT by asafeks

Steps to Reproduce:

In order to reproduce this problem please use the attachment files. 
please follow the instructions at: "sync_failure.txt" 


This event sent from IssueTracker by dejohnso  [Support Engineering Group]
 issue 314926
Comment 3 Issue Tracker 2009-07-08 12:30:51 EDT
Event posted on 07-08-2009 12:01pm EDT by gcase

SEG Escalation Template

All Issues: Problem Description
---------------------------------------------------
1. Time and date of problem: Ongoing
2. System architecture(s): RHEL5.2 x86_64
3. Provide a clear and concise problem description as it is understood at
the time of escalation. Please be as specific as possible in your
description. Do not use the generic term "hang", as that can mean many
things.
   Observed behavior: Diligent has discovered that data may disappear from
a GFS cluster under certain circumstances. Please see the sync_failure.txt
file in the attached zip archive for a full explanation.
   Desired behavior: Files written with FsFile::sync should be committed
to the GFS filesystem and available even if node that wrote the file goes
down.
4. Specific action requested of SEG: Please determine what's causing this
bug.
5. Is a defect (bug) in the product suspected? yes
   Bugzilla number (if one already exists): None
6. Does a proposed patch exist? no
7. What is the impact to the customer when they experience this problem?
This is especially important for severity one and two issues: This issue
could mean that all GFS filesystems are subject to data loss if the
described scenario occurs.
   
All Issues: Supporting Information
------------------------------------------------------
1. Other actions already taken in working the problem (tech-list posting,
google searches, fulltext search, consultation with another engineer,
etc.): Spoke with josef, confirmed that this is a bug.
   Relevant data found (if any): It's a bug, according to josef's
analysis
2. Attach sosreport. Reproducer attached. Any RHEL5.2 x86_64 cluster can
be used to reproduce this.
3. Attach other supporting data (if any). NOTE: Diligent is restricted to
using RHEL5.2 x86_64 due to their custom software. They cannot upgrade to
5.3 as a solution. They have EUS support, so we can get fixes backported
to them if needed.
4. Provide issue reproduction information, including location and access
of reproducer machine, if available. Reproducer script attached, along
with directions. 
5. Known hot-fix packages on the system: None
6. Customer applied changes from the last 30 days: None




Issue escalated to Support Engineering Group by: gcase.
Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by dejohnso  [Support Engineering Group]
 issue 314926
Comment 4 Debbie Johnson 2009-07-08 12:34:09 EDT
Created attachment 350965 [details]
Contains duplicator
Comment 6 Ric Wheeler 2009-07-08 14:13:51 EDT
Has this been seen/tested on 5.4 beta?
Comment 9 Robert Peterson 2009-07-09 11:08:43 EDT
I finally got all the perl dependencies resolved and the scenario
setup.  My cluster uses an APC network power switch fencing device,
so I changed the script to use fence_apc rather than fence_wti, and
gave it the appropriate parameters.  On my RHEL5.4+patches system,
here's what happens:

The script runs on roth-01, calls the sync_and_fail program on roth-02,
then calls fence_apc to power off roth-02.  Then it tries to run
check_sync_and_fail (on roth-01).  This understandably hangs because
all GFS IO should be blocked until the roth-02 comes back into the
cluster.  The script does not power roth-02 back on until after the
call to check_sync_and_fail, so the program and script just hang.
This is working as designed.  Here is the bottom of the log file:

[DEBUG] [Thu Jul  9 09:45:51 CDT 2009] - Running command: fence_apc -a roth-apc -n 1:2 -o off -lapc -p apc
[TRACE] [Thu Jul  9 09:45:56 CDT 2009] - Ran Command: "fence_apc -a roth-apc -n 1:2 -o off -lapc -p apc", ERRNO
: 0, STDOUT: Success: Powered OFF, STDERR: 
[INFO] [Thu Jul  9 09:45:56 CDT 2009] - End
[INFO] [Thu Jul  9 09:45:56 CDT 2009] - Start
[DEBUG] [Thu Jul  9 09:45:56 CDT 2009] - Running command: /pt_work/ariel/sync_failure/check_sync_and_fail

If I manually power on roth-02 again and get it to rejoin the cluster,
the GFS operations continue, journal recovery occurs, the
check_sync_and_fail continues and the file is deleted properly.
My script does not continue to run unless I manually power on roth-02
because the script doesn't execute the command until after the
check_sync_and_fail program is run.

I'm not a fencing expert, so I don't know fence_wti.  How is the
scenario different for the customer?  Maybe the fence_wti script is
not really fencing their node-02?  Does GFS continue running on node-01
after the fence-wti power off is performed for node-02?
Do I need to somehow adjust my script because fence_apc behaves
differently from fence_wti?  Are they experiencing split-brain because
they have a two-node cluster whereas mine has three nodes?
Comment 11 Shane Bradley 2009-07-13 13:37:59 EDT
Created attachment 351499 [details]
runTrigger log

This is new log that shows, node doing a sleep on failure and tries again. Then the node is fenced off then fenced on. Loop then waits for node to come online then fails.

--sbradley
Comment 13 Robert Peterson 2009-07-14 17:11:33 EDT
I can recreate the problem now, so I'm investigating the problem.
Comment 14 Robert Peterson 2009-07-15 14:28:24 EDT
Created attachment 353867 [details]
sync_and_fail program that uses jdata

The problem seems to be that a fsync in GFS does not cause the
journal to be flushed to the media, unless the file has the
journaled data (jdata) attribute.

I've analyzed the journals from several recreations of the failure
and they all show some of the writes to the kuku file, but not all
of the writes.  Thus, after the node is fenced and the journal is
replayed, the file is restored to a sane-but earlier version of
the data that reflects the earlier writes but not the last write.

This attachment is a modified version of the sync_and_fail program
that sets the jdata attribute on the output file, kuku.  Using this
version of the program, the runTrigger script has iterated several
times without failure.  (On the original version, the failure would
occur reliably).

What I don't know is whether this is working "as designed", but I
suspect it is.  I don't know the intent of the original designers
of GFS, so I'm not sure yet, and need to do further investigation.
I also suspect that if we somehow changed the code to cause fsync
to force the journal to disk, it will severely impact performance.
Comment 15 Steve Whitehouse 2009-07-15 18:32:14 EDT
The purpose of fsync is to ensure that the data is on disk, so whatever the performance issues, we must ensure that the data is safe. The code seems to fsync both the directory and the file itself. On that basis, and providing the fsync calls returned successfully before the power was removed on the node, then the data should still be there afterwards.
Comment 16 Robert Peterson 2009-07-16 14:19:09 EDT
Comment #15 makes sense to me, so let's call it a "real bug" we
need to fix.

Side note: The problem does not appear in the latest GFS2 code.
I tested a pre-release 5.4 system with GFS2 on the 2.6.18-158
kernel and the normal runTrigger script ran for well over an hour
with no failures.

I tried back-porting the fsync code from GFS2 to GFS, but it did
not solve the problem for GFS.  I suspect it's a case of GFS not
setting the dirty bit when it should.  If I solve that and then
add the back-port I'm hoping it solves the problem.

Also, I'm still sitting on the fix for bug #491369, which is
awaiting the results from a customer test.  I want to test that
next to see if it maintains the dirty bit any better, which will
allow my back-port to work properly.
Comment 17 Robert Peterson 2009-07-16 16:31:46 EDT
Created attachment 354046 [details]
Patch Try #1

This patch seems to fix the problem.  It simply adds a corresponding
check for a stuffed dinode to the GFS fsync code.  With this patch,
the runTrigger test has been running for several iterations now
without failure on my test cluster.  Needs more testing.

According to the comments in this commit to gfs2, a fsync for a
stuffed inode should cause the journal to be flushed, and thus
the problem was fixed in gfs2:

http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=33c3de32872ef3c075e4dac04c0de8f86ac39f6f
Comment 18 Robert Peterson 2009-07-17 09:47:17 EDT
With this patch the runTrigger script ran successfully for ten
hours last night until it was killed because my vpn dropped.
Changing status to Assigned.  Requesting flags for inclusion into
5.5.
Comment 20 Robert Peterson 2009-07-20 12:15:38 EDT
Will they accept an rpm, and if so, what arch do they need?
Comment 22 Robert Peterson 2009-07-20 13:49:42 EDT
Created attachment 354374 [details]
x86_64 rpm for 5.2.z
Comment 23 Nate Straz 2009-07-21 11:17:44 EDT
Trying to pull this back into 5.4 as per GFS meeting.
Comment 26 Robert Peterson 2009-07-22 13:51:07 EDT
I pushed the patch to the master branch of the gfs1-utils git
tree, and the STABLE2 and STABLE3 branches of the cluster git tree.
I'll wait on RHEL5, RHEL54, RHEL53, and RHEL52 branches until I get
some ack flags and the necessary paperwork is done.  Here are the
commit messages:

[master.510310 a5b702e] GFS: Commited data evaporates
 1 files changed, 2 insertions(+), 0 deletions(-)
[STABLE3.510310 be5f9ef] GFS: Commited data evaporates
 1 files changed, 2 insertions(+), 0 deletions(-)
[STABLE2.510310 16f27b5] GFS: Commited data evaporates
 1 files changed, 2 insertions(+), 0 deletions(-)

Still pending:

[RHEL5.510310 344ede9] GFS: Commited data evaporates
 1 files changed, 2 insertions(+), 0 deletions(-)
[RHEL54.510310 aa0b70b] GFS: Commited data evaporates
 1 files changed, 2 insertions(+), 0 deletions(-)
[RHEL53.510310 f19cc83] GFS: Commited data evaporates
 1 files changed, 2 insertions(+), 0 deletions(-)
[RHEL52.510310 d52f4c7] GFS: Commited data evaporates
 1 files changed, 2 insertions(+), 0 deletions(-)
Comment 28 Robert Peterson 2009-07-23 11:33:03 EDT
Fix was pushed to RHEL5, RHEL54, RHEL53 and RHEL52 branches of
the cluster git tree for inclusion into 5.4 (and up), plus 5.3.z
and 5.2.z.  Changing status to Modified.
Comment 33 errata-xmlrpc 2009-09-02 07:03:00 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1338.html

Note You need to log in before you can comment on or make changes to this bug.