Escalated to Bugzilla from IssueTracker
Event posted on 07-07-2009 02:37am EDT by asafeks Description of problem: We are using GFS cluster of 2 nodes, with RH 5 update 2. The incident: 1) A file named "kuku" was written and committed on "node A" 2) "node A" was unexpectedly powered off. 3) "node B" could not read the data in "kuku". notes: 1) the details of the above incident are captured at the attached example. 2) This is a crucial problem to the application that relies on the GFS and RH cluster. How reproducible: Steps to Reproduce: Actual results: Expected results: Additional info: This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 314926
Event posted on 07-07-2009 07:58am EDT by asafeks Steps to Reproduce: In order to reproduce this problem please use the attachment files. please follow the instructions at: "sync_failure.txt" This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 314926
Event posted on 07-08-2009 12:01pm EDT by gcase SEG Escalation Template All Issues: Problem Description --------------------------------------------------- 1. Time and date of problem: Ongoing 2. System architecture(s): RHEL5.2 x86_64 3. Provide a clear and concise problem description as it is understood at the time of escalation. Please be as specific as possible in your description. Do not use the generic term "hang", as that can mean many things. Observed behavior: Diligent has discovered that data may disappear from a GFS cluster under certain circumstances. Please see the sync_failure.txt file in the attached zip archive for a full explanation. Desired behavior: Files written with FsFile::sync should be committed to the GFS filesystem and available even if node that wrote the file goes down. 4. Specific action requested of SEG: Please determine what's causing this bug. 5. Is a defect (bug) in the product suspected? yes Bugzilla number (if one already exists): None 6. Does a proposed patch exist? no 7. What is the impact to the customer when they experience this problem? This is especially important for severity one and two issues: This issue could mean that all GFS filesystems are subject to data loss if the described scenario occurs. All Issues: Supporting Information ------------------------------------------------------ 1. Other actions already taken in working the problem (tech-list posting, google searches, fulltext search, consultation with another engineer, etc.): Spoke with josef, confirmed that this is a bug. Relevant data found (if any): It's a bug, according to josef's analysis 2. Attach sosreport. Reproducer attached. Any RHEL5.2 x86_64 cluster can be used to reproduce this. 3. Attach other supporting data (if any). NOTE: Diligent is restricted to using RHEL5.2 x86_64 due to their custom software. They cannot upgrade to 5.3 as a solution. They have EUS support, so we can get fixes backported to them if needed. 4. Provide issue reproduction information, including location and access of reproducer machine, if available. Reproducer script attached, along with directions. 5. Known hot-fix packages on the system: None 6. Customer applied changes from the last 30 days: None Issue escalated to Support Engineering Group by: gcase. Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 314926
Created attachment 350965 [details] Contains duplicator
Has this been seen/tested on 5.4 beta?
I finally got all the perl dependencies resolved and the scenario setup. My cluster uses an APC network power switch fencing device, so I changed the script to use fence_apc rather than fence_wti, and gave it the appropriate parameters. On my RHEL5.4+patches system, here's what happens: The script runs on roth-01, calls the sync_and_fail program on roth-02, then calls fence_apc to power off roth-02. Then it tries to run check_sync_and_fail (on roth-01). This understandably hangs because all GFS IO should be blocked until the roth-02 comes back into the cluster. The script does not power roth-02 back on until after the call to check_sync_and_fail, so the program and script just hang. This is working as designed. Here is the bottom of the log file: [DEBUG] [Thu Jul 9 09:45:51 CDT 2009] - Running command: fence_apc -a roth-apc -n 1:2 -o off -lapc -p apc [TRACE] [Thu Jul 9 09:45:56 CDT 2009] - Ran Command: "fence_apc -a roth-apc -n 1:2 -o off -lapc -p apc", ERRNO : 0, STDOUT: Success: Powered OFF, STDERR: [INFO] [Thu Jul 9 09:45:56 CDT 2009] - End [INFO] [Thu Jul 9 09:45:56 CDT 2009] - Start [DEBUG] [Thu Jul 9 09:45:56 CDT 2009] - Running command: /pt_work/ariel/sync_failure/check_sync_and_fail If I manually power on roth-02 again and get it to rejoin the cluster, the GFS operations continue, journal recovery occurs, the check_sync_and_fail continues and the file is deleted properly. My script does not continue to run unless I manually power on roth-02 because the script doesn't execute the command until after the check_sync_and_fail program is run. I'm not a fencing expert, so I don't know fence_wti. How is the scenario different for the customer? Maybe the fence_wti script is not really fencing their node-02? Does GFS continue running on node-01 after the fence-wti power off is performed for node-02? Do I need to somehow adjust my script because fence_apc behaves differently from fence_wti? Are they experiencing split-brain because they have a two-node cluster whereas mine has three nodes?
Created attachment 351499 [details] runTrigger log This is new log that shows, node doing a sleep on failure and tries again. Then the node is fenced off then fenced on. Loop then waits for node to come online then fails. --sbradley
I can recreate the problem now, so I'm investigating the problem.
Created attachment 353867 [details] sync_and_fail program that uses jdata The problem seems to be that a fsync in GFS does not cause the journal to be flushed to the media, unless the file has the journaled data (jdata) attribute. I've analyzed the journals from several recreations of the failure and they all show some of the writes to the kuku file, but not all of the writes. Thus, after the node is fenced and the journal is replayed, the file is restored to a sane-but earlier version of the data that reflects the earlier writes but not the last write. This attachment is a modified version of the sync_and_fail program that sets the jdata attribute on the output file, kuku. Using this version of the program, the runTrigger script has iterated several times without failure. (On the original version, the failure would occur reliably). What I don't know is whether this is working "as designed", but I suspect it is. I don't know the intent of the original designers of GFS, so I'm not sure yet, and need to do further investigation. I also suspect that if we somehow changed the code to cause fsync to force the journal to disk, it will severely impact performance.
The purpose of fsync is to ensure that the data is on disk, so whatever the performance issues, we must ensure that the data is safe. The code seems to fsync both the directory and the file itself. On that basis, and providing the fsync calls returned successfully before the power was removed on the node, then the data should still be there afterwards.
Comment #15 makes sense to me, so let's call it a "real bug" we need to fix. Side note: The problem does not appear in the latest GFS2 code. I tested a pre-release 5.4 system with GFS2 on the 2.6.18-158 kernel and the normal runTrigger script ran for well over an hour with no failures. I tried back-porting the fsync code from GFS2 to GFS, but it did not solve the problem for GFS. I suspect it's a case of GFS not setting the dirty bit when it should. If I solve that and then add the back-port I'm hoping it solves the problem. Also, I'm still sitting on the fix for bug #491369, which is awaiting the results from a customer test. I want to test that next to see if it maintains the dirty bit any better, which will allow my back-port to work properly.
Created attachment 354046 [details] Patch Try #1 This patch seems to fix the problem. It simply adds a corresponding check for a stuffed dinode to the GFS fsync code. With this patch, the runTrigger test has been running for several iterations now without failure on my test cluster. Needs more testing. According to the comments in this commit to gfs2, a fsync for a stuffed inode should cause the journal to be flushed, and thus the problem was fixed in gfs2: http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=33c3de32872ef3c075e4dac04c0de8f86ac39f6f
With this patch the runTrigger script ran successfully for ten hours last night until it was killed because my vpn dropped. Changing status to Assigned. Requesting flags for inclusion into 5.5.
Will they accept an rpm, and if so, what arch do they need?
Created attachment 354374 [details] x86_64 rpm for 5.2.z
Trying to pull this back into 5.4 as per GFS meeting.
I pushed the patch to the master branch of the gfs1-utils git tree, and the STABLE2 and STABLE3 branches of the cluster git tree. I'll wait on RHEL5, RHEL54, RHEL53, and RHEL52 branches until I get some ack flags and the necessary paperwork is done. Here are the commit messages: [master.510310 a5b702e] GFS: Commited data evaporates 1 files changed, 2 insertions(+), 0 deletions(-) [STABLE3.510310 be5f9ef] GFS: Commited data evaporates 1 files changed, 2 insertions(+), 0 deletions(-) [STABLE2.510310 16f27b5] GFS: Commited data evaporates 1 files changed, 2 insertions(+), 0 deletions(-) Still pending: [RHEL5.510310 344ede9] GFS: Commited data evaporates 1 files changed, 2 insertions(+), 0 deletions(-) [RHEL54.510310 aa0b70b] GFS: Commited data evaporates 1 files changed, 2 insertions(+), 0 deletions(-) [RHEL53.510310 f19cc83] GFS: Commited data evaporates 1 files changed, 2 insertions(+), 0 deletions(-) [RHEL52.510310 d52f4c7] GFS: Commited data evaporates 1 files changed, 2 insertions(+), 0 deletions(-)
Fix was pushed to RHEL5, RHEL54, RHEL53 and RHEL52 branches of the cluster git tree for inclusion into 5.4 (and up), plus 5.3.z and 5.2.z. Changing status to Modified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1338.html