Bug 845233

Summary:	XFS regularly truncating files after crash/reboot
Product:	Red Hat Enterprise Linux 6	Reporter:	Daire Byrne <daire.byrne>
Component:	kernel	Assignee:	Dave Chinner <dchinner>
Status:	CLOSED ERRATA	QA Contact:	Boris Ranto <branto>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	6.2	CC:	cww, dchinner, dhoward, eguan, esandeen, fhirtz, fs-qe, jamesb, kernel-eus-qe, ksquizza, kzhang, npajkovs, pasteur, pds, rwheeler, yowang
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	kernel-2.6.32-328.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-02-21 06:44:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	846704, 960437, 960438

Description Daire Byrne 2012-08-02 11:39:51 UTC

Description of problem:

We have observed that files can be truncated (0 bytes for small files) on an XFS filesystem after a power cycle or crash. It seems like perhaps data is not being flushed to disk often enough. A search brought this patch to my attention which seems to describe the issue we are experiencing:

http://oss.sgi.com/archives/xfs/2012-06/msg00350.html

Are there plans to commit this patch to the EL6 kernel? Are there any other workarounds we can try in the meantime?

Version-Release number of selected component (if applicable):
kernel-2.6.32-220.17.1.el6.x86_64

How reproducible:
Not sure how to reproduce every time - it has happened numerous times in the last couple of weeks on our EL6 file server.

Steps to Reproduce:
1. Power cycle server not long after "writing" some small files (e.g. a source tree)
2.
3.
  
Actual results:
Files being truncated after log replay

Expected results:
File data should be committed to disk.

Additional info:

Comment 2 Ric Wheeler 2012-08-02 13:44:58 UTC

Hi Daire,

Please open up a support ticket with RH so our support staff can help gather the needed information.

In general, it is the applications duty to use fsync() or fdatasync() when it wants to have data persist over a power failure.

Comment 7 Ric Wheeler 2012-08-03 11:03:43 UTC

Hi Daire,

We are looking to pull some upsteam fixes back into RHEL6. This BZ will get updated with the details once that happens.

Thanks for the report!

Comment 8 James Braid 2012-08-15 22:59:52 UTC

We appear to be seeing files get truncated (zero bytes) after rebooting even when the files are opened read only. This is a bit concerning...

This thread describes the same thing as we're seeing:

http://thread.gmane.org/gmane.comp.file-systems.xfs.general/45648

Comment 9 Ric Wheeler 2012-08-15 23:20:28 UTC

James, if you see that behavior (which I have *never* seen or heard of), please open a ticket with Red Hat support so we can debug with you.

That specific thread you reference was not on RHEL (CentOS) and the SGI engineer and reporter saw it only on a specific machine/harwdare type.

Thanks!

Comment 10 RHEL Program Management 2012-09-20 21:11:01 UTC

This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 11 Dave Chinner 2012-09-25 02:16:58 UTC

Posted to rhkernel-list:

http://post-office.corp.redhat.com/archives/rhkernel-list/2012-September/msg02828.html

Comment 12 Dave Chinner 2012-09-25 02:22:03 UTC

*** Bug 835623 has been marked as a duplicate of this bug. ***

Comment 14 Jarod Wilson 2012-10-10 20:08:56 UTC

Patch(es) available on kernel-2.6.32-328.el6

Comment 17 Paul Smith 2012-11-08 23:34:07 UTC

Hi Jarod; I'm confused by this issue.  We are seeing a serious 0-length file problem on XFS partitions after a system crash.  These are files which were written to the disk over 18 hours before the crash and not modified since (they were programs, not data files etc.)  I've been doing an hourly scan for 0-length files and in one case after the crash I found 379 new 0-length files on the system, compared to the scan before the crash!

We're running RHEL 6.2.  I found, in the release notes for RHEL 6.3, a reference to Bug 856686 which seems like it might be our problem.  However I can't see that bug as it's apparently marked private, so I can't be sure.  The dup bug 835623 here is also private.

Now I find this bug, which also sounds similar and is marked as available in 2.6.32-328 which I guess will be the kernel for RHEL 6.4?

Is there any possibility of backports of this bug to the current RHEL 6.3 (at least)?  I don't have access to the rhelkernel-list link above so I'm not sure how much work the fix would be.

I'm wondering if XFS is simply not reliable for use in currently-released versions of Red Hat EL, and I should avoid it.  Unfortunately we do a lot of formatting of very large partitions and switching back to ext4, with the orders of magnitude longer format times, would be very painful.

Comment 18 Dave Chinner 2012-11-09 00:01:32 UTC

Hi Paul,

Perhaps you should have contacted RH support as soon as you started seeing data loss problems rather than working around them. As it is, you're going to be looking for the fix to 856685, which has been available for RHEL6.2 since this errata was release:

http://rhn.redhat.com/errata/RHSA-2012-1401.html

It was also fixed in 6.3 at the same time.

This bug was never triaged as the reporter never followed up, and so was used to close off the last known, quite rare recovery problem (reported maybe 5 times in the past 5 years!) that was solved upstream that could have resulted in zero length files. So I think the above errata kernel is what you want. If it doesn't fix your problems, then please go through the usual channels to get a new bug opened.

-Dave.

Comment 19 Paul Smith 2012-11-09 00:20:53 UTC

Thanks.  I haven't tried any workarounds, I was obtaining tracking data with a simple cron.hourly job to search for 0-length files; I've just started seriously looking into this problem and only today did I discover it was related to XFS and system crashes (the nodes are remote and headless and I didn't realize they were crashing in the first place--we would just notice that some files were 0 length and we had no idea when or how it happened).  Luckily we're still in development so no customer data lost!

I'll take a look at that errata.  Cheers!

Comment 22 errata-xmlrpc 2013-02-21 06:44:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0496.html

Comment 27 Frank Hirtz 2013-05-09 21:36:33 UTC

*** Bug 960641 has been marked as a duplicate of this bug. ***