Description of problem:
We have observed that files can be truncated (0 bytes for small files) on an XFS filesystem after a power cycle or crash. It seems like perhaps data is not being flushed to disk often enough. A search brought this patch to my attention which seems to describe the issue we are experiencing:
Are there plans to commit this patch to the EL6 kernel? Are there any other workarounds we can try in the meantime?
Version-Release number of selected component (if applicable):
Not sure how to reproduce every time - it has happened numerous times in the last couple of weeks on our EL6 file server.
Steps to Reproduce:
1. Power cycle server not long after "writing" some small files (e.g. a source tree)
Files being truncated after log replay
File data should be committed to disk.
Please open up a support ticket with RH so our support staff can help gather the needed information.
In general, it is the applications duty to use fsync() or fdatasync() when it wants to have data persist over a power failure.
We are looking to pull some upsteam fixes back into RHEL6. This BZ will get updated with the details once that happens.
Thanks for the report!
We appear to be seeing files get truncated (zero bytes) after rebooting even when the files are opened read only. This is a bit concerning...
This thread describes the same thing as we're seeing:
James, if you see that behavior (which I have *never* seen or heard of), please open a ticket with Red Hat support so we can debug with you.
That specific thread you reference was not on RHEL (CentOS) and the SGI engineer and reporter saw it only on a specific machine/harwdare type.
This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release. Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.
Posted to rhkernel-list:
*** Bug 835623 has been marked as a duplicate of this bug. ***
Patch(es) available on kernel-2.6.32-328.el6
Hi Jarod; I'm confused by this issue. We are seeing a serious 0-length file problem on XFS partitions after a system crash. These are files which were written to the disk over 18 hours before the crash and not modified since (they were programs, not data files etc.) I've been doing an hourly scan for 0-length files and in one case after the crash I found 379 new 0-length files on the system, compared to the scan before the crash!
We're running RHEL 6.2. I found, in the release notes for RHEL 6.3, a reference to Bug 856686 which seems like it might be our problem. However I can't see that bug as it's apparently marked private, so I can't be sure. The dup bug 835623 here is also private.
Now I find this bug, which also sounds similar and is marked as available in 2.6.32-328 which I guess will be the kernel for RHEL 6.4?
Is there any possibility of backports of this bug to the current RHEL 6.3 (at least)? I don't have access to the rhelkernel-list link above so I'm not sure how much work the fix would be.
I'm wondering if XFS is simply not reliable for use in currently-released versions of Red Hat EL, and I should avoid it. Unfortunately we do a lot of formatting of very large partitions and switching back to ext4, with the orders of magnitude longer format times, would be very painful.
Perhaps you should have contacted RH support as soon as you started seeing data loss problems rather than working around them. As it is, you're going to be looking for the fix to 856685, which has been available for RHEL6.2 since this errata was release:
It was also fixed in 6.3 at the same time.
This bug was never triaged as the reporter never followed up, and so was used to close off the last known, quite rare recovery problem (reported maybe 5 times in the past 5 years!) that was solved upstream that could have resulted in zero length files. So I think the above errata kernel is what you want. If it doesn't fix your problems, then please go through the usual channels to get a new bug opened.
Thanks. I haven't tried any workarounds, I was obtaining tracking data with a simple cron.hourly job to search for 0-length files; I've just started seriously looking into this problem and only today did I discover it was related to XFS and system crashes (the nodes are remote and headless and I didn't realize they were crashing in the first place--we would just notice that some files were 0 length and we had no idea when or how it happened). Luckily we're still in development so no customer data lost!
I'll take a look at that errata. Cheers!
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
*** Bug 960641 has been marked as a duplicate of this bug. ***