Bug 592863

Summary:	GFS: data lost after hard restart of cluster node
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Krzysztof Kopec <uniks>
Component:	gfs	Assignee:	Robert Peterson <rpeterso>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	low
Version:	4	CC:	edamato, swhiteho
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-07-12 13:34:29 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Krzysztof Kopec 2010-05-17 08:55:07 UTC

Description of problem:
We have 2 nodes cluster and on each node ftp server is running. Both ftp servers are using one GFS resource. Few days ago we had to restart one of the nodes and unfortunately we had to do power cycle of the machine instead of harmless restart.
The machine was restarted during active ftp transmission (user uploaded file successfully under temporary name and was doing rename/move operation) and as a result the uploaded file was lost. In fact we have some file within proper directory with proper name and size, but content is totally wrong. It's looks like an inode is pointing to wrong data blocks. In the content of this file we can see some data from other file received about hour ago and some binary garbage.

The question is, how this could happen ?
Has anyone faced with this problem?
What shall we do to avoid such situation in future ?

Version-Release number of selected component (if applicable):
tefse-pro2 ~ $ uname -a
Linux tefse-pro2 2.6.9-78.ELsmp #1 SMP Thu Jul 24 21:03:01 CDT 2008 i686 i686 i386 GNU/Linux

tefse-pro2 ~ $ rpm -qa | egrep "magma|css|manager|GFS|cman|dlm|cluster"
cman-kernel-smp-2.6.9-55.13
cman-devel-1.0.24-1
dlm-1.0.7-1
system-config-cluster-1.0.54-2.0
GFS-6.1.18-1
cmanic-7.6.0-5.rhel4
magma-1.0.8-1
cman-kernel-2.6.9-55.13
cman-1.0.24-1
dlm-kernel-smp-2.6.9-54.11
dlm-kernheaders-2.6.9-54.11
dlm-devel-1.0.7-1
magma-plugins-1.0.15-1
GFS-kernel-2.6.9-80.9
GFS-kernel-smp-2.6.9-80.9
rgmanager-1.9.87-1
magma-devel-1.0.8-1
cman-kernheaders-2.6.9-55.13
dlm-kernel-2.6.9-54.11
GFS-kernheaders-2.6.9-80.9

How reproducible:
unknown

Steps to Reproduce:
1. unknown
2.
3.

Actual results:
After restart node joined cluster without problem.
In server logs:
[...]
May 11 19:47:54 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodataftpd.1: Joined cluster. Now mounting FS...
May 11 19:47:54 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodataftpd.1: jid=1: Trying to acquire journal lock...
May 11 19:47:54 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodataftpd.1: jid=1: Looking at journal...
May 11 19:47:54 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodataftpd.1: jid=1: Done
May 11 19:47:54 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodataftpd.1: Scanning for log elements...
May 11 19:47:54 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodataftpd.1: Found 0 unlinked inodes
May 11 19:47:54 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodataftpd.1: Found quota changes for 0 IDs
May 11 19:47:54 tefse-pro2 kernel: GFS: fsid=tefcl-pro:prodataftpd.1: Done
[...]

Expected results:

Additional info:

Comment 1 Robert Peterson 2010-05-20 19:05:21 UTC

Using GFS doesn't prevent data loss in and of itself.  In theory
the journals should prevent metadata loss to a large degree,
but even that isn't perfect.  I'll explain why, and how to
minimize data loss.

When processes write to the GFS file system, the metadata is
kept in the journals, and that metadata is synced to disk at the
end of every transaction (write operations, renames, creates, etc.)

Files and directories that have the "jdata" attribute will also
have their data kept in the GFS journal as well as their metadata,
and that provides more assurance of data integrity when nodes fail
(the journal is simply replayed and the transactions are
re-written in place).  However, if the files or directories are
not marked "jdata" only the metadata (file disk inodes, directories,
block numbers, bit allocation information, etc.) is journaled.
So then it all comes down to whether the data landed on disk before
the system went down, which brings me to my next topic: hardware
issues.

Many modern storage devices have a large memory cache that remembers
blocks written.  Those devices tell the file system, such as GFS,
that blocks are "written" as soon as the data has landed in the
cache, but not actually on the storage media, e.g. disk.
If the storage loses power, it will lose its write cache before
it has a chance to write the actual data blocks to disk.
Some storage devices are better than others at maintaining
data integrity, even when power is lost.  An OEM hard drive
exported through iSCSI or gnbd will lose power suddenly when the
host system loses power.  On the other hand, most SAN storage will
have separate power, so pulling the plug on the nodes won't affect
their data integrity, but if the whole data center or building
lose power, the SAN goes down at the same time, and then it's down
to tricks they play with battery backups and such.

So there are several things you can do to ensure data integrity:
1. Keep your GFS software up to date by running newer kernels.
2. Use good quality (expensive) storage devices that have
   separate power and/or internal data integrity mechanisms.
3. Use storage devices that are powered separately from the nodes.
4. Use a UPS and/or battery backup system to ensure the storage
   doesn't lose power, even if the nodes do.
5. Set the "journaled data" or "jdata" bits on most critical
   data.  Obviously, that comes with a performance penalty.
6. Use RAID and similar concepts for storage redundancy.
7. Always make backups.

I hope this helps.  I'll leave this bug record open for a few
days in case you have questions or specific concerns about GFS
doing something wrong.  I'm setting the NEEDINFO flag in the
meantime.

Comment 3 Robert Peterson 2010-07-12 13:34:29 UTC

We don't have enough information to solve this problem.
Feel free to reopen the bug record if more information is
provided.  Hopefully the information in comment #1 was
sufficient to solve the issue.