1066181 – GFS not syncing inode mtime before freeze

Bug 1066181 - GFS not syncing inode mtime before freeze

Summary: GFS not syncing inode mtime before freeze

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	gfs-kmod
Sub Component:
Version:	5.8
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Ben Marzinski
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1066178 1066179 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-02-17 22:58 UTC by Harold Miller
Modified:	2018-12-05 17:19 UTC (History)
CC List:	16 users (show)
Fixed In Version:	gfs-kmod-0.1.34-22.el5
Doc Type:	Bug Fix
Doc Text:	Cause: gfs was not syncing the inode attributes to disk along with the data for shared mmap writes. Consequence: files did not have their mtime updated by mmapped writes Fix: gfs now syncs the inode attributes to disk along with the mmap write data when msync is called Result: files now have their mtime updated by mmap writes if msync is called.
Clone Of:
Environment:	Red Hat Enterprise Linux Server release 5.8 Booted kernel: 2.6.18-308.8.2.el5 gfs2-utils-0.1.62-34.el5.x86_64 gfs-utils-0.1.20-13.el5.x86_64
Last Closed:	2014-09-16 00:18:58 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
test program used to demonstrate the issue (7.38 KB, application/x-bzip2) 2014-02-17 22:59 UTC, Harold Miller	no flags	Details
patch to make gfs write out mtime on mmap syncs (1.34 KB, patch) 2014-03-25 04:19 UTC, Ben Marzinski	no flags	Details \| Diff
Version of the patch that is posted to cluster-devel (1.35 KB, patch) 2014-03-25 20:09 UTC, Ben Marzinski	no flags	Details \| Diff
New version to fix the QA issues. (2.99 KB, patch) 2014-05-30 06:38 UTC, Ben Marzinski	no flags	Details \| Diff
Yet another version. This one should do no harm, but it does much less good. (2.76 KB, patch) 2014-06-06 06:29 UTC, Ben Marzinski	no flags	Details \| Diff
Show Obsolete (3) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	639903	0	None	None	None	Never
Red Hat Product Errata	RHBA-2014:1234	0	normal	SHIPPED_LIVE	gfs-kmod bug fix update	2014-09-16 04:14:41 UTC

Description Harold Miller 2014-02-17 22:58:23 UTC

Testing with meta.c (attached)

Test program created numerous files using various file open() flags and truncated each to 10 MiB.  The test then wrote a 512-byte block beginning at offset 0 to each file.  After a 10-second delay, the next 512-byte block was written, and so forth.

During the testing, both regular file write() and memcpy() to an mmapped file were tested, as well as combinations of fsync(), fdatasync(), and no file sync.

While the test program was running (after 10+ hours), customer initiated their snapshot + backup process.  The metadata of the various files (in particular, the 'mtime') was then compared.

The result of the testing indicated that the file open() flags had no bearing on whether mtime was updated.  Both fsync() and fdatasync() were confirmed to behave as expected (fsync forced both data write-out and 'mtime' update, while fdatasync() only caused write-out of the data).

When the file was written to through an mmap, fdatasync() was called, and fsync() was not called, the file's 'mtime' was never updated.  In these cases, the mtime remained unchanged since the file creation time.  File contents appeared to be correct in all cases.

In comparison, when the test program ran on a local ext4 filesystem (with the default data=ordered), the combination of mmap and not calling either fsync() or fdatasync() resulted in files where the 'mtime' would periodically lag behind the write time (and the mtime of the other files), but never by more than 20 seconds.  (Note: after running 'sync', all metadata was written to disk, and the maximum mtime difference was less than 10 milliseconds).


It appears that gfs does not update the on-disk mtime in file writes when: an mmap() is used, fdatasync() is called, and fsync() is not called

Comment 1 Harold Miller 2014-02-17 22:59:13 UTC

Created attachment 864329 [details]
test program used to demonstrate the issue

Comment 2 Steve Whitehouse 2014-02-18 14:41:23 UTC

Well I'm not sure exactly what the problem is that you are reporting here. Why is it not expected that mtime would update during an mmap write if fsync has not been called?

The man page for fdatasync says:

      fdatasync() is similar to fsync(), but does not flush modified metadata
       unless that metadata is needed in order  to  allow  a  subsequent  data
       retrieval to be correctly handled.  For example, changes to st_atime or
       st_mtime (respectively, time of last access and time of last  modifica‐
       tion;  see stat(2)) do not require flushing because they are not neces‐
       sary for a subsequent data read to be handled correctly.  On the  other
       hand, a change to the file size (st_size, as made by say ftruncate(2)),
       would require a metadata flush.

I can't see the attachment in comment #1, as it seems to be corrupt or marked as the wrong type of file.

Comment 4 Sergey Meirovich 2014-02-18 14:58:16 UTC

Hi Steve,

https://bugzilla.redhat.com/attachment.cgi?id=864329

Comment 5 Steve Whitehouse 2014-02-18 15:13:26 UTC

*** Bug 1066178 has been marked as a duplicate of this bug. ***

Comment 6 Steve Whitehouse 2014-02-18 15:14:38 UTC

*** Bug 1066179 has been marked as a duplicate of this bug. ***

Comment 7 Steve Whitehouse 2014-02-18 15:34:23 UTC

Thanks for the link in comment #4, however so far as I can tell the behaviour is as expected per the man page information for fdatasync() unless I'm missing something here?

Comment 9 Sergey Meirovich 2014-02-19 00:10:32 UTC

In-memory mtime is in fact updated. And the fact that in-memory never propagated to the disk creates a lot of confusion in case of the last close (2) SAN level snapshots creation. Though I think DRBD will be affected the same.

Comment 10 Sergey Meirovich 2014-02-19 00:13:01 UTC

I think it would be rather nice to have `gfs_tool freeze' to synchronize in-memory with on-disc. This single change  will make me happy.

Comment 11 Steve Whitehouse 2014-02-19 11:40:15 UTC

Ok, so the issue is not fdatasync then, its not syncing before a freeze. I'll update the bug description accordingly.

Comment 14 RHEL Program Management 2014-02-25 00:09:08 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unable to address this
request at this time.

Red Hat invites you to ask your support representative to
propose this request, if appropriate, in the next release of
Red Hat Enterprise Linux.

Comment 15 RHEL Program Management 2014-03-07 12:18:25 UTC

Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in the  last planned RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX. To request that Red Hat re-consider this request, please re-open the bugzilla via  appropriate support channels and provide additional business and/or technical details about its importance to you.

Comment 16 Steve Whitehouse 2014-03-07 12:37:07 UTC

Please do not close this bug. It should not be closed as WONTFIX

Comment 19 Ben Marzinski 2014-03-15 05:00:03 UTC

Actually, I don't see any evidence that the mtime is getting updated at all for mmap writes in your test program.  the file keeps it's creation mtime no matter how long you leave it. Unmounting and remounting doesn't update it.

I'm going to do some more digging.  Apparently the Single Unix Specification guarantee is that mtime will be updated at the very least by the next call to msync() or if no call to mysnc happens, by munmap().  Presumably, even if there isn't an explicit call to munmap(), when the file is closed, the mtime should be updated.

Comment 20 Ben Marzinski 2014-03-25 04:19:50 UTC

Created attachment 878304 [details]
patch to make gfs write out mtime on mmap syncs

This patch modifies gfs_write_inode to make it actually write out the modified vfs inode attributes.  This will cause mtime to get updated when the mmap file changes are synced back to disk.

Comment 21 Steve Whitehouse 2014-03-25 09:01:07 UTC

This bit:

 	if (ip && sync)
 		gfs_log_flush_glock(ip->i_gl, 0);

seems odd, bearing in mind that ip is dereferenced higher up the function. However that seems to be in the existing code, but maybe worth checking for certain that ip can never be NULL here.

Still in dev phase for 5.11, so lets ask for blocker for this as it is something that we ought to ensure is fixed.

Comment 22 Ben Marzinski 2014-03-25 20:09:13 UTC

Created attachment 878667 [details]
Version of the patch that is posted to cluster-devel

This version is just like the former, without the debug printk, and with a sensible checking if ip != NULL.  Now, I can't see where ip could be NULL calling this function, but the other gfs_super_ops functions also do this check, so I assume that there was at least a worry that this could happen, and I left the check in. 5.11 seems like a bad place to clean up this sort of thing.

Comment 24 Nate Straz 2014-05-23 19:38:22 UTC

Verified against kmod-gfs-0.1.34-20.el5 with the following simple python script:

from mmap import *
from time import sleep
from random import randint

maxsize = 2**25

f = open("mmapfile", "r+b")
mm = mmap(f.fileno(), maxsize, MAP_SHARED)
token = "hello"
while True:
	o = randint(0, maxsize - len(token))
	mm[o:o+len(token)] = token
	sleep(5)


Running `watch stat mmapfile` on each node shows that the mtime is updated across the cluster while the above script is running.

With kmod-gfs-0.1.34-18.el5, the mtime on other nodes never updates.

Comment 25 Nate Straz 2014-05-27 15:14:57 UTC

After I stopped my test case to verify this bug, my system panicked with the following assert.

GFS: fsid=nate:nate0.1: fatal: assertion "gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)" failed
GFS: fsid=nate:nate0.1:   function = gfs_trans_add_gl
GFS: fsid=nate:nate0.1:   file = /builddir/build/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/trans.c, line = 234
GFS: fsid=nate:nate0.1:   time = 1400873905
GFS: fsid=nate:nate0.1: about to withdraw from the cluster
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...ld/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/lm.c:112
invalid opcode: 0000 [1] SMP 
last sysfs file: /fs/gfs/nate:nate0/lock_module/recover_done
CPU 0 
Modules linked in: gfs(U) dm_log_clustered(U) lock_nolock lock_dlm gfs2 dlm gnbd(U) configfs lpfc scsi_transport_fc sg be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic cxgb3i iptable_filter ip_tables x_tables sctp autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api uio libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy i2c_piix4 i2c_core pcspkr virtio_balloon virtio_net serio_raw tpm_tis tpm tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci ata_piix libata sd_mod scsi_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 30450, comm: python Tainted: G     --------------------    2.6.18-389.el5 #1
RIP: 0010:[<ffffffff8882f06f>]  [<ffffffff8882f06f>] :gfs:gfs_lm_withdraw+0x97/0x103
RSP: 0018:ffff810030e65b88  EFLAGS: 00010202
RAX: 000000000000003e RBX: ffffc200001ce000 RCX: 0000000000000286
RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff803270dc
RBP: ffffc200002069d4 R08: 000000000000000d R09: 00000000ffffffff
R10: 0000000000000000 R11: 000000000000000a R12: ffff81003b31fdc8
R13: ffffc200001ce000 R14: 0000000000000001 R15: ffff810030e65e98
FS:  00002b4a72e94170(0000) GS:ffffffff80436000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000152c3000 CR3: 0000000037549000 CR4: 00000000000006e0
Process python (pid: 30450, threadinfo ffff810030e64000, task ffff810039f657b0)
Stack:  0000003000000030 ffff810030e65ca0 ffff810030e65ba8 ffff810030e65c38
 0000000300000000 ffffc20000206870 ffffc200002069d4 ffffffff8884e6f0
 ffffc200002069d4 ffffffff88848c60 ffff810030e65cb8 000000000001725f
Call Trace:
 [<ffffffff8002621e>] find_or_create_page+0x22/0x72
 [<ffffffff8881b753>] :gfs:getbuf+0x172/0x181
 [<ffffffff8881ba54>] :gfs:gfs_dreread+0x72/0xc6
 [<ffffffff8884725c>] :gfs:gfs_assert_withdraw_i+0x30/0x3c
 [<ffffffff88845e10>] :gfs:gfs_trans_add_gl+0x82/0xbe
 [<ffffffff88845ef3>] :gfs:gfs_trans_add_bh+0xa7/0xd9
 [<ffffffff8883d8ba>] :gfs:gfs_write_inode+0x10c/0x166
 [<ffffffff80063117>] wait_for_completion+0x1f/0xa2
 [<ffffffff8002ff43>] __writeback_single_inode+0x1dd/0x31c
 [<ffffffff88825d51>] :gfs:gfs_glock_nq+0x3aa/0x3ea
 [<ffffffff800f8ee8>] sync_inode+0x24/0x33
 [<ffffffff88839f32>] :gfs:gfs_fsync+0x88/0xba
 [<ffffffff8005afe4>] do_writepages+0x29/0x2f
 [<ffffffff8005055a>] do_fsync+0x52/0xa4
 [<ffffffff800d468e>] sys_msync+0xff/0x180
 [<ffffffff8005d29e>] tracesys+0xd5/0xdf


Code: 0f 0b 68 3b ad 84 88 c2 70 00 eb fe 48 89 ee 48 c7 c7 7b ad 
RIP  [<ffffffff8882f06f>] :gfs:gfs_lm_withdraw+0x97/0x103
 RSP <ffff810030e65b88>
 <0>Kernel panic - not syncing: Fatal exception

Comment 26 Ben Marzinski 2014-05-30 06:38:58 UTC

Created attachment 900627 [details]
New version to fix the QA issues.

This patch is similar to my earlier version, but includes more checks in gfs_write_inode to make sure that it can start a transaction.  It now makes sure that there isn't already a transaction in progress, and that if it already has a lock, that the lock is exclusive.

I also noticed that the original fix itself doesn't always work. The issue is that after the last holder of the inode glock is dropped, the vfs inode timestamps are overwritten by the gfs inode ones.  During mmap, the mtime is getting updated in the vfs inode during the page fault. If either a call to stat the inode or a call to write out the vfs inode doesn't attempt to lock it while the mmap call is still holding its glock, the updated vfs timestamp will get overwritten, and be lost. To solve this, I've changed gfs_inode_attr_in() to
not overwrite the vfs timestamps unless the gfs ones are newer.  The timestamps still always get synced to the gfs inode when the vfs inode is first created. I have verified that it is still possible to manually reset the inode mtime to an earlier time using the touch command, and I can't think of any other possible issue with doing this. However, I'll bug Steve about this tomorrow to see if he can think of any problem with it.

Comment 27 Ben Marzinski 2014-05-30 18:14:01 UTC

Applied the above patch.

Comment 28 Ben Marzinski 2014-06-06 06:29:32 UTC

Created attachment 902772 [details]
Yet another version.  This one should do no harm, but it does much less good.

The problem with my previous patches is that they do incorrect lock ordering.  In RHEL5 gfs, the inode glock must be grabbed before inode is locked (it's opposite in RHEL5 gfs2), and __sync_single_inode() locks the inode before it calls gfs_write_inode(). This means that gfs_write_inode() can't lock the inode glock.  Some times __sync_single_inode() is called when the process already has locked the inode glock exclusively.  In these cases gfs_write_inode is able to
write out the inode.

Unfortunately, limiting gfs_write_inode() to only writing out the inode structure when it's called with the inode glock already held doesn't get the mtime updated when munmap is called, or an mmaped file is closed.  The only condition that will trigger the mtime getting updated on disk is when the
msync is called on the mmapped area with the MS_SYNC flag.

Right now, I can't see a way to do better than this. For this to work, we'd need to be able to write out the inode in a transaction from some other gfs function that was getting called at least as often as whenever you msync or munmap and mmaped area, which hasn't already locked the inode.  I don't see any function that would work for this.  It's possible that this is the best we can do in gfs.

Comment 29 Ben Marzinski 2014-06-06 21:24:11 UTC

Above patch applied.  This will only fix the issue when msync is called with MS_SYNC.

Comment 30 Nate Straz 2014-06-10 20:31:02 UTC

I can confirm that mtime is updated on disk (and visible on other nodes) after msync is called.  Otherwise, mtime is not updated when munmap is called, the file is closed, or fsfreeze is invoked.

Used: kmod-gfs-0.1.34-22.el5

Move to VERIFIED after TPS runs are complete.

Comment 32 errata-xmlrpc 2014-09-16 00:18:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1234.html

Note You need to log in before you can comment on or make changes to this bug.