Bug 457798

Summary:

GFS2 : gfs2meta is FUBAR

Product:

Red Hat Enterprise Linux 5

Reporter:

Alexander Viro <aviro>

Component:

kernel

Assignee:

Abhijith Das <adas>

Status:

CLOSED ERRATA

QA Contact:

Martin Jenner <mjenner>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.3

CC:

adas, bmarzins, bstevens, edamato, lwang, nstraz, rpeterso, swhiteho

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-01-20 19:42:20 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

459630

Attachments:

Description	Flags
Initial patch	none
Cleaned up patch	none
Fixes the umount hang experienced earlier	none
More cleanup	none
RHEL5 port of upstream patch	none

Description Alexander Viro 2008-08-04 17:42:18 UTC

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.14eol) Gecko/20070505 Iceape/1.0.9 (Debian-1.0.13~pre080323b-0etch3)

a) mount gfs2 e.g. on /tmp/foo; mount --bind /tmp/foo /tmp/bar; umount /tmp/foo.
Now ->sd_gfs2mnt points to freed vfsmount and mounting the same device with type
gfs2meta will corrupt memory when it tries to increment ->mnt_count of that vfsmount.

b) get_gfs2_sb() traverses the list of gfs2 superblocks with no locking; oopsable.

c) there is no warranty that superblock found by get_gfs2_sb() will not disappear just as get_gfs2_sb() is returning a pointer to it; IOW, mounting gfs2meta can race with unmounting underlying gfs2.  Result is a memory corruptor, again.

d) root inode of gfs2meta is allocated before its superblock even exists; as the result, we get dentry->d_inode->i_sb != dentry->d_sb, which will make a lot of places in VFS very unhappy; violating assertions is a Bad Thing(tm)

e) mounting gfs2meta can race with mounting gfs2meta; we do mark the corresponding gfs2 superblock as "already has gfs2meta", but we have no protection against two mount(2) checking that at the same time, finding that gfs2meta is not already mounted and then both proceeding to set that mark and move on to mounting gfs2meta, with very nasty results.

IMO we need to put the entire thing into alternative dentry tree on gfs2 itself (see the things nfs+fscache are doing).  Have that mounted in presence of some
option (e.g. -o meta) and have gfs2meta ->get_sb() do the equivalent, setting
->mnt_sb to gfs2 superblock and ->mnt_root - to root of that tree.  ->show_options() would recognize these vfsmounts by their ->mnt_root belonging
to the tree in question, adding "meta" to the option list.

I suspect that it's the easiest solution; any better suggestions are welcome.

Reproducible: Always

Steps to Reproduce:
1.
2.
3.

Comment 1 Alexander Viro 2008-08-04 17:45:14 UTC

Note that this is present in mainline kernel as well, so the fix will be needed there.

Comment 2 RHEL Program Management 2008-08-04 18:07:27 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Steve Whitehouse 2008-08-05 16:26:36 UTC

Created attachment 313463 [details]
Initial patch

So these are my initial thoughts collected into a patch. Issues still to be addressed:

1. Do we need userland changes? (how do we detect umount for example?)
2. Since this allows mounting the "meta" subtree directly, we need a way to mount the normal tree after we've mounted the meta tree. The userland tools probably don't deal with either of those cases.
3. A quick go/no go test seems to indicate that this patch does work, further testing to follow
4. Maybe the is_ancestor() test should be generic in dcache.h?

Anyway, this removes a nice lot of code, so thats good too. There seems to be scope for further cleanups in the area later on too.

Comment 4 Steve Whitehouse 2008-08-06 12:20:31 UTC

Created attachment 313557 [details]
Cleaned up patch

This cleans a few things up, but there is one bug left in it (at least!). When I do this:

[root@men-an-tol fs]# mount /gfs2
[root@men-an-tol fs]# mount -t gfs2meta /gfs2 /mnt/foo
[root@men-an-tol fs]# echo /mnt/foo/*
/mnt/foo/inum /mnt/foo/jindex /mnt/foo/per_node /mnt/foo/quota /mnt/foo/rindex /mnt/foo/statfs
[root@men-an-tol fs]# umount /gfs2
[root@men-an-tol fs]# umount /mnt/foo

all is ok, but when I do this:

[root@men-an-tol fs]# mount /gfs2
[root@men-an-tol fs]# mount -t gfs2meta /gfs2 /mnt/foo
[root@men-an-tol fs]# mount --bind /mnt/foo/per_node /mnt/bar
[root@men-an-tol fs]# umount /gfs2
[root@men-an-tol fs]# umount /mnt/foo
[root@men-an-tol fs]# umount /mnt/bar

the final umount hangs with /proc/slabinfo showing four glocks and four struct inodes left. Exactly what we'd expect to see if two inodes were being held via a ref count somewhere along the line. I suspect its something to do with us holding dcache refs to the root & master in gfs2's private super block. But why this should be a problem only if I add the bind mount I'm not sure.

I think I got the sb ref counting correct since only on the last umount do we hit ->put_super, but thats where the hang is. I'm going to try a few more experiments to see if I can track down whats going on here.

Comment 5 Steve Whitehouse 2008-08-06 16:02:56 UTC

Created attachment 313591 [details]
Fixes the umount hang experienced earlier

So the problem is related to ref counting & the way the dcache is flushed on umount. It was missing the dentries relating to the second root of the filesystem, and thus hanging. I've added a shrink_dcache_sb(sb); in ->kill_sb() after we dput() the two roots (GFS2 has its own ref on these aside from the VFS ->s_root) and before kill_block_super(sb) so that we don't land up with unwanted entries floating about.

Also I moved the gfs2_delete_debugfs_file(sdp) call to the end of ->kill_sb() since if something goes wrong here, then we want to be able to list the remaing glocks via debugfs.

I'm rather suspicious that we are not getting the right balance between things which are done in ->kill_sb() and those which are done in ->put_super() and perhaps we should be doing more (all?) of the shutdown in ->kill_sb(). Certainly it would make a great deal of sense to move the various iput()s into ->kill_sb() so that we do not have to flush these manually and we could also then remove the strange loop that we have to clear the glocks, since one pass over them should be enough.

Comment 6 Steve Whitehouse 2008-08-07 09:54:57 UTC

Created attachment 313671 [details]
More cleanup

A bunch of functions were in the wrong file, this moves them so that they become static. There are further potantial similar changes which can be made here, but this does the most obvious stuff.

I think this patch is pretty much ready to go now.

Comment 7 Abhijith Das 2008-08-20 23:16:22 UTC

I'm looking into this panic I'm seeing with the latest rhel5 patch I posted to the rhkernel-list. This happens simply by mounting gfs2.

gfs2: error 4096 reading superblock
Unable to handle kernel NULL pointer dereference at 0000000000000ba4 RIP:
 [<ffffffff800076ff>] _raw_spin_lock+0x1/0xeb
PGD 6c764067 PUD 6ca1c067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:0d.0/0000:03:00.0/irq
CPU 1
Modules linked in: gfs2(U) dlm(U) configfs(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) cpufreq_ondemand(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) qla2xxx(U) sg(U) shpchp(U) serio_raw(U) scsi_transport_fc(U) tg3(U) i2c_nforce2(U) k8temp(U) hwmon(U) k8_edac(U) edac_mc(U) i2c_core(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) sata_nv(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 6123, comm: mount.gfs2 Tainted: G      2.6.18-104.gfs2abhi.001 #2
RIP: 0010:[<ffffffff800076ff>]  [<ffffffff800076ff>] _raw_spin_lock+0x1/0xeb
RSP: 0018:ffff81006eadfab8  EFLAGS: 00010046
RAX: ffff81006ca97040 RBX: 0000000000000ba0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000ba0
RBP: 0000000000000b98 R08: 0000000000000002 R09: 0000000000000001
R10: ffff81006c391ac0 R11: 0000000000000bb8 R12: 0000000000000000
R13: 0000000000000b98 R14: ffff81006c03f000 R15: ffffffff8854040e
FS:  00002b00cb049220(0000) GS:ffff810037cb04c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000ba4 CR3: 000000006bcac000 CR4: 00000000000006e0
Process mount.gfs2 (pid: 6123, threadinfo ffff81006eade000, task ffff81006ca97040)
Stack:  0000000000000ba0 ffffffff80066be4 0000000000000000 0000000000000000
 0000000000000b98 ffff81006c03f000 ffffffff8854040e 0000000000000000
 0000000000000000 ffffffff885398af ffff81006c540078 0000000000000046
Call Trace:
 [<ffffffff80066be4>] __down_write_nested+0x15/0x96
 [<ffffffff8854040e>] :gfs2:fill_super+0x0/0xad6
 [<ffffffff885398af>] :gfs2:gfs2_log_flush+0x1f/0x485
 [<ffffffff88539d20>] :gfs2:gfs2_meta_syncfs+0xb/0x37
 [<ffffffff8853f41c>] :gfs2:gfs2_kill_sb+0x19/0x63
 [<ffffffff8854040e>] :gfs2:fill_super+0x0/0xad6
 [<ffffffff800e473e>] deactivate_super+0x6c/0x84
 [<ffffffff800e4d31>] get_sb_bdev+0x12a/0x16d
 [<ffffffff800e47e9>] vfs_kern_mount+0x93/0x11a
 [<ffffffff800e48b2>] do_kern_mount+0x36/0x4d
 [<ffffffff800ee442>] do_mount+0x6af/0x71f
 [<ffffffff800a178e>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800076c0>] find_get_page+0x1a/0x58
 [<ffffffff800469cf>] do_sock_read+0xc0/0xcb
 [<ffffffff80218b1a>] sock_aio_read+0x4f/0x5e
 [<ffffffff8000ce69>] do_sync_read+0xc7/0x104
 [<ffffffff800ce39a>] zone_statistics+0x3e/0x6d
 [<ffffffff8000f4a7>] __alloc_pages+0x65/0x2ce
 [<ffffffff8004d2de>] sys_mount+0x8a/0xcd
 [<ffffffff8005f2a6>] tracesys+0xd5/0xdf


Code: 81 7f 04 ad 4e ad de 48 89 fb 74 0c 48 c7 c6 75 c7 2a 80 e8
RIP  [<ffffffff800076ff>] _raw_spin_lock+0x1/0xeb
 RSP <ffff81006eadfab8>
CR2: 0000000000000ba4
 f

Comment 8 Steve Whitehouse 2008-08-21 09:20:31 UTC

Looks like you've changed the prototype to end_bio_io_page(). It should have retained the same form as it had in RHEL originally, and not be the same as the upstream code. As a result you've got the bytes_done appearing in the error position. I'm surprised this didn't generate a compiler warning.

Comment 9 Steve Whitehouse 2008-08-21 13:11:37 UTC

Move back to assigned until the patch is fixed.

Comment 10 Abhijith Das 2008-08-25 04:49:12 UTC

Created attachment 314898 [details]
RHEL5 port of upstream patch

With the latest rhel5 port, I'm hitting hangs on a clustered gfs2 with bind mounts. Standalone gfs2(lock_nolock) seems to be working fine. (I couldn't get it to hang with bind mounts).

Comment 11 Abhijith Das 2008-08-25 22:28:58 UTC

With a clustered gfs2 (only salem mounts gfs2) I hit a hang when I do the following:

[root@salem ~]# mount -t gfs2 /dev/Smoke4/smoke40 /mnt/gfs2/
[root@salem ~]# mount --bind /mnt/gfs2/ /tmp/foo/
[root@salem ~]# umount /mnt/gfs2/
[root@salem ~]# cd /tmp/foo/
[root@salem /tmp/foo]# rm foobar 
[----hang----]

The output of gfs2_hangalyzer is:

[root@salem ../fs/gfs2]# /tmp/gfs2_hangalyzer -n salem -q

salem     : gfs2: G:  s:UN n:2/2053d f:ls t:SH d:EX/0 l:0 a:0 r:4
salem     :                         (locked, sticky)
salem     : gfs2:  H: s:SH f:W e:0 p:6200 [rm] gfs2_rindex_hold+0x32/0x154 [gfs2]
salem     : gfs2: No refs found to 2/2053d

There is 1 glock with waiters.
salem, pid 6200 is waiting for glock 2/2053d, but no holder was found.

Comment 12 Steve Whitehouse 2008-08-26 09:55:41 UTC

The strange thing about this is that the dlm claims not to have any references to the lock in question. It does know about the iopen lock for this inode though.

Comment 13 Steve Whitehouse 2008-08-26 10:55:09 UTC

It also seems upstream behaves the same way. I have a sneeking suspicion that it might be something related to the umount helper, but no proof so far.

Comment 14 Steve Whitehouse 2008-08-26 12:15:33 UTC

...and thus we see:

[root@men-an-tol fs]# /sbin/dmsetup create gfs0 ~steve/etc/my.table 
[root@men-an-tol fs]# mount /mnt/gfs0
[root@men-an-tol fs]# mkdir /mnt/gfs0/foo
[root@men-an-tol fs]# mount --bind /mnt/gfs0/foo /mnt/gfs1
[root@men-an-tol fs]# mv /sbin/umount.gfs2 /sbin/tmp.umount.gfs2
[root@men-an-tol fs]# umount /mnt/gfs0
[root@men-an-tol fs]# cd /mnt/gfs1
[root@men-an-tol gfs1]# touch bar
[root@men-an-tol gfs1]# ls
bar
[root@men-an-tol gfs1]# cd ..
[root@men-an-tol mnt]# umount /mnt/gfs1

which works ok. So this bug is a tools bug, not a kernel bug I think. Thus I think we should be able to post this patch now. Looks like we need to check the umount helper though, so probably means opening another bug unless we can add it to the "things to do in userland" bug, #459630

Comment 16 Abhijith Das 2008-08-26 22:09:14 UTC

Leaving the --bind mount/umount.gfs2 issue aside, I was doing a few tests to verify the correctness of the meta filesystem and found this:

 [root@salem ~]# mount -t gfs2 -o meta /dev/Smoke4/smoke40 /tmp/foo/
 [root@salem ~]# ls /tmp/foo/
 inum  jindex  per_node  quota  rindex  statfs

 [root@salem ~]# mount -t gfs2 /dev/Smoke4/smoke40 /mnt/gfs2
 [root@salem ~]# ls /mnt/gfs2
 inum  jindex  per_node  quota  rindex  statfs

From a different node...
 [root@camel ~]# mount -t gfs2 /dev/Smoke4/smoke40 /mnt/gfs2
 [root@camel ~]# ls /mnt/gfs2
 foobar

 [root@camel ~]# mount -t gfs2 -o meta /dev/Smoke4/smoke40 /mnt/foo/
 [root@camel ~]# ls /mnt/foo/
 foobar

Looks like the root_dir and master_dir are being mixed up.

Comment 17 Steve Whitehouse 2008-08-27 07:22:36 UTC

They are not being mixed up, we just haven't got around to making the changes required to mount the main fs after we've mounted with -o meta. We can fix that later though, for now the important thing is that all the existing mount methods continue to work.

Comment 18 Abhijith Das 2008-08-28 14:47:34 UTC

Steve, but look at the second set of commands:

 [root@camel ~]# mount -t gfs2 /dev/Smoke4/smoke40 /mnt/gfs2
 [root@camel ~]# ls /mnt/gfs2
 foobar

 [root@camel ~]# mount -t gfs2 -o meta /dev/Smoke4/smoke40 /mnt/foo/
 [root@camel ~]# ls /mnt/foo/
 foobar

Here, we mount gfs2 first and then try to mount -o meta. They both show the same tree (that of the first mounted gfs2 fs). The userland tools work like this:

eg: gfs2_quota list -f /mnt/gfs2
expects to find a gfs2 fs at /mnt/gfs2. gfs2_quota then goes about finding the corresponding device and mount it -o meta at /tmp/.gfs2meta/ to do the quota listing. Right now, this will fail, because there will be no quota file in /tmp/.gfs2meta/, instead it will have the contents of the gfs2 filesystem.

Am I missing something here?

Comment 19 Abhijith Das 2008-08-29 21:11:25 UTC

I think I spoke too soon. I guess I don't completely understand the way this is implemented. This seems to work as expected:

[root@salem ~]# mount -t gfs2 /dev/Smoke4/smoke40 /mnt/gfs2
[root@salem ~]# ls /mnt/gfs2/
foobar
[root@salem ~]# mount -t gfs2meta /mnt/gfs2 /tmp/foo/
[root@salem ~]# ls /tmp/foo/
inum  jindex  per_node  quota  rindex  statfs

What is the difference between the following two commands?
a) mount -t gfs2meta /mnt/gfs2 /tmp/foo/
b) mount -t gfs2 -o meta /dev/Smoke4/smoke40 /tmp/foo (this shows the same directory tree as gfs2 if the gfs2 fs was mounted first)

and what is the preferred way to mount the meta filesystem?

Comment 20 Don Zickus 2008-09-03 03:41:11 UTC

in kernel-2.6.18-107.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 24 errata-xmlrpc 2009-01-20 19:42:20 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html