Bug 457798
Summary: | GFS2 : gfs2meta is FUBAR | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Alexander Viro <aviro> | ||||||||||||
Component: | kernel | Assignee: | Abhijith Das <adas> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 5.3 | CC: | adas, bmarzins, bstevens, edamato, lwang, nstraz, rpeterso, swhiteho | ||||||||||||
Target Milestone: | rc | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | All | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2009-01-20 19:42:20 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 459630 | ||||||||||||||
Attachments: |
|
Description
Alexander Viro
2008-08-04 17:42:18 UTC
Note that this is present in mainline kernel as well, so the fix will be needed there. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 313463 [details]
Initial patch
So these are my initial thoughts collected into a patch. Issues still to be addressed:
1. Do we need userland changes? (how do we detect umount for example?)
2. Since this allows mounting the "meta" subtree directly, we need a way to mount the normal tree after we've mounted the meta tree. The userland tools probably don't deal with either of those cases.
3. A quick go/no go test seems to indicate that this patch does work, further testing to follow
4. Maybe the is_ancestor() test should be generic in dcache.h?
Anyway, this removes a nice lot of code, so thats good too. There seems to be scope for further cleanups in the area later on too.
Created attachment 313557 [details]
Cleaned up patch
This cleans a few things up, but there is one bug left in it (at least!). When I do this:
[root@men-an-tol fs]# mount /gfs2
[root@men-an-tol fs]# mount -t gfs2meta /gfs2 /mnt/foo
[root@men-an-tol fs]# echo /mnt/foo/*
/mnt/foo/inum /mnt/foo/jindex /mnt/foo/per_node /mnt/foo/quota /mnt/foo/rindex /mnt/foo/statfs
[root@men-an-tol fs]# umount /gfs2
[root@men-an-tol fs]# umount /mnt/foo
all is ok, but when I do this:
[root@men-an-tol fs]# mount /gfs2
[root@men-an-tol fs]# mount -t gfs2meta /gfs2 /mnt/foo
[root@men-an-tol fs]# mount --bind /mnt/foo/per_node /mnt/bar
[root@men-an-tol fs]# umount /gfs2
[root@men-an-tol fs]# umount /mnt/foo
[root@men-an-tol fs]# umount /mnt/bar
the final umount hangs with /proc/slabinfo showing four glocks and four struct inodes left. Exactly what we'd expect to see if two inodes were being held via a ref count somewhere along the line. I suspect its something to do with us holding dcache refs to the root & master in gfs2's private super block. But why this should be a problem only if I add the bind mount I'm not sure.
I think I got the sb ref counting correct since only on the last umount do we hit ->put_super, but thats where the hang is. I'm going to try a few more experiments to see if I can track down whats going on here.
Created attachment 313591 [details]
Fixes the umount hang experienced earlier
So the problem is related to ref counting & the way the dcache is flushed on umount. It was missing the dentries relating to the second root of the filesystem, and thus hanging. I've added a shrink_dcache_sb(sb); in ->kill_sb() after we dput() the two roots (GFS2 has its own ref on these aside from the VFS ->s_root) and before kill_block_super(sb) so that we don't land up with unwanted entries floating about.
Also I moved the gfs2_delete_debugfs_file(sdp) call to the end of ->kill_sb() since if something goes wrong here, then we want to be able to list the remaing glocks via debugfs.
I'm rather suspicious that we are not getting the right balance between things which are done in ->kill_sb() and those which are done in ->put_super() and perhaps we should be doing more (all?) of the shutdown in ->kill_sb(). Certainly it would make a great deal of sense to move the various iput()s into ->kill_sb() so that we do not have to flush these manually and we could also then remove the strange loop that we have to clear the glocks, since one pass over them should be enough.
Created attachment 313671 [details]
More cleanup
A bunch of functions were in the wrong file, this moves them so that they become static. There are further potantial similar changes which can be made here, but this does the most obvious stuff.
I think this patch is pretty much ready to go now.
I'm looking into this panic I'm seeing with the latest rhel5 patch I posted to the rhkernel-list. This happens simply by mounting gfs2. gfs2: error 4096 reading superblock Unable to handle kernel NULL pointer dereference at 0000000000000ba4 RIP: [<ffffffff800076ff>] _raw_spin_lock+0x1/0xeb PGD 6c764067 PUD 6ca1c067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:0d.0/0000:03:00.0/irq CPU 1 Modules linked in: gfs2(U) dlm(U) configfs(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) cpufreq_ondemand(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) qla2xxx(U) sg(U) shpchp(U) serio_raw(U) scsi_transport_fc(U) tg3(U) i2c_nforce2(U) k8temp(U) hwmon(U) k8_edac(U) edac_mc(U) i2c_core(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) sata_nv(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 6123, comm: mount.gfs2 Tainted: G 2.6.18-104.gfs2abhi.001 #2 RIP: 0010:[<ffffffff800076ff>] [<ffffffff800076ff>] _raw_spin_lock+0x1/0xeb RSP: 0018:ffff81006eadfab8 EFLAGS: 00010046 RAX: ffff81006ca97040 RBX: 0000000000000ba0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000ba0 RBP: 0000000000000b98 R08: 0000000000000002 R09: 0000000000000001 R10: ffff81006c391ac0 R11: 0000000000000bb8 R12: 0000000000000000 R13: 0000000000000b98 R14: ffff81006c03f000 R15: ffffffff8854040e FS: 00002b00cb049220(0000) GS:ffff810037cb04c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000ba4 CR3: 000000006bcac000 CR4: 00000000000006e0 Process mount.gfs2 (pid: 6123, threadinfo ffff81006eade000, task ffff81006ca97040) Stack: 0000000000000ba0 ffffffff80066be4 0000000000000000 0000000000000000 0000000000000b98 ffff81006c03f000 ffffffff8854040e 0000000000000000 0000000000000000 ffffffff885398af ffff81006c540078 0000000000000046 Call Trace: [<ffffffff80066be4>] __down_write_nested+0x15/0x96 [<ffffffff8854040e>] :gfs2:fill_super+0x0/0xad6 [<ffffffff885398af>] :gfs2:gfs2_log_flush+0x1f/0x485 [<ffffffff88539d20>] :gfs2:gfs2_meta_syncfs+0xb/0x37 [<ffffffff8853f41c>] :gfs2:gfs2_kill_sb+0x19/0x63 [<ffffffff8854040e>] :gfs2:fill_super+0x0/0xad6 [<ffffffff800e473e>] deactivate_super+0x6c/0x84 [<ffffffff800e4d31>] get_sb_bdev+0x12a/0x16d [<ffffffff800e47e9>] vfs_kern_mount+0x93/0x11a [<ffffffff800e48b2>] do_kern_mount+0x36/0x4d [<ffffffff800ee442>] do_mount+0x6af/0x71f [<ffffffff800a178e>] autoremove_wake_function+0x0/0x2e [<ffffffff800076c0>] find_get_page+0x1a/0x58 [<ffffffff800469cf>] do_sock_read+0xc0/0xcb [<ffffffff80218b1a>] sock_aio_read+0x4f/0x5e [<ffffffff8000ce69>] do_sync_read+0xc7/0x104 [<ffffffff800ce39a>] zone_statistics+0x3e/0x6d [<ffffffff8000f4a7>] __alloc_pages+0x65/0x2ce [<ffffffff8004d2de>] sys_mount+0x8a/0xcd [<ffffffff8005f2a6>] tracesys+0xd5/0xdf Code: 81 7f 04 ad 4e ad de 48 89 fb 74 0c 48 c7 c6 75 c7 2a 80 e8 RIP [<ffffffff800076ff>] _raw_spin_lock+0x1/0xeb RSP <ffff81006eadfab8> CR2: 0000000000000ba4 f Looks like you've changed the prototype to end_bio_io_page(). It should have retained the same form as it had in RHEL originally, and not be the same as the upstream code. As a result you've got the bytes_done appearing in the error position. I'm surprised this didn't generate a compiler warning. Move back to assigned until the patch is fixed. Created attachment 314898 [details]
RHEL5 port of upstream patch
With the latest rhel5 port, I'm hitting hangs on a clustered gfs2 with bind mounts. Standalone gfs2(lock_nolock) seems to be working fine. (I couldn't get it to hang with bind mounts).
With a clustered gfs2 (only salem mounts gfs2) I hit a hang when I do the following: [root@salem ~]# mount -t gfs2 /dev/Smoke4/smoke40 /mnt/gfs2/ [root@salem ~]# mount --bind /mnt/gfs2/ /tmp/foo/ [root@salem ~]# umount /mnt/gfs2/ [root@salem ~]# cd /tmp/foo/ [root@salem /tmp/foo]# rm foobar [----hang----] The output of gfs2_hangalyzer is: [root@salem ../fs/gfs2]# /tmp/gfs2_hangalyzer -n salem -q salem : gfs2: G: s:UN n:2/2053d f:ls t:SH d:EX/0 l:0 a:0 r:4 salem : (locked, sticky) salem : gfs2: H: s:SH f:W e:0 p:6200 [rm] gfs2_rindex_hold+0x32/0x154 [gfs2] salem : gfs2: No refs found to 2/2053d There is 1 glock with waiters. salem, pid 6200 is waiting for glock 2/2053d, but no holder was found. The strange thing about this is that the dlm claims not to have any references to the lock in question. It does know about the iopen lock for this inode though. It also seems upstream behaves the same way. I have a sneeking suspicion that it might be something related to the umount helper, but no proof so far. ...and thus we see: [root@men-an-tol fs]# /sbin/dmsetup create gfs0 ~steve/etc/my.table [root@men-an-tol fs]# mount /mnt/gfs0 [root@men-an-tol fs]# mkdir /mnt/gfs0/foo [root@men-an-tol fs]# mount --bind /mnt/gfs0/foo /mnt/gfs1 [root@men-an-tol fs]# mv /sbin/umount.gfs2 /sbin/tmp.umount.gfs2 [root@men-an-tol fs]# umount /mnt/gfs0 [root@men-an-tol fs]# cd /mnt/gfs1 [root@men-an-tol gfs1]# touch bar [root@men-an-tol gfs1]# ls bar [root@men-an-tol gfs1]# cd .. [root@men-an-tol mnt]# umount /mnt/gfs1 which works ok. So this bug is a tools bug, not a kernel bug I think. Thus I think we should be able to post this patch now. Looks like we need to check the umount helper though, so probably means opening another bug unless we can add it to the "things to do in userland" bug, #459630 Leaving the --bind mount/umount.gfs2 issue aside, I was doing a few tests to verify the correctness of the meta filesystem and found this: [root@salem ~]# mount -t gfs2 -o meta /dev/Smoke4/smoke40 /tmp/foo/ [root@salem ~]# ls /tmp/foo/ inum jindex per_node quota rindex statfs [root@salem ~]# mount -t gfs2 /dev/Smoke4/smoke40 /mnt/gfs2 [root@salem ~]# ls /mnt/gfs2 inum jindex per_node quota rindex statfs From a different node... [root@camel ~]# mount -t gfs2 /dev/Smoke4/smoke40 /mnt/gfs2 [root@camel ~]# ls /mnt/gfs2 foobar [root@camel ~]# mount -t gfs2 -o meta /dev/Smoke4/smoke40 /mnt/foo/ [root@camel ~]# ls /mnt/foo/ foobar Looks like the root_dir and master_dir are being mixed up. They are not being mixed up, we just haven't got around to making the changes required to mount the main fs after we've mounted with -o meta. We can fix that later though, for now the important thing is that all the existing mount methods continue to work. Steve, but look at the second set of commands: [root@camel ~]# mount -t gfs2 /dev/Smoke4/smoke40 /mnt/gfs2 [root@camel ~]# ls /mnt/gfs2 foobar [root@camel ~]# mount -t gfs2 -o meta /dev/Smoke4/smoke40 /mnt/foo/ [root@camel ~]# ls /mnt/foo/ foobar Here, we mount gfs2 first and then try to mount -o meta. They both show the same tree (that of the first mounted gfs2 fs). The userland tools work like this: eg: gfs2_quota list -f /mnt/gfs2 expects to find a gfs2 fs at /mnt/gfs2. gfs2_quota then goes about finding the corresponding device and mount it -o meta at /tmp/.gfs2meta/ to do the quota listing. Right now, this will fail, because there will be no quota file in /tmp/.gfs2meta/, instead it will have the contents of the gfs2 filesystem. Am I missing something here? I think I spoke too soon. I guess I don't completely understand the way this is implemented. This seems to work as expected: [root@salem ~]# mount -t gfs2 /dev/Smoke4/smoke40 /mnt/gfs2 [root@salem ~]# ls /mnt/gfs2/ foobar [root@salem ~]# mount -t gfs2meta /mnt/gfs2 /tmp/foo/ [root@salem ~]# ls /tmp/foo/ inum jindex per_node quota rindex statfs What is the difference between the following two commands? a) mount -t gfs2meta /mnt/gfs2 /tmp/foo/ b) mount -t gfs2 -o meta /dev/Smoke4/smoke40 /tmp/foo (this shows the same directory tree as gfs2 if the gfs2 fs was mounted first) and what is the preferred way to mount the meta filesystem? in kernel-2.6.18-107.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html |