Bug 475312
Summary: | GFS2: mount attempt hangs if no more journals available | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Nate Straz <nstraz> | ||||||
Component: | kernel | Assignee: | Robert Peterson <rpeterso> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | low | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 5.3 | CC: | dzickus, edamato, sghosh, swhiteho, teigland | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-09-02 09:01:51 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 425421 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Description
Nate Straz
2008-12-08 20:51:32 UTC
Its worth trying this with the "remove lock_dlm" patch applied since that makes changes to that particular area, and might just fix it. Something I did notice while testing that was that if the kernel refuses to mount and returns an error code, the usual result was a seg fault from mount.gfs2. So there is something that needs looking at in this area still. First, this is gfs1, not gfs2. Second, there's no way lock_dlm will be removed in RHEL5. Yes, but thats not what I was pointing out... the issue was that we seem to have some kind of problem when mounts fail (e.g. arguments don't parse correctly) which results in mount.gfs2 getting stuck. I'd have though that gfs1 probably was using the same or very similar code in that area. (In reply to comment #2) > First, this is gfs1, not gfs2. It was fixed in 5.3 for gfs1. I was running the same test on gfs2 and found it failed. This affects the RHEL 5.3 release for GFS2. Log messages on morph-04, which tried to mount a GFS2 file system which didn't have a journal free: GFS2: fsid=: Trying to join cluster "lock_dlm", "morph-cluster:morph-cluster0" GFS2: fsid=morph-cluster:morph-cluster0.2: Joined cluster. Now mounting FS... GFS2: fsid=morph-cluster:morph-cluster0.2: can't mount journal #2 GFS2: fsid=morph-cluster:morph-cluster0.2: there are only 2 journals (0 - 1) group_tool shows that it did join the dlm lockspaces. [root@morph-04 ~]# group_tool type level name id state fence 0 default 00010001 none [1 3 4] dlm 1 clvmd 00020001 none [1 3 4] dlm 1 morph-cluster0 00060001 none [1 3 4] gfs 2 morph-cluster0 00050001 none [1 3 4] strace output shows that it is hung in the mount system call. 3439 connect(3, {sa_family=AF_FILE, path=@"gfs_controld_sock"...}, 20) = 0 3439 write(3, "join /mnt/morph-cluster0 gfs2 lo"..., 256) = 256 3439 read(3, "0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0". .., 256) = 256 3439 read(3, "hostdata=jid=2:id=262145:first=0"..., 256) = 256 3439 mount("/dev/mapper/morph--cluster-morph--cluster0", "/mnt/morph-cluster0", "gfs2", 0, "hostdata=jid=2:id=262145:first=0" Backtrace of the mount.gfs2 process crash> bt 3727 PID: 3727 TASK: f3203550 CPU: 1 COMMAND: "mount.gfs2" #0 [f3213c64] schedule at c060e785 #1 [f3213cd0] schedule_timeout at c060eedf #2 [f3213cf4] msleep at c042c2ad #3 [f3213cf8] gfs2_gl_hash_clear at f8eb9f07 #4 [f3213d10] fill_super at f8ec621d #5 [f3213da0] get_sb_bdev at c04787ee #6 [f3213dd4] gfs2_get_sb at f8ec494d #7 [f3213de4] vfs_kern_mount at c04782b4 #8 [f3213e0c] do_kern_mount at c0478359 #9 [f3213e24] do_mount at c048b374 #10 [f3213f98] sys_mount at c048b451 #11 [f3213fb8] system_call at c0404f10 EAX: ffffffda EBX: bfb89198 ECX: bfb8a199 EDX: 0804ed05 DS: 007b ESI: 00000000 ES: 007b EDI: bfb8e19d SS: 007b ESP: bfb8915c EBP: bfb90698 CS: 0073 EIP: 00f66402 ERR: 00000015 EFLAGS: 00000246 In the case of GFS, the problem was that the error path when mounting failed to release resources associated with the license file inode, which had been retooled as the fast statfs file. The fix should therefore not need to be crosswritten to gfs2. Here is a link to the fix: http://git.fedoraproject.org/git/?p=cluster.git;a=blobdiff;f=gfs-kernel/src/gfs/ops_fstype.c;h=e01ea32a8bd670f98463a8ddc8f1ce1f04904e49;hp=10b08385275ef17130a5032fcef3db5c7cad9315;hb=b5cc95a48417758429752998be12c059b7ac2b95;hpb=1d56fb441d78faf375eb26a84e775cc8dde7e705 However, it might be a clue as to what's going wrong. I'll check gfs2's error path during mounting to see if there's a similar inode not being released. I left the system in the hung state for an hour and I started seeing these messages on the console: GFS2: fsid=morph-cluster:morph-cluster0.2: Unmount seems to be stalled. Dumping lock state... G: s:SH n:5/16 f: t:SH d:EX/0 l:0 a:0 r:2 H: s:SH f:EH e:0 p:3727 [mount.gfs2] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:UN n:2/16 f: t:UN d:EX/0 l:0 a:0 r:2 So the iopen glock is still held for the root inode (5/16). Hopefully easy to find and fix. I've recreated the problem and am confident I can fix this easily. Changing status to assigned and requesting ack flags for 5.4. I suspect that we need to add dput(sb->s_root); just before sb->s_root = NULL; in fill_super() since the dcache seems to be holding a ref to the root inode at that point. Created attachment 329478 [details]
Small patch
This appears to do the trick.
Patch is now upstream Created attachment 329495 [details]
RHEL5 version of the upstream patch
This is the RHEL5 version of the upstream patch. The patch is identical
except for the diff offsets. I have tested this patch and verified that
it fixes the failing scenario on system roth-01. I'll post this one to
rhkernel-list for inclusion into the 5.4 kernel.
The patch was posted to rhkernel-list, so I'm changing the status to POST and adding Don Zickus to the cc list. in kernel-2.6.18-129.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Updating PM score. Verified against kernel-2.6.18-140.gfs2abhi.004. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html |