| Summary: | hang during mount of GFS2 in Fedora 15 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | David Booher <dbooher> | ||||
| Component: | kernel | Assignee: | Steve Whitehouse <swhiteho> | ||||
| Status: | CLOSED UPSTREAM | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 15 | CC: | adas, agk, aquini, bmarzins, fdinitto, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, rpeterso, swhiteho | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | i686 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-07-15 07:52:40 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
David Booher
2011-06-21 17:37:38 UTC
It looks like gfs_controld is doing the right thing, but for some reason the kernel is having problems with what has been sent. We need to try to figure out exactly what the object is that is NULL here. If you are able to get an strace from gfs_controld when this is happening, that might be helpful. Otherwise I'll try and take a look at this tomorrow. Also it would be useful to know the mount command line, or at least what the lock protocol was that was in use (lock_nolock or lock_dlm). It's easy for me to get it in this state. If you tell me how to take the strace, I'll be glad to send it along. OK. I was able to get the PID of the gfs_controld and issue: strace -o /home/dbooher/strace.txt -p pid The strange this is that the mount succeeds when strace is attached to the gfs_controld process. If I unmount the GFS, terminate the strace and then try to mount it again, it will hang. Very strange. Just to note: lock_dlm is in use. The mount command is: mount /dev/mapper/DaveLV-DaveGFS /test I have two straces of 2 succesful mounts if needed. So it sounds like it is highly timing dependent... in f15, gfs2 uses a different scheme for mounting than it did in earlier versions. Instead of using mount.gfs2, it communicates directly with the kernel (through gfs_controld). I've been running that code on my test system for some time before it was merged, and I'd not hit this issue. The main clue from the oops is that whatever is NULL is at offset 4 in some structure. I don't think that this can be the superblock. The first field that is accessed is not at offset 4. On the other hand lm_mount is at offset 4 since you are on a 32 bit platform. So I suspect that gfs_controld might have managed to get to the point of trying to set the "first" parameter before the kernel has set the lock operations on the superblock. I suspect that stracing gfs_controld adds just enough delay to prevent the race. Should be fairly easy to fix at least, at worst we can just add a completion to be set once the lock operations are valid in the superblock. Created attachment 505984 [details]
Proposed fix
If you are in a position to test a patch, then this should fix your issue.
Since this was an "Install to HD" from the LIVE system, my expertise level isn't probably quite up-to-snuff. However, I assume I would need to: 1). Get the correct devel packages 2). Apply the patch 3). Rebuild GFS2 I've probably over-simplified it, but I'd be willing to give it a shot :) You'd need the srpm for the kernel and install that, add the patch and use rpmbuild to recreate the binary rpm. It can be a bit of a long & involved process if you've not done it before. We can still fix the issue anyway for upstream, but its always nice to get confirmation from the original reporter that it does do the right thing :-) I've been trying to think up another workaround too, aside from using strace, but I'm not sure at the moment how else we could slow down gfs_controld a bit... I've not come up with anything so far. You're right...a little too involved for me (not that I wouldn't like to tackle something like that in the future). If you had the binary rpm built, I would certainly test that on my systems. I have at least two that are exhibiting these symptoms and I agree that leaving strace attached is probably not the best option. Thanks for your help. Patch now in the GFS2 -nmw tree. Patch in upstream kernel now. |