+++ This bug was initially created as a clone of Bug #354201 +++ This is for the gfs2 kernel portion of the fix. In order for gfs2_tool to correctly find the tuning files in sysfs, it needs to find them by the device ID. This bug record is for patching the RHEL5.2 gfs2 kernel code so that it puts "major:minor" in the /sys/fs/gfs2/<s_id>/id file rather than the s_id value. As seen in the trace, the gfs2_tool improperly parses dev names if they have an underscore as in seen with cciss (HP) disks ---------------------- [root@ustst-d0024 ~]# gfs2_tool settune /uss incore_log_blocks 4096 gfs2_tool: unknown mountpoint /uss, no fsname for c0d1p1 [root@ustst-d0024 ~]# gfs2_tool settune /uss/ incore_log_blocks 4096 gfs2_tool: unknown mountpoint /uss/, no device found [root@ustst-d0024 ~]# strace gfs2_tool settune /uss incore_log_blocks 4096 execve("/sbin/gfs2_tool", ["gfs2_tool", "settune", "/uss", "incore_log_blocks", "4096"], [/* 21 vars */]) = 0 brk(0) = 0x9fc0000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=45044, ...}) = 0 mmap2(NULL, 45044, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f9b000 close(3) = 0 open("/lib/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0000\357\226\0004\0\0\0"..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=1585788, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f9a000 mmap2(0x959000, 1308068, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x959000 mmap2(0xa93000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x139) = 0xa93000 mmap2(0xa96000, 9636, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xa96000 close(3) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f99000 set_thread_area({entry_number:-1 -> 6, base_addr:0xb7f996c0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0 mprotect(0xa93000, 8192, PROT_READ) = 0 mprotect(0x955000, 4096, PROT_READ) = 0 munmap(0xb7f9b000, 45044) = 0 open("/uss", O_RDONLY|O_LARGEFILE) = 3 close(3) = 0 brk(0) = 0x9fc0000 brk(0x9fe1000) = 0x9fe1000 open("/proc/mounts", O_RDONLY|O_LARGEFILE) = 3 fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fa5000 read(3, "rootfs / rootfs rw 0 0\n/dev/root"..., 4096) = 872 open("/proc/devices", O_RDONLY|O_LARGEFILE) = 4 fstat64(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fa4000 read(4, "Character devices:\n 1 mem\n 4 /"..., 4096) = 410 close(4) = 0 munmap(0xb7fa4000, 4096) = 0 stat64("/dev/cciss/c0d1p1", {st_mode=S_IFBLK|0640, st_rdev=makedev(104, 17), ...}) = 0 close(3) = 0 munmap(0xb7fa5000, 4096) = 0 open("/sys/fs/gfs2", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3 fstat64(3, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 getdents64(3, /* 3 entries */, 4096) = 80 open("/sys/fs/gfs2/cciss_c0d1p1/id", O_RDONLY|O_LARGEFILE) = 4 read(4, "cciss/c0d1p1\n", 4096) = 13 close(4) = 0 getdents64(3, /* 0 entries */, 4096) = 0 close(3) = 0 write(2, "gfs2_tool: ", 11gfs2_tool: ) = 11 write(2, "unknown mountpoint /uss, no fsna"..., 46unknown mountpoint /uss, no fsname for c0d1p1 ) = 46 exit_group(1) = ? -- Additional comment from rkenna on 2007-10-26 10:53 EST -- Reproducer: Mount a gfs2 file system directly on a dev (not LVM2) with an underscore (mknod an alias if needed) Run the command [root@et-virt05 dev]# ~/gfs2_tool settune /uss incore_log_blocks 4096 /root/gfs2_tool: unknown mountpoint /uss, no fsname for xxx_yyy -- Additional comment from rkenna on 2007-10-26 10:54 EST -- Trace with proposed change for the library [root@et-virt05 dev]# strace ~/gfs2_tool settune /uss incore_log_blocks 4096 execve("/root/gfs2_tool", ["/root/gfs2_tool", "settune", "/uss", "incore_log_blocks", "4096"], [/* 24 vars */]) = 0 brk(0) = 0x1036000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaab000 uname({sys="Linux", node="et-virt05.lab.boston.redhat.com", ...}) = 0 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=104403, ...}) = 0 mmap(NULL, 104403, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2aaaaaaac000 close(3) = 0 open("/lib64/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\331A\250>\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1687464, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaac6000 mmap(0x3ea8400000, 3469464, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3ea8400000 mprotect(0x3ea8546000, 2097152, PROT_NONE) = 0 mmap(0x3ea8746000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x146000) = 0x3ea8746000 mmap(0x3ea874b000, 16536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3ea874b000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaac7000 arch_prctl(ARCH_SET_FS, 0x2aaaaaac7240) = 0 mprotect(0x3ea8746000, 16384, PROT_READ) = 0 mprotect(0x3ea8219000, 4096, PROT_READ) = 0 munmap(0x2aaaaaaac000, 104403) = 0 brk(0) = 0x1036000 brk(0x1057000) = 0x1057000 open("/proc/mounts", O_RDONLY) = 3 lstat("/uss", {st_mode=S_IFDIR|0755, st_size=3864, ...}) = 0 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000 read(3, "rootfs / rootfs rw 0 0\n/dev/root"..., 4096) = 871 close(3) = 0 munmap(0x2aaaaaaac000, 4096) = 0 lstat("/uss", {st_mode=S_IFDIR|0755, st_size=3864, ...}) = 0 open("/proc/mounts", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000 read(3, "rootfs / rootfs rw 0 0\n/dev/root"..., 4096) = 871 open("/proc/devices", O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaad000 read(4, "Character devices:\n 1 mem\n 4 /"..., 4096) = 454 close(4) = 0 munmap(0x2aaaaaaad000, 4096) = 0 stat("/dev/xxx_yyy", {st_mode=S_IFBLK|0640, st_rdev=makedev(8, 17), ...}) = 0 close(3) = 0 munmap(0x2aaaaaaac000, 4096) = 0 open("/sys/fs/gfs2", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 fstat(3, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 getdents(3, /* 4 entries */, 4096) = 120 open("/sys/fs/gfs2/sdb1/id", O_RDONLY) = 4 read(4, "sdb1\n", 4096) = 5 close(4) = 0 open("/sys/fs/gfs2/RHMag-Phys:guest_roots/id", O_RDONLY) = 4 read(4, "dm-3\n", 4096) = 5 close(4) = 0 getdents(3, /* 0 entries */, 4096) = 0 close(3) = 0 write(2, "/root/gfs2_tool: ", 17/root/gfs2_tool: ) = 17 write(2, "unknown mountpoint /uss, no fsna"..., 47unknown mountpoint /uss, no fsname for xxx_yyy ) = 47 exit_group(1) = ? -- Additional comment from rpeterso on 2007-10-26 11:34 EST -- It's my belief that it has nothing to do with the underscore. I believe that the problem lies when the device name is in a subdirectory, at least in the case of cciss. (I haven't tried the reproducer yet). When the gfs2 kernel code registers the file system in the sysfs, it has to substitute slash (/) chars for underscore (_). Thus, /dev/cciss/c0d0p0 is registered as /dev/cciss_c0d0p0. The problem is that the gfs2 utils doesn't do that substitution. We'd likely have this problem with AOE as well, where the device is something like: /dev/etherd/e1.1p1 which gets registered as /dev/etherd_e1.1p1. We should probably change the summary accordingly. The fix should be simple and straightforward. How soon do they want or need a fix (so I can prioritize)? -- Additional comment from rkenna on 2007-10-26 12:05 EST -- re: Comment #3 -- Additional comment from rkenna on 2007-10-26 12:06 EST -- re: Comment #3 I have a failure case that works with 0 brw-r----- 1 root disk 8, 17 Oct 26 10:38 /dev/sdb1 and fails with 0 brw-r----- 1 root root 8, 17 Oct 26 10:42 /dev/xxx_yyy -- Additional comment from rpeterso on 2007-10-29 16:05 EST -- I verified that we are dealing with two problems: (1) The AOL problem where a sub-subdirectory like /dev/cciss/c0d0p1 doesn't work, and (2) The problem reported from comment #5 where alternate devices created with mknod don't work. Again, the underscore isn't the cause. I've got a fix that seems to work for both cases. Unfortunately, it's backed up behind bug #349601, which I can ship easily; I just wanted to give people more of a chance to respond to my proposed output format. -- Additional comment from rpeterso on 2007-10-30 10:45 EST -- Created an attachment (id=243451) Preliminary patch Now that I've committed bug #349601, I can move on with this one. This is a patch that mostly resolves the problem. There is still one case where the problem is not solved, and I'll explain why: The problem lies when gfs2_tool is trying to figure out where the gfs2 kernel code will keep its sysfs interface for the mounted fs. The gfs2 kernel code uses /sys/fs/gfs2/<SOME NAME>/* for these files. In the case where the file system has a lock table, gfs2 uses the lock table name. Something like: /sys/fs/gfs2/bobs_roth\:test_gfs/*. That's easy for gfs2_tool to figure out, by reading the gfs2 superblock of the device. If the lock table is NULL--meaning it's a stand-alone file system, gfs2 uses VFS's "s_id" value. In cases where the device mounted is something like "/dev/sdb1", the s_id will be "sdb1" and thus, gfs2 will use /sys/fs/gfs2/sdb1/*. That's also not a problem for gfs2_tool to figure out. However, here's the catch: If the user creates their own device equivalent, s_id won't match the device name. So if I try to simulate customer hardware I don't actually have, it's not going to work. For example, if I do: mkfs /dev/cciss mknod /dev/cciss/c0d0p1 b 8 17 mount -tgfs2 /dev/cciss/c0d0p0 /mnt/gfs2 This will actually mount the second partition on the second SCSI drive under /mnt/gfs2. However, the gfs2 kernel code will still create its sysfs files under: /sys/fs/gfs2/sdb1/*, NOT something like the device name. So gfs2_tool won't be able to find the sysfs files. The previous version checked the last basename of the device against all the gfs2 sysfs files it could find. However, in this case, the "id" file will still be "sdb1", again, named after the VFS s_id. AFAIK, there's no good way for gfs2_tool to determine the VFS s_id of a mounted file system. The only way I can think of to do this properly is to change the gfs2 kernel code so that a statfs from userland can return a unique identifier in the f_fsid field of the struct statfs. This seems to be what some other file systems do. Right now, gfs2 always returns "0, 0" as the f_fsid. Now normally customers aren't going to be creating their own phantom devices that don't match what vfs thinks. That's the kind of thing we do in order to test customer scenarios. So our options are: (a) document around this restriction, (b) fix gfs2 kernel to pass something in the statfs f_fsid and use it in gfs2_tool, or (c) maybe someone can think of a magic trick to get vfs's s_id from user space. I didn't find any offhand. -- Additional comment from rpeterso on 2007-10-31 10:02 EST -- Adding Steve Whitehouse to the cc list to get his input regarding comment #7. -- Additional comment from swhiteho on 2007-10-31 10:33 EST -- Please don't do this by matching device names. Ideally you ought to be matching device numbers (not major/minor, just the dev_t directly). Use stat to get the dev_t for the fs in question and then its a question of how to find the dev_t for the /sys directory. Then you can iterate over the /sys directories until you find the matching one. Ideally we'd have a device node in there which would be set with the same device number as the device that the fs is mounted on, then we can just compare the two numbers. That would be my preferred approach. If that turns out to be impossible, then it looks like we'll need (at least to attempt) to reconstruct the device path from the info thats in that directory and then use stat to turn the path into a device number. I really don't like that as the mounter of the fs might be in a different fs name space to the person who is running the user tool so the paths might not be the same, but we might have to do that as a last resort if the other method can't be made to work. -- Additional comment from rpeterso on 2007-10-31 18:22 EST -- Created an attachment (id=244921) Proposed gfs2 kernel patch This patch puts the dev_t value into /sys/fs/gfs2/<fs>/id so that userland can pick it up. This enables userland to determine which file system is being referenced. Note that there's a fundamental translation that needs to be done. The kernel references dev_t as ((major << 20) | minor), whereas userland uses ((major << 8) | minor. This translation can be done in the kernel or in userland. Since gfs2's sysfs is the accepted interface into userland, I thought it was best done in the kernel code that manages that. Let me know if you disagree. This makes the userland patch much simpler (also tested and it works). If you like this patch, I'll post it to cluster-devel so it can go upstream, and I'll open a bugzilla to get it into rhel5.2. I'll wait for feedback from Steve W. -- Additional comment from rpeterso on 2007-10-31 18:29 EST -- Created an attachment (id=244931) Replacement patch that relies on the kernel patch This patch is similar to how the code worked before. It gets the dev_t from the stat command, then it tries all the /sys/fs/gfs2/*/id files looking for a matching dev_t value. It's tested and works, but it relies on the kernel patch, so pending its acceptance. -- Additional comment from rpeterso on 2007-11-02 09:49 EST -- Created an attachment (id=246731) Proposed gfs2 kernel patch--try #2 This gfs2 kernel patch presents the device "major:minor" in the ID file, in that order. -- Additional comment from rpeterso on 2007-11-02 09:51 EST -- Created an attachment (id=246741) Replacement gfs2_tool patch that relies on kernel patch
Requesting ack flags to get this in the 5.2 kernel.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Dave Teigland had a good point about being careful when changing the abi. Following his lead, I found out that if we switch the value of the "id" file as suggested in the patch, older versions of the gfs2_tool that expect it the other way will break. In other words, the users would then be required to upgrade their gfs2_tool. That might be acceptable for RHEL because we can control it, but it probably won't fly for upstream. Therefore, I propose we instead export the device id to a new sysfs file called device_id. That way we won't break anything. My big question is: Are there any kabi requirements here that I don't know about? Is it okay to create a new sysfs file for gfs2? How will that be viewed by the upstream community?
We discussed the patch in Monday's gfs2 tech meeting, and we decided that there shouldn't be an issue exporting the device "major:minor" in the ID file. Other file systems do this, so it should be okay for us. Right now, gfs2 is still in tech preview and no one in RHEL or the upstream community should be under the impression that it's ready for enterprise use yet, so making this change to the kernel abi seemed acceptable to everyone in the meeting. Therefore, we decided to go ahead with the patch as is and not try to create a new device_id file. I've tested with that already. I'll attach the patch to this bz and post it to rhkernel-list for inclusion in RHEL5.2. The kernel should really have a better solution for this that works for all file systems. Perhaps we should create some kind of interface that works for all file systems and propose it on lkml. Time permitting.
Created attachment 249291 [details] Patch to fix the problem Here is the patch I'm going to post to rhkernel-list.
Posted to rhkernel-list. Turning it over to Don Zickus.
in 2.6.18-58.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html