Bug 711881
Summary: | too funny btrfs st_dev numbers | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Karel Zak <kzak> |
Component: | kernel | Assignee: | Josef Bacik <jbacik> |
Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | rawhide | CC: | agk, aquini, gansalmon, itamar, jengelh, jonathan, kernel-maint, madhu.chinakonda, rleigh, tmraz, tom |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-06-08 21:33:04 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Karel Zak
2011-06-08 20:50:29 UTC
This is because we create a fake super for every subvol, so you'll notice if you have multiple subvols that the st_dev will be different between subvols. (In reply to comment #1) > This is because we create a fake super for every subvol, so you'll notice if > you have multiple subvols that the st_dev will be different between subvols. And? I didn't say that multiple st_dev numbers is a problem. I said that stat() is not consistent with /proc/self/mountinfo. Why the file does not contain the private subvolume st_dev numbers? So what happens is there are several anonymous supers 1) The main super block that describes the entire fs is one anon super which has its own s_dev 2) Every root gets its own anon super which has it's own s_dev So mountinfo will have sb->s_dev, but if you stat the file you will get the roots anon_super->s_dev for its device name. A btrfs volume can only have one “real” root (level 5), can it not? Are the extra vfs supers really needed? Currently (kernel 3.4), they cause rename(2) operations between subvolumes to return -EXDEV, requiring an expensive copy, even though it is in essence the same filesystem. This is confusing valgrind's sanity checks as well, because the value it gets from stat when a file is mapped does not match what it later finds in /proc/XXX/maps: https://bugs.kde.org/show_bug.cgi?id=317127 Even ignoring that I don't see how this is supposed to work - surely if you are making up fake st_dev values it breaks the unix semantics that st_dev+st_ino uniquely identifies a file? At least it would do if anybody else was to have the same bright idea and make up st_dev values. The current Btrfs behaviour is quite broken. Programs can and *do* expect the st_dev field of stat(2) to mean something. This does need to be exposed to userspace. Btrfs is alone in having these broken semantics. Why can't Btrfs expose each subvolume as a separate block device? Obviously not a "real" block device, but a virtual one which would be a proxy for the underlying subvolume. Similar to how LVM exposes virtual devices as LVs in /dev/mapper, but it need not contain any real data. These could be created in /dev using the filesystem label and subvolume paths. These could then be mounted individually, and also fscked (it's not a real device--obviously the tools would need to be able to mount the real device, and fsck the real filesystem, it's just a proxy for the subvolume in that filesystem). It wouldn't matter if these device nodes didn't exist until the Btrfs filesystem was initially mounted, but it would mean that there would be a unique, valid and usable device backing every st_dev that stat(2) could return. udev would be able to readily handle this. In Debian, we fsck the read-only rootfs in early boot. We normally know the device node for this, but as a fallback in cases where the device does not exist or doesn't match the mount device, we get st_dev from stat and create a temporary device node. Btrfs always triggers the fallback due to the discrepancy between the mount device and the device reported by stat, and then fsck fails due to the block device being invalid. This currently breaks with (and *only* with) Btrfs. This is due to exactly the misbehaviour which Karel reported initially. The fact that Btrfs can't fsck a mounted read-only filesystem is another serious (but separate) issue. Regards, Roger >Why can't Btrfs expose each subvolume as a separate block device?
That is what it essentially is doing, by making files on each subvol have a different st_dev. NFS and tmpfses are in the same boat, they all have a different st_dev. They do not have real block device nodes in /dev (because of their nature), but st_dev is still unique. In that regard, btrfs is just the same - the device node you specify for initial mounting is just an entrypoint, similar to an NFS mount specification (host:/path).
Hi Jan. While I agree with the facts of what you're saying, nfs and tmpfs are special cases. They aren't backed by block-based storage; both are completely virtual. We don't mount them using a block device. We don't try to fsck them. Btrfs violates the reasonable expectations of existing software, and this does result in breakage. Regards, Roger |