Bug 711881

Summary: too funny btrfs st_dev numbers
Product: [Fedora] Fedora Reporter: Karel Zak <kzak>
Component: kernelAssignee: Josef Bacik <jbacik>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: agk, aquini, gansalmon, itamar, jengelh, jonathan, kernel-maint, madhu.chinakonda, rleigh, tmraz, tom
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-08 21:33:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Karel Zak 2011-06-08 20:50:29 UTC
btrfs:

$ grep /dev/sdb1 /proc/self/mountinfo
53 21 0:40 / /mnt/test rw,relatime shared:5 - btrfs /dev/sdb1 rw,ssd
      ^^^^
$ stat --format "%d" /mnt/test/a
42
^^^

NFS (you can also try cifs, ...):

$ grep nfs /proc/self/mountinfo
51 21 0:39 / /mnt/store rw,relatime shared:4 - nfs sr.net.home:/mnt/store ...
      ^^^^
# stat --format "%d" /mnt/store/a
39
^^^

Why stat() on brtfs does not return the same devno as kernel uses for the filesystem in /proc/self/mountinfo? It would be nice to have a way how map files to filesystems without care about paths.

Comment 1 Josef Bacik 2011-06-08 21:33:04 UTC
This is because we create a fake super for every subvol, so you'll notice if you have multiple subvols that the st_dev will be different between subvols.

Comment 2 Karel Zak 2011-06-08 21:42:31 UTC
(In reply to comment #1)
> This is because we create a fake super for every subvol, so you'll notice if
> you have multiple subvols that the st_dev will be different between subvols.

And? I didn't say that multiple st_dev numbers is a problem. I said that stat() is not consistent with /proc/self/mountinfo. Why the file does not contain the private subvolume st_dev numbers?

Comment 3 Josef Bacik 2011-06-09 13:48:25 UTC
So what happens is there are several anonymous supers

1) The main super block that describes the entire fs is one anon super which has its own s_dev
2) Every root gets its own anon super which has it's own s_dev

So mountinfo will have sb->s_dev, but if you stat the file you will get the roots anon_super->s_dev for its device name.

Comment 4 Jan Engelhardt 2012-09-11 03:51:16 UTC
A btrfs volume can only have one “real” root (level 5), can it not?

Are the extra vfs supers really needed? Currently (kernel 3.4), they cause rename(2) operations between subvolumes to return -EXDEV, requiring an expensive copy, even though it is in essence the same filesystem.

Comment 5 Tom Hughes 2013-03-21 13:39:30 UTC
This is confusing valgrind's sanity checks as well, because the value it gets from stat when a file is mapped does not match what it later finds in /proc/XXX/maps:

  https://bugs.kde.org/show_bug.cgi?id=317127

Even ignoring that I don't see how this is supposed to work - surely if you are making up fake st_dev values it breaks the unix semantics that st_dev+st_ino uniquely identifies a file? At least it would do if anybody else was to have the same bright idea and make up st_dev values.

Comment 6 Roger Leigh 2013-05-10 09:26:58 UTC
The current Btrfs behaviour is quite broken.  Programs can and *do* expect the st_dev field of stat(2) to mean something.  This does need to be exposed to userspace.  Btrfs is alone in having these broken semantics.

Why can't Btrfs expose each subvolume as a separate block device?  Obviously not a "real" block device, but a virtual one which would be a proxy for the underlying subvolume.  Similar to how LVM exposes virtual devices as LVs in /dev/mapper, but it need not contain any real data.  These could be created in /dev using the filesystem label and subvolume paths.  These could then be mounted individually, and also fscked (it's not a real device--obviously the tools would need to be able to mount the real device, and fsck the real filesystem, it's just a proxy for the subvolume in that filesystem).  It wouldn't matter if these device nodes didn't exist until the Btrfs filesystem was initially mounted, but it would mean that there would be a unique, valid and usable device backing every st_dev that stat(2) could return.  udev would be able to readily handle this.

In Debian, we fsck the read-only rootfs in early boot.  We normally know the device node for this, but as a fallback in cases where the device does not exist or doesn't match the mount device, we get st_dev from stat and create a temporary device node.  Btrfs always triggers the fallback due to the discrepancy between the mount device and the device reported by stat, and then fsck fails due to the block device being invalid.  This currently breaks with (and *only* with) Btrfs.  This is due to exactly the misbehaviour which Karel reported initially.

The fact that Btrfs can't fsck a mounted read-only filesystem is another serious (but separate) issue.


Regards,
Roger

Comment 7 Jan Engelhardt 2013-05-10 10:41:30 UTC
>Why can't Btrfs expose each subvolume as a separate block device?

That is what it essentially is doing, by making files on each subvol have a different st_dev. NFS and tmpfses are in the same boat, they all have a different st_dev. They do not have real block device nodes in /dev (because of their nature), but st_dev is still unique. In that regard, btrfs is just the same - the device node you specify for initial mounting is just an entrypoint, similar to an NFS mount specification (host:/path).

Comment 8 Roger Leigh 2013-05-10 10:54:07 UTC
Hi Jan.  While I agree with the facts of what you're saying, nfs and tmpfs are special cases.  They aren't backed by block-based storage; both are completely virtual.  We don't mount them using a block device.  We don't try to fsck them.  Btrfs violates the reasonable expectations of existing software, and this does result in breakage.

Regards,
Roger