Bug 1294531
Summary: | btrfs device delete does not, hangs | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Konstantin Olchanski <olchansk> |
Component: | btrfs-progs | Assignee: | fs-maint |
Status: | CLOSED WONTFIX | QA Contact: | Filesystem QE <fs-qe> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.2 | CC: | bugzilla, esandeen, xzhou |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-02-19 18:50:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Konstantin Olchanski
2015-12-28 20:18:29 UTC
Btrfs doesn't yet have device 'faulty' state like md/mdadm, even upstream. It will try to read/write to defective devices indefinitely, and maybe the resulting flood of retries is what's slowing things down. https://btrfs.wiki.kernel.org/index.php/Project_ideas#Take_device_with_heavy_IO_errors_offline_or_mark_as_.22unreliable.22 'dev delete <dev>' does not consider the specified device actually deleted (or ignorable) until all of its data is replicated on other devices, i.e. a 3rd copy must be created before sde1 is considered no longer necessary and the device is released. Instead, physically remove the device, or issue 'echo 1 > /sys/block/device-name/device/delete' and then use 'btrfs dev delete missing' to initiate the replication of missing data that was on the bad device to remaining devices. Alternatively, when replacing the bad device, it's better to use 'btrfs replace' either with -r option (mostly ignore the bad device unless needed), or physically remove or sysfs delete it first. I do not think your instructions will work. a) If I physically remove the disk, it will not become "missing" in btrfs, instead the syslog will fill with disk errors. b) If I "echo 1 > /sys/block/.../delete", I think the same thing will happen. I suspect the only way to mark a disk as missing (permitting "btrfs dev delete missing") is through a reboot, but as we already know, RHEL7.2 will not boot from a degraded btrfs filesystem. A catch22 if there is one. K.O. P.S. I see all this as a very bad sign. Obviously, btrfs authors failed to think though the most simple failure scenario (a dead disk). Makes one wonder what other failure modes they ignored or dismissed as "an exercise for the user" (as in "restore from backup and start from scratch" - I do read the btrfs mailing lists). P.P.S. As for my machine with the stuck "btrfs dev delete", after 4 days of "maybe it just takes a very very very long time, let's wait", the machine died (no ping). K.O. (In reply to Konstantin Olchanski from comment #3) > a) If I physically remove the disk, it will not become "missing" in btrfs, > instead the syslog will fill with disk errors. > b) If I "echo 1 > /sys/block/.../delete", I think the same thing will happen. Every time I've tried either of these, 'btrfs fi show' has always immediately displayed the missing device as missing. > I suspect the only way to mark a disk as missing (permitting "btrfs dev > delete missing") is through a reboot, but as we already know, RHEL7.2 will > not boot from a degraded btrfs filesystem. Try it first? I've done this a bunch of times and in the normal case it does work. When it doesn't work it's because something else is wrong, and thus an edge case, and requires supplying a lot of state information because "it doesn't work" is just totally not revealing. I only tried to simulate disk failure by disconnecting the disk under el7.0, did not try with el7.1 and el7.2. I am pretty sure I did not see the disconnected disk go "missing" then. I will try again with el7.2 early next week when I can physically access the machine. BTW, what you say is inconsistent with the BTRFS documentation: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices says: "btrfs device delete missing tells btrfs to remove the first device that is described by the filesystem metadata but not present when the FS was mounted." Which I read as: to use "btrfs dev delete missing", the btrfs fileystem has to be unmounted, then remounted in degraded mode. For the "/" filesystem it means the machine has to be rebooted. A search for "missing" in the btrfs wiki (https://btrfs.wiki.kernel.org/index.php?title=Special%3ASearch&search=missing&go=Go) does not show any additional information on what "missing" means, does and how one gets there. K.O. We are both right. Physical disconnect of the disk does make "btrfs fi show /" report "some devices missing" (as Chris M. says) and all other commands still see this disconnected device and btrfs still tries to write to it. (as I remember). Now from this state, I confirm that "delete missing" does not work: [root@daq11 ~]# btrfs dev delete missing / ERROR: error removing the device 'missing' - no missing devices found to remove [root@daq11 ~]# [root@daq11 ~]# btrfs dev delete /dev/sde1 / ERROR: error removing the device '/dev/sde1' - No such file or directory [root@daq11 ~]# Here is additional information: [root@daq11 ~]# btrfs fi show / Label: 'centos_daq11' uuid: 8ef30d1e-8671-4f99-9032-3fb1ca9ccf99 Total devices 6 FS bytes used 699.64GiB devid 1 size 1.75TiB used 263.00GiB path /dev/sda3 devid 2 size 1.75TiB used 264.00GiB path /dev/sdb3 devid 6 size 1.82TiB used 310.03GiB path /dev/sdf1 devid 8 size 1.75TiB used 263.00GiB path /dev/sdd3 devid 9 size 1.75TiB used 26.00GiB path /dev/sdc3 *** Some devices missing btrfs-progs v3.19.1 [root@daq11 ~]# [root@daq11 ~]# btrfs dev usage / /dev/sda3, ID: 1 Device size: 1.75TiB Data,RAID1: 236.00GiB Metadata,RAID1: 27.00GiB Unallocated: 1.49TiB /dev/sdb3, ID: 2 Device size: 1.75TiB Data,RAID1: 241.00GiB Metadata,RAID1: 23.00GiB Unallocated: 1.49TiB /dev/sdc3, ID: 9 Device size: 1.75TiB Data,RAID1: 26.00GiB Unallocated: 1.72TiB /dev/sdd3, ID: 8 Device size: 1.75TiB Data,RAID1: 247.00GiB Metadata,RAID1: 16.00GiB Unallocated: 1.49TiB /dev/sde1, ID: 5 Device size: 0.00B Data,RAID1: 276.00GiB Metadata,RAID1: 28.00GiB System,RAID1: 32.00MiB Unallocated: 1.52TiB /dev/sdf1, ID: 6 Device size: 1.82TiB Data,RAID1: 284.00GiB Metadata,RAID1: 26.00GiB System,RAID1: 32.00MiB Unallocated: 1.52TiB [root@daq11 ~]# With help from Chris M., my catch-22 is resolved: a) disconnect disk that will be removed from btrfs b) reboot with "rd.shell" and "rd.break=pre-init" (I type them in the grub editor from the grub menu) c) get the "emergency shell" (looks like right before the infinite wait for btrfs uuid) d) # mount -o degraded /dev/sdb3 /sysroot e) btrfs dev delete missing /sysroot f) watch progress of btrfs data balancer, will take some time. Would be nice if normal "btrfs dev delete" were fixed some day. K.O. Made a typo in previous message: "rd.break=pre-mount", not "pre-init". K.O. Additional information. Back on January 5th, I booted the machine in single-user mode and it was running "btrfs delete missing /" ever since. Today "btrfs delete" finally completed - around 300 GB of data rearranged in 20 days - this must a speed record of sorts - 15 GB per day, 0.2 Mbytes/sec. With btrfs no longer degraded, rebooted the machine in multi-user mode (degraded btrfs will not boot, remember?), rebooted latest kernel. [root@daq11 ~]# uname -a Linux daq11.triumf.ca 3.10.0-327.4.5.el7.x86_64 #1 SMP Mon Jan 25 22:07:14 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Now running "btrfs dev delete" to liberate one more disk, expect an update in 20 days. Impressive! K.O. For the record. "btrfs dev delete" never completed. After 1 month (I am patient), I ended up reinstalling the OS (to move "/" from btrfs on 6xHDD to xfs on SSD) and erasing the btrfs disks (complete data loss, if this were actual data). As summary, btrfs in el7.2 is useless junk. (and I do not care if it works oh so well on the ssd on your laptop). K.O. ok to close this bug, I do not see how I can close it myself. K.O. My btrfs evaluation is complete, btrfs in el7.2 is unusable, will be using zfs instead. K.O. close this bug already. nobody but bots left at red hat? K.O. (In reply to Konstantin Olchanski from comment #13) > close this bug already. nobody but bots left at red hat? K.O. Apologies for the lack of attention on this bug, it had been mis-assigned. However, I'm afraid that btrfs did not exit tech preview in RHEL7 and has been deprecated. No further fixes will be provided. |