User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0 Build Identifier: Following an upgrade of a Fedora 29 system from Kernel 4.18.16-300 to 4.18.17-300, two BTRFS raid-10 volumes (each containing multiple subvolumes) failed to be mounted at subsequent reboot, dropping the boot process into emergency mode. Same thing occurred after subsequent upgrade to kernel 4.18.18-300. Booting on kernel 4.18.16-300 works fine. (All 3 kernels are simultaneously installed, so no dnf-install / dnf-remove activity in-between boot attempts.) All volumes/subvolumes can be manually mounted by hand after the failure occurs, i.e. by issuing appropriate 'mount' commands at the command-line. The volumes successfully passed subsequent 'btrfs check' (while unmounted) and 'btrfs scrub' (while mounted) operations. Various diagnostics have been attached to this bug-report (e.g. output of dmesg, journalctl -xb, systemd-analyze critical-chain,etc.) However, it would appear that the raid volumes may be trying to become mounted before all member-disks have become available. All mounting is done via /etc/fstab, i.e. there are no hand-crafted systemd mount units involved. Initially the entries in /etc/fstab took the following form (several more subvols exist on both volumes but have been excluded here for brevity): LABEL=btrfs-highrise-raid10-01 /var/data/filesystems/btrfs-highrise-raid10-01 btrfs noatime,noauto 0 0 LABEL=btrfs-highrise-raid10-02 /var/data/filesystems/btrfs-highrise-raid10-02 btrfs noatime,noauto 0 0 LABEL=btrfs-highrise-raid10-01 /var/data/subvols/backups_bareos btrfs noatime,subvol=backups_bareos 0 0 LABEL=btrfs-highrise-raid10-01 /var/data/subvols/backups_db btrfs noatime,subvol=backups_db 0 0 Identical results occur whether or not the toplevel volumes are set to "noauto". Identical results also occur when the /etc/fstab entries are changed to explicitly specify all member disks, such as: /dev/sdb /var/data/filesystems/btrfs-highrise-raid10-01 btrfs device=/dev/sdb,device=/dev/sdc,device=sdd,device=/dev/sde,noatime 0 0 /dev/sdf /var/data/filesystems/btrfs-highrise-raid10-02 btrfs device=/dev/sdf,device=/dev/sdg,device=sdh,device=/dev/sdi,noatime 0 0 and also: /dev/sdb /var/data/filesystems/btrfs-highrise-raid10-01 btrfs device=fbb6ce57-66c3-4189-a180-2242ef8aa43b,device=4bc98037-1e45-4cd8-8c21-b4789d1586b8,device=72d8133b-7ffc-463c-bfcd-73a03cc42e54,device=03b19f31-0a34-4dfc-b420-ac0568ecf170,noatime 0 0 /dev/sdf /var/data/filesystems/btrfs-highrise-raid10-02 btrfs device=34b5aa14-b79a-4565-973c-282c173d7354,device=67a155dd-6b7f-4e5b-ba39-71fc4e2f3885,device=5f2e017c-dd2c-4f3a-b8c7-d612590dbede,device=36072420-2451-4a9b-9264-e11861a2d3fc,noatime 0 0 Again, when all else is left the same and the system is just booted with different kernel revisions selected, the problem manifests reliably under 4.18.17-300 (and later), but not on 4.18.16-300 (and prior). Prior to 4.18.17, this problem had not been experienced for 2+ years as the system hosting the raid was progressively migrated from one Fedora version to the next. The following RPMs were updated via DNF immediately prior to the failure starting to occur: dnf history info 21 Transaction ID : 21 Begin time : Sat 10 Nov 2018 07:37:26 PM EST Begin rpmdb : 442:8975452ebac65dfad9eea3d24864b51cc8dddfa5 End time : Sat 10 Nov 2018 07:38:28 PM EST (62 seconds) End rpmdb : 448:7c538d7e7c8cced1ed103716bda59b756903e9ec User : root <root> Return-Code : Success Releasever : 29 Command Line : update Packages Altered: Install kernel-4.18.17-300.fc29.x86_64 @updates Install kernel-core-4.18.17-300.fc29.x86_64 @updates Install kernel-modules-4.18.17-300.fc29.x86_64 @updates Install python3-unbound-1.8.1-1.fc29.x86_64 @updates Install unbound-libs-1.8.1-1.fc29.x86_64 @updates Install zchunk-libs-0.9.14-1.fc29.x86_64 @updates Upgrade coreutils-8.30-6.fc29.x86_64 @updates Upgraded coreutils-8.30-5.fc29.x86_64 @@System Upgrade coreutils-common-8.30-6.fc29.x86_64 @updates Upgraded coreutils-common-8.30-5.fc29.x86_64 @@System Upgrade curl-7.61.1-4.fc29.x86_64 @updates Upgraded curl-7.61.1-3.fc29.x86_64 @@System Upgrade dnf-4.0.4-2.fc29.noarch @updates Upgraded dnf-4.0.4-1.fc29.noarch @@System Upgrade dnf-data-4.0.4-2.fc29.noarch @updates Upgraded dnf-data-4.0.4-1.fc29.noarch @@System Upgrade dnf-yum-4.0.4-2.fc29.noarch @updates Upgraded dnf-yum-4.0.4-1.fc29.noarch @@System Upgrade findutils-1:4.6.0-21.fc29.x86_64 @updates Upgraded findutils-1:4.6.0-20.fc29.x86_64 @@System Upgrade glibc-2.28-17.fc29.x86_64 @updates Upgraded glibc-2.28-9.fc29.x86_64 @@System Upgrade glibc-all-langpacks-2.28-17.fc29.x86_64 @updates Upgraded glibc-all-langpacks-2.28-9.fc29.x86_64 @@System Upgrade glibc-common-2.28-17.fc29.x86_64 @updates Upgraded glibc-common-2.28-9.fc29.x86_64 @@System Upgrade libcurl-7.61.1-4.fc29.x86_64 @updates Upgraded libcurl-7.61.1-3.fc29.x86_64 @@System Upgrade libdnf-0.22.0-8.fc29.x86_64 @updates Upgraded libdnf-0.22.0-6.fc29.x86_64 @@System Upgrade libgcc-8.2.1-5.fc29.x86_64 @updates Upgraded libgcc-8.2.1-4.fc29.x86_64 @@System Upgrade libgomp-8.2.1-5.fc29.x86_64 @updates Upgraded libgomp-8.2.1-4.fc29.x86_64 @@System Upgrade libicu-62.1-3.fc29.x86_64 @updates Upgraded libicu-62.1-2.fc29.x86_64 @@System Upgrade libsolv-0.7.1-1.fc29.x86_64 @updates Upgraded libsolv-0.6.35-3.fc29.x86_64 @@System Upgrade libstdc++-8.2.1-5.fc29.x86_64 @updates Upgraded libstdc++-8.2.1-4.fc29.x86_64 @@System Upgrade lmdb-libs-0.9.22-4.fc29.x86_64 @updates Upgraded lmdb-libs-0.9.22-3.fc29.x86_64 @@System Upgrade pcre-8.42-5.fc29.x86_64 @updates Upgraded pcre-8.42-4.fc29.x86_64 @@System Upgrade pcre2-10.32-4.fc29.x86_64 @updates Upgraded pcre2-10.32-3.fc29.x86_64 @@System Upgrade perl-Errno-1.29-424.fc29.x86_64 @updates Upgraded perl-Errno-1.29-423.fc29.x86_64 @@System Upgrade perl-IO-1.39-424.fc29.x86_64 @updates Upgraded perl-IO-1.39-423.fc29.x86_64 @@System Upgrade perl-interpreter-4:5.28.0-424.fc29.x86_64 @updates Upgraded perl-interpreter-4:5.28.0-423.fc29.x86_64 @@System Upgrade perl-libs-4:5.28.0-424.fc29.x86_64 @updates Upgraded perl-libs-4:5.28.0-423.fc29.x86_64 @@System Upgrade perl-macros-4:5.28.0-424.fc29.x86_64 @updates Upgraded perl-macros-4:5.28.0-423.fc29.x86_64 @@System Upgrade plymouth-0.9.4-1.fc29.x86_64 @updates Upgraded plymouth-0.9.3-14.fc29.x86_64 @@System Upgrade plymouth-core-libs-0.9.4-1.fc29.x86_64 @updates Upgraded plymouth-core-libs-0.9.3-14.fc29.x86_64 @@System Upgrade plymouth-scripts-0.9.4-1.fc29.x86_64 @updates Upgraded plymouth-scripts-0.9.3-14.fc29.x86_64 @@System Upgrade python3-3.7.1-2.fc29.x86_64 @updates Upgraded python3-3.7.1-1.fc29.x86_64 @@System Upgrade python3-dnf-4.0.4-2.fc29.noarch @updates Upgraded python3-dnf-4.0.4-1.fc29.noarch @@System Upgrade python3-hawkey-0.22.0-8.fc29.x86_64 @updates Upgraded python3-hawkey-0.22.0-6.fc29.x86_64 @@System Upgrade python3-libdnf-0.22.0-8.fc29.x86_64 @updates Upgraded python3-libdnf-0.22.0-6.fc29.x86_64 @@System Upgrade python3-libs-3.7.1-2.fc29.x86_64 @updates Upgraded python3-libs-3.7.1-1.fc29.x86_64 @@System Upgrade selinux-policy-3.14.2-42.fc29.noarch @updates Upgraded selinux-policy-3.14.2-40.fc29.noarch @@System Upgrade selinux-policy-targeted-3.14.2-42.fc29.noarch @updates Upgraded selinux-policy-targeted-3.14.2-40.fc29.noarch @@System Upgrade vim-common-2:8.1.511-1.fc29.x86_64 @updates Upgraded vim-common-2:8.1.483-1.fc29.x86_64 @@System Upgrade vim-enhanced-2:8.1.511-1.fc29.x86_64 @updates Upgraded vim-enhanced-2:8.1.483-1.fc29.x86_64 @@System Upgrade vim-filesystem-2:8.1.511-1.fc29.noarch @updates Upgraded vim-filesystem-2:8.1.483-1.fc29.noarch @@System Upgrade vim-minimal-2:8.1.511-1.fc29.x86_64 @updates Upgraded vim-minimal-2:8.1.483-1.fc29.x86_64 @@System Reproducible: Always Steps to Reproduce: 1. Have a multi-disk BTRFS array (in this case an 8-disk raid-10 array) present and functioning in the system. 2. Boot into Fedora 29 kernel 4.18.17-300 or later. 3. Observe the boot process drop into emergency mode. Actual Results: The BTRFS multi-disk array fails to get mounted, and the boot process drops into emergency mode. Expected Results: The BTRFS multi-disk array gets mounted successfully, and the boot process reaches the desired runlevel (as happens with kernels 4.18.16-300 and prior).
Created attachment 1506300 [details] dmesg output
Created attachment 1506301 [details] journald output
Created attachment 1506302 [details] 'systemd-analyse blame' output
Created attachment 1506303 [details] 'systemd-analyze critical chain' output
We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs. Fedora 29 has now been rebased to 4.19.5-300.fc29. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those.
(In reply to Jeremy Cline from comment #5) > Fedora 29 has now been rebased to 4.19.5-300.fc29. Please test this kernel > update (or newer) and let us know if you issue has been resolved or if it is > still present with the newer kernel. Problem reoccured under kernel-4.19.5-300, which was the most-recent kernel as of the time of testing. Diagnostics yielded substantially the same logging output as has already been attached to this bugreport under kernel 4.18.17-300. After reverting again to 4.18.16-300, the problem did not manifest. So behaviour remains the same as originally documented: Any kernel _after_ 4.18.16-300 appears to manifest the documented issue.
I'm afraid I don't see anything obvious in the changelog for v4.18.17. Your best best is to do a git-bisect and figure out which commit caused the regression. There are numerous guides on the Internet, but there's a Fedora-specific(ish) guide at https://docs.fedoraproject.org/en-US/quick-docs/kernel/troubleshooting/index.html#_bisecting_the_kernel
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs. Fedora 29 has now been rebased to 4.20.5-200.fc29. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.
Chances are this is solved by now but... [ 4.036987] BTRFS error (device sdd): devid 4 uuid fbb6ce57-66c3-4189-a180-2242ef8aa43b is missing For whatever reason, one of the devices in the Btrfs volume is delayed in becoming available, and the mount fails because Btrfs by default will not do degraded mounts. There is a udev rule that I'm pretty sure should prevent systemd from even attempting to mount any Btrfs volume until all btrfs volume member devices are reported ready. /usr/lib/udev/rules.d/64-btrfs.rules There are actually two questions: why is this device late to the party? And why is the btrfs udev rule somehow not preventing systemd from attempting to mount this volume? I suggest separately using boot parameters rd.udev.debug and systemd.log_level=debug - if used together it might slow down boot enough that the slow device shows up and the problem doesn't happen only because of debugging.
I just upgraded from Fedora 28 to 29 on my custom NAS box and I'm having this issue with the latest kernel 5.0.13-200.fc29.x86_64 If I comment out my btrfs raid mounts from /etc/fstab I can boot and I'm able to mount them manually or by uncommenting the lines in fstab and running `mount -a`.
(In reply to Devin Henderson from comment #11) > I just upgraded from Fedora 28 to 29 on my custom NAS box and I'm having > this issue with the latest kernel 5.0.13-200.fc29.x86_64 > > If I comment out my btrfs raid mounts from /etc/fstab I can boot and I'm > able to mount them manually or by uncommenting the lines in fstab and > running `mount -a`. Revert the /etc/fstab change, and then boot with 'rd.udev.debug systemd.debug-shell=1' and hopefully the problem will reproduce. You can then switch to tty9 and collect some information: # script bug1650391-info.txt # systemctl list-jobs # btrfs fi show # btrfs device scan # btrfs fi show # exit # journalctl -b -o short-monotonic > bug1650391-journaludev.txt # vi /etc/fstab ## you can either comment the volume out again, or I think adding 'nofail' as a mount option will cause systemd to eventually continue the boot rather than hanging indefinitely waiting for this volume to become available # reboot And then attach the two files, /root/*txt to the bug report. I'm very curious why a kernel change would affect device discovery.
Thanks Chris. I'm running this in a KVM guest and I ran those commands in the "emergency mode" shell that comes up when the boot fails, I hope that works. I'll add the attachments now. BTW, adding nofail did allow the boot to finish. Once it was up I ran `mount -a` immediately and it worked.
Created attachment 1568698 [details] bug1650391-info.txt for Chris
Created attachment 1568699 [details] bug1650391-journaludev.txt for Chris
Shoot! I forgot to add the kernel params. I will re-attache new text files shortly. Sorry.
Created attachment 1568702 [details] REAL bug1650391-info.txt for Chris
Created attachment 1568703 [details] REAL bug1650391-journaludev.txt for Chris
New files are attached. Sorry for the confusion. I appreciate the help.
[ 9.086400] nas kernel: BTRFS error (device vdc): devid 1 uuid 404b0f4d-6b07-4094-a0ac-07634b8b2548 is missing ... [ 9.131558] nas kernel: BTRFS error (device vdc): devid 1 uuid 404b0f4d-6b07-4094-a0ac-07634b8b2548 is missing ... [ 9.151269] nas kernel: BTRFS error (device vdc): devid 1 uuid 404b0f4d-6b07-4094-a0ac-07634b8b2548 is missing And yet your 'btrfs fi show' only discovers one Btrfs volume with a totally different UUID a36a537e-71e5-4569-a642-d926ea21455d. Note that the filesystem/volume UUID for Btrfs applies to the entire volume across all devices, so in a 2x raid1, both devices will have the same UUID. There's a separate dev_item.uuid that is unique for each device stored in the superblock. And blkid and the kernel report the volume UUID not the device UUID. The device is distinguished in kernel messages with "devid n" Can you provide the output from # blkid # btrfs insp dump-s -f /dev/vdb # btrfs insp dump-s -f /dev/vdc # btrfs rescue super -v /dev/vdb # btrfs rescue super -v /dev/vdb You can use script again for that if you're at a console. These are read only commands, they don't make any changes. I see two Btrfs volume UUIDs in the journal, that's a bit troubling that maybe there's a stale super or some other confusion. Right now I'm a bit nervous about any more writes happening, so I suggest not mounting the volume rw. It's safe to mount ro. And it should be safe to make sure backups are up to date, just in case this all goes badly. But I'd make sure the backup, whether incremental or full, is separate (doesn't overwrite) prior backups. This conversation needs to move to the upstream Btrfs list. It's sufficiently serious upstream needs to know of the conditions that led to this situation. It may be a pre-existing problem that's only just now getting caught by newer kernel versions; but the fact you get a mount failure during startup, but not manually, is weird. I smell a bug. I suggest file attachments go in this bug report. And the conversation goes to the Btrfs list. http://vger.kernel.org/vger-lists.html#linux-btrfs Click on the subscribe link, send the email, reply to confirmation. (It's not strictly required to subscribe, but a lot of people do not reply all so if you don't subscribe just end the email with 'please reply all I'm not subscribed') Include the following: summary of the problem kernel versions being used btrfs progs version(s) being used History of the file system. mkfs options used (was it always a raid1 or was it later converted into one); was a device ever replaced removed or add? did the volume ever have its UUID changed using btrfstune? Has it ever previously been mounted with degraded option, and if yes, what where the conditions requiring that? Has 'btrfs check --repair' been used? Don't use it now, just note if it has been used, and if you can remember what happened (error or no error, couple dozen lines or hundreds). I'm gonna go dig into this udev debug in a bit.
[ 9.084816] nas kernel: BTRFS info (device vdc): disk space caching is enabled [ 9.086400] nas kernel: BTRFS error (device vdc): devid 1 uuid 404b0f4d-6b07-4094-a0ac-07634b8b2548 is missing [ 9.086407] nas kernel: BTRFS error (device vdc): failed to read the system array: -2 [ 9.093686] nas kernel: BTRFS error (device vdc): open_ctree failed The second two lines mean something tried to mount the file system while devid 1 was missing. It shouldn't be possible for systemd to do this because of this udev rule: [chris@flap ~]$ sudo dnf provides /usr/lib/udev/rules.d/64-btrfs.rules systemd-udev-241-8.git9ef65cb.fc30.x86_64 : Rule-based device node and kernel event manager [chris@flap ~]$ sudo cat /usr/lib/udev/rules.d/64-btrfs.rules # do not edit this file, it will be overwritten on update SUBSYSTEM!="block", GOTO="btrfs_end" ACTION=="remove", GOTO="btrfs_end" ENV{ID_FS_TYPE}!="btrfs", GOTO="btrfs_end" ENV{SYSTEMD_READY}=="0", GOTO="btrfs_end" # let the kernel know about this btrfs filesystem, and check if it is complete IMPORT{builtin}="btrfs ready $devnode" # mark the device as not ready to be used by the system ENV{ID_BTRFS_READY}=="0", ENV{SYSTEMD_READY}="0" # reconsider pending devices in case when multidevice volume awaits ENV{ID_BTRFS_READY}=="1", RUN+="/usr/bin/udevadm trigger -s block -p ID_BTRFS_READY=0" LABEL="btrfs_end" --- systemd is supposed to indefinitely wait for all Btrfs devices in a volume to be reported as ready, before it will even try to mount the file system. And yet... we have an error that only happens during failed mount attempts. And it failed because the device is not yet available. It doesn't appear until later... [ 10.113322] nas kernel: BTRFS: device label WD80EFAX devid 1 transid 272924 /dev/vdb OK so I need something else :D I need you to boot up again with 'systemd.log_level=debug' and output the journal to a file 'journalctl -b -o short-monotonic > journalsysdebug.txt' or something otherwise sanely named. UUID 404b0f4d... is only ever reported by the kernel, and it's missing UUID a36a537e... is reported by both udev and btrfs progs but not the kernel This is confusing.
This might be a udev or systemd bug actually... I know it seem instigated by a kernel change. And a kernel change might be related, but I'll need to see systemd debug journal. [ 4.470332] nas kernel: virtio_blk virtio4: [vdb] 15628050432 512-byte logical blocks (8.00 TB/7.28 TiB) [ 4.539011] nas kernel: virtio_blk virtio5: [vdc] 15628050432 512-byte logical blocks (8.00 TB/7.28 TiB) [ 4.933352] nas systemd-udevd[382]: creating link '/dev/disk/by-uuid/a36a537e-71e5-4569-a642-d926ea21455d' to '/dev/vdb' [ 4.933411] nas systemd-udevd[382]: creating symlink '/dev/disk/by-uuid/a36a537e-71e5-4569-a642-d926ea21455d' to '../../vdb' [ 4.953123] nas systemd-udevd[377]: creating link '/dev/disk/by-uuid/a36a537e-71e5-4569-a642-d926ea21455d' to '/dev/vdc' [ 4.953197] nas systemd-udevd[377]: atomically replace '/dev/disk/by-uuid/a36a537e-71e5-4569-a642-d926ea21455d' I don't grok udev pretty much at all. But clearly the virtio devices vdb and vdc exist way early. And then udev seems to initially assign the correct UUID to /dev/vdb; but then steps on that (?) with /dev/vdc. And later vdb appears to the kernel. I'm not really sure of the interaction between udev and kernel, why the block device can appear but kernel code doesn't see /dev/vdb is a btrfs device until much much later.
Also, can you include a copy of /etc/fstab? I want to understand what assembly is supposed to be like and where all the UUIDs are supposed to end up.