Created attachment 1644595 [details] system journal dump 1. Please describe the problem: There are overflow bugs present in the bcache driver in the 5.3.15-300.fc31 kernel regarding stripe number calculations, caused by assuming that stripe numbers can fit in 31 bits, and relatedly, that stripe sizes will be large enough relative to device sizes to make stripe numbers fit in 31 bits. The possible failure modes include at least (1) inability to configure bcache devices atop certain types of backing storage of certain ranges of sizes, and (2) out of bounds pointer accesses in the kernel. Stripe sizes can be derived from the underlying storage device’s optimum I/O size (io_opt). If this is set to something small, then for a large enough underlying storage device, the number of stripes may exceed 2**31. Red Hat’s VDO device (available at https://github.com/dm-vdo/kvdo) is one such device, with an optimum I/O size of 4096. I see that zram uses PAGE_SIZE. I’m more familiar with VDO, so that’s what I’ll use for my example, but as far as I can tell, the only interesting characteristic is that VDO sets io_opt to 4096 (and it’s easy to make a storage volume appear to have several TB of storage on a test system). In …/bcache/super.c where nr_stripes is calculated: d->nr_stripes = DIV_ROUND_UP_ULL(sectors, d->stripe_size); Unfortunately, nr_stripes is an “unsigned int” field, so if, for example, the device is 18 TB and the optimum I/O size is 4 kB, we have a truncation problem: sectors = 38654705664 stripe_size = 8 DIV_ROUND_UP_ULL() = 4831838208 = 0x120000000 (unsigned int) ... = 0x20000000 = 536870912 So nr_stripes in this case is the same as if the device size were 2 TB. Also, elsewhere in this function, the value is capped at INT_MAX, so if the stripe count after truncation is over 2**31, the configuration is rejected altogether. In the case of the 4 kB optimum I/O size, this is device sizes of 8-16 TB modulo 16 TB (so, 0-8 TB okay, 8-16 fail, 16-24 okay, 24-32 fail, etc). For other optimum I/O sizes, scale accordingly. If we get past the INT_MAX check but have truncated the stripe count, bcache_dev_stripe_dirty in writeback.h can dereference the stripe_sectors_dirty array without bounds checking. The array is allocated based on the truncated nr_stripes value, but indexed based on a real 64-bit device offset. This can lead to kernel errors when writing to some offsets: Dec 12 03:02:29 kernel: BUG: unable to handle page fault for address: ffff8f59b0300008 Dec 12 03:02:29 kernel: #PF: supervisor read access in kernel mode Dec 12 03:02:29 kernel: #PF: error_code(0x0000) - not-present page Dec 12 03:02:29 kernel: PGD 5b001067 P4D 5b001067 PUD 0 Dec 12 03:02:29 kernel: Oops: 0000 [#1] SMP PTI Dec 12 03:02:29 kernel: CPU: 0 PID: 844 Comm: dd Tainted: G OE 5.3.15-300.fc31.x86_64 #1 Dec 12 03:02:29 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014 Dec 12 03:02:29 kernel: RIP: 0010:cached_dev_make_request+0x8c0/0xdf0 [bcache] [...] Dec 12 03:02:29 kernel: Call Trace: Dec 12 03:02:29 kernel: generic_make_request+0xcf/0x320 Dec 12 03:02:29 kernel: submit_bio+0x3c/0x160 Dec 12 03:02:29 kernel: ? bio_add_page+0x62/0x90 Dec 12 03:02:29 kernel: ? guard_bio_eod+0x2c/0x130 Dec 12 03:02:29 kernel: submit_bh_wbc+0x16d/0x190 Dec 12 03:02:29 kernel: __block_write_full_page+0x200/0x410 Dec 12 03:02:29 kernel: ? touch_buffer+0x60/0x60 Dec 12 03:02:29 kernel: ? bdev_evict_inode+0xf0/0xf0 Dec 12 03:02:29 kernel: __writepage+0x19/0x50 Dec 12 03:02:29 kernel: write_cache_pages+0x171/0x420 Dec 12 03:02:29 kernel: ? __wb_calc_thresh+0x130/0x130 Dec 12 03:02:29 kernel: ? ep_poll_callback+0x29c/0x2c0 Dec 12 03:02:29 kernel: generic_writepages+0x57/0x90 Dec 12 03:02:29 kernel: ? __wake_up_common_lock+0x8a/0xc0 Dec 12 03:02:29 kernel: do_writepages+0x43/0xd0 Dec 12 03:02:29 kernel: __filemap_fdatawrite_range+0xce/0x110 Dec 12 03:02:29 kernel: filemap_write_and_wait+0x2d/0x80 Dec 12 03:02:29 kernel: __blkdev_put+0x70/0x1e0 Dec 12 03:02:29 kernel: blkdev_close+0x22/0x30 [...] Also, offset_to_stripe in that same header returns an “unsigned int”, but the value it has computed is a 64-bit quotient (64-bit value divided by a 32-bit value) that is potentially being truncated if the stripe size is small. I do not know if there are other dereferencing problems, or any more subtle problems caused by being able to do I/O with sector offsets that generate stripe numbers larger than nr_stripes. 2. What is the Version-Release number of the kernel: 5.3.15-300.fc31 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : Kernel 4.17.8 had the same or similar bugs. I'm not aware of a previously working version. 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: I started with a “Vagrant” virtual machine image (fedora/31-cloud-base), 16 GB of memory, additional 150 GB disk image, updated to the current kernel and kernel development packages (5.3.15-300.fc31). Setting it up with more than one virtual CPU may make it easier to keep interacting with the VM even if one CPU is thrown into a loop or something. I installed VDO on it. The current VDO package on GitHub as of 2019-12-11 does not build with the latest kernel due to some of the warning options in newer kernels, so it needs a little tweaking. My setup script, run as root: #!/bin/sh -ex dnf install -y device-mapper-devel device-mapper-event-devel dkms \ libuuid-devel lvm2 make python3 python3-PyYAML redhat-lsb-core rpm-build mkdir /tmp/builds cd /tmp/builds git clone https://github.com/dm-vdo/kvdo.git git clone https://github.com/dm-vdo/vdo.git # Patch up kvdo for new VLA warnings sed -i -e '/^EXTRA_CFLAGS/s/=/= -Wno-vla/' kvdo/uds/Makefile kvdo/vdo/Makefile REL=$(uname -r) make -C /usr/src/kernels/$REL M=/tmp/builds/kvdo cp kvdo/uds/uds.ko kvdo/vdo/kvdo.ko /lib/modules/$REL/updates/ depmod -a make -C vdo make -C vdo install modprobe kvdo I set up a VDO device on the second disk, and created a volume group on top of the VDO volume. [root@localhost vagrant]# vdo create -n vdo0 --device /dev/vdb --sparseIndex disabled --indexMem 0.25 --vdoLogicalSize 18T Creating VDO vdo0 Starting VDO vdo0 Starting compression on VDO vdo0 VDO instance 0 volume is ready at /dev/mapper/vdo0 [root@localhost vagrant]# vgcreate testvg /dev/mapper/vdo0 Physical volume "/dev/mapper/vdo0" successfully created. Volume group "testvg" successfully created [root@localhost vagrant]# Set up a small volume for a cache, and initially we’ll try a less-than-8-TB block device to show that that goes okay: [root@localhost ~]# lvcreate --size 30G -n cache testvg Logical volume "cache" created. [root@localhost ~]# lvcreate --size 7T -n foo testvg Logical volume "foo" created. [root@localhost ~]# Install bcache-tools and systemtap (for later). Try creating the cache: [root@localhost ~]# make-bcache -C /dev/testvg/cache -B /dev/testvg/foo UUID: dbc49d7d-6f7b-4d71-802a-503ca95ef7af […] Success! So tear it down. [root@localhost ~]# ls -l /dev/testvg/foo lrwxrwxrwx. 1 root root 7 Dec 12 22:06 /dev/testvg/foo -> ../dm-2 [root@localhost ~]# echo 1 > /sys/block/dm-2/bcache/detach [root@localhost ~]# echo 1 > /sys/block/dm-2/bcache/stop [root@localhost ~]# echo 1 > /sys/fs/bcache/*-*/stop [root@localhost ~]# wipefs -a /dev/testvg/cache /dev/testvg/cache: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81 [root@localhost ~]# wipefs -a /dev/testvg/foo /dev/testvg/foo: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81 [root@localhost ~]# Try a size between 8 TB and 16 TB: [root@localhost ~]# lvresize --size 9T testvg/foo Size of logical volume testvg/foo changed from 7.00 TiB (1835008 extents) to 9.00 TiB (2359296 extents). Logical volume testvg/foo successfully resized. [root@localhost ~]# make-bcache -C /dev/testvg/cache -B /dev/testvg/foo UUID: f932256f-aca8-4f6c-a945-b3487adb6e37 Set UUID: 4c631dd0-8605-4f56-91ad-22951bd4e7bd version: 0 nbuckets: 61440 block_size: 8 bucket_size: 1024 nr_in_set: 1 nr_this_dev: 0 first_bucket: 1 UUID: 0cb54114-f665-4ae7-ae8c-1423c0af3808 Set UUID: 4c631dd0-8605-4f56-91ad-22951bd4e7bd version: 1 block_size: 8 data_offset: 16 [root@localhost ~]# This appears successful, and the exit status was 0; however, the kernel has logged an error and the /sys/block/dm-2/bcache directory didn’t appear: Dec 12 22:16:15 localhost.localdomain kernel: bcache: run_cache_set() invalidating existing data Dec 12 22:16:15 localhost.localdomain kernel: bcache: register_cache() registered cache device dm-1 Dec 12 22:16:15 localhost.localdomain kernel: bcache: bcache_device_init() nr_stripes too large or invalid: 2415919102 (start sector beyond end of disk?) Dec 12 22:16:15 localhost.localdomain kernel: bcache: register_bdev() error dm-2: cannot allocate memory Dec 12 22:16:15 localhost.localdomain kernel: bcache: register_bcache() error /dev/dm-2: failed to register device Dec 12 22:16:15 localhost.localdomain kernel: bcache: bcache_device_free() (efault) stopped Dec 12 22:16:15 localhost.localdomain systemd-udevd[19151]: dm-2: Process 'bcache-register /dev/dm-2' failed with exit code 1. So, clean up: [root@localhost ~]# echo 1 > /sys/fs/bcache/*-*/stop [root@localhost ~]# wipefs -a /dev/testvg/foo /dev/testvg/foo: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81 [root@localhost ~]# wipefs -a /dev/testvg/cache /dev/testvg/cache: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81 [root@localhost ~]# And try again in the 16-24 TB range. This time, we use systemtap to print out some of the computed stripe info when the device starts up (and udev scans it looking for identifiable content). [root@localhost ~]# lvresize --size 17T testvg/foo Size of logical volume testvg/foo changed from 9.00 TiB (2359296 extents) to 17.00 TiB (4456448 extents). Logical volume testvg/foo successfully resized. [root@localhost ~]# stap -e 'probe module("bcache").function("cached_dev_read") { printf("stripe size %u nr_stripes %u\n", $dc->disk->stripe_size, $dc->disk->nr_stripes); exit(); }' & [1] 20377 [root@localhost ~]# [root@localhost ~]# [root@localhost ~]# make-bcache -C /dev/testvg/cache -B /dev/testvg/foo UUID: 498f42ee-13ef-48ea-b67a-8d7d2a5437cb Set UUID: d6cf5434-dedf-44b0-94b1-d3722ce1e0f8 version: 0 nbuckets: 61440 block_size: 8 bucket_size: 1024 nr_in_set: 1 nr_this_dev: 0 first_bucket: 1 UUID: 5e0cde0c-49ca-4e94-bcc1-e1ddc5fee8c0 Set UUID: d6cf5434-dedf-44b0-94b1-d3722ce1e0f8 version: 1 block_size: 8 data_offset: 16 [root@localhost ~]# stripe size 8 nr_stripes 268435454 [1]+ Done stap -e 'probe module("bcache").function("cached_dev_read") { printf("stripe size %u nr_stripes %u\n", $dc->disk->stripe_size, $dc->disk->nr_stripes); exit(); }' [root@localhost ~]# 268435454 stripes of 8 sectors or 4096 bytes each is 1099511619584, or just under 1 TB. So, now we can try seeing what happens if we cause stripe numbers to be computed that are higher than this. An offset of 4 TB in 4 kB blocks is 1024**3 or 1073741824. [root@localhost ~]# echo writeback > /sys/block/dm-2/bcache/cache_mode [root@localhost ~]# dd if=/dev/urandom bs=4096 count=1 of=/dev/bcache0 1+0 records in 1+0 records out 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00331645 s, 1.2 MB/s [root@localhost ~]# dd if=/dev/urandom bs=4096 count=1 of=/dev/bcache0 seek=1073741824 1+0 records in 1+0 records out 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00115473 s, 3.5 MB/s [root@localhost ~]# At this point, the VM became unresponsive, consuming lots of CPU time. The console shows a stack trace with a fault at cached_dev_make_request+0x8c0, but some of the info had scrolled off the screen. After rebooting the machine and restarting VDO (it wasn’t configured to restart automatically), the logical volumes and bcache came up automatically, so I reissued that last “dd” command again, and immediately got this: Dec 12 22:55:07 localhost.localdomain kernel: BUG: unable to handle page fault for address: ffffaea6de7f1008 Dec 12 22:55:07 localhost.localdomain kernel: #PF: supervisor read access in kernel mode Dec 12 22:55:07 localhost.localdomain kernel: #PF: error_code(0x0000) - not-present page Dec 12 22:55:07 localhost.localdomain kernel: PGD 42dd46067 P4D 42dd46067 PUD 0 Dec 12 22:55:07 localhost.localdomain kernel: Oops: 0000 [#1] SMP PTI Dec 12 22:55:07 localhost.localdomain kernel: CPU: 1 PID: 905 Comm: dd Tainted: G OE 5.3.15-300.fc31.x86_64 #1 Dec 12 22:55:07 localhost.localdomain kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014 Dec 12 22:55:07 localhost.localdomain kernel: RIP: 0010:cached_dev_make_request+0x8c0/0xdf0 [bcache] Dec 12 22:55:07 localhost.localdomain kernel: Code: 00 00 44 8b b3 9c 00 00 00 48 8b 46 20 31 d2 8b 7e 28 49 f7 f6 4d 89 f1 4c 8b b3 a0 00 00 00 c1 ef 09 89 c2 89 c0 49 8d 04 86 <8b> 00 85 c0 75 26 44 39 cf 77 0e e9 ae 03 00 00 41 39 f9 0f 83 a5 Dec 12 22:55:07 localhost.localdomain kernel: RSP: 0018:ffffaea5c079f9d0 EFLAGS: 00010202 Dec 12 22:55:07 localhost.localdomain kernel: RAX: ffffaea6de7f1008 RBX: ffff8f96bc1f0010 RCX: 0000000000000000 Dec 12 22:55:07 localhost.localdomain kernel: RDX: 0000000040000002 RSI: ffff8f96786b4cc0 RDI: 0000000000000008 Dec 12 22:55:07 localhost.localdomain kernel: RBP: ffff8f96bc119720 R08: 0000000000000801 R09: 0000000000000008 Dec 12 22:55:07 localhost.localdomain kernel: R10: 0000000000000801 R11: 0000000000000000 R12: ffff8f96bc119848 Dec 12 22:55:07 localhost.localdomain kernel: R13: ffff8f96bc119770 R14: ffffaea5de7f1000 R15: ffff8f96bc1f0000 Dec 12 22:55:07 localhost.localdomain kernel: FS: 00007f376d40c580(0000) GS:ffff8f96efb00000(0000) knlGS:0000000000000000 Dec 12 22:55:07 localhost.localdomain kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 12 22:55:07 localhost.localdomain kernel: CR2: ffffaea6de7f1008 CR3: 00000003b82bc005 CR4: 00000000003606e0 Dec 12 22:55:07 localhost.localdomain kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 12 22:55:07 localhost.localdomain kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Dec 12 22:55:07 localhost.localdomain kernel: Call Trace: Dec 12 22:55:07 localhost.localdomain kernel: generic_make_request+0xcf/0x320 Dec 12 22:55:07 localhost.localdomain kernel: submit_bio+0x3c/0x160 Dec 12 22:55:07 localhost.localdomain kernel: ? bio_add_page+0x62/0x90 Dec 12 22:55:07 localhost.localdomain kernel: ? guard_bio_eod+0x2c/0x130 Dec 12 22:55:07 localhost.localdomain kernel: submit_bh_wbc+0x16d/0x190 Dec 12 22:55:07 localhost.localdomain kernel: __block_write_full_page+0x200/0x410 Dec 12 22:55:07 localhost.localdomain kernel: ? touch_buffer+0x60/0x60 Dec 12 22:55:07 localhost.localdomain kernel: ? bdev_evict_inode+0xf0/0xf0 Dec 12 22:55:07 localhost.localdomain kernel: __writepage+0x19/0x50 Dec 12 22:55:07 localhost.localdomain kernel: write_cache_pages+0x171/0x420 Dec 12 22:55:07 localhost.localdomain kernel: ? __wb_calc_thresh+0x130/0x130 Dec 12 22:55:07 localhost.localdomain kernel: ? ep_poll_callback+0x29c/0x2c0 Dec 12 22:55:07 localhost.localdomain kernel: generic_writepages+0x57/0x90 Dec 12 22:55:07 localhost.localdomain kernel: ? __wake_up_common_lock+0x8a/0xc0 Dec 12 22:55:07 localhost.localdomain kernel: do_writepages+0x43/0xd0 Dec 12 22:55:07 localhost.localdomain kernel: __filemap_fdatawrite_range+0xce/0x110 Dec 12 22:55:07 localhost.localdomain kernel: filemap_write_and_wait+0x2d/0x80 Dec 12 22:55:07 localhost.localdomain kernel: __blkdev_put+0x70/0x1e0 Dec 12 22:55:07 localhost.localdomain kernel: blkdev_close+0x22/0x30 Dec 12 22:55:07 localhost.localdomain kernel: __fput+0xc1/0x250 Dec 12 22:55:07 localhost.localdomain kernel: task_work_run+0x87/0xa0 Dec 12 22:55:07 localhost.localdomain kernel: exit_to_usermode_loop+0xda/0x100 Dec 12 22:55:07 localhost.localdomain kernel: do_syscall_64+0x183/0x1a0 Dec 12 22:55:07 localhost.localdomain kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Dec 12 22:55:07 localhost.localdomain kernel: RIP: 0033:0x7f376d334c07 Dec 12 22:55:07 localhost.localdomain kernel: Code: ff ff e8 fc e8 01 00 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 f3 41 f9 ff Dec 12 22:55:07 localhost.localdomain kernel: RSP: 002b:00007ffcacdeba98 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 Dec 12 22:55:07 localhost.localdomain kernel: RAX: 0000000000000000 RBX: 0000556129b9b120 RCX: 00007f376d334c07 Dec 12 22:55:07 localhost.localdomain kernel: RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000001 Dec 12 22:55:07 localhost.localdomain kernel: RBP: 0000000000000000 R08: 000055612a22ab60 R09: 00000000000000f0 Dec 12 22:55:07 localhost.localdomain kernel: R10: 000055612a22b000 R11: 0000000000000246 R12: 0000000000000001 Dec 12 22:55:07 localhost.localdomain kernel: R13: 0000000040000000 R14: 0000000000000000 R15: 0000000000000001 Dec 12 22:55:07 localhost.localdomain kernel: Modules linked in: bcache crc64 kvdo(OE) uds(OE) intel_rapl_msr intel_rapl_common kvm_intel kvm cirrus irqbypass drm_kms_helper drm virtio_net net_failover failover i2c_piix4 virtio_balloon ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_blk qemu_fw_cfg ata_generic pata_acpi Dec 12 22:55:07 localhost.localdomain kernel: CR2: ffffaea6de7f1008 Dec 12 22:55:07 localhost.localdomain kernel: ---[ end trace 24e47706f4bb72ab ]--- Dec 12 22:55:07 localhost.localdomain kernel: RIP: 0010:cached_dev_make_request+0x8c0/0xdf0 [bcache] Dec 12 22:55:07 localhost.localdomain kernel: Code: 00 00 44 8b b3 9c 00 00 00 48 8b 46 20 31 d2 8b 7e 28 49 f7 f6 4d 89 f1 4c 8b b3 a0 00 00 00 c1 ef 09 89 c2 89 c0 49 8d 04 86 <8b> 00 85 c0 75 26 44 39 cf 77 0e e9 ae 03 00 00 41 39 f9 0f 83 a5 Dec 12 22:55:07 localhost.localdomain kernel: RSP: 0018:ffffaea5c079f9d0 EFLAGS: 00010202 Dec 12 22:55:07 localhost.localdomain kernel: RAX: ffffaea6de7f1008 RBX: ffff8f96bc1f0010 RCX: 0000000000000000 Dec 12 22:55:07 localhost.localdomain kernel: RDX: 0000000040000002 RSI: ffff8f96786b4cc0 RDI: 0000000000000008 Dec 12 22:55:07 localhost.localdomain kernel: RBP: ffff8f96bc119720 R08: 0000000000000801 R09: 0000000000000008 Dec 12 22:55:07 localhost.localdomain kernel: R10: 0000000000000801 R11: 0000000000000000 R12: ffff8f96bc119848 Dec 12 22:55:07 localhost.localdomain kernel: R13: ffff8f96bc119770 R14: ffffaea5de7f1000 R15: ffff8f96bc1f0000 Dec 12 22:55:07 localhost.localdomain kernel: FS: 00007f376d40c580(0000) GS:ffff8f96efb00000(0000) knlGS:0000000000000000 Dec 12 22:55:07 localhost.localdomain kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 12 22:55:07 localhost.localdomain kernel: CR2: ffffaea6de7f1008 CR3: 00000003b82bc005 CR4: 00000000003606e0 Dec 12 22:55:07 localhost.localdomain kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 12 22:55:07 localhost.localdomain kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 This trace is consistent with what I saw the first time I hit this aspect of the problem. This problem was present in earlier kernels such as 4.17.8-1.el7.elrepo; see https://github.com/dm-vdo/kvdo/issues/9 for info. The pattern continues; from 24-32 TB, an error occurs; from 32-40 TB, configuration succeeds; and so on. Thanks to Bryan Gurney and Sergey Ponomarev for helping look into it initially back then. 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: unknown 6. Are you running any modules that not shipped with directly Fedora's kernel?: kvdo (https://github.com/dm-vdo/kvdo) 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 31 kernel bugs. Fedora 31 has now been rebased to 5.5.7-200.fc31. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 32, and are still experiencing this issue, please change the version to Fedora 32. If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.
I rebuilt my Fedora 31 test VM and got kernel 5.5.11-200. The failure with the 9 TB volume is the same. With the 17 TB volume, I can create the cache. Writing to it didn't produce the BUG failure in my first few attempts, but it didn't always happen with the 5.3.15-300 kernel either, so I can't say for certain whether that part has been fixed. My systemtap one-liner had to be made a bit more complicated due to the restructuring of the bcache read path: probe module("bcache").function("cached_dev_read_done_bh") { disk = @container_of($cl, "struct search", cl)->d; printf("stripe_size %u nr_stripes %u\n", @cast(disk, "struct bcache_device", "bcache")->stripe_size, @cast(disk, "struct bcache_device", "bcache")->nr_stripes); exit(); } ...but it still confirms that the number of stripes gets truncated for the 17 TB volume.
Ah, yes... it helps to force the bcache device to have partial_stripes_expensive=1; without a handy configuration where it could be inherited from q->limits.raid_partial_stripes_expensive, I took the sledgehammer approach: # run under systemtap and then read a block probe module("bcache").function("cached_dev_read_done_bh") { dc = &@container_of(@container_of($cl, "struct search", cl)->d, "struct cached_dev", disk); @cast(dc,"struct cached_dev","bcache")->partial_stripes_expensive = 1; exit(); } Then if writeback mode is enabled and other conditions are met, bcache_dev_stripe_dirty can be called when writing a block to a well chosen offset, and may, if you're lucky, hit an unmapped address and trigger the fault again. $ dd if=/dev/urandom bs=4096 count=1 of=/dev/bcache0 seek=1073741824 ... kernel: BUG: unable to handle page fault for address: ffff9f96fcf1d008 kernel: #PF: supervisor read access in kernel mode kernel: #PF: error_code(0x0000) - not-present page kernel: PGD 1ff994b067 P4D 1ff994b067 PUD 0 kernel: Oops: 0000 [#1] SMP PTI kernel: CPU: 3 PID: 5296 Comm: dd Kdump: loaded Tainted: P OE 5.5.11-200.fc31.x86_64 #1 kernel: Hardware name: Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F, BIOS 3.2 01/16/2015 kernel: RIP: 0010:cached_dev_make_request+0x8b0/0xdf0 [bcache] kernel: Code: 00 00 45 8b be 9c 00 00 00 48 8b 46 20 31 d2 8b 7e 28 49 f7 f7 4d 89 fa 4d 8b be a0 00 00 00 c1 ef 09 89 c2 89 c0 49 8d 04 87 <8b> 00 85 c0 75 26 44 39 d7 77 0e e9 c4 03 00 00 41 39 fa 0f 83 bb kernel: RSP: 0018:ffff9f95ea48f9c0 EFLAGS: 00010202 kernel: RAX: ffff9f96fcf1d008 RBX: ffff8e9c0af10000 RCX: 0000000000000000 kernel: RDX: 0000000040000002 RSI: ffff8e9b6f4b90c0 RDI: 0000000000000008 kernel: RBP: ffff8e9b7d148000 R08: 000000000000002a R09: 0000000000000801 kernel: R10: 0000000000000008 R11: 0000000000000801 R12: ffff8e9b7d148128 kernel: R13: ffff8e9b7d148050 R14: ffff8e9c0af10010 R15: ffff9f95fcf1d000 kernel: FS: 00007f49f4d99580(0000) GS:ffff8e9c7fcc0000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: ffff9f96fcf1d008 CR3: 0000001ecb27c005 CR4: 00000000001606e0 kernel: Call Trace: kernel: ? recalibrate_cpu_khz+0x10/0x10 kernel: ? ktime_get+0x38/0x90 kernel: ? generic_make_request_checks+0x2f7/0x5f0 kernel: generic_make_request+0xcf/0x320 kernel: submit_bio+0x42/0x1c0 [...] Looking at the stack trace ("cached_dev_make_request+0x8b0") and a kernel module disassembly with debug info, it appears to confirm that the fault is at line 50 of writeback.h, performing the atomic_read with a bogus address outside the range of the allocated stripe_sectors_dirty array, because stripe is bigger than nr_stripes: while (1) { >>> if (atomic_read(dc->disk.stripe_sectors_dirty + stripe)) return true; This should also be possible to set up with an actual raid5 configuration. Looking at the 5.3.15 sources, the test of partial_stripes_expensive was present there too, and I'm pretty sure my earlier test on a virtual machine didn't use raid5...so I'm not quite sure how it got triggered before...
This message is a reminder that Fedora 31 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '31'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 31 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Coly Li made a couple changes to address this problem back in July; they were incorporated in Linus's tree in the 5.9 release. By inspection, it seems clear that 5.8.18-200.fc32 has not added any patches to address the problems.
This message is a reminder that Fedora 32 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '32'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 32 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.