Bug 1783075

Summary:

bcache: overflow in stripe number calculations for devices with small io_opt

Product:

[Fedora] Fedora

Reporter:

Ken Raeburn <raeburn>

Component:

kernel

Assignee:

Kernel Maintainer List <kernel-maint>

Status:

CLOSED EOL

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

airlied, bskeggs, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, linville, masami256, mchehab, mjg59, steved

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-05-25 15:13:46 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
system journal dump	none

Description Ken Raeburn 2019-12-12 23:45:12 UTC

Created attachment 1644595 [details]
system journal dump

1. Please describe the problem:

There are overflow bugs present in the bcache driver in the 5.3.15-300.fc31
kernel regarding stripe number calculations, caused by assuming that stripe
numbers can fit in 31 bits, and relatedly, that stripe sizes will be large
enough relative to device sizes to make stripe numbers fit in 31 bits. The
possible failure modes include at least (1) inability to configure bcache
devices atop certain types of backing storage of certain ranges of sizes, and
(2) out of bounds pointer accesses in the kernel.

Stripe sizes can be derived from the underlying storage device’s optimum I/O
size (io_opt). If this is set to something small, then for a large enough
underlying storage device, the number of stripes may exceed 2**31. Red Hat’s
VDO device (available at https://github.com/dm-vdo/kvdo) is one such device,
with an optimum I/O size of 4096. I see that zram uses PAGE_SIZE. I’m more
familiar with VDO, so that’s what I’ll use for my example, but as far as I can
tell, the only interesting characteristic is that VDO sets io_opt to 4096 (and
it’s easy to make a storage volume appear to have several TB of storage on a
test system).

In …/bcache/super.c where nr_stripes is calculated:

  d->nr_stripes = DIV_ROUND_UP_ULL(sectors, d->stripe_size);

Unfortunately, nr_stripes is an “unsigned int” field, so if, for example, the
device is 18 TB and the optimum I/O size is 4 kB, we have a truncation
problem:

  sectors = 38654705664
  stripe_size = 8
  DIV_ROUND_UP_ULL() = 4831838208 = 0x120000000
  (unsigned int) ... = 0x20000000 = 536870912

So nr_stripes in this case is the same as if the device size were 2 TB.

Also, elsewhere in this function, the value is capped at INT_MAX, so if the
stripe count after truncation is over 2**31, the configuration is rejected
altogether. In the case of the 4 kB optimum I/O size, this is device sizes of
8-16 TB modulo 16 TB (so, 0-8 TB okay, 8-16 fail, 16-24 okay, 24-32 fail,
etc). For other optimum I/O sizes, scale accordingly.

If we get past the INT_MAX check but have truncated the stripe count,
bcache_dev_stripe_dirty in writeback.h can dereference the
stripe_sectors_dirty array without bounds checking. The array is allocated
based on the truncated nr_stripes value, but indexed based on a real 64-bit
device offset.

This can lead to kernel errors when writing to some offsets:

    Dec 12 03:02:29 kernel: BUG: unable to handle page fault for address: ffff8f59b0300008
    Dec 12 03:02:29 kernel: #PF: supervisor read access in kernel mode
    Dec 12 03:02:29 kernel: #PF: error_code(0x0000) - not-present page
    Dec 12 03:02:29 kernel: PGD 5b001067 P4D 5b001067 PUD 0 
    Dec 12 03:02:29 kernel: Oops: 0000 [#1] SMP PTI
    Dec 12 03:02:29 kernel: CPU: 0 PID: 844 Comm: dd Tainted: G           OE     5.3.15-300.fc31.x86_64 #1
    Dec 12 03:02:29 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
    Dec 12 03:02:29 kernel: RIP: 0010:cached_dev_make_request+0x8c0/0xdf0 [bcache]
    [...]
    Dec 12 03:02:29 kernel: Call Trace:
    Dec 12 03:02:29 kernel:  generic_make_request+0xcf/0x320
    Dec 12 03:02:29 kernel:  submit_bio+0x3c/0x160
    Dec 12 03:02:29 kernel:  ? bio_add_page+0x62/0x90
    Dec 12 03:02:29 kernel:  ? guard_bio_eod+0x2c/0x130
    Dec 12 03:02:29 kernel:  submit_bh_wbc+0x16d/0x190
    Dec 12 03:02:29 kernel:  __block_write_full_page+0x200/0x410
    Dec 12 03:02:29 kernel:  ? touch_buffer+0x60/0x60
    Dec 12 03:02:29 kernel:  ? bdev_evict_inode+0xf0/0xf0
    Dec 12 03:02:29 kernel:  __writepage+0x19/0x50
    Dec 12 03:02:29 kernel:  write_cache_pages+0x171/0x420
    Dec 12 03:02:29 kernel:  ? __wb_calc_thresh+0x130/0x130
    Dec 12 03:02:29 kernel:  ? ep_poll_callback+0x29c/0x2c0
    Dec 12 03:02:29 kernel:  generic_writepages+0x57/0x90
    Dec 12 03:02:29 kernel:  ? __wake_up_common_lock+0x8a/0xc0
    Dec 12 03:02:29 kernel:  do_writepages+0x43/0xd0
    Dec 12 03:02:29 kernel:  __filemap_fdatawrite_range+0xce/0x110
    Dec 12 03:02:29 kernel:  filemap_write_and_wait+0x2d/0x80
    Dec 12 03:02:29 kernel:  __blkdev_put+0x70/0x1e0
    Dec 12 03:02:29 kernel:  blkdev_close+0x22/0x30
    [...]

Also, offset_to_stripe in that same header returns an “unsigned int”, but the
value it has computed is a 64-bit quotient (64-bit value divided by a 32-bit
value) that is potentially being truncated if the stripe size is small.

I do not know if there are other dereferencing problems, or any more subtle
problems caused by being able to do I/O with sector offsets that generate
stripe numbers larger than nr_stripes.


2. What is the Version-Release number of the kernel:

5.3.15-300.fc31

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Kernel 4.17.8 had the same or similar bugs.
I'm not aware of a previously working version.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

I started with a “Vagrant” virtual machine image (fedora/31-cloud-base), 16 GB
of memory, additional 150 GB disk image, updated to the current kernel and
kernel development packages (5.3.15-300.fc31). Setting it up with more than
one virtual CPU may make it easier to keep interacting with the VM even if one
CPU is thrown into a loop or something.

I installed VDO on it. The current VDO package on GitHub as of 2019-12-11 does
not build with the latest kernel due to some of the warning options in newer
kernels, so it needs a little tweaking. My setup script, run as root:

  #!/bin/sh -ex
  dnf install -y device-mapper-devel device-mapper-event-devel dkms \
      libuuid-devel lvm2 make python3 python3-PyYAML redhat-lsb-core rpm-build
  mkdir /tmp/builds
  cd /tmp/builds
  git clone https://github.com/dm-vdo/kvdo.git
  git clone https://github.com/dm-vdo/vdo.git
  # Patch up kvdo for new VLA warnings
  sed -i -e '/^EXTRA_CFLAGS/s/=/= -Wno-vla/' kvdo/uds/Makefile kvdo/vdo/Makefile
  REL=$(uname -r)
  make -C /usr/src/kernels/$REL M=/tmp/builds/kvdo
  cp kvdo/uds/uds.ko kvdo/vdo/kvdo.ko /lib/modules/$REL/updates/
  depmod -a
  make -C vdo
  make -C vdo install
  modprobe kvdo

I set up a VDO device on the second disk, and created a volume group on top of
the VDO volume.

  [root@localhost vagrant]# vdo create -n vdo0 --device /dev/vdb --sparseIndex disabled --indexMem 0.25 --vdoLogicalSize 18T 
  Creating VDO vdo0
  Starting VDO vdo0
  Starting compression on VDO vdo0
  VDO instance 0 volume is ready at /dev/mapper/vdo0
  [root@localhost vagrant]# vgcreate testvg /dev/mapper/vdo0 
    Physical volume "/dev/mapper/vdo0" successfully created.
    Volume group "testvg" successfully created
  [root@localhost vagrant]#

Set up a small volume for a cache, and initially we’ll try a less-than-8-TB
block device to show that that goes okay:

  [root@localhost ~]# lvcreate --size 30G -n cache testvg 
    Logical volume "cache" created.
  [root@localhost ~]# lvcreate --size 7T -n foo testvg 
    Logical volume "foo" created.
  [root@localhost ~]# 

Install bcache-tools and systemtap (for later).
Try creating the cache:

  [root@localhost ~]# make-bcache -C /dev/testvg/cache -B /dev/testvg/foo 
  UUID:			dbc49d7d-6f7b-4d71-802a-503ca95ef7af
  […]

Success! So tear it down.

  [root@localhost ~]# ls -l /dev/testvg/foo
  lrwxrwxrwx. 1 root root 7 Dec 12 22:06 /dev/testvg/foo -> ../dm-2
  [root@localhost ~]# echo 1 > /sys/block/dm-2/bcache/detach
  [root@localhost ~]# echo 1 > /sys/block/dm-2/bcache/stop
  [root@localhost ~]# echo 1 > /sys/fs/bcache/*-*/stop 
  [root@localhost ~]# wipefs -a /dev/testvg/cache
  /dev/testvg/cache: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
  [root@localhost ~]# wipefs -a /dev/testvg/foo
  /dev/testvg/foo: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
  [root@localhost ~]#

Try a size between 8 TB and 16 TB:

  [root@localhost ~]# lvresize --size 9T testvg/foo 
    Size of logical volume testvg/foo changed from 7.00 TiB (1835008 extents) to 9.00 TiB (2359296 extents).
    Logical volume testvg/foo successfully resized.
  [root@localhost ~]# make-bcache -C /dev/testvg/cache -B /dev/testvg/foo 
  UUID:			f932256f-aca8-4f6c-a945-b3487adb6e37
  Set UUID:		4c631dd0-8605-4f56-91ad-22951bd4e7bd
  version:		0
  nbuckets:		61440
  block_size:		8
  bucket_size:		1024
  nr_in_set:		1
  nr_this_dev:		0
  first_bucket:		1
  UUID:			0cb54114-f665-4ae7-ae8c-1423c0af3808
  Set UUID:		4c631dd0-8605-4f56-91ad-22951bd4e7bd
  version:		1
  block_size:		8
  data_offset:		16
  [root@localhost ~]#

This appears successful, and the exit status was 0; however, the kernel has
logged an error and the /sys/block/dm-2/bcache directory didn’t appear:

  Dec 12 22:16:15 localhost.localdomain kernel: bcache: run_cache_set() invalidating existing data
  Dec 12 22:16:15 localhost.localdomain kernel: bcache: register_cache() registered cache device dm-1
  Dec 12 22:16:15 localhost.localdomain kernel: bcache: bcache_device_init() nr_stripes too large or invalid: 2415919102 (start sector beyond end of disk?)
  Dec 12 22:16:15 localhost.localdomain kernel: bcache: register_bdev() error dm-2: cannot allocate memory
  Dec 12 22:16:15 localhost.localdomain kernel: bcache: register_bcache() error /dev/dm-2: failed to register device
  Dec 12 22:16:15 localhost.localdomain kernel: bcache: bcache_device_free() (efault) stopped
  Dec 12 22:16:15 localhost.localdomain systemd-udevd[19151]: dm-2: Process 'bcache-register /dev/dm-2' failed with exit code 1.

So, clean up:

  [root@localhost ~]# echo 1 > /sys/fs/bcache/*-*/stop 
  [root@localhost ~]# wipefs -a /dev/testvg/foo
  /dev/testvg/foo: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
  [root@localhost ~]# wipefs -a /dev/testvg/cache
  /dev/testvg/cache: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
  [root@localhost ~]#

And try again in the 16-24 TB range. This time, we use systemtap to print out
some of the computed stripe info when the device starts up (and udev scans it
looking for identifiable content).

  [root@localhost ~]# lvresize --size 17T testvg/foo 
    Size of logical volume testvg/foo changed from 9.00 TiB (2359296 extents) to 17.00 TiB (4456448 extents).
    Logical volume testvg/foo successfully resized.
  [root@localhost ~]# stap -e 'probe module("bcache").function("cached_dev_read") { printf("stripe size %u nr_stripes %u\n", $dc->disk->stripe_size, $dc->disk->nr_stripes); exit(); }' &
  [1] 20377
  [root@localhost ~]# 
  [root@localhost ~]# 
  [root@localhost ~]# make-bcache -C /dev/testvg/cache -B /dev/testvg/foo 
  UUID:			498f42ee-13ef-48ea-b67a-8d7d2a5437cb
  Set UUID:		d6cf5434-dedf-44b0-94b1-d3722ce1e0f8
  version:		0
  nbuckets:		61440
  block_size:		8
  bucket_size:		1024
  nr_in_set:		1
  nr_this_dev:		0
  first_bucket:		1
  UUID:			5e0cde0c-49ca-4e94-bcc1-e1ddc5fee8c0
  Set UUID:		d6cf5434-dedf-44b0-94b1-d3722ce1e0f8
  version:		1
  block_size:		8
  data_offset:		16
  [root@localhost ~]# stripe size 8 nr_stripes 268435454

  [1]+  Done                    stap -e 'probe module("bcache").function("cached_dev_read") { printf("stripe size %u nr_stripes %u\n", $dc->disk->stripe_size, $dc->disk->nr_stripes); exit(); }'
  [root@localhost ~]# 

268435454 stripes of 8 sectors or 4096 bytes each is 1099511619584, or just
under 1 TB.

So, now we can try seeing what happens if we cause stripe numbers to be
computed that are higher than this. An offset of 4 TB in 4 kB blocks is
1024**3 or 1073741824.

  [root@localhost ~]# echo writeback > /sys/block/dm-2/bcache/cache_mode 
  [root@localhost ~]# dd if=/dev/urandom bs=4096 count=1 of=/dev/bcache0
  1+0 records in
  1+0 records out
  4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00331645 s, 1.2 MB/s
  [root@localhost ~]# dd if=/dev/urandom bs=4096 count=1 of=/dev/bcache0 seek=1073741824 
  1+0 records in
  1+0 records out
  4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00115473 s, 3.5 MB/s
  [root@localhost ~]#

At this point, the VM became unresponsive, consuming lots of CPU time. The
console shows a stack trace with a fault at cached_dev_make_request+0x8c0, but
some of the info had scrolled off the screen.

After rebooting the machine and restarting VDO (it wasn’t configured to
restart automatically), the logical volumes and bcache came up automatically,
so I reissued that last “dd” command again, and immediately got this:

  Dec 12 22:55:07 localhost.localdomain kernel: BUG: unable to handle page fault for address: ffffaea6de7f1008
  Dec 12 22:55:07 localhost.localdomain kernel: #PF: supervisor read access in kernel mode
  Dec 12 22:55:07 localhost.localdomain kernel: #PF: error_code(0x0000) - not-present page
  Dec 12 22:55:07 localhost.localdomain kernel: PGD 42dd46067 P4D 42dd46067 PUD 0 
  Dec 12 22:55:07 localhost.localdomain kernel: Oops: 0000 [#1] SMP PTI
  Dec 12 22:55:07 localhost.localdomain kernel: CPU: 1 PID: 905 Comm: dd Tainted: G           OE     5.3.15-300.fc31.x86_64 #1
  Dec 12 22:55:07 localhost.localdomain kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
  Dec 12 22:55:07 localhost.localdomain kernel: RIP: 0010:cached_dev_make_request+0x8c0/0xdf0 [bcache]
  Dec 12 22:55:07 localhost.localdomain kernel: Code: 00 00 44 8b b3 9c 00 00 00 48 8b 46 20 31 d2 8b 7e 28 49 f7 f6 4d 89 f1 4c 8b b3 a0 00 00 00 c1 ef 09 89 c2 89 c0 49 8d 04 86 <8b> 00 85 c0 75 26 44 39 cf 77 0e e9 ae 03 00 00 41 39 f9 0f 83 a5
  Dec 12 22:55:07 localhost.localdomain kernel: RSP: 0018:ffffaea5c079f9d0 EFLAGS: 00010202
  Dec 12 22:55:07 localhost.localdomain kernel: RAX: ffffaea6de7f1008 RBX: ffff8f96bc1f0010 RCX: 0000000000000000
  Dec 12 22:55:07 localhost.localdomain kernel: RDX: 0000000040000002 RSI: ffff8f96786b4cc0 RDI: 0000000000000008
  Dec 12 22:55:07 localhost.localdomain kernel: RBP: ffff8f96bc119720 R08: 0000000000000801 R09: 0000000000000008
  Dec 12 22:55:07 localhost.localdomain kernel: R10: 0000000000000801 R11: 0000000000000000 R12: ffff8f96bc119848
  Dec 12 22:55:07 localhost.localdomain kernel: R13: ffff8f96bc119770 R14: ffffaea5de7f1000 R15: ffff8f96bc1f0000
  Dec 12 22:55:07 localhost.localdomain kernel: FS:  00007f376d40c580(0000) GS:ffff8f96efb00000(0000) knlGS:0000000000000000
  Dec 12 22:55:07 localhost.localdomain kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  Dec 12 22:55:07 localhost.localdomain kernel: CR2: ffffaea6de7f1008 CR3: 00000003b82bc005 CR4: 00000000003606e0
  Dec 12 22:55:07 localhost.localdomain kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  Dec 12 22:55:07 localhost.localdomain kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Dec 12 22:55:07 localhost.localdomain kernel: Call Trace:
  Dec 12 22:55:07 localhost.localdomain kernel:  generic_make_request+0xcf/0x320
  Dec 12 22:55:07 localhost.localdomain kernel:  submit_bio+0x3c/0x160
  Dec 12 22:55:07 localhost.localdomain kernel:  ? bio_add_page+0x62/0x90
  Dec 12 22:55:07 localhost.localdomain kernel:  ? guard_bio_eod+0x2c/0x130
  Dec 12 22:55:07 localhost.localdomain kernel:  submit_bh_wbc+0x16d/0x190
  Dec 12 22:55:07 localhost.localdomain kernel:  __block_write_full_page+0x200/0x410
  Dec 12 22:55:07 localhost.localdomain kernel:  ? touch_buffer+0x60/0x60
  Dec 12 22:55:07 localhost.localdomain kernel:  ? bdev_evict_inode+0xf0/0xf0
  Dec 12 22:55:07 localhost.localdomain kernel:  __writepage+0x19/0x50
  Dec 12 22:55:07 localhost.localdomain kernel:  write_cache_pages+0x171/0x420
  Dec 12 22:55:07 localhost.localdomain kernel:  ? __wb_calc_thresh+0x130/0x130
  Dec 12 22:55:07 localhost.localdomain kernel:  ? ep_poll_callback+0x29c/0x2c0
  Dec 12 22:55:07 localhost.localdomain kernel:  generic_writepages+0x57/0x90
  Dec 12 22:55:07 localhost.localdomain kernel:  ? __wake_up_common_lock+0x8a/0xc0
  Dec 12 22:55:07 localhost.localdomain kernel:  do_writepages+0x43/0xd0
  Dec 12 22:55:07 localhost.localdomain kernel:  __filemap_fdatawrite_range+0xce/0x110
  Dec 12 22:55:07 localhost.localdomain kernel:  filemap_write_and_wait+0x2d/0x80
  Dec 12 22:55:07 localhost.localdomain kernel:  __blkdev_put+0x70/0x1e0
  Dec 12 22:55:07 localhost.localdomain kernel:  blkdev_close+0x22/0x30
  Dec 12 22:55:07 localhost.localdomain kernel:  __fput+0xc1/0x250
  Dec 12 22:55:07 localhost.localdomain kernel:  task_work_run+0x87/0xa0
  Dec 12 22:55:07 localhost.localdomain kernel:  exit_to_usermode_loop+0xda/0x100
  Dec 12 22:55:07 localhost.localdomain kernel:  do_syscall_64+0x183/0x1a0
  Dec 12 22:55:07 localhost.localdomain kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  Dec 12 22:55:07 localhost.localdomain kernel: RIP: 0033:0x7f376d334c07
  Dec 12 22:55:07 localhost.localdomain kernel: Code: ff ff e8 fc e8 01 00 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 f3 41 f9 ff
  Dec 12 22:55:07 localhost.localdomain kernel: RSP: 002b:00007ffcacdeba98 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
  Dec 12 22:55:07 localhost.localdomain kernel: RAX: 0000000000000000 RBX: 0000556129b9b120 RCX: 00007f376d334c07
  Dec 12 22:55:07 localhost.localdomain kernel: RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000001
  Dec 12 22:55:07 localhost.localdomain kernel: RBP: 0000000000000000 R08: 000055612a22ab60 R09: 00000000000000f0
  Dec 12 22:55:07 localhost.localdomain kernel: R10: 000055612a22b000 R11: 0000000000000246 R12: 0000000000000001
  Dec 12 22:55:07 localhost.localdomain kernel: R13: 0000000040000000 R14: 0000000000000000 R15: 0000000000000001
  Dec 12 22:55:07 localhost.localdomain kernel: Modules linked in: bcache crc64 kvdo(OE) uds(OE) intel_rapl_msr intel_rapl_common kvm_intel kvm cirrus irqbypass drm_kms_helper drm virtio_net net_failover failover i2c_piix4 virtio_balloon ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_blk qemu_fw_cfg ata_generic pata_acpi
  Dec 12 22:55:07 localhost.localdomain kernel: CR2: ffffaea6de7f1008
  Dec 12 22:55:07 localhost.localdomain kernel: ---[ end trace 24e47706f4bb72ab ]---
  Dec 12 22:55:07 localhost.localdomain kernel: RIP: 0010:cached_dev_make_request+0x8c0/0xdf0 [bcache]
  Dec 12 22:55:07 localhost.localdomain kernel: Code: 00 00 44 8b b3 9c 00 00 00 48 8b 46 20 31 d2 8b 7e 28 49 f7 f6 4d 89 f1 4c 8b b3 a0 00 00 00 c1 ef 09 89 c2 89 c0 49 8d 04 86 <8b> 00 85 c0 75 26 44 39 cf 77 0e e9 ae 03 00 00 41 39 f9 0f 83 a5
  Dec 12 22:55:07 localhost.localdomain kernel: RSP: 0018:ffffaea5c079f9d0 EFLAGS: 00010202
  Dec 12 22:55:07 localhost.localdomain kernel: RAX: ffffaea6de7f1008 RBX: ffff8f96bc1f0010 RCX: 0000000000000000
  Dec 12 22:55:07 localhost.localdomain kernel: RDX: 0000000040000002 RSI: ffff8f96786b4cc0 RDI: 0000000000000008
  Dec 12 22:55:07 localhost.localdomain kernel: RBP: ffff8f96bc119720 R08: 0000000000000801 R09: 0000000000000008
  Dec 12 22:55:07 localhost.localdomain kernel: R10: 0000000000000801 R11: 0000000000000000 R12: ffff8f96bc119848
  Dec 12 22:55:07 localhost.localdomain kernel: R13: ffff8f96bc119770 R14: ffffaea5de7f1000 R15: ffff8f96bc1f0000
  Dec 12 22:55:07 localhost.localdomain kernel: FS:  00007f376d40c580(0000) GS:ffff8f96efb00000(0000) knlGS:0000000000000000
  Dec 12 22:55:07 localhost.localdomain kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  Dec 12 22:55:07 localhost.localdomain kernel: CR2: ffffaea6de7f1008 CR3: 00000003b82bc005 CR4: 00000000003606e0
  Dec 12 22:55:07 localhost.localdomain kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  Dec 12 22:55:07 localhost.localdomain kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

This trace is consistent with what I saw the first time I hit this aspect of
the problem.

This problem was present in earlier kernels such as 4.17.8-1.el7.elrepo; see
https://github.com/dm-vdo/kvdo/issues/9 for info. The pattern continues; from
24-32 TB, an error occurs; from 32-40 TB, configuration succeeds; and so
on. Thanks to Bryan Gurney and Sergey Ponomarev for helping look into it
initially back then.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

unknown

6. Are you running any modules that not shipped with directly Fedora's kernel?:

kvdo (https://github.com/dm-vdo/kvdo)

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Comment 1 Justin M. Forbes 2020-03-03 16:18:04 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 31 kernel bugs.

Fedora 31 has now been rebased to 5.5.7-200.fc31.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 32, and are still experiencing this issue, please change the version to Fedora 32.

If you experience different issues, please open a new bug report for those.

Comment 2 Justin M. Forbes 2020-03-25 22:26:43 UTC

*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Comment 3 Ken Raeburn 2020-03-28 06:37:33 UTC

I rebuilt my Fedora 31 test VM and got kernel 5.5.11-200.

The failure with the 9 TB volume is the same.

With the 17 TB volume, I can create the cache. Writing to it didn't produce the BUG failure in my first few attempts, but it didn't always happen with the 5.3.15-300 kernel either, so I can't say for certain whether that part has been fixed.

My systemtap one-liner had to be made a bit more complicated due to the restructuring of the bcache read path:
probe module("bcache").function("cached_dev_read_done_bh")
{
  disk = @container_of($cl, "struct search", cl)->d;
  printf("stripe_size %u nr_stripes %u\n",
         @cast(disk, "struct bcache_device", "bcache")->stripe_size,
         @cast(disk, "struct bcache_device", "bcache")->nr_stripes);
  exit();
}
...but it still confirms that the number of stripes gets truncated for the 17 TB volume.

Comment 4 Ken Raeburn 2020-03-28 08:26:12 UTC

Ah, yes... it helps to force the bcache device to have partial_stripes_expensive=1; without a handy configuration where it could be inherited from q->limits.raid_partial_stripes_expensive, I took the sledgehammer approach:

# run under systemtap and then read a block
probe module("bcache").function("cached_dev_read_done_bh")
{
  dc = &@container_of(@container_of($cl, "struct search", cl)->d,
		      "struct cached_dev", disk);
  @cast(dc,"struct cached_dev","bcache")->partial_stripes_expensive = 1;
  exit();
}

Then if writeback mode is enabled and other conditions are met, bcache_dev_stripe_dirty can be called when writing a block to a well chosen offset, and may, if you're lucky, hit an unmapped address and trigger the fault again.

$ dd if=/dev/urandom bs=4096 count=1 of=/dev/bcache0 seek=1073741824
...
kernel: BUG: unable to handle page fault for address: ffff9f96fcf1d008
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 1ff994b067 P4D 1ff994b067 PUD 0 
kernel: Oops: 0000 [#1] SMP PTI
kernel: CPU: 3 PID: 5296 Comm: dd Kdump: loaded Tainted: P           OE     5.5.11-200.fc31.x86_64 #1
kernel: Hardware name: Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F, BIOS 3.2 01/16/2015
kernel: RIP: 0010:cached_dev_make_request+0x8b0/0xdf0 [bcache]
kernel: Code: 00 00 45 8b be 9c 00 00 00 48 8b 46 20 31 d2 8b 7e 28 49 f7 f7 4d 89 fa 4d 8b be a0 00 00 00 c1 ef 09 89 c2 89 c0 49 8d 04 87 <8b> 00 85 c0 75 26 44 39 d7 77 0e e9 c4 03 00 00 41 39 fa 0f 83 bb
kernel: RSP: 0018:ffff9f95ea48f9c0 EFLAGS: 00010202
kernel: RAX: ffff9f96fcf1d008 RBX: ffff8e9c0af10000 RCX: 0000000000000000
kernel: RDX: 0000000040000002 RSI: ffff8e9b6f4b90c0 RDI: 0000000000000008
kernel: RBP: ffff8e9b7d148000 R08: 000000000000002a R09: 0000000000000801
kernel: R10: 0000000000000008 R11: 0000000000000801 R12: ffff8e9b7d148128
kernel: R13: ffff8e9b7d148050 R14: ffff8e9c0af10010 R15: ffff9f95fcf1d000
kernel: FS:  00007f49f4d99580(0000) GS:ffff8e9c7fcc0000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: ffff9f96fcf1d008 CR3: 0000001ecb27c005 CR4: 00000000001606e0
kernel: Call Trace:
kernel:  ? recalibrate_cpu_khz+0x10/0x10
kernel:  ? ktime_get+0x38/0x90
kernel:  ? generic_make_request_checks+0x2f7/0x5f0
kernel:  generic_make_request+0xcf/0x320
kernel:  submit_bio+0x42/0x1c0
[...]

Looking at the stack trace ("cached_dev_make_request+0x8b0") and a kernel module disassembly with debug info, it appears to confirm that the fault is at line 50 of writeback.h, performing the atomic_read with a bogus address outside the range of the allocated stripe_sectors_dirty array, because stripe is bigger than nr_stripes:
	while (1) {
>>>		if (atomic_read(dc->disk.stripe_sectors_dirty + stripe))
			return true;

This should also be possible to set up with an actual raid5 configuration.

Looking at the 5.3.15 sources, the test of partial_stripes_expensive was present there too, and I'm pretty sure my earlier test on a virtual machine didn't use raid5...so I'm not quite sure how it got triggered before...

Comment 5 Ben Cotton 2020-11-03 15:59:19 UTC

This message is a reminder that Fedora 31 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '31'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 31 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 6 Ken Raeburn 2020-11-11 01:00:28 UTC

Coly Li made a couple changes to address this problem back in July; they were incorporated in Linus's tree in the 5.9 release.

By inspection, it seems clear that 5.8.18-200.fc32 has not added any patches to address the problems.

Comment 7 Fedora Program Management 2021-04-29 16:01:01 UTC

This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 8 Ben Cotton 2021-05-25 15:13:46 UTC

Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.