Bug 1788596 - Thin-pool target OOPS on dm_pool_commit_metadata
Summary: Thin-pool target OOPS on dm_pool_commit_metadata
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: LVM and device-mapper
Classification: Community
Component: device-mapper
Version: unspecified
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
: ---
Assignee: Mike Snitzer
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-07 15:13 UTC by Zdenek Kabelac
Modified: 2020-05-15 19:51 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-15 19:51:07 UTC
Embargoed:
pm-rhel: lvm-technical-solution?


Attachments (Terms of Use)
Kernel config file for 5.5-rc5+ (43.54 KB, text/plain)
2020-01-08 09:48 UTC, Zdenek Kabelac
no flags Details
Log of test execution using scsi_debug backend device (422.42 KB, text/plain)
2020-01-13 10:24 UTC, Zdenek Kabelac
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1782045 0 high CLOSED reshape of a raid5 thinpool results in a hung lvconvert with an error " Internal error: Performing unsafe table load wh... 2023-09-18 00:19:07 UTC

Description Zdenek Kabelac 2020-01-07 15:13:11 UTC
Description of problem:

When thin-pool is supposed to be committing and thin-pool's metadata device has invalid size (due to other table manipulation errors with raid reshape) target generates this OOPS message:

 device-mapper: raid: raid456 discard support disabled due to discard_zeroes_data uncertainty.
 device-mapper: raid: Set dm-raid.devices_handle_discard_safely=Y to override.
 device-mapper: thin: Data device (dm-1) discard unsupported: Disabling discard passdown.
 device-mapper: table: 253:2: dm-1 too small for target: start=0, len=67584, dev_size=61440
 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 800000002ab8a067 P4D 800000002ab8a067 PUD 0 
 Oops: 0000 [#1] SMP PTI
 CPU: 1 PID: 3520 Comm: dmsetup Not tainted 5.5.0-rc5+ #109
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1.fc32 04/01/2014
 RIP: 0010:metadata_pre_commit_callback+0x15/0x60 [dm_thin_pool]
 Code: fe e9 9f ce b3 c4 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 55 48 8d 6f 50 53 48 89 fb 48 89 ef e8 2f 79 e1 c4 48 8b 43 10 <48> 8b 10 48 8b 92 80 00 00 00 48 39 53 58 74 10 66 81 63 64 ff fd
 RSP: 0018:ffffb527006c3bf0 EFLAGS: 00010216
 RAX: 0000000000000000 RBX: ffff8e8b6a126900 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffff8e8b6a1269b8
 RBP: ffff8e8b6a126950 R08: ffffb527006c3c30 R09: 0000000000000000
 R10: 0000000000000000 R11: ffffffff8623aca8 R12: 0000000000000001
 R13: ffff8e8b6b258800 R14: ffff8e8b6e7f8a00 R15: ffff8e8b6c65e000
 FS:  0000000000000000(0000) GS:ffff8e8b6f100000(0063) knlGS:00000000f7be7a40
 CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 000000002ab8c000 CR4: 00000000000006e0
 Call Trace:
  __commit_transaction+0x36/0x200 [dm_thin_pool]
  dm_pool_commit_metadata+0x28/0x60 [dm_thin_pool]
  commit+0x1d/0xc0 [dm_thin_pool]
  dm_table_postsuspend_targets+0x40/0x50
  __dm_destroy+0x128/0x140
  dm_hash_remove_all+0x58/0x130
  remove_all+0x19/0x30
  ctl_ioctl+0x1ae/0x460
  ? dm_hash_remove_all+0x130/0x130
  dm_compat_ctl_ioctl+0x7/0x10
  __ia32_compat_sys_ioctl+0x164/0x240
  do_fast_syscall_32+0xa4/0x2c5
  entry_SYSENTER_compat+0x7f/0x91
 Modules linked in: dm_raid dm_thin_pool dm_persistent_data dm_bio_prison nfsv3 nfs_acl nfs lockd grace joydev serio_raw virtio_balloon sunrpc virtio_rng raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 linear virtio_net net_failover virtio_blk failover virtio_pci virtio_ring virtio ata_generic pata_acpi
 CR2: 0000000000000000
 ---[ end trace ac1a948a9460b1cd ]---
 RIP: 0010:metadata_pre_commit_callback+0x15/0x60 [dm_thin_pool]
 Code: fe e9 9f ce b3 c4 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 55 48 8d 6f 50 53 48 89 fb 48 89 ef e8 2f 79 e1 c4 48 8b 43 10 <48> 8b 10 48 8b 92 80 00 00 00 48 39 53 58 74 10 66 81 63 64 ff fd
 RSP: 0018:ffffb527006c3bf0 EFLAGS: 00010216
 RAX: 0000000000000000 RBX: ffff8e8b6a126900 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffff8e8b6a1269b8
 RBP: ffff8e8b6a126950 R08: ffffb527006c3c30 R09: 0000000000000000
 R10: 0000000000000000 R11: ffffffff8623aca8 R12: 0000000000000001
 R13: ffff8e8b6b258800 R14: ffff8e8b6e7f8a00 R15: ffff8e8b6c65e000
 FS:  0000000000000000(0000) GS:ffff8e8b6f100000(0063) knlGS:00000000f7be7a40
 CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 000000002ab8c000 CR4: 00000000000006e0

Version-Release number of selected component (if applicable):
Current upstream 5.5.0-rc5+

How reproducible:


Not so easy to describe - latest patch for raid fix by Heinz is needed.

lvcreate -T -L10 vg/pool
lvconvert -y --ty raid5 --stripes 3 vg/pool_tdata
lvconvert -y --ty raid5 --stripes 3 vg/pool_tdata
lvconvert -y --ty raid5 --stripes 3 vg/pool_tdata

now it ends with failed DM - pool remains suspend.
After resume  - just run  'dmsetup remove...' for pool device.

Comment 1 Corey Marthaler 2020-01-07 16:29:14 UTC
I can repo a similar deadlock in the latest rhel8.2 rpms.

kernel-4.18.0-167.el8    BUILT: Sat Dec 14 19:43:52 CST 2019
lvm2-2.03.07-1.el8    BUILT: Mon Dec  2 00:09:32 CST 2019
lvm2-libs-2.03.07-1.el8    BUILT: Mon Dec  2 00:09:32 CST 2019
lvm2-dbusd-2.03.07-1.el8    BUILT: Mon Dec  2 00:12:23 CST 2019
device-mapper-1.02.167-1.el8    BUILT: Mon Dec  2 00:09:32 CST 2019
device-mapper-libs-1.02.167-1.el8    BUILT: Mon Dec  2 00:09:32 CST 2019
device-mapper-event-1.02.167-1.el8    BUILT: Mon Dec  2 00:09:32 CST 2019
device-mapper-event-libs-1.02.167-1.el8    BUILT: Mon Dec  2 00:09:32 CST 2019




[root@hayes-02 ~]# lvcreate -T -L10 vg/pool
  Rounding up size to full physical extent 12.00 MiB
  Thin pool volume with chunk size 64.00 KiB can address at most 15.81 TiB of data.
  Logical volume "pool" created.
[root@hayes-02 ~]# lvs -a -o +devices
  LV              VG Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices      
  [lvol0_pmspare] vg ewi-------  4.00m                                                     /dev/sdb1(0) 
  pool            vg twi-a-tz-- 12.00m             0.00   10.84                            pool_tdata(0)
  [pool_tdata]    vg Twi-ao---- 12.00m                                                     /dev/sdb1(1) 
  [pool_tmeta]    vg ewi-ao----  4.00m                                                     /dev/sdg1(0) 
[root@hayes-02 ~]# lvconvert -y --ty raid5 --stripes 3 vg/pool_tdata
  Using default stripesize 64.00 KiB.
  Replaced LV type raid5 (same as raid5_ls) with possible type raid1.
  Repeat this command to convert to raid5 after an interim conversion has finished.
  --stripes not allowed for LV vg/pool_tdata when converting from linear to raid1.
  Logical volume vg/pool_tdata successfully converted.
[root@hayes-02 ~]# lvconvert -y --ty raid5 --stripes 3 vg/pool_tdata
  Using default stripesize 64.00 KiB.
  --stripes not allowed for LV vg/pool_tdata when converting from raid1 to raid5.
  Logical volume vg/pool_tdata successfully converted.
[root@hayes-02 ~]# lvs -a -o +devices
  LV                    VG Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices                                      
  [lvol0_pmspare]       vg ewi-------  4.00m                                                     /dev/sdb1(0)                                 
  pool                  vg twi-a-tz-- 12.00m             0.00   10.84                            pool_tdata(0)                                
  [pool_tdata]          vg rwi-aor--- 12.00m                                    100.00           pool_tdata_rimage_0(0),pool_tdata_rimage_1(0)
  [pool_tdata_rimage_0] vg iwi-aor--- 12.00m                                                     /dev/sdb1(1)                                 
  [pool_tdata_rimage_1] vg iwi-aor--- 12.00m                                                     /dev/sdc1(1)                                 
  [pool_tdata_rmeta_0]  vg ewi-aor---  4.00m                                                     /dev/sdb1(4)                                 
  [pool_tdata_rmeta_1]  vg ewi-aor---  4.00m                                                     /dev/sdc1(0)                                 
  [pool_tmeta]          vg ewi-ao----  4.00m                                                     /dev/sdg1(0)                                 

[root@hayes-02 ~]# lvconvert -y --ty raid5 --stripes 3 vg/pool_tdata
  Using default stripesize 64.00 KiB.
  WARNING: Adding stripes to active and open logical volume vg/pool_tdata will grow it from 3 to 9 extents!
  Run "lvresize -l3 vg/pool_tdata" to shrink it or use the additional capacity.
  Internal error: Performing unsafe table load while 11 device(s) are known to be suspended:  (253:1) 

Jan  7 10:20:51 hayes-02 kernel: device-mapper: raid: Device 3 specified for rebuild; clearing superblock
Jan  7 10:20:51 hayes-02 kernel: md/raid:mdX: device dm-4 operational as raid disk 0
Jan  7 10:20:51 hayes-02 kernel: md/raid:mdX: device dm-6 operational as raid disk 1
Jan  7 10:20:51 hayes-02 kernel: md/raid:mdX: raid level 5 active with 2 out of 2 devices, algorithm 2
Jan  7 10:20:51 hayes-02 lvm[29105]: No longer monitoring RAID device vg-pool_tdata for events.
Jan  7 10:20:51 hayes-02 dmeventd[29105]: No longer monitoring thin pool vg-pool.
Jan  7 10:20:59 hayes-02 systemd-logind[1532]: New session 3 of user root.
Jan  7 10:20:59 hayes-02 systemd[1]: Started Session 3 of user root.
Jan  7 10:23:25 hayes-02 kernel: INFO: task lvconvert:29276 blocked for more than 120 seconds.
Jan  7 10:23:25 hayes-02 kernel:      Not tainted 4.18.0-167.el8.x86_64 #1
Jan  7 10:23:25 hayes-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  7 10:23:25 hayes-02 kernel: lvconvert       D    0 29276   2349 0x00000080
Jan  7 10:23:25 hayes-02 kernel: Call Trace:
Jan  7 10:23:25 hayes-02 kernel: ? __schedule+0x24f/0x650
Jan  7 10:23:25 hayes-02 kernel: ? try_to_wake_up+0x54/0x570
Jan  7 10:23:25 hayes-02 kernel: ? path_lookupat.isra.50+0xa7/0x200
Jan  7 10:23:25 hayes-02 kernel: schedule+0x2f/0xa0
Jan  7 10:23:25 hayes-02 kernel: schedule_timeout+0x20d/0x310
Jan  7 10:23:25 hayes-02 kernel: ? dm_make_request+0x74/0x90 [dm_mod]
Jan  7 10:23:25 hayes-02 kernel: io_schedule_timeout+0x19/0x40
Jan  7 10:23:25 hayes-02 kernel: wait_for_completion_io+0x11f/0x190
Jan  7 10:23:25 hayes-02 kernel: ? wake_up_q+0x70/0x70
Jan  7 10:23:25 hayes-02 kernel: submit_bio_wait+0x5b/0x80
Jan  7 10:23:25 hayes-02 kernel: sync_page_io+0x127/0x160
Jan  7 10:23:25 hayes-02 kernel: read_disk_sb+0x39/0x50 [dm_raid]
Jan  7 10:23:25 hayes-02 kernel: raid_ctr+0x6aa/0x1838 [dm_raid]
Jan  7 10:23:25 hayes-02 kernel: ? dm_split_args+0x63/0x160 [dm_mod]
Jan  7 10:23:25 hayes-02 kernel: dm_table_add_target+0x17d/0x360 [dm_mod]
Jan  7 10:23:25 hayes-02 kernel: table_load+0x122/0x2e0 [dm_mod]
Jan  7 10:23:25 hayes-02 kernel: ? dev_status+0x40/0x40 [dm_mod]
Jan  7 10:23:25 hayes-02 kernel: ctl_ioctl+0x1af/0x3f0 [dm_mod]
Jan  7 10:23:25 hayes-02 kernel: ? selinux_file_ioctl+0x90/0x200
Jan  7 10:23:25 hayes-02 kernel: dm_ctl_ioctl+0xa/0x10 [dm_mod]
Jan  7 10:23:25 hayes-02 kernel: do_vfs_ioctl+0xa4/0x630
Jan  7 10:23:25 hayes-02 kernel: ksys_ioctl+0x60/0x90
Jan  7 10:23:25 hayes-02 kernel: __x64_sys_ioctl+0x16/0x20
Jan  7 10:23:25 hayes-02 kernel: do_syscall_64+0x5b/0x1a0
Jan  7 10:23:25 hayes-02 kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
Jan  7 10:23:25 hayes-02 kernel: RIP: 0033:0x7fefb653487b
Jan  7 10:23:25 hayes-02 kernel: Code: Bad RIP value.
Jan  7 10:23:25 hayes-02 kernel: RSP: 002b:00007ffd2aeaf058 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
Jan  7 10:23:25 hayes-02 kernel: RAX: ffffffffffffffda RBX: 000055c6a285acb0 RCX: 00007fefb653487b
Jan  7 10:23:25 hayes-02 kernel: RDX: 000055c6a49a1b10 RSI: 00000000c138fd09 RDI: 0000000000000003
Jan  7 10:23:25 hayes-02 kernel: RBP: 000055c6a2908917 R08: 0000000000000000 R09: 00007ffd2aeaeec0
Jan  7 10:23:25 hayes-02 kernel: R10: 000055c6a2974f46 R11: 0000000000000206 R12: 0000000000000000
Jan  7 10:23:25 hayes-02 kernel: R13: 000055c6a49a1b40 R14: 000055c6a49a1b10 R15: 000055c6a5296090
Jan  7 10:25:28 hayes-02 kernel: INFO: task lvconvert:29276 blocked for more than 120 seconds.
Jan  7 10:25:28 hayes-02 kernel:      Not tainted 4.18.0-167.el8.x86_64 #1
Jan  7 10:25:28 hayes-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  7 10:25:28 hayes-02 kernel: lvconvert       D    0 29276   2349 0x00000080
Jan  7 10:25:28 hayes-02 kernel: Call Trace:
Jan  7 10:25:28 hayes-02 kernel: ? __schedule+0x24f/0x650
Jan  7 10:25:28 hayes-02 kernel: ? try_to_wake_up+0x54/0x570
Jan  7 10:25:28 hayes-02 kernel: ? path_lookupat.isra.50+0xa7/0x200
Jan  7 10:25:28 hayes-02 kernel: schedule+0x2f/0xa0
Jan  7 10:25:28 hayes-02 kernel: schedule_timeout+0x20d/0x310
Jan  7 10:25:28 hayes-02 kernel: ? dm_make_request+0x74/0x90 [dm_mod]
Jan  7 10:25:28 hayes-02 kernel: io_schedule_timeout+0x19/0x40
Jan  7 10:25:28 hayes-02 kernel: wait_for_completion_io+0x11f/0x190
Jan  7 10:25:28 hayes-02 kernel: ? wake_up_q+0x70/0x70
Jan  7 10:25:28 hayes-02 kernel: submit_bio_wait+0x5b/0x80
Jan  7 10:25:28 hayes-02 kernel: sync_page_io+0x127/0x160
Jan  7 10:25:28 hayes-02 kernel: read_disk_sb+0x39/0x50 [dm_raid]
Jan  7 10:25:28 hayes-02 kernel: raid_ctr+0x6aa/0x1838 [dm_raid]
Jan  7 10:25:28 hayes-02 kernel: ? dm_split_args+0x63/0x160 [dm_mod]
Jan  7 10:25:28 hayes-02 kernel: dm_table_add_target+0x17d/0x360 [dm_mod]
Jan  7 10:25:28 hayes-02 kernel: table_load+0x122/0x2e0 [dm_mod]
Jan  7 10:25:28 hayes-02 kernel: ? dev_status+0x40/0x40 [dm_mod]
Jan  7 10:25:28 hayes-02 kernel: ctl_ioctl+0x1af/0x3f0 [dm_mod]
Jan  7 10:25:28 hayes-02 kernel: ? selinux_file_ioctl+0x90/0x200
Jan  7 10:25:28 hayes-02 kernel: dm_ctl_ioctl+0xa/0x10 [dm_mod]
Jan  7 10:25:28 hayes-02 kernel: do_vfs_ioctl+0xa4/0x630
Jan  7 10:25:28 hayes-02 kernel: ksys_ioctl+0x60/0x90
Jan  7 10:25:28 hayes-02 kernel: __x64_sys_ioctl+0x16/0x20
Jan  7 10:25:28 hayes-02 kernel: do_syscall_64+0x5b/0x1a0

Comment 2 Zdenek Kabelac 2020-01-07 16:47:32 UTC
Yep - it's the original BZ  1782045 - which lead to this report - but to hit the oops - patch for raid needs to be applied on RAID target.
otherwise we end with suspended devices blocking progress.

Comment 3 Mike Snitzer 2020-01-07 19:31:29 UTC
Seems like pool_ctr() completes but the table load goes on to fail due to size issues:
  device-mapper: table: 253:2: dm-1 too small for target: start=0, len=67584, dev_size=61440

Then __commit_transaction() is called via postsuspend.  But somehow pmd->pre_commit_context is NULL.

Not _exactly_ sure on the exact sequence of what is occurring so I'll just reproduce and hopefully chase it down.

Comment 4 Mike Snitzer 2020-01-07 21:22:42 UTC
Just showing my work, please scroll to bottom of this comment to see unexpected success (no crash).

[root@thegoat ~]# vgcreate vg /dev/nvme0n1p[1-4]
File descriptor 6 (/dev/pts/2) leaked on vgcreate invocation. Parent PID 2041: /bin/bash
  Physical volume "/dev/nvme0n1p1" successfully created.
  Physical volume "/dev/nvme0n1p2" successfully created.
  Physical volume "/dev/nvme0n1p3" successfully created.
  Physical volume "/dev/nvme0n1p4" successfully created.
  Volume group "vg" successfully created
[root@thegoat ~]# lvcreate -T -L10 vg/pool
File descriptor 6 (/dev/pts/2) leaked on lvcreate invocation. Parent PID 2041: /bin/bash
  Rounding up size to full physical extent 12.00 MiB
  Thin pool volume with chunk size 64.00 KiB can address at most 15.81 TiB of data.
  Logical volume "pool" created.
[root@thegoat ~]# lvs -a -o +devices
File descriptor 6 (/dev/pts/2) leaked on lvs invocation. Parent PID 2041: /bin/bash
  LV              VG Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices
  [lvol0_pmspare] vg ewi-------  4.00m                                                     /dev/nvme0n1p1(0)
  pool            vg twi-a-tz-- 12.00m             0.00   10.84                            pool_tdata(0)
  [pool_tdata]    vg Twi-ao---- 12.00m                                                     /dev/nvme0n1p1(1)
  [pool_tmeta]    vg ewi-ao----  4.00m                                                     /dev/nvme0n1p4(0)
[root@thegoat ~]#
[root@thegoat ~]#
[root@thegoat ~]# lvconvert -y --ty raid5 --stripes 3 vg/pool_tdata
File descriptor 6 (/dev/pts/2) leaked on lvconvert invocation. Parent PID 2041: /bin/bash
  Using default stripesize 64.00 KiB.
  Replaced LV type raid5 (same as raid5_ls) with possible type raid1.
  Repeat this command to convert to raid5 after an interim conversion has finished.
  --stripes not allowed for LV vg/pool_tdata when converting from linear to raid1.
  Logical volume vg/pool_tdata successfully converted.
[root@thegoat ~]# lvconvert -y --ty raid5 --stripes 3 vg/pool_tdata
File descriptor 6 (/dev/pts/2) leaked on lvconvert invocation. Parent PID 2041: /bin/bash
  Using default stripesize 64.00 KiB.
  --stripes not allowed for LV vg/pool_tdata when converting from raid1 to raid5.
  Logical volume vg/pool_tdata successfully converted.
[root@thegoat ~]# lvconvert -y --ty raid5 --stripes 3 vg/pool_tdata
File descriptor 6 (/dev/pts/2) leaked on lvconvert invocation. Parent PID 2041: /bin/bash
  Using default stripesize 64.00 KiB.
  WARNING: Adding stripes to active and open logical volume vg/pool_tdata will grow it from 3 to 9 extents!
  Run "lvresize -l3 vg/pool_tdata" to shrink it or use the additional capacity.
  device-mapper: resume ioctl on  (253:2) failed: Invalid argument
  Unable to resume vg-pool (253:2).
  Problem reactivating logical volume vg/pool.
  Reshape request failed on LV vg/pool_tdata.
  Releasing activation in critical section.
  libdevmapper exiting with 1 device(s) still suspended.

[root@thegoat ~]# dmsetup info vg-pool
Name:              vg-pool
State:             SUSPENDED
Read Ahead:        8192
Tables present:    LIVE
Open count:        0
Event number:      0
Major, minor:      253, 2
Number of targets: 1
UUID: LVM-TiaRkyhhNy4nb38EvoL7JOGQBcC0RWic7VPIDfk5SrqzvQALskYsXdtquWP9nkig-tpool

[root@thegoat ~]# dmsetup table
vg-pool_tdata_rimage_2: 0 32768 linear 259:8 10240
vg-pool_tdata_rmeta_0: 0 8192 linear 259:6 34816
vg-pool_tdata_rimage_1: 0 32768 linear 259:7 10240
vg-pool: 0 24576 thin-pool 253:0 253:1 128 0 0
vg-pool_tdata_rimage_0: 0 24576 linear 259:6 10240
vg-pool_tdata_rimage_0: 24576 8192 linear 259:6 43008
vg-pool_tdata: 0 73728 raid raid5_ls 7 128 rebuild 2 rebuild 3 region_size 4096 4 253:3 253:4 253:5 253:6 253:7 253:8 253:9 253:10
vg-pool_tmeta: 0 8192 linear 259:9 2048
vg-pool_tdata_rmeta_3: 0 8192 linear 259:9 10240
vg-pool_tdata_rmeta_2: 0 8192 linear 259:8 2048
vg-pool_tdata_rimage_3: 0 32768 linear 259:9 18432
vg-pool_tdata_rmeta_1: 0 8192 linear 259:7 2048

[root@thegoat ~]# lsblk
NAME                       MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme0n1                    259:1    0 260.9G  0 disk
├─nvme0n1p3                259:8    0    60G  0 part
│ ├─vg-pool_tdata_rimage_2 253:8    0    16M  0 lvm
│ │ └─vg-pool_tdata        253:1    0    36M  0 lvm
│ │   └─vg-pool            253:2    0    12M  0 lvm
│ └─vg-pool_tdata_rmeta_2  253:7    0     4M  0 lvm
│   └─vg-pool_tdata        253:1    0    36M  0 lvm
│     └─vg-pool            253:2    0    12M  0 lvm
├─nvme0n1p1                259:6    0    60G  0 part
│ ├─vg-pool_tdata_rimage_0 253:4    0    16M  0 lvm
│ │ └─vg-pool_tdata        253:1    0    36M  0 lvm
│ │   └─vg-pool            253:2    0    12M  0 lvm
│ └─vg-pool_tdata_rmeta_0  253:3    0     4M  0 lvm
│   └─vg-pool_tdata        253:1    0    36M  0 lvm
│     └─vg-pool            253:2    0    12M  0 lvm
├─nvme0n1p4                259:9    0    60G  0 part
│ ├─vg-pool_tdata_rimage_3 253:10   0    16M  0 lvm
│ │ └─vg-pool_tdata        253:1    0    36M  0 lvm
│ │   └─vg-pool            253:2    0    12M  0 lvm
│ ├─vg-pool_tmeta          253:0    0     4M  0 lvm
│ │ └─vg-pool              253:2    0    12M  0 lvm
│ └─vg-pool_tdata_rmeta_3  253:9    0     4M  0 lvm
│   └─vg-pool_tdata        253:1    0    36M  0 lvm
│     └─vg-pool            253:2    0    12M  0 lvm
└─nvme0n1p2                259:7    0    60G  0 part
  ├─vg-pool_tdata_rimage_1 253:6    0    16M  0 lvm
  │ └─vg-pool_tdata        253:1    0    36M  0 lvm
  │   └─vg-pool            253:2    0    12M  0 lvm
  └─vg-pool_tdata_rmeta_1  253:5    0     4M  0 lvm
    └─vg-pool_tdata        253:1    0    36M  0 lvm
      └─vg-pool            253:2    0    12M  0 lvm

[root@thegoat ~]# dmsetup info vg-pool_tdata
Name:              vg-pool_tdata
State:             ACTIVE
Read Ahead:        1024
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 1
Number of targets: 1
UUID: LVM-TiaRkyhhNy4nb38EvoL7JOGQBcC0RWic2L9qCKncfl51biu9tyE515aGZCBLvMBK-tdata

[root@thegoat ~]# dmsetup resume vg-pool

console shows:
[ 1840.500421] md/raid:mdX: device dm-4 operational as raid disk 0
[ 1840.506351] md/raid:mdX: device dm-6 operational as raid disk 1
[ 1840.512551] md/raid:mdX: raid level 5 active with 2 out of 2 devices, algorithm 2
[ 1840.525984] device-mapper: raid: raid456 discard support disabled due to discard_zeroes_data uncertainty.
[ 1840.535561] device-mapper: raid: Set dm-raid.devices_handle_discard_safely=Y to override.
[ 1853.716383] device-mapper: raid: Device 2 specified for rebuild; clearing superblock
[ 1853.724128] device-mapper: raid: Device 3 specified for rebuild; clearing superblock
[ 1853.732717] md/raid:mdX: device dm-4 operational as raid disk 0
[ 1853.738642] md/raid:mdX: device dm-6 operational as raid disk 1
[ 1853.744769] md/raid:mdX: raid level 5 active with 2 out of 2 devices, algorithm 2
[ 1853.759966] device-mapper: raid: raid456 discard support disabled due to discard_zeroes_data uncertainty.
[ 1853.769530] device-mapper: raid: Set dm-raid.devices_handle_discard_safely=Y to override.
[ 1853.847610] device-mapper: thin: Data device (dm-1) discard unsupported: Disabling discard passdown.
[ 1853.856738] device-mapper: table: 253:2: dm-1 too small for target: start=0, len=98304, dev_size=73728

# dmsetup remove vg-pool

but no crash... nothing on console.. and vg-pool is gone:

# dmsetup info vg-pool
Device does not exist.
Command failed.

But I'm testing latest dm-5.6 kernel (5.5-rc5 based, which also has Heinz's fix) with RHEL7.8-beta1's lvm2:

# rpm -q lvm2 device-mapper
lvm2-2.02.186-2.el7.x86_64
device-mapper-1.02.164-2.el7.x86_64

Comment 5 Zdenek Kabelac 2020-01-08 09:46:56 UTC
For the reproduction this patch for raid target is mandatory: 

https://www.redhat.com/archives/dm-devel/2019-December/msg00065.html

Then after the 3rd. call of lvconvert - lvm should finish command with just a single device left in suspend (aka vg-pool)
Once this device is resumed then removing devices will provoke OOPS.

Note - it's pretty much clear ATM the table manipulation is wrong with reshaping - as we are changing size of reshaped device up & DOWN and clearly this can be going one while device is stacked - so upon 'standard' work I assume this OOPS can't be triggered.  It's likely some corner case of error handling.

Comment 6 Zdenek Kabelac 2020-01-08 09:48:12 UTC
Created attachment 1650621 [details]
Kernel config file for 5.5-rc5+

Comment 7 Mike Snitzer 2020-01-08 14:24:34 UTC
(In reply to Zdenek Kabelac from comment #5)
> For the reproduction this patch for raid target is mandatory: 
> 
> https://www.redhat.com/archives/dm-devel/2019-December/msg00065.html
> 
> Then after the 3rd. call of lvconvert - lvm should finish command with just
> a single device left in suspend (aka vg-pool)
> Once this device is resumed then removing devices will provoke OOPS.

I did that, doesn't OOPS for me.

> Note - it's pretty much clear ATM the table manipulation is wrong with
> reshaping - as we are changing size of reshaped device up & DOWN and clearly
> this can be going one while device is stacked - so upon 'standard' work I
> assume this OOPS can't be triggered.  It's likely some corner case of error
> handling.

In the end you're saying that resuming the suspended device (which still has old table) is trigger for OOPS.  You clearly hit an OOPS, I just don't see why and cannot reproduce.  But I'll analyze more and think harder ;)

Comment 8 Mike Snitzer 2020-01-08 14:26:28 UTC
(In reply to Mike Snitzer from comment #7)

> In the end you're saying that resuming the suspended device (which still has
> old table) is trigger for OOPS.

Sorry, the device remove after suspend is triggering your OOPS.  Regardless, by point still stands, not yet seeing _why_.

Comment 9 Mike Snitzer 2020-01-08 14:27:13 UTC
(In reply to Mike Snitzer from comment #8)
> (In reply to Mike Snitzer from comment #7)
> 
> > In the end you're saying that resuming the suspended device (which still has
> > old table) is trigger for OOPS.
> 
> Sorry, the device remove after suspend is triggering your OOPS.  Regardless,
> by point still stands, not yet seeing _why_.

Ha, remove after _resume_

Comment 11 Mike Snitzer 2020-01-10 16:54:05 UTC
Been trying _hard_ to reproduce, still cannot, resorting to trying to run Zdenek's minimalist 2GB debian based kvm image and have hit all sorts of redhat vs debian KVM usage quirks.

Bottomline, Zdenek is doing his testing on the least "enterprise" testbed that exists (even using loopback devices for his storage devices).

Inclined to close WORKSFORME

Comment 12 Zdenek Kabelac 2020-01-13 10:22:41 UTC
Reopening with log  where the test is purely executer on  top  /dev/sda (scsi_debug)  and 6 DM devices are made on top of this target.
(same happens when  scsi_debug is replaced with brd  device  aka /dev/ram0)

Providing full log of lvm2 test suite element - hopefully this will help more - I'm thinkinkg which of the _DEBUG_ symbols should I enable for kernel compilation to see it more easily.

Few more interesting notes might also help here -  the OOPS issue is not occurring when lvcoversion 'breaks'  pool_tmeta.
And for this test - actually _tdata was not likely used *AT ALL* (since there we not any thin LV actually created)

When I create 1 thin LV on this badly reshaped&resized thin-pool  the  thin-pool destruction proceeds and finishes normally.

So it seems the OOPS occurs only for unused thin-pool case (on my setup) (and happens without SMP)

Comment 13 Zdenek Kabelac 2020-01-13 10:24:43 UTC
Created attachment 1651806 [details]
Log of test execution using  scsi_debug  backend device

Comment 14 Zdenek Kabelac 2020-01-13 12:07:17 UTC
I've looked more closely in the ordering - there is needed one extra call for 'dmsetup status'.
(as I've this embedded into my debugging version of lvm2 - and it didn't came to me initially as an important difference)

dmsetup status  vg-pool
dmsetup resume  vg-pool     <---  hits OOPS

Comment 15 Mikuláš Patočka 2020-01-13 14:04:30 UTC
This seems to be caused by the commit 694cfe7f31db36912725e63a38a5179c8628a496

Comment 16 Zdenek Kabelac 2020-01-13 14:54:36 UTC
I can confirm that replacing  'metadata_pre_commit_callback()' internals with just 'return 0'  
fixes the problem for me.  

So there is something 'weird' with _tdata flushing logic  (if _tdata are on badly working device)

Comment 17 Mike Snitzer 2020-01-13 15:14:31 UTC
(In reply to Mikuláš Patočka from comment #15)
> This seems to be caused by the commit
> 694cfe7f31db36912725e63a38a5179c8628a496

This isn't news, well aware of that commit being the source of this issue.

The bdev is getting free'd after the thin-pool table load that fails (during the last lvconvert that fails).
After resuming the thin-pool, running 'dmsetup vg-pool' triggers OOPS during commit of the postsuspend hook when the new precommit hook tries to access the thin-pool's data device that _should_ still exist because the resumed vg-pool still has a holder on it via that thin-pool table's dm_get_device().

Yet oops is occurring...

Comment 18 Mike Snitzer 2020-01-13 15:24:00 UTC
(In reply to Mike Snitzer from comment #17)
> (In reply to Mikuláš Patočka from comment #15)
> > This seems to be caused by the commit
> > 694cfe7f31db36912725e63a38a5179c8628a496
> 
> This isn't news, well aware of that commit being the source of this issue.
> 
> The bdev is getting free'd after the thin-pool table load that fails (during
> the last lvconvert that fails).
> After resuming the thin-pool, running 'dmsetup vg-pool' triggers OOPS during
> commit of the postsuspend hook when the new precommit hook tries to access
> the thin-pool's data device that _should_ still exist because the resumed
> vg-pool still has a holder on it via that thin-pool table's dm_get_device().
> 
> Yet oops is occurring...

Mikulas says oops is due to stale 'struct pool_c' pointer stored in in-core pool metadata.  I'll buy that, just need to track down why that is occurring.
Reopening this bug was appropriate.

Comment 19 Mike Snitzer 2020-01-14 14:49:14 UTC
I can easily reproduce using procedure detailed in comment#0 with these additional Kconfig CONFIG_ options enabled (thanks to Mikulas):

# diff -Naur .config.5.5.0-rc5.snitm+_without_slub_debgging_on_and_enhanced .config
--- .config.5.5.0-rc5.snitm+_without_slub_debgging_on_and_enhanced      2020-01-13 10:41:17.903198330 -0500
+++ .config     2020-01-13 10:41:27.389207112 -0500
@@ -5198,11 +5198,20 @@
 # CONFIG_PAGE_EXTENSION is not set
 # CONFIG_DEBUG_PAGEALLOC is not set
 # CONFIG_PAGE_OWNER is not set
-# CONFIG_PAGE_POISONING is not set
+CONFIG_PAGE_POISONING=y
+CONFIG_PAGE_POISONING_NO_SANITY=y
+# CONFIG_PAGE_POISONING_ZERO is not set
 # CONFIG_DEBUG_PAGE_REF is not set
 CONFIG_DEBUG_RODATA_TEST=y
-# CONFIG_DEBUG_OBJECTS is not set
-# CONFIG_SLUB_DEBUG_ON is not set
+CONFIG_DEBUG_OBJECTS=y
+# CONFIG_DEBUG_OBJECTS_SELFTEST is not set
+CONFIG_DEBUG_OBJECTS_FREE=y
+CONFIG_DEBUG_OBJECTS_TIMERS=y
+# CONFIG_DEBUG_OBJECTS_WORK is not set
+# CONFIG_DEBUG_OBJECTS_RCU_HEAD is not set
+# CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER is not set
+CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
+CONFIG_SLUB_DEBUG_ON=y
 # CONFIG_SLUB_STATS is not set
 CONFIG_HAVE_DEBUG_KMEMLEAK=y
 # CONFIG_DEBUG_KMEMLEAK is not set
@@ -5273,6 +5282,7 @@
 CONFIG_STACKTRACE=y
 # CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set
 # CONFIG_DEBUG_KOBJECT is not set
+# CONFIG_DEBUG_KOBJECT_RELEASE is not set

 #
 # Debug kernel data structures

Comment 20 Mike Snitzer 2020-01-14 17:48:51 UTC
These 4 commits have been staged for upstream Linux 5.6 inclusion:

0031fd86398f dm thin: change data device's flush_bio to be member of struct pool
fe64369163c5 dm thin: don't allow changing data device during thin-pool load
08285c05b4ce dm thin: fix use-after-free in metadata_pre_commit_callback
6d773d89a444 dm thin metadata: use pool locking at end of dm_pool_metadata_close

They'll get sent to Linus when the 5.6 merge window opens (likely Jan 26).

Comment 21 Mike Snitzer 2020-01-15 02:13:30 UTC
I had to rebase (to tweak some commit headers) so here are the updated 5.6 commits:

44d8ebf43639 dm thin metadata: use pool locking at end of dm_pool_metadata_close
a4a8d286586d dm thin: fix use-after-free in metadata_pre_commit_callback
873937e75f9a dm thin: don't allow changing data device during thin-pool reload
f06c03d1ded2 dm thin: change data device's flush_bio to be member of struct pool

Comment 22 Mike Snitzer 2020-05-15 19:51:07 UTC
These all went upstream for Linux 5.6, closing CURRENTRELEASE


Note You need to log in before you can comment on or make changes to this bug.