Bug 1111748
Summary: | kernel BUG at drivers/md/persistent-data/dm-btree-spine.c when taking snapshot of a cached XFS | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> | |
Component: | lvm2 | Assignee: | Joe Thornber <thornber> | |
lvm2 sub component: | Cache Logical Volumes | QA Contact: | Cluster QE <mspqa-list> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, thornber, zkabelac | |
Version: | 7.0 | Keywords: | Triaged | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | lvm2-2.02.112-1.el7 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1113759 (view as bug list) | Environment: | ||
Last Closed: | 2015-03-05 13:09:17 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1113520, 1119326 |
Description
Corey Marthaler
2014-06-20 22:43:15 UTC
I've added a test to the device-mapper-test-suite that recreates this scenario driving device-mapper directly (cutting LVM out of the loop): https://github.com/jthornber/device-mapper-test-suite/blob/master/lib/dmtest/tests/stacking/stacking_tests.rb With my kernel tree this passes fine. The next step is to test with a RHEL7 kernel. Test passes on a RHEL7 kernel. Handing bug back to LVM team for their take. Taking back since seems to be stalled. I could not reproduce this with current LVM and a RHEL7 kernel (I have used smaller devices and a smaller filesystem, but that does not seem critical for reproducing the bug). The exact wording from the original bug is no longer allowed, you need to take a snapshot of the _corig volume explicitly (this might be related to why the bug no longer happens). Presumed fixed. Hit this on the latest rpms (w/o snaps being involved) by doing many iterations of volume configuration/changing operations. I was unable to reproduce this more than once so this may be rare now. I'll work on getting a new reproducer since the original one in comment #0 no longer applies. 3.10.0-184.el7.x86_64 lvm2-2.02.112-1.el7 BUILT: Tue Nov 11 09:39:35 CST 2014 lvm2-libs-2.02.112-1.el7 BUILT: Tue Nov 11 09:39:35 CST 2014 lvm2-cluster-2.02.112-1.el7 BUILT: Tue Nov 11 09:39:35 CST 2014 device-mapper-1.02.91-1.el7 BUILT: Tue Nov 11 09:39:35 CST 2014 device-mapper-libs-1.02.91-1.el7 BUILT: Tue Nov 11 09:39:35 CST 2014 device-mapper-event-1.02.91-1.el7 BUILT: Tue Nov 11 09:39:35 CST 2014 device-mapper-event-libs-1.02.91-1.el7 BUILT: Tue Nov 11 09:39:35 CST 2014 device-mapper-persistent-data-0.3.2-1.el7 BUILT: Thu Apr 3 09:58:51 CDT 2014 cmirror-2.02.112-1.el7 BUILT: Tue Nov 11 09:39:35 CST 2014 [...] ==================================================================== Iter 93 started at Wed Nov 12 21:46:41 CST 2014 ==================================================================== VOLUME CREATION: cache_1_1276, convert= snap= 8 disk(s) to be used: host-115.virt.lab.msp.redhat.com=/dev/sdh /dev/sdg /dev/sdf /dev/sdd /dev/sda /dev/sde /dev/sdc /dev/sdb removing PV /dev/sdf1 on host-115.virt.lab.msp.redhat.com removing PV /dev/sda2 on host-115.virt.lab.msp.redhat.com removing PV /dev/sdg1 on host-115.virt.lab.msp.redhat.com removing PV /dev/sda1 on host-115.virt.lab.msp.redhat.com removing PV /dev/sdg2 on host-115.virt.lab.msp.redhat.com removing PV /dev/sdd2 on host-115.virt.lab.msp.redhat.com removing PV /dev/sdd1 on host-115.virt.lab.msp.redhat.com removing PV /dev/sdf2 on host-115.virt.lab.msp.redhat.com creating lvm devices... Create 2 PV(s) for cache_1_1276 on host-115.virt.lab.msp.redhat.com Create VG cache_1_1276 on host-115.virt.lab.msp.redhat.com make needed LVs on host-115.virt.lab.msp.redhat.com Create origin (slow) volume lvcreate -L 4G -n cache_1_12760 cache_1_1276 /dev/sdb1 Create cache (fast) volume lvcreate -L 1G -n cache_1_12760_fast cache_1_1276 /dev/sdc1 Create cache metadata (fast) volume lvcreate -L 8M -n cache_1_12760_fast_meta cache_1_1276 /dev/sdc1 Create cache pool volume by combining the cache data and cache metadata (fast) volumes lvconvert --type cache-pool --poolmetadata cache_1_1276/cache_1_12760_fast_meta cache_1_1276/cache_1_12760_fast --yes WARNING: Converting logical volume cache_1_1276/cache_1_12760_fast and cache_1_1276/cache_1_12760_fast_meta to pool's data and metadata volumes. THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.) Create cached volume by combining the cache pool (fast) and origin (slow) volumes lvconvert --type cache --cachepool cache_1_1276/cache_1_12760_fast cache_1_1276/cache_1_12760 --yes [...] --- ALTERATION ITERATION 1 --------------------------- Skipping reduce (Not supported on thin pool, raid, or cache volumes) Skipping extend (Not supported on thin pool, raid, or cache volumes) VOLUME RENAME from /dev/cache_1_1276/cache_1_12760 to rename_388 on host-115.virt.lab.msp.redhat.com VOLUME RENAME back to /dev/cache_1_1276/cache_1_12760 from rename_388 on host-115.virt.lab.msp.redhat.com deactivating VG cache_1_1276 on host-115.virt.lab.msp.redhat.com EXPORTING VOLUME GROUP on host-115.virt.lab.msp.redhat.com attempting to activate on host-115.virt.lab.msp.redhat.com Volume group "cache_1_1276" is exported IMPORTING VOLUME GROUP on host-115.virt.lab.msp.redhat.com activating VG cache_1_1276 on host-115.virt.lab.msp.redhat.com CHANGING LV WRITE ACCESS TO RO on host-115.virt.lab.msp.redhat.com verifying ro access flag on on host-115.virt.lab.msp.redhat.com activating VG cache_1_1276 on host-115.virt.lab.msp.redhat.com attempting to write to lv on host-115.virt.lab.msp.redhat.com dd: writing to ‘/dev/cache_1_1276/cache_1_12760’: Operation not permitted 1+0 records in 0+0 records out 0 bytes (0 B) copied, 0.000311297 s, 0.0 kB/s CHANGING LV WRITE ACCESS BACK TO RW on host-115.virt.lab.msp.redhat.com Didn't receive heartbeat from host-115.virt.lab.msp.redhat.com for 120 seconds write access change attempt failed on host-115.virt.lab.msp.redhat.com [13640.377423] device-mapper: btree spine: node_check failed: csum 2719545050 != wanted 2719646530 [13640.378437] device-mapper: block manager: btree_node validator check failed for block 4 [13640.379362] ------------[ cut here ]------------ [13640.379886] kernel BUG at drivers/md/persistent-data/dm-btree-spine.c:178! [13640.380332] invalid opcode: 0000 [#1] SMP [13640.380332] Modules linked in: dm_cache_mq dm_cache() dm_thin_pool dm_persistent_data dm_bio_prison dm_raid raid456 raid1 raid10 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx dm_snapshot dm_bufio sd_mod crct10dif_generic crc_t10dif crct10dif_common iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iptable_filter ip_tables ppdev serio_raw pcspkr virtio_balloon parport_pc parport i2c_piix4 i2c_core dm_multipath xfs libcrc32c ata_generic pata_acpi virtio_net virtio_blk ata_piix libata virtio_pci virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod [13640.380332] CPU: 0 PID: 4306 Comm: lvchange Tainted: G -------------- T 3.10.0-184.el7.x86_64 #1 [13640.380332] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [13640.380332] task: ffff88003a09e660 ti: ffff880000154000 task.ti: ffff880000154000 [13640.380332] RIP: 0010:[<ffffffffa02f4721>] [<ffffffffa02f4721>] ro_node+0x21/0x30 [dm_persistent_data] [13640.380332] RSP: 0018:ffff880000157b20 EFLAGS: 00010246 [13640.380332] RAX: 0000000000000000 RBX: ffff880000157b80 RCX: 0000000000000000 [13640.380332] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff880000157b80 [13640.380332] RBP: ffff880000157b70 R08: 0000000000000000 R09: 0000000000007314 [13640.380332] R10: 6567616e616d206b R11: 636f6c62203a7265 R12: ffffffffa02ed7a0 [13640.380332] R13: ffffffffa02ed7a0 R14: ffff880000157bd0 R15: ffff88003ccb6800 [13640.380332] FS: 00007f714875d880(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 [13640.380332] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [13640.380332] CR2: 00007f4312d1e028 CR3: 000000003d4d7000 CR4: 00000000000006f0 [13640.380332] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [13640.380332] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [13640.380332] Stack: [13640.380332] ffffffffa02f24e2 ffffffac00000246 ffff880000157b80 ffff880000157b70 [13640.380332] 00000000e913e5a1 0000000000000004 ffffffffa02ed7a0 ffff880000157bd0 [13640.380332] ffff88002b13dc00 ffff88003ccb6800 ffff880000157bc0 ffffffffa02f31de [13640.380332] Call Trace: [13640.380332] [<ffffffffa02f24e2>] ? walk_node+0x42/0x100 [dm_persistent_data] [13640.380332] [<ffffffffa02ed7a0>] ? block_dec+0x160/0x160 [dm_persistent_data] [13640.380332] [<ffffffffa02f31de>] dm_btree_walk+0x4e/0x80 [dm_persistent_data] [13640.380332] [<ffffffffa03159f0>] ? complete_migration+0x30/0x30 [dm_cache] [13640.380332] [<ffffffffa02ed5dc>] dm_array_walk+0x3c/0x60 [dm_persistent_data] [13640.380332] [<ffffffffa0318640>] ? blocks_are_unmapped_or_clean+0xd0/0xd0 [dm_cache] [13640.380332] [<ffffffffa031950f>] dm_cache_load_mappings+0x7f/0xe0 [dm_cache] [13640.380332] [<ffffffffa03159f0>] ? complete_migration+0x30/0x30 [dm_cache] [13640.380332] [<ffffffffa0317e69>] cache_preresume+0xf9/0x1a0 [dm_cache] [13640.380332] [<ffffffffa00064a9>] dm_table_resume_targets+0x49/0xe0 [dm_mod] [13640.380332] [<ffffffffa000392c>] dm_resume+0x4c/0xd0 [dm_mod] [13640.380332] [<ffffffffa0008ccb>] dev_suspend+0x12b/0x250 [dm_mod] [13640.380332] [<ffffffffa0008ba0>] ? table_load+0x380/0x380 [dm_mod] [13640.380332] [<ffffffffa00095e5>] ctl_ioctl+0x255/0x500 [dm_mod] [13640.380332] [<ffffffff812530d4>] ? SYSC_semtimedop+0x264/0xd10 [13640.380332] [<ffffffffa00098a3>] dm_ctl_ioctl+0x13/0x20 [dm_mod] [13640.380332] [<ffffffff811d8185>] do_vfs_ioctl+0x2e5/0x4c0 [13640.380332] [<ffffffff8126d01e>] ? file_has_perm+0xae/0xc0 [13640.380332] [<ffffffff81251b01>] ? unmerge_queues+0x1/0x70 [13640.380332] [<ffffffff811d8401>] SyS_ioctl+0xa1/0xc0 [13640.380332] [<ffffffff81611de9>] system_call_fastpath+0x16/0x1b [13640.380332] Code: ff ff 5d c3 0f 0b 0f 1f 40 00 0f 1f 44 00 00 8b 47 08 85 c0 74 15 83 e8 01 55 48 98 48 8b 7c c7 10 48 89 e5 e8 d1 99 ff ff 5d c3 <0f> 0b 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 [13640.380332] RIP [<ffffffffa02f4721>] ro_node+0x21/0x30 [dm_persistent_data] [13640.380332] RSP <ffff880000157b20> [13640.429171] ---[ end trace f102357b28798cc2 ]--- [13640.429714] Kernel panic - not syncing: Fatal exception (In reply to Corey Marthaler from comment #6) > Hit this on the latest rpms (w/o snaps being involved) by doing many > iterations of volume configuration/changing operations. I was unable to > reproduce this more than once so this may be rare now. I'll work on getting > a new reproducer since the original one in comment #0 no longer applies. > > > 3.10.0-184.el7.x86_64 You should be testing with this brew build kernel: https://brewweb.devel.redhat.com/taskinfo?taskID=8236908 (It includes the patchset to bring the RHEL7.1 kernel to "full support" for dm-cache, including the fix for the bug#1080894 crash you shared in comment#6) Moving back to ON_QA... (In reply to Mike Snitzer from comment #7) > You should be testing with this brew build kernel: > https://brewweb.devel.redhat.com/taskinfo?taskID=8236908 > > (It includes the patchset to bring the RHEL7.1 kernel to "full support" for > dm-cache, including the fix for the bug#1080894 crash you shared in > comment#6) Seems that scratch brew build already got cleaned up, here is another one I just kicked off: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=8254038 Fix verified in the latest rpms. 3.10.0-219.el7.x86_64 lvm2-2.02.114-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 lvm2-libs-2.02.114-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 lvm2-cluster-2.02.114-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 device-mapper-1.02.92-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 device-mapper-libs-1.02.92-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 device-mapper-event-1.02.92-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 device-mapper-event-libs-1.02.92-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 device-mapper-persistent-data-0.4.1-2.el7 BUILT: Wed Nov 12 12:39:46 CST 2014 cmirror-2.02.114-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 [...] ==================================================================== Iter 558 started at Fri Jan 9 14:59:52 CST 2015 ==================================================================== VOLUME CREATION: cache_3_1055, convert= snap= Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0513.html |