1608070 – NULL pointer dereference while deleting VDO volumes that had been used for stacked raid testing

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1608070 - NULL pointer dereference while deleting VDO volumes that had been used for stacked raid testing

Summary: NULL pointer dereference while deleting VDO volumes that had been used for st...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kmod-kvdo
Sub Component:
Version:	7.6
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Matthew Sakai
QA Contact:	Corey Marthaler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1619605
TreeView+	depends on / blocked

Reported:	2018-07-24 22:22 UTC by Corey Marthaler
Modified:	2021-09-03 12:06 UTC (History)
CC List:	6 users (show)
Fixed In Version:	6.1.1.120
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1619605 (view as bug list)
Environment:
Last Closed:	2018-10-30 09:40:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3094	0	None	None	None	2018-10-30 09:40:40 UTC

Description Corey Marthaler 2018-07-24 22:22:48 UTC

Description of problem:
This happened after all bunch of lvm raid10 test cases on top of VDO had passed and were in the process of being removed. 


[...]
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] SCENARIO (raid10) - [resync_raid_extend_attempt]
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] Create a nosync raid, reactivate it to cause a resync, then attempt to extend it
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] hayes-02: lvcreate  --nosync --type raid10 -i 2 -n resync_raid_extend -L 2G raid_sanity
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10]   WARNING: New raid10 won't be synchronised. Don't read what you didn't write!
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] WARNING: xfs signature detected on /dev/raid_sanity/resync_raid_extend at offset 0. Wipe it? [y/n]: [n]
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10]   Aborted wiping of xfs.
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10]   1 existing signature left on the device.
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] Resyncing resync_raid_extend raid
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] lvchange --resync -y raid_sanity/resync_raid_extend
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] Attempt to extend the raid while it's resyncing
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] lvextend -L +500M raid_sanity/resync_raid_extend
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] 
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] perform raid scrubbing (lvchange --syncaction repair) on raid raid_sanity/resync_raid_extend
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10]   raid_sanity/resync_raid_extend state is currently "resync".  Unable to switch to "repair".
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] Waiting until all mirror|raid volumes become fully syncd...
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10]    0/1 mirror(s) are fully synced: ( 91.05% )
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10]    1/1 mirror(s) are fully synced: ( 100.00% )
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] Sleeping 15 sec
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] 
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] Deactivating raid resync_raid_extend... and removing
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] 
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] 
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] SCENARIO (raid10) - [nosync_raid_resynchronization]
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] Create a nosync raid and resync it
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] Skipping this test case, only supported on raid1 mirrors
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] 
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] 
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] SCENARIO (raid10) - [degraded_upconversion_attempt]
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] Create a raid, fail one of the legs to enter a degraded state, and then attmept an upconversion
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] Skipping this test case, only supported on raid1 mirrors
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] 
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] 
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] Didn't receive heartbeat from hayes-02 for 120 seconds
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] unable to remove vPV15 on hayes-02
[lvm_vdo_raid] [lvm_vdo_raid_sanity_raid10] vol_cleanup failed


Jul 24 15:48:14 hayes-02 qarshd[96413]: Running cmdline: vdo remove --name vPV4
Jul 24 15:48:14 hayes-02 UDS/vdodmeventd[96418]: ERROR  (vdodmeventd/96418) The dynamic shared library libdevmapper-event-lvm2vdo.so could not be loaded: libdevmapper-event-lvm2vdo.so: cannot open shared object file: No such file or directory
Jul 24 15:48:14 hayes-02 UDS/vdodmeventd[96418]: ERROR  (vdodmeventd/96418) Failed to load the dmeventd plugin
Jul 24 15:48:14 hayes-02 multipathd: dm-3: remove map (uevent)
Jul 24 15:48:14 hayes-02 multipathd: dm-3: devmap not registered, can't remove
Jul 24 15:48:14 hayes-02 kernel: kvdo48:dmsetup: suspending device 'vPV4'
Jul 24 15:48:14 hayes-02 kernel: kvdo48:dmsetup: device 'vPV4' suspended
Jul 24 15:48:14 hayes-02 kernel: kvdo48:dmsetup: stopping device 'vPV4'
Jul 24 15:48:14 hayes-02 kernel: kvdo48:dmsetup:
Jul 24 15:48:14 hayes-02 kernel:
Jul 24 15:48:14 hayes-02 kernel: uds: kvdo48:dedupeQ:
Jul 24 15:48:14 hayes-02 kernel: index_0: beginning save (vcn 4294967295)
Jul 24 15:48:14 hayes-02 kernel:
Jul 24 15:48:14 hayes-02 kernel: Setting UDS index target state to closed
Jul 24 15:48:14 hayes-02 kernel:
Jul 24 15:48:20 hayes-02 kernel: kvdo48:dmsetup: device 'vPV4' stopped
Jul 24 15:48:20 hayes-02 multipathd: dm-3: remove map (uevent)
Jul 24 15:48:20 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:39812).
Jul 24 15:48:20 hayes-02 qarshd[96427]: Talking to peer ::ffff:10.15.80.218:39812 (IPv6)
Jul 24 15:48:21 hayes-02 qarshd[96427]: Running cmdline: vdo remove --name vPV8
Jul 24 15:48:21 hayes-02 UDS/vdodmeventd[96432]: ERROR  (vdodmeventd/96432) The dynamic shared library libdevmapper-event-lvm2vdo.so could not be loaded: libdevmapper-event-lvm2vdo.so: cannot open shared object file: No such file or directory
Jul 24 15:48:21 hayes-02 UDS/vdodmeventd[96432]: ERROR  (vdodmeventd/96432) Failed to load the dmeventd plugin
Jul 24 15:48:21 hayes-02 multipathd: dm-7: remove map (uevent)
Jul 24 15:48:21 hayes-02 multipathd: dm-7: devmap not registered, can't remove
Jul 24 15:48:21 hayes-02 kernel: kvdo52:dmsetup: suspending device 'vPV8'
Jul 24 15:48:21 hayes-02 kernel: kvdo52:packerQ: compression is disabled
Jul 24 15:48:21 hayes-02 kernel: kvdo52:packerQ: compression is enabled
Jul 24 15:48:21 hayes-02 kernel: kvdo52:dmsetup: device 'vPV8' suspended
Jul 24 15:48:21 hayes-02 kernel: kvdo52:dmsetup: stopping device 'vPV8'
Jul 24 15:48:21 hayes-02 kernel: kvdo52:dmsetup:
Jul 24 15:48:21 hayes-02 kernel:
Jul 24 15:48:21 hayes-02 kernel: uds: kvdo52:dedupeQ:
Jul 24 15:48:21 hayes-02 kernel: index_0: beginning save (vcn 4294967295)
Jul 24 15:48:21 hayes-02 kernel:
Jul 24 15:48:21 hayes-02 kernel: Setting UDS index target state to closed
Jul 24 15:48:21 hayes-02 kernel:
Jul 24 15:48:28 hayes-02 kernel: kvdo52:dmsetup: device 'vPV8' stopped
Jul 24 15:48:28 hayes-02 multipathd: dm-7: remove map (uevent)
Jul 24 15:48:28 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:39822).
Jul 24 15:48:28 hayes-02 qarshd[96441]: Talking to peer ::ffff:10.15.80.218:39822 (IPv6)
Jul 24 15:48:28 hayes-02 qarshd[96441]: Running cmdline: vdo remove --name vPV6

[...]

Jul 24 15:49:19 hayes-02 qarshd[96523]: Running cmdline: vdo remove --name vPV10
Jul 24 15:49:19 hayes-02 UDS/vdodmeventd[96528]: ERROR  (vdodmeventd/96528) The dynamic shared library libdevmapper-event-lvm2vdo.so could not be loaded: libdevmapper-event-lvm2vdo.so: cannot open shared object file: No such file or directory
Jul 24 15:49:19 hayes-02 UDS/vdodmeventd[96528]: ERROR  (vdodmeventd/96528) Failed to load the dmeventd plugin



[85919.163605] kvdo47:dmsetup: [85919.164236] uds: kvdo47:dedupeQ: index_0: beginning save (vcn 4294967295)

[85919.174232] Setting UDS index target state to closed
[85924.739934] kvdo47:dmsetup: device 'vPV3' stopped
[85925.274003] kvdo54:dmsetup: suspending device 'vPV10'
[85925.279692] kvdo54:packerQ: compression is disabled
[85925.319334] kvdo54:packerQ: compression is enabled
[85925.324763] kvdo54:dmsetup: device 'vPV10' suspended
[85925.330470] kvdo54:dmsetup: stopping device 'vPV10'
[85928.990779] kvdo54:dmsetup: [85928.991628] uds: kvdo54:dedupeQ: index_0: beginning save (vcn 33)

[85929.000635] Setting UDS index target state to closed
[85977.691722] kvdo54:dmsetup: device 'vPV10' stopped
[85978.211340] kvdo57:dmsetup: suspending device 'vPV13'
[85978.217075] kvdo57:packerQ: compression is disabled
[85978.222556] kvdo57:packerQ: compression is enabled
[85978.227954] kvdo57:dmsetup: device 'vPV13' suspended
[85978.233620] kvdo57:dmsetup: stopping device 'vPV13'
[85978.260477] kvdo57:dmsetup: [85978.261147] uds: kvdo57:dedupeQ: index_0: beginning save (vcn 34)

[85978.270332] Setting UDS index target state to closed
[85978.642572] kvdo57:dmsetup: device 'vPV13' stopped
[85979.155339] kvdo59:dmsetup: suspending device 'vPV15'
[85979.161057] kvdo59:packerQ: compression is disabled
[85979.166539] kvdo59:packerQ: compression is enabled
[85979.171942] kvdo59:dmsetup: device 'vPV15' suspended
[85979.177648] kvdo59:dmsetup: stopping device 'vPV15'
[85979.227410] BUG: unable to handle kernel NULL pointer dereference at 0000000000000100
[85979.236166] IP: [<ffffffffc0855ea6>] getThreadData+0x16/0x40 [kvdo]
[85979.243186] PGD 0 
[85979.245437] Thread overran stack, or stack corrupted
[85979.250975] Oops: 0000 [#1] SMP 
[85979.254592] Modules linked in: raid0 gfs2 dlm loop dm_zero fuse btrfs vfat msdos fat ext4 mbcache jbd2 dm_crypt drbg ansi_cprng crypto_null dm_cache_smq dm_cache raid10 dm_mirror dm_region_hash dm_log dm_thin_pool dm_persistent_data dm_bio_prison dm_snapshot dm_bufio kvdo(O) uds(O) sunrpc raid1 dm_raid raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mxm_wmi iTCO_wdt dcdbas iTCO_vendor_support ipmi_ssif pcspkr sg mei_me ipmi_si ipmi_devintf mei lpc_ich ipmi_msghandler acpi_power_meter wmi dm_multipath dm_mod ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic sr_mod cdrom qla2xxx mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm nvme_fc(T) nvme_fabrics nvme_core crct10dif_pclmul drm ahci crct10dif_common tg3 libahci crc32c_intel scsi_transport_fc ptp libata megaraid_sas drm_panel_orientation_quirks scsi_tgt pps_core [last unloaded: scsi_debug]
[85979.362540] CPU: 10 PID: 33580 Comm: kvdo59:journalQ Kdump: loaded Tainted: G           O   ------------ T 3.10.0-925.el7.x86_64 #1
[85979.375746] Hardware name: Dell Inc. PowerEdge R830/0VVT0H, BIOS 1.8.0 05/28/2018
[85979.384095] task: ffff9722c5b6c100 ti: ffff9722d6210000 task.ti: ffff9722d6210000
[85979.392445] RIP: 0010:[<ffffffffc0855ea6>]  [<ffffffffc0855ea6>] getThreadData+0x16/0x40 [kvdo]
[85979.402170] RSP: 0018:ffff9722d6210018  EFLAGS: 00010246
[85979.408094] RAX: 0000000000000000 RBX: ffff9702c771a400 RCX: 0000000000000e94
[85979.416055] RDX: 15441b81ce80d767 RSI: ffffffffc087840a RDI: ffff9722c50abe00
[85979.424017] RBP: ffff9722d6210020 R08: 0000000000000000 R09: 0000000000000001
[85979.431979] R10: 0000000000000004 R11: 0000000000000005 R12: ffff96ff72073980
[85979.439940] R13: ffff96ff4d73ec00 R14: ffff9722d5c3a8e8 R15: ffff96ff5a978c60
[85979.447902] FS:  0000000000000000(0000) GS:ffff9702fe740000(0000) knlGS:0000000000000000
[85979.456930] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[85979.463340] CR2: 0000000000000100 CR3: 0000003abf20e000 CR4: 00000000003607e0
[85979.471301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[85979.479263] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[85979.487225] Call Trace:
[85979.489962]  [<ffffffffc0855ede>] isReadOnlyVDO+0xe/0x20 [kvdo]
[85979.496575]  [<ffffffffc084d785>] isReadOnly+0x15/0x20 [kvdo]
[85979.502989]  [<ffffffffc08374f9>] writeBlock+0x49/0x760 [kvdo]
[85979.509501]  [<ffffffffc0838dd0>] ? completeWrite+0x5f0/0x5f0 [kvdo]
[85979.516611]  [<ffffffffc0837c3f>] assignEntries.part.13+0x2f/0x60 [kvdo]
[85979.524092]  [<ffffffffc08385b2>] completeReaping+0x92/0x2c0 [kvdo]
[85979.531089]  [<ffffffffc0838dd0>] ? completeWrite+0x5f0/0x5f0 [kvdo]
[85979.538189]  [<ffffffffc085e85a>] invokeCallback+0x8a/0xd0 [kvdo]
[85979.544995]  [<ffffffffc085e903>] completeCompletion+0x23/0x40 [kvdo]
[85979.552190]  [<ffffffffc086cd4e>] ? isFlushRequired+0xe/0x10 [kvdo]
[85979.559189]  [<ffffffffc083d29a>] launchFlush+0x7a/0x90 [kvdo]
[85979.565703]  [<ffffffffc0838dd0>] ? completeWrite+0x5f0/0x5f0 [kvdo]
[85979.572798]  [<ffffffffc0836e86>] reapRecoveryJournal.part.8+0x116/0x170 [kvdo]
[85979.580958]  [<ffffffffc0838605>] completeReaping+0xe5/0x2c0 [kvdo]
[85979.587959]  [<ffffffffc0855eb2>] ? getThreadData+0x22/0x40 [kvdo]
[85979.594857]  [<ffffffffc0838dd0>] ? completeWrite+0x5f0/0x5f0 [kvdo]
[85979.601954]  [<ffffffffc085e85a>] invokeCallback+0x8a/0xd0 [kvdo]
[85979.608760]  [<ffffffffc085e903>] completeCompletion+0x23/0x40 [kvdo]
[85979.615953]  [<ffffffffc086cd4e>] ? isFlushRequired+0xe/0x10 [kvdo]
[85979.622951]  [<ffffffffc083d29a>] launchFlush+0x7a/0x90 [kvdo]
[85979.629463]  [<ffffffffc0838dd0>] ? completeWrite+0x5f0/0x5f0 [kvdo]
[85979.636548]  [<ffffffffc0836e86>] reapRecoveryJournal.part.8+0x116/0x170 [kvdo]
[85979.644699]  [<ffffffffc0838605>] completeReaping+0xe5/0x2c0 [kvdo]
[85979.651695]  [<ffffffffc0838dd0>] ? completeWrite+0x5f0/0x5f0 [kvdo]
[85979.658792]  [<ffffffffc085e85a>] invokeCallback+0x8a/0xd0 [kvdo]
[85979.665598]  [<ffffffffc085e903>] completeCompletion+0x23/0x40 [kvdo]
[85979.672794]  [<ffffffffc086cd4e>] ? isFlushRequired+0xe/0x10 [kvdo]
[85979.679793]  [<ffffffffc083d29a>] launchFlush+0x7a/0x90 [kvdo]
[85979.686306]  [<ffffffffc0838dd0>] ? completeWrite+0x5f0/0x5f0 [kvdo]
[85979.693389]  [<ffffffffc0836e86>] reapRecoveryJournal.part.8+0x116/0x170 [kvdo]
[85979.701550]  [<ffffffffc0838605>] completeReaping+0xe5/0x2c0 [kvdo]
[85979.708553]  [<ffffffffc0855eb2>] ? getThreadData+0x22/0x40 [kvdo]
[85979.715451]  [<ffffffffc0838dd0>] ? completeWrite+0x5f0/0x5f0 [kvdo]
[85979.722551]  [<ffffffffc085e85a>] invokeCallback+0x8a/0xd0 [kvdo]
[85979.729357]  [<ffffffffc085e903>] completeCompletion+0x23/0x40 [kvdo]
[85979.736551]  [<ffffffffc086cd4e>] ? isFlushRequired+0xe/0x10 [kvdo]
[85979.743549]  [<ffffffffc083d29a>] launchFlush+0x7a/0x90 [kvdo]
[85979.750060]  [<ffffffffc0838dd0>] ? completeWrite+0x5f0/0x5f0 [kvdo]
[85979.757153]  [<ffffffffc0836e86>] reapRecoveryJournal.part.8+0x116/0x170 [kvdo]
[85979.765313]  [<ffffffffc0838605>] completeReaping+0xe5/0x2c0 [kvdo]
[85979.772310]  [<ffffffffc0838dd0>] ? completeWrite+0x5f0/0x5f0 [kvdo]
[85979.779407]  [<ffffffffc085e85a>] invokeCallback+0x8a/0xd0 [kvdo]
[85979.786211]  [<ffffffffc085e903>] completeCompletion+0x23/0x40 [kvdo]
[85979.793404]  [<ffffffffc086cd4e>] ? isFlushRequired+0xe/0x10 [kvdo]
[85979.800401]  [<ffffffffc083d29a>] launchFlush+0x7a/0x90 [kvdo]
[85979.806905]  [[    0.000000] Initializing cgroup subsys cpuset



Version-Release number of selected component (if applicable):
3.10.0-925.el7.x86_64

lvm2-2.02.180-1.el7    BUILT: Fri Jul 20 12:21:35 CDT 2018
lvm2-libs-2.02.180-1.el7    BUILT: Fri Jul 20 12:21:35 CDT 2018
lvm2-cluster-2.02.180-1.el7    BUILT: Fri Jul 20 12:21:35 CDT 2018
lvm2-lockd-2.02.180-1.el7    BUILT: Fri Jul 20 12:21:35 CDT 2018
lvm2-python-boom-0.9-4.el7    BUILT: Fri Jul 20 12:23:30 CDT 2018
cmirror-2.02.180-1.el7    BUILT: Fri Jul 20 12:21:35 CDT 2018
device-mapper-1.02.149-1.el7    BUILT: Fri Jul 20 12:21:35 CDT 2018
device-mapper-libs-1.02.149-1.el7    BUILT: Fri Jul 20 12:21:35 CDT 2018
device-mapper-event-1.02.149-1.el7    BUILT: Fri Jul 20 12:21:35 CDT 2018
device-mapper-event-libs-1.02.149-1.el7    BUILT: Fri Jul 20 12:21:35 CDT 2018
device-mapper-persistent-data-0.7.3-3.el7    BUILT: Tue Nov 14 05:07:18 CST 2017
vdo-6.1.1.111-3.el7    BUILT: Sun Jul 15 21:10:34 CDT 2018
kmod-kvdo-6.1.1.111-1.el7    BUILT: Sun Jul 15 21:16:18 CDT 2018


How reproducible:
Once so far

Comment 2 Matthew Sakai 2018-07-24 23:24:49 UTC

I think I see the issue. First, I'm assuming that the storage under the VDO device does not accept flushes. Is this correct?

It looks like if the storage does not accept flushes, and the VDO's other components release their atomic locks with proper timing, then recovery journal reaping code can grow the stack to an arbitrary depth. This case occurred during device shutdown, which may have helped us produce the problematic lock release timing, since everything will be trying to release their locks at that time.

The fix is not obvious, but we are considering options. Given the reliance on timing between different threads, however, I'm not sure how easy it will be to reproduce this issue reliably, in order to test the fix we settle on.

Comment 3 Corey Marthaler 2018-07-25 16:20:31 UTC

It appears they do not:

[...]
Jul 24 20:41:28 hayes-02 kernel: kvdo31:dmsetup: underlying device, REQ_FLUSH: not supported, REQ_FUA: not supported
Jul 24 20:41:29 hayes-02 kernel: kvdo32:dmsetup: underlying device, REQ_FLUSH: not supported, REQ_FUA: not supported
Jul 24 20:41:31 hayes-02 kernel: kvdo33:dmsetup: underlying device, REQ_FLUSH: not supported, REQ_FUA: not supported
Jul 24 20:41:33 hayes-02 kernel: kvdo34:dmsetup: underlying device, REQ_FLUSH: not supported, REQ_FUA: not supported

Comment 10 Corey Marthaler 2018-09-19 16:52:19 UTC

We're no longer seeing this issue after stacked raid10 vdo tests are run and cleaned up for additional testing. Marking verified in the latest rpms.

[lvm_vdo_raid] lvm_vdo_raid_sanity_raid10               PASS      

3.10.0-951.el7.x86_64
vdo-6.1.1.125-3.el7    BUILT: Sun Sep 16 21:51:14 CDT 2018
kmod-kvdo-6.1.1.125-5.el7    BUILT: Tue Sep 18 09:32:02 CDT 2018

lvm2-2.02.180-8.el7    BUILT: Mon Sep 10 04:45:22 CDT 2018
lvm2-libs-2.02.180-8.el7    BUILT: Mon Sep 10 04:45:22 CDT 2018
device-mapper-1.02.149-8.el7    BUILT: Mon Sep 10 04:45:22 CDT 2018
device-mapper-libs-1.02.149-8.el7    BUILT: Mon Sep 10 04:45:22 CDT 2018
device-mapper-event-1.02.149-8.el7    BUILT: Mon Sep 10 04:45:22 CDT 2018
device-mapper-event-libs-1.02.149-8.el7    BUILT: Mon Sep 10 04:45:22 CDT 2018
device-mapper-persistent-data-0.7.3-3.el7    BUILT: Tue Nov 14 05:07:18 CST 2017

Comment 12 errata-xmlrpc 2018-10-30 09:40:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3094

Note You need to log in before you can comment on or make changes to this bug.