Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1669960

Summary: Calling vdoStatus too early can result in NULL pointer dereference
Product: Red Hat Enterprise Linux 8 Reporter: Jakub Krysl <jkrysl>
Component: kmod-kvdoAssignee: sclafani
Status: CLOSED ERRATA QA Contact: vdo-qe
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.0CC: awalsh, bgurney, limershe, sweettea
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 6.2.1.124 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-05 22:12:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1682560    
Bug Blocks:    

Description Jakub Krysl 2019-01-28 08:44:24 UTC
Description of problem:
I hit this during testing for BZ 1669124. It seems vdoStatus got called before vdo start finished.

[  467.242685] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  467.250513] PGD 0 P4D 0
[  467.253051] Oops: 0000 [#1] SMP NOPTI
[  467.256718] CPU: 42 PID: 17728 Comm: dmeventd Kdump: loaded Tainted: G           O     --------- ---  4.18.0-60.el8.x86_64 #1
[  467.268012] Hardware name: Supermicro AS -2023US-TR4/H11DSU-iN, BIOS 1.1a 04/26/2018
[  467.275769] RIP: 0010:getKernelLayerBdev+0x10/0x20 [kvdo]
[  467.281168] Code: e8 c5 4e ff ff 0f b6 c8 e9 38 fe ff ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 08 01 00 00 48 8b 40 08 <48> 8b 00 c3 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 85
[  467.299910] RSP: 0018:ffffc06307fabbc8 EFLAGS: 00010246
[  467.305135] RAX: 0000000000000000 RBX: 0000000000003ea0 RCX: 0000000001900000
[  467.312263] RDX: ffffffffc0a249bd RSI: 0000000000000002 RDI: ffff9d2ee6fe3000
[  467.319396] RBP: ffff9d2ee6fe3000 R08: ffff9d24e79a3338 R09: ffffffffc0a24772
[  467.326527] R10: 0000000000000000 R11: ffff9d24e79a19e8 R12: ffff9d24d6d60160
[  467.333657] R13: ffff9d2ee6fe3670 R14: ffffffffc0a24772 R15: ffffffffc0a24758
[  467.340787] FS:  00007f9761192700(0000) GS:ffff9d24e7980000(0000) knlGS:0000000000000000
[  467.348871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  467.354614] CR2: 0000000000000000 CR3: 000000021b0b0000 CR4: 00000000003406e0
[  467.361747] Call Trace:
[  467.364217]  vdoStatus+0x11d/0x170 [kvdo]
[  467.368230]  retrieve_status+0xa7/0x1f0 [dm_mod]
[  467.372856]  ? dm_get_live_or_inactive_table.isra.7+0x20/0x20 [dm_mod]
[  467.379378]  table_status+0x61/0xa0 [dm_mod]
[  467.383650]  ctl_ioctl+0x1af/0x3f0 [dm_mod]
[  467.387841]  ? selinux_file_ioctl+0xc0/0x200
[  467.392112]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
[  467.396296]  do_vfs_ioctl+0xa4/0x630
[  467.399873]  ksys_ioctl+0x60/0x90
[  467.403196]  __x64_sys_ioctl+0x16/0x20
[  467.406952]  do_syscall_64+0x5b/0x1b0
[  467.410616]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  467.415667] RIP: 0033:0x7f975f68445b
[  467.419241] Code: 0f 1e fa 48 8b 05 2d aa 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fd a9 2c 00 f7 d8 64 89 01 48
[  467.437979] RSP: 002b:00007f97611919d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[  467.445548] RAX: ffffffffffffffda RBX: 0000563447e38a40 RCX: 00007f975f68445b
[  467.452681] RDX: 00007f975803ca20 RSI: 00000000c138fd0c RDI: 0000000000000007
[  467.459810] RBP: 00007f975fbb6173 R08: 0000000000000004 R09: 00007f975fbb6d00
[  467.466934] R10: 000000000000001e R11: 0000000000000202 R12: 00007f975803ca20
[  467.474059] R13: 0000000000000000 R14: 00007f975803cad0 R15: 00007f975803b280
[  467.481191] Modules linked in: kvdo(O) uds(O) nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc vfat fat dm_multipath dm_mod amd64_edac_mod edac_mce_amd kvm_amd joydev ipmi_ssif kvm irqbypass crct10dif_pclmul crc32_pclmul sg ghash_clmulni_intel pcspkr sp5100_tco ipmi_si ccp i2c_piix4 k10temp ipmi_devintf ipmi_msghandler acpi_cpufreq xfs libcrc32c sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb nvme megaraid_sas drm nvme_core crc32c_intel dca i2c_algo_bit pinctrl_amd
[  467.528827] CR2: 0000000000000000

Version-Release number of selected component (if applicable):
kernel-4.18.0-60.el8.x86_64
kmod-kvdo-6.2.0.293-43.el8.x86_64
vdo-6.2.0.293-10.el8.x86_64

How reproducible:
once

Steps to Reproduce:
1. Not really known, possibly call vdoStatus on a VDO device that is not yet properly created

Actual results:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000

Expected results:
no calltrace

Additional info:

Comment 3 Jakub Krysl 2019-02-21 11:55:14 UTC
I hit exactly the same calltrace again on different server with newer kernel, this time not null pointer dereference but general protection fault:

[83392.935830] general protection fault: 0000 [#1] SMP PTI 
[83392.961956] CPU: 2 PID: 21434 Comm: dmsetup Kdump: loaded Tainted: G           O     --------- -  - 4.18.0-67.el8.x86_64 #1 
[83393.016659] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 10/25/2017 
[83393.055157] RIP: 0010:getKernelLayerBdev+0x10/0x20 [kvdo] 
[83393.080193] Code: e8 75 16 00 00 0f b6 c8 e9 38 fe ff ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 08 01 00 00 48 8b 40 08 <48> 8b 00 c3 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 85 
[83393.167816] RSP: 0018:ffffb40080eb3bc8 EFLAGS: 00010246 
[83393.192252] RAX: 5f74696e695f5f60 RBX: 0000000000003ea0 RCX: 0000000001900000 
[83393.225699] RDX: ffffffffc0c399ca RSI: 0000000000000002 RDI: ffff999f81b0b000 
[83393.259276] RBP: ffff999f81b0b000 R08: 0000000000011915 R09: ffffffffc0c37a77 
[83393.292768] R10: 0000000000000000 R11: ffff999fb7aa1ae8 R12: ffff999f599ac160 
[83393.326280] R13: ffff999f81b0b670 R14: ffffffffc0c37a77 R15: ffffffffc0c37a5d 
[83393.359705] FS:  00007f8462084880(0000) GS:ffff999fb7a80000(0000) knlGS:0000000000000000 
[83393.397713] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[83393.424669] CR2: 000055e5f26fe7a0 CR3: 0000000212ed8004 CR4: 00000000001606e0 
[83393.458173] Call Trace: 
[83393.470080]  vdoStatus+0x11d/0x170 [kvdo] 
[83393.488978]  retrieve_status+0xa7/0x1f0 [dm_mod] 
[83393.511571]  ? dm_get_live_or_inactive_table.isra.7+0x20/0x20 [dm_mod] 
[83393.542708]  table_status+0x61/0xa0 [dm_mod] 
[83393.562663]  ctl_ioctl+0x1af/0x3f0 [dm_mod] 
[83393.582173]  ? selinux_file_ioctl+0x70/0x200 
[83393.602281]  dm_ctl_ioctl+0xa/0x10 [dm_mod] 
[83393.621802]  do_vfs_ioctl+0xa4/0x630 
[83393.638733]  ksys_ioctl+0x60/0x90 
[83393.654158]  __x64_sys_ioctl+0x16/0x20 
[83393.671795]  do_syscall_64+0x5b/0x1b0 
[83393.688959]  entry_SYSCALL_64_after_hwframe+0x65/0xca 
[83393.712909] RIP: 0033:0x7f846194945b 
[83393.729701] Code: 0f 1e fa 48 8b 05 2d aa 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fd a9 2c 00 f7 d8 64 89 01 48 
[83393.817889] RSP: 002b:00007fff888f9538 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 
[83393.853681] RAX: ffffffffffffffda RBX: 00007f8461c280a0 RCX: 00007f846194945b 
[83393.887452] RDX: 000055e5f42dcc20 RSI: 00000000c138fd0c RDI: 0000000000000003 
[83393.921148] RBP: 00007f8461c63053 R08: 00007f8461c63be0 R09: 00007fff888f93a0 
[83393.954928] R10: 000000000000001e R11: 0000000000000202 R12: 000055e5f42dcc20 
[83393.989239] R13: 0000000000000000 R14: 000055e5f42dccd0 R15: 000055e5f42dc5d0 
[83394.025658] Modules linked in: kvdo(O) uds(O) nfsv3 nfs_acl dm_service_time rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc dm_multipath intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt iTCO_vendor_support kvm_intel ipmi_ssif kvm sg irqbypass crct10dif_pclmul crc32_pclmul i2c_i801 ghash_clmulni_intel intel_cstate intel_uncore hpilo pcspkr intel_rapl_perf lpc_ich hpwdt ipmi_si ipmi_devintf acpi_tad ipmi_msghandler ioatdma wmi acpi_power_meter xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci drm libahci libata crc32c_intel tg3 ixgbe serio_raw mdio dca dm_mirror dm_region_hash dm_log dm_mod 

Did not manage to reproduce again with same commands.

Comment 6 Sweet Tea Dorminy 2019-03-04 19:52:41 UTC
Upon further review, the vdostatus issue appears to be likely to be 'running vdostatus during a table change fails', not 'running vdostatus during start fails' -- Jakub, can you confirm that you've only seen this when doing a vdo modify of a running VDO?

Comment 7 Jakub Krysl 2019-03-05 08:49:02 UTC
(In reply to Sweet Tea Dorminy from comment #6)
> Upon further review, the vdostatus issue appears to be likely to be 'running
> vdostatus during a table change fails', not 'running vdostatus during start
> fails' -- Jakub, can you confirm that you've only seen this when doing a vdo
> modify of a running VDO?

Well it is quite hard to say. I've hit it twice, always right after the device was created ("VDO instance 2 volume is ready at /dev/mapper/vdo1" appeared). The server crashed right after it without running any futher commands from my side. I've hit it only during manual testing when the server was already in unclean state, automated testing did not hit it (yet).

My guess is this: The VDO gets created, but DM table is not yet loaded completely. vdoStatus gets called (maybe because of the free space checking event?) and tries to access the table too soon.

I'm sorry I cannot help you much here, as I fail to find any reliable reproducer at all. One possible way to reproduce this could be using some really old slow machine (or slow it down) and running vdoStatus in different terminal during creation / table modification.

Comment 8 bjohnsto 2019-03-11 17:40:47 UTC
I've not been able to reproduce this issue either. I wonder if its related to 1659247, which will be fixed in 8.1? Might things be reloading a table and then calling resume somewhere in your test cycle?

Comment 9 Andy Walsh 2019-05-23 16:37:19 UTC
Jakub,

Since we're not able to reproduce this, can you see if the most recent code still has the ability to trigger this?

Comment 10 Jakub Krysl 2019-05-31 12:58:01 UTC
(In reply to Andy Walsh from comment #9)
> Jakub,
> 
> Since we're not able to reproduce this, can you see if the most recent code
> still has the ability to trigger this?

Andy, this is something I hit during my manual testing by accident and was not able to reproduce it again, seems to be some rare racing condition that depends on some unknown outside factors. It might have been fixed already somewhere, but we cannot be sure if we do not know what caused it.

Comment 11 Sweet Tea Dorminy 2019-06-03 20:35:30 UTC
I have reproduced it, methinks.

Comment 12 sclafani 2019-06-27 19:59:30 UTC
The problem is that the two inactive tables share a VDO layer object. The first table (and its config) are destroyed and the layer winds up with a dangling pointer to the destroyed config. This becomes a use-after-free land mine which the status queries can trip over, but the status itself has nothing directly to do with it. If the memory gets overwritten, even just removing the target can blow up. I expect the other table-reload-then-crash bugs are the same underlying problem.

Comment 15 Andy Walsh 2019-08-05 19:41:53 UTC
*** Bug 1678761 has been marked as a duplicate of this bug. ***

Comment 16 Jakub Krysl 2019-09-19 13:46:05 UTC
Using the reproducer from JIRA plus some more manual, I am not able to hit it again.

# rpm -qa kmod-kvdo
kmod-kvdo-6.2.1.138-57.el8.x86_64

# sleep 1 && sudo dmsetup load vdo0 --table " 0 8 vdo V2 /dev/sda 2441609216 4096 32768 16380 off sync vdo_instance maxDiscard 1 ack 1 bio 4 bioRotationInterval 1 cpu 2 hash 1 logical 1 physical 1"  & for i in {1..1000}; do sudo dmsetup status vdo0 --inactive; done;

Comment 19 errata-xmlrpc 2019-11-05 22:12:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3548