Bug 1669960
| Summary: | Calling vdoStatus too early can result in NULL pointer dereference | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Jakub Krysl <jkrysl> |
| Component: | kmod-kvdo | Assignee: | sclafani |
| Status: | CLOSED ERRATA | QA Contact: | vdo-qe |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.0 | CC: | awalsh, bgurney, limershe, sweettea |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | 8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 6.2.1.124 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-11-05 22:12:27 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1682560 | ||
| Bug Blocks: | |||
|
Description
Jakub Krysl
2019-01-28 08:44:24 UTC
I hit exactly the same calltrace again on different server with newer kernel, this time not null pointer dereference but general protection fault: [83392.935830] general protection fault: 0000 [#1] SMP PTI [83392.961956] CPU: 2 PID: 21434 Comm: dmsetup Kdump: loaded Tainted: G O --------- - - 4.18.0-67.el8.x86_64 #1 [83393.016659] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 10/25/2017 [83393.055157] RIP: 0010:getKernelLayerBdev+0x10/0x20 [kvdo] [83393.080193] Code: e8 75 16 00 00 0f b6 c8 e9 38 fe ff ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 08 01 00 00 48 8b 40 08 <48> 8b 00 c3 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 85 [83393.167816] RSP: 0018:ffffb40080eb3bc8 EFLAGS: 00010246 [83393.192252] RAX: 5f74696e695f5f60 RBX: 0000000000003ea0 RCX: 0000000001900000 [83393.225699] RDX: ffffffffc0c399ca RSI: 0000000000000002 RDI: ffff999f81b0b000 [83393.259276] RBP: ffff999f81b0b000 R08: 0000000000011915 R09: ffffffffc0c37a77 [83393.292768] R10: 0000000000000000 R11: ffff999fb7aa1ae8 R12: ffff999f599ac160 [83393.326280] R13: ffff999f81b0b670 R14: ffffffffc0c37a77 R15: ffffffffc0c37a5d [83393.359705] FS: 00007f8462084880(0000) GS:ffff999fb7a80000(0000) knlGS:0000000000000000 [83393.397713] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [83393.424669] CR2: 000055e5f26fe7a0 CR3: 0000000212ed8004 CR4: 00000000001606e0 [83393.458173] Call Trace: [83393.470080] vdoStatus+0x11d/0x170 [kvdo] [83393.488978] retrieve_status+0xa7/0x1f0 [dm_mod] [83393.511571] ? dm_get_live_or_inactive_table.isra.7+0x20/0x20 [dm_mod] [83393.542708] table_status+0x61/0xa0 [dm_mod] [83393.562663] ctl_ioctl+0x1af/0x3f0 [dm_mod] [83393.582173] ? selinux_file_ioctl+0x70/0x200 [83393.602281] dm_ctl_ioctl+0xa/0x10 [dm_mod] [83393.621802] do_vfs_ioctl+0xa4/0x630 [83393.638733] ksys_ioctl+0x60/0x90 [83393.654158] __x64_sys_ioctl+0x16/0x20 [83393.671795] do_syscall_64+0x5b/0x1b0 [83393.688959] entry_SYSCALL_64_after_hwframe+0x65/0xca [83393.712909] RIP: 0033:0x7f846194945b [83393.729701] Code: 0f 1e fa 48 8b 05 2d aa 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fd a9 2c 00 f7 d8 64 89 01 48 [83393.817889] RSP: 002b:00007fff888f9538 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [83393.853681] RAX: ffffffffffffffda RBX: 00007f8461c280a0 RCX: 00007f846194945b [83393.887452] RDX: 000055e5f42dcc20 RSI: 00000000c138fd0c RDI: 0000000000000003 [83393.921148] RBP: 00007f8461c63053 R08: 00007f8461c63be0 R09: 00007fff888f93a0 [83393.954928] R10: 000000000000001e R11: 0000000000000202 R12: 000055e5f42dcc20 [83393.989239] R13: 0000000000000000 R14: 000055e5f42dccd0 R15: 000055e5f42dc5d0 [83394.025658] Modules linked in: kvdo(O) uds(O) nfsv3 nfs_acl dm_service_time rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc dm_multipath intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt iTCO_vendor_support kvm_intel ipmi_ssif kvm sg irqbypass crct10dif_pclmul crc32_pclmul i2c_i801 ghash_clmulni_intel intel_cstate intel_uncore hpilo pcspkr intel_rapl_perf lpc_ich hpwdt ipmi_si ipmi_devintf acpi_tad ipmi_msghandler ioatdma wmi acpi_power_meter xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci drm libahci libata crc32c_intel tg3 ixgbe serio_raw mdio dca dm_mirror dm_region_hash dm_log dm_mod Did not manage to reproduce again with same commands. Upon further review, the vdostatus issue appears to be likely to be 'running vdostatus during a table change fails', not 'running vdostatus during start fails' -- Jakub, can you confirm that you've only seen this when doing a vdo modify of a running VDO? (In reply to Sweet Tea Dorminy from comment #6) > Upon further review, the vdostatus issue appears to be likely to be 'running > vdostatus during a table change fails', not 'running vdostatus during start > fails' -- Jakub, can you confirm that you've only seen this when doing a vdo > modify of a running VDO? Well it is quite hard to say. I've hit it twice, always right after the device was created ("VDO instance 2 volume is ready at /dev/mapper/vdo1" appeared). The server crashed right after it without running any futher commands from my side. I've hit it only during manual testing when the server was already in unclean state, automated testing did not hit it (yet). My guess is this: The VDO gets created, but DM table is not yet loaded completely. vdoStatus gets called (maybe because of the free space checking event?) and tries to access the table too soon. I'm sorry I cannot help you much here, as I fail to find any reliable reproducer at all. One possible way to reproduce this could be using some really old slow machine (or slow it down) and running vdoStatus in different terminal during creation / table modification. I've not been able to reproduce this issue either. I wonder if its related to 1659247, which will be fixed in 8.1? Might things be reloading a table and then calling resume somewhere in your test cycle? Jakub, Since we're not able to reproduce this, can you see if the most recent code still has the ability to trigger this? (In reply to Andy Walsh from comment #9) > Jakub, > > Since we're not able to reproduce this, can you see if the most recent code > still has the ability to trigger this? Andy, this is something I hit during my manual testing by accident and was not able to reproduce it again, seems to be some rare racing condition that depends on some unknown outside factors. It might have been fixed already somewhere, but we cannot be sure if we do not know what caused it. I have reproduced it, methinks. The problem is that the two inactive tables share a VDO layer object. The first table (and its config) are destroyed and the layer winds up with a dangling pointer to the destroyed config. This becomes a use-after-free land mine which the status queries can trip over, but the status itself has nothing directly to do with it. If the memory gets overwritten, even just removing the target can blow up. I expect the other table-reload-then-crash bugs are the same underlying problem. *** Bug 1678761 has been marked as a duplicate of this bug. *** Using the reproducer from JIRA plus some more manual, I am not able to hit it again.
# rpm -qa kmod-kvdo
kmod-kvdo-6.2.1.138-57.el8.x86_64
# sleep 1 && sudo dmsetup load vdo0 --table " 0 8 vdo V2 /dev/sda 2441609216 4096 32768 16380 off sync vdo_instance maxDiscard 1 ack 1 bio 4 bioRotationInterval 1 cpu 2 hash 1 logical 1 physical 1" & for i in {1..1000}; do sudo dmsetup status vdo0 --inactive; done;
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3548 |