Bug 1442369
Summary: | Multipathd crashes | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Anandhakannan Subramanian <anasubra> | |
Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> | |
Status: | CLOSED ERRATA | QA Contact: | Lin Li <lilin> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 6.8 | CC: | agk, bmarzins, heinzm, jbrassow, jkurik, lilin, msnitzer, nkshirsa, prajnoha, rbalakri, rhandlin, zkabelac | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | device-mapper-multipath-0.4.9-103.el6 | Doc Type: | Bug Fix | |
Doc Text: |
Cause: The code to check if a dm device was a partition of a multipath device gave the incorrect answer for any devices whose table contained a device with a minor number that was the same as the multipath device minor number with additional digits on the end.
Consequence: When removing or renaming a multipath device, multipath could recursively check a device over and over again for, thinking it was a partition of itself, and eventually run out of memory and crash.
Fix: Multipath's partition device check is much more robust.
Result: multipath correctly identifies the what dm devices are partitions of other devices, and will rename or remove them without crashing.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1444194 (view as bug list) | Environment: | ||
Last Closed: | 2018-06-19 05:17:52 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1444194, 1461138, 1507140 |
Description
Anandhakannan Subramanian
2017-04-14 10:31:47 UTC
I'm not so sure that this is a regression of Bug 1349376, although that was my first guess as well. Looking at the symbols in libmultipath.so (dm_rename_partmaps is in libmultipath, and multipath and multipathd do not call dm_task_create directly, only in libmultipath does) I see: ** This is using device-mapper-multipath-debuginfo-0.4.9-100.el6.x86_64.rpm [bmarzins@octiron lib64]$ nm libmultipath.so.debug | grep dm_task_create U dm_task_create@@Base So, it is versioned. Looking a dmsetup, to see that the symbol versioning matches, I see ** This is using lvm2-debuginfo-2.02.143-12.el6.x86_64.rpm which has the debug symbols for device-mapper-1.02.117-12.el6.x86_64.rpm [bmarzins@octiron sbin]$ nm lvm.debug | grep dm_task_create U dm_task_create@@Base So the symbol versioning matches. On the other hand, I can't easily see how you can get a segfault on this line: 1113 if (!(dmt = dm_task_create(DM_DEVICE_LIST))) assuming that really is where the segfault happened. Could you possibly post a core dump for me to look at? Well, I know what's going on, and it's pretty impressive that this bug hasn't been hit before. If you look at the 4 backtrace lines you posted, you can see that they are calling the same two functions over and over again. That's because multipath is stuck in an infinite recursion. Here is the top of the stack, almost 6000 frames up #5817 0x0000003c79c12b85 in dm_rename (old=0x183b6fc "mpathhzyp4", new=0x7ffc0c54e0f0 "SCP-EBSINT-BAIE09-PUR-ERPIC-24Lp4", skip_kpartx=<value optimized out>) at devmapper.c:1185 #5818 0x0000003c79c12b25 in dm_rename_partmaps (old=0x18376ec "mpathhzyp4", new=0x7ffc0c54f1d0 "SCP-EBSINT-BAIE09-PUR-ERPIC-24Lp4") at devmapper.c:1161 #5819 0x0000003c79c12b85 in dm_rename (old=0x18376ec "mpathhzyp4", new=0x7ffc0c54f1d0 "SCP-EBSINT-BAIE09-PUR-ERPIC-24Lp4", skip_kpartx=<value optimized out>) at devmapper.c:1185 #5820 0x0000003c79c12b25 in dm_rename_partmaps (old=0x18352a0 "mpathhzy", new=0x15f0ec0 "SCP-EBSINT-BAIE09-PUR-ERPIC-24L") at devmapper.c:1161 #5821 0x0000003c79c12b85 in dm_rename (old=0x18352a0 "mpathhzy", new=0x15f0ec0 "SCP-EBSINT-BAIE09-PUR-ERPIC-24L", skip_kpartx=<value optimized out>) at devmapper.c:1185 #5822 0x0000003c79c313c3 in domap (mpp=0x1835220) at configure.c:643 #5823 0x0000003c79c3215b in coalesce_paths (vecs=0x15fb840, newmp=0x15fbbf0, refwwid=0x0, force_reload=1) at configure.c:810 #5824 0x0000000000406390 in configure (vecs=0x15fb840, start_waiters=1) at main.c:1458 #5825 0x0000000000406dc6 in child (argc=<value optimized out>, argv=<value optimized out>) at main.c:1766 #5826 main (argc=<value optimized out>, argv=<value optimized out>) at main.c:1992 What's happening here is that multipath is trying to figure out what partitions are on top of the multipath device. The multipath device mpathhzy is 253:100 (gdb) frame 5820 (gdb) print dev_t $3 = "253:100 and the kpartx device is 253:10 (gdb) frame 2 $4 = "253:10" This is the cause of all the problems. When multipath tries to determine if a kpartx partition device belongs to a multipath device, it makes sure that the multipath part of their dm uuid is the same, and then it checks if the multipath device (253:100) appears in the kpartx device table. If so, it renames the kpartx device as well. The problem is that it does so by recursively calling dm_rename, which tries to see if there are any kpartx device that are using this device in their device table (253:10). The way it checks is strstr(<table_string>, <device>). The problem is that when it is checking for 253:10, if it finds 253:100, strstr will match that, which means that it's own table will match the check to see if it is a partition of itself, which causes an endless recursion. So, clearly multipath needs smarter code to find the kpartx partitions of a multipath device. This bug exists in RHEL7 and upstream as well. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1893 |