Bug 680480
Summary: | multipathd SEGV in sysfs_get_timeout() during double path failure | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Mike Snitzer <msnitzer> | ||||||||
Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Gris Ge <fge> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 6.0 | CC: | agk, bdonahue, berthiaume_wayne, bmarzins, bugproxy, coughlan, dwysocha, eddie.williams, fge, heinzm, mbroz, msnitzer, prajnoha, prockai, zkabelac | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | 6.1 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | device-mapper-multipath-0.4.9-40.el6 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
During a double path failure, the sysfs device file is removed and the sysdev path attribute is set to NULL. The sysfs device cache is indexed by the actual sysfs directory, and /sys/block/pathname is a symlink. Prior to this update, if the path was deleted, multipathd was not able to find the actual directory, which /sys/block/pathname pointed to, and searched the cache. With this update, multipathd verifies that sysdev has NULL value before updating it.
|
Story Points: | --- | ||||||||
Clone Of: | 680140 | Environment: | |||||||||
Last Closed: | 2011-05-19 14:13:00 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Mike Snitzer
2011-02-25 17:25:37 UTC
Similar to bug 680140 this is related to devices being removed due to failure. In each case that I have seen this core dump devices are being removed due to failure. When the device is removed the dereferencing pointers will get errors like this. In one case: Feb 24 17:13:36 fiji kernel: device-mapper: multipath: Failing path 67:160. Feb 24 17:13:36 fiji kernel: device-mapper: multipath: Failing path 67:176. Feb 24 17:13:36 fiji kernel: device-mapper: multipath: Failing path 68:208. Feb 24 17:13:36 fiji kernel: device-mapper: multipath: Failing path 67:224. Feb 24 17:13:36 fiji kernel: multipathd[10655]: segfault at 8 ip 0000003716447ff7 sp 00007f90f170e020 error 4 in libc-2.12.so[3716400000+175000] Feb 24 17:13:36 fiji abrt[6329]: saved core dump of pid 10354 (/sbin/multipathd) to /var/spool/abrt/ccpp-1298585616-10354.new/coredump (5931008 bytes) In another: Feb 25 09:59:53 fiji multipathd: sdap: failed to get sysfs information Feb 25 09:59:53 fiji multipathd: sdap: failed to get sysfs information Feb 25 09:59:53 fiji kernel: device-mapper: multipath: Failing path 66:176. Feb 25 09:59:53 fiji kernel: __ratelimit: 5196 callbacks suppressed Feb 25 09:59:53 fiji kernel: multipathd[4808]: segfault at 8 ip 0000003716447ff7 sp 00007ffe4d8e4020 error 4 in libc-2.12.so[3716400000+175000] In the second case the sysfs entry is missing but then it looks like sysfs_get_timeout() tries to dereference the sysfs entry at: if (safe_sprintf(attr_path, "%s/device", dev->devpath)) where dev is suppose to point to the sysfs entry (libmultipath/discovery.c). Can you reliably reproduce this? It's pretty straightforward to slove this crash, however, I don't understand the root cause. Every indication points to the sysfs device file being removed, causing the sysdev path attribute to be set to NULL. Then sysfs_get_timeout() doesn't check for it being NULL and crashes. So far so good. The only issue is that multipathd maintains a a sysdev cache to avoid just such an issue. The sysdev is only removed from the cache when the path is removed. This is done under a lock that is also held by the code that is crashing, so these two things couldn't happen at the same time. and the sysdev isn't removed until the actual path is removed, so it should never get checked after the sysdev is removed. That leaves two options, the sysdev isn't getting added to the cache when it should be (which looks pretty unlikely), and the sysdev is getting removed from the cache when it shouldn't be. If you can reliably reproduce this, I can send you a test package that doesn't have the code to remove the sysdev from the cache. This should allow us to check if the device really is getting removed when it shouldn't be. I can fairly reliably reproduce it. :-) In most cases when I am running some tests and cause two paths to fail I will see the problem around 50% of the time. However it has at times gone 4 or 5 tests without duplicating. So the window seems to be fairly small. I will be happy to test it. If I can go through 10 tests without duplicating the problem I will be satisfied that the problem has been resolved. I have two servers setup so the chances of hitting this have greatly improved, so while not 100% it is close that one or the other will hit it. Any estimate when a package may be available for me to test? There are debug packages available at: http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/x86_64/ and http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/i686/ These packages will not ever remove a sysfs device from the cache. If this solves your problem, then it's pretty obvious that these devices are getting removed when they shouldn't be. Created attachment 482169 [details] coredump info from /var/spool/abrt Updated multipath packages to 38.el6.bz680480 and retried test with multiple paths failing (simulating a switch failure). Multipathd core dumped. Note I ran into core dumps on both systems being tested the first attempt. I may have been just lucky or with the new bits the failure rate is higher and maybe 100%. O.k. This all makes sense now. I forgot that the sysfs device cache is indexed by the actual sysfs directory, and /sys/block/<pathname> is now a symlink, so if the path is deleted, it isn't able to find the actual directory that /sys/block/<pathname> points to, so it can't search the cache. Multipath is still able to clear the cache, because the remove events come in from udev using the actual sysfs directory. So the answer is simply to not bother to update sysdev if it's not NULL, and we won't have to worry if it goes away before these checks. Fixed issue. Reproduced this issue and got same coredump as reported: #0 0x0000003716447ff7 in _IO_vfprintf_internal (s=<value optimized out>, format=<value optimized out>, ap=<value optimized out>) at vfprintf.c:1593 1593 process_string_arg (((struct printf_spec *) NULL)); Missing separate debuginfos, use: debuginfo-install libaio-0.3.107-10.el6.x86_64 libselinux-2.0.94-2.el6.x86_64 libsepol-2.0.41-3.el6.x86_64 libudev-147-2.29.el6.x86_64 (gdb) br Breakpoint 1 at 0x3716447ff7: file vfprintf.c, line 1593. (gdb) list 1588 1589 /* Process current format. */ 1590 while (1) 1591 { 1592 process_arg (((struct printf_spec *) NULL)); 1593 process_string_arg (((struct printf_spec *) NULL)); 1594 1595 LABEL (form_unknown): 1596 if (spec == L_('\0')) 1597 { Test enviroment: device-mapper-multipath-0.4.9-39 (hit the problem) device-mapper-multipath-0.4.9-40 (fix the problem) Boot from SAN with 4 multibus path (2 HBA ports x 2 controler ports). Bring 1 HBA ports down from switch. I have tested device-mapper-multipath-0.4.9-40 for 10 times link bouncing, PASS. Verified this bug. Can you send me a pointer to the new package for me to test with? The device-mapper-multipath-0.4.9-40.el6 packages are available at: http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/ I've put up packages for all supported architectures. I downloaded the x86_64 bits and they resolved the problem. THANKS *** Bug 691658 has been marked as a duplicate of this bug. *** Created attachment 489160 [details]
messages
Created attachment 489161 [details]
sosreport
------- Comment From christian_may.com 2011-04-05 03:21 EDT------- All three blades (i386, x86_64, ppc64) were setup with RHEL6.1 Snapshot 1. I/O and port bounce scenario started. Currently 50 cycles have been successfully passed for all architecures. multipathd still up'n running. Looks good...50 cycles still remaining... ------- Comment From christian_may.com 2011-04-06 07:10 EDT------- I've stopped the test after appr. 70 cycles. Multipath daemon still running. Problem fixed with device-mapper packages from Snapshot 1. Bug can be closed. ------- Comment From prem.karat.ibm.com 2011-04-11 02:34 EDT------- (In reply to comment #30) > I've stopped the test after appr. 70 cycles. Multipath daemon still running. > Problem fixed with device-mapper packages from Snapshot 1. > Bug can be closed. Since the bug is resolved and the fix is included in RHEL 6.1 Snap1, am closing this one out as per previous comment. Cheers, Prem Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: During a double path failure, the sysfs device file is removed and the sysdev path attribute is set to NULL. The sysfs device cache is indexed by the actual sysfs directory, and /sys/block/pathname is a symlink. Prior to this update, if the path was deleted, multipathd was not able to find the actual directory, which /sys/block/pathname pointed to, and searched the cache. With this update, multipathd verifies that sysdev has NULL value before updating it. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0725.html |