Bug 593379 - [NetApp 6.0 bug] DM-Multipath: Multipathd crash on RHEL6 Snap3
[NetApp 6.0 bug] DM-Multipath: Multipathd crash on RHEL6 Snap3
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: device-mapper-multipath (Show other bugs)
6.0
All Linux
high Severity high
: rc
: 6.0
Assigned To: Ben Marzinski
Red Hat Kernel QE team
: OtherQA
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-05-18 13:00 EDT by Martin George
Modified: 2010-11-15 08:54 EST (History)
20 users (show)

See Also:
Fixed In Version: device-mapper-multipath-0.4.9-21.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-11-15 08:54:55 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Coredump (764.00 KB, application/octet-stream)
2010-05-18 13:00 EDT, Martin George
no flags Details
Multipathd coredump during the FCoE switch port block test (3.31 MB, application/octet-stream)
2010-05-19 10:43 EDT, Martin George
no flags Details

  None (edit)
Description Martin George 2010-05-18 13:00:22 EDT
Created attachment 414911 [details]
Coredump

Description of problem:
Multipathd segfaulted on my RHEL6 Snap3 SW FCoE host (with Intel 10GbE NICs) after dm-multipath was configured on 1 lun mapped to the host:

Version-Release number of selected component (if applicable):
RHEL6 Snap3 (2.6.32-24.el6)
device-mapper-multipath-0.4.9-19.el6

How reproducible:
Intermittent

Additional info: 
Attaching the coredump for the same. Running gdb on the coredump reveals the following:

Core was generated by `multipathd'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000003b9a2492e1 in vfprintf () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install device-mapper-libs-1.02.47-1.el6.x86_64 glibc-2.11.90-22.el6.x86_64 libaio-0.3.1
07-10.el6.x86_64 libselinux-2.0.90-3.el6.x86_64 libsepol-2.0.41-3.el6.x86_64 libudev-147-2.15.el6.x86_64 ncurses-libs-5.7-3.20090208
.el6.x86_64 readline-6.0-3.el6.x86_64
(gdb) where
#0  0x0000003b9a2492e1 in vfprintf () from /lib64/libc.so.6
#1  0x0000003b9a270842 in vsnprintf () from /lib64/libc.so.6
#2  0x0000003b9a250523 in snprintf () from /lib64/libc.so.6
#3  0x0000003b9aa16bd8 in sysfs_get_timeout (dev=0x0, timeout=0x1171180) at discovery.c:174
#4  0x0000003b9aa1a193 in select_checker (pp=0x1170ed0) at propsel.c:295
#5  0x0000003b9aa189ba in get_state (pp=0x1170ed0, daemon=0) at discovery.c:799
#6  0x0000003b9aa19022 in pathinfo (pp=0x1170ed0, hwtable=0x1161440, mask=12) at discovery.c:915
#7  0x0000003b9aa359d4 in adopt_paths (pathvec=0x116bfd0, mpp=0x1170390) at structs_vec.c:72
#8  0x0000003b9aa366af in setup_multipath (vecs=0x116b3f0, mpp=0x1170390) at structs_vec.c:358
#9  0x00000000004061fe in map_discovery (vecs=0x116b3f0) at main.c:611
#10 0x0000000000407db1 in configure (vecs=0x116b3f0, start_waiters=1) at main.c:1175
#11 0x000000000040876b in child (param=0x0) at main.c:1463
#12 0x0000000000408fc6 in main (argc=1, argv=0x7fff2bccfb28) at main.c:1661
(gdb)
Comment 3 Ben Marzinski 2010-05-18 13:26:43 EDT
Thanks for the stack trace. I'll look into this right away.
Comment 4 Ben Marzinski 2010-05-18 16:44:29 EDT
I can see what the problem is. However, I can't see how it could happen, except if you had something like a multipath device ceated with a blacklisted path.

The only way I can recreate this is to

1. edit multipath.conf to not blacklist my devices
2. run multipath to create some multipath devices
3. edit multipath.conf to blacklist those devices
4. start multipathd

This will segfault.  Is something like this happening on your setup?
Comment 5 Ben Marzinski 2010-05-18 23:09:42 EDT
I fixed the problem, or at least the only one that I can see.  The way I was hitting this was by adding a map that has a path which multipathd doesn't know anything about.  The easiest way for me to do the is the reproducer in Comment #4, but there are other ways.  So unless you are getting paths that haven't initialized their sysfs path information by some other method, this should fix it for you. I'll get a build with this done shortly, so you can test it.
Comment 6 Martin George 2010-05-19 10:22:47 EDT
(In reply to comment #4)
> I can see what the problem is. However, I can't see how it could happen, except
> if you had something like a multipath device ceated with a blacklisted path.
> 
> The only way I can recreate this is to
> 
> 1. edit multipath.conf to not blacklist my devices
> 2. run multipath to create some multipath devices
> 3. edit multipath.conf to blacklist those devices
> 4. start multipathd
> 
> This will segfault.  Is something like this happening on your setup?    

No, I am not doing any think like that on my host. My blacklisting section in the multipath.conf looks like:

blacklist {
        wwid SIBM-ESXSMAW3073NC_FDAR9P6504MTK
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^hd[a-z][[0-9]*]"
}

where the 1st entry is the wwid corresponding to my local SCSI drive. 

I only saw the following messages in the /var/log/ during the crash:

kernel: multipathd[9576]: segfault at 8 ip 0000003b9a2492e1 sp 00007fff411f3000 error 4 in libc-2.11.90.so[3b9a200000+17d000]
abrt[9582]: saved core dump of pid 9576 (/sbin/multipathd) to /var/cache/abrt/ccpp-1274264715-9576.new/coredump (782336 bytes)
Comment 7 Martin George 2010-05-19 10:40:21 EDT
Ok, I've now figured out how to trigger the crash. 

I have a single lun with 4 underlying paths mapped to my sw FCoE RHEL6 host and configured dm-multipath on it. I then proceed to block/unblock the target array ports on the Cisco Nexus FCoE switch, which corresponds to paths getting offline/online on the host. This triggers the multipathd crash. However this is highly intermittent.

The following messages are seen in the /var/log/ during the crash:

May 19 17:08:57 IBMx336-200-108 multipathd: sdc: add path (uevent)
May 19 17:08:58 IBMx336-200-108 abrt[2863]: saved core dump of pid 1087 (/sbin/multipathd) to /var/cache/abrt/ccpp-1274269138-1087.new/coredump (3325952 bytes)

A new stack trace is seen with this crash now:

Core was generated by `/sbin/multipathd'.
Program terminated with signal 6, Aborted.
#0  0x0000003b9a233955 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install device-mapper-libs-1.02.47-1.el6.x86_64 glibc-2.11.90-22.el6.x86_64 libaio-0.3.107-10.el6.x86_64 libgcc-4.4.4-2.el6.x86_64 libselinux-2.0.90-3.el6.x86_64 libsepol-2.0.41-3.el6.x86_64 libudev-147-2.15.el6.x86_64 n
curses-libs-5.7-3.20090208.el6.x86_64 readline-6.0-3.el6.x86_64
(gdb) where
#0  0x0000003b9a233955 in raise () from /lib64/libc.so.6
#1  0x0000003b9a235135 in abort () from /lib64/libc.so.6
#2  0x0000003b9a2716bb in __libc_message () from /lib64/libc.so.6
#3  0x0000003b9a277076 in malloc_printerr () from /lib64/libc.so.6
#4  0x0000003b9aa15629 in free_path (pp=0x1791de0) at structs.c:51
#5  0x0000003b9aa36b01 in verify_paths (mpp=0x17822d0, vecs=0x17753f0, rpvec=0x0) at structs_vec.c:471
#6  0x0000000000405ae6 in ev_add_path (devname=0x1789ea8 "sdc", vecs=0x17753f0) at main.c:419
#7  0x00000000004057da in uev_add_path (dev=0x1789aa0, vecs=0x17753f0) at main.c:352
#8  0x00000000004066ce in uev_trigger (uev=0x7f79e00009f0, trigger_data=0x17753f0) at main.c:717
#9  0x0000003b9aa2c6ff in service_uevq () at uevent.c:84
#10 0x0000003b9aa2c7c8 in uevq_thread (et=0x0) at uevent.c:110
#11 0x0000003b9ae07951 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003b9a2e4d3d in clone () from /lib64/libc.so.6
(gdb)
Comment 8 Martin George 2010-05-19 10:43:56 EDT
Created attachment 415144 [details]
Multipathd coredump during the FCoE switch port block test
Comment 9 Ben Marzinski 2010-05-20 12:39:28 EDT
Well, that sounds a lot like 593426.  Both of these (the problem with removing and re-adding a path, and the problem with adding a map with an unmonitored path) are a regression caused by a patch to allow multipath to pull in missing paths when they become available.  I wen't and reverted that patch, and solved the problem differently, so these should all go away with the next package, which I will hope to have out today, or tomorrow at the latest.
Comment 11 Ben Marzinski 2010-05-21 20:26:33 EDT
here are some packages that should fix this problem:

http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/
Comment 12 Martin George 2010-05-26 10:10:02 EDT
Initial tests with device-mapper-multipath-0.4.9-21.el6 look promising. Not seen the crash with switch port block/unblock tests so far.
Comment 14 Rajashekhar M A 2010-09-13 06:48:27 EDT
Verified with Snap13 (rpm version - device-mapper-multipath-0.4.9-28.el6.x86_64).
Comment 15 releng-rhel@redhat.com 2010-11-15 08:54:55 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.