Bug 670836 - [NetApp 5.6 Bug] Multipathd occasionally crashes during IO with fabric faults
Summary: [NetApp 5.6 Bug] Multipathd occasionally crashes during IO with fabric faults
Keywords:
Status: CLOSED DUPLICATE of bug 639429
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: device-mapper-multipath
Version: 5.6
Hardware: All
OS: All
urgent
urgent
Target Milestone: rc
: ---
Assignee: Ben Marzinski
QA Contact: Storage QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-01-19 13:59 UTC by Martin George
Modified: 2011-04-07 21:12 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-04-07 21:12:57 UTC
Target Upstream Version:


Attachments (Terms of Use)
Logs containing dumps & config details (3.94 MB, application/x-zip-compressed)
2011-01-19 14:08 UTC, Martin George
no flags Details
Multipathd core during shutdown (150.72 KB, application/octet-stream)
2011-02-03 17:38 UTC, Martin George
no flags Details

Description Martin George 2011-01-19 13:59:01 UTC
Description of problem:
During IO with fabric faults on a 5.6 root multipathed host, multipathd occasionally hangs causing IO to fail (both 'multipath -ll' &  multipathd -k"list paths" commands block indefinitely indicating the hang) eventually resulting in a crash as seen by the following entry in the /var/log/messages:

kernel: multipathd[16114]: segfault at 0000000000000000 rip 0000000000408fe7 rsp 0000000041f7ceb0 error 4

The multipathd status then shows up as "multipathd dead but subsys locked".

Running gdb on the multipathd core reveals the following:

Core was generated by `multipathd -d'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000408fe7 in find_key (str=0x2019b9b0 "show") at cli.c:167
167             vector_foreach_slot (keys, kw, i) {
(gdb) where
#0  0x0000000000408fe7 in find_key (str=0x2019b9b0 "show") at cli.c:167
#1  0x000000000040910d in get_cmdvec (cmd=0x1fffe8c0 "show paths") at cli.c:220
#2  0x0000000000409525 in parse_cmd (cmd=0x1fffe8c0 "show paths", reply=0x41f7d0a0, len=0x41f7d0b4, data=0x1ffdf8f0) at cli.c:341
#3  0x0000000000404fff in uxsock_trigger (str=0x1fffe8c0 "show paths", reply=0x41f7d0a0, len=0x41f7d0b4, trigger_data=0x1ffdf8f0)at main.c:652
#4  0x00000000004086a9 in uxsock_listen (uxsock_trigger=0x404f35 <uxsock_trigger>, trigger_data=0x1ffdf8f0) at uxlsnr.c:148
#5  0x0000000000405680 in uxlsnrloop (ap=0x1ffdf8f0) at main.c:793
#6  0x00000039d620673d in start_thread () from /lib64/libpthread.so.0
#7  0x00000039d56d40cd in clone () from /lib64/libc.so.6
(gdb)

Version-Release number of selected component (if applicable):
RHEL 5.6 GA (2.6.18-238.el5)
device-mapper-multipath-0.4.7-42

How reproducible:
Occasionally (usually after 24 hours of continuous IO run with fabric faults)

Comment 1 Martin George 2011-01-19 14:08:40 UTC
Created attachment 474274 [details]
Logs containing dumps & config details

Logs contain the following:
1) /etc/multipath.conf
2) /var/log/messages
3) Sysrq dumps collected during the multipathd hang
4) Multipathd core file

Comment 2 Martin George 2011-01-24 20:01:57 UTC
Ben,

Do you require any additional logs/dumps for this issue?

Comment 3 Martin George 2011-01-28 11:24:20 UTC
Any updates?

This is currently impacting our testing efforts on 5.6.

Comment 4 Ben Marzinski 2011-01-30 05:14:14 UTC
The odd thing about this crash is that it's happening because the show paths request is getting processed while multipathd is shutting down.  Looking through the core file, Most of the data structures have already been deleted, and the pthread exit_cond contitional no longer has anyone waiting on it.  Is it possible that you tried to kill multipathd when it was hanging, found out that that didn't work, and then tried to run show paths?  However it happened, it looks like multipathd was in the process of shutting down, which means that something sent it a SIGTERM or a SIGINT.  I assume that since it was most likely already stuck when it started shutting down, it wasn't able to finish, and apparently the interactive command listening thread was still able to process connections.

I'd really like to see where the process is hung before it is crashing or shutting itself down.  You could either run gbd on it, and step through all of the threads and run bt to get a backtrace on all them.  Either that or sending the process a SIGSEGV and sending me the core dump should allow me to go digging through it.

Comment 5 Martin George 2011-02-03 17:37:32 UTC
(In reply to comment #4)
> The odd thing about this crash is that it's happening because the show paths
> request is getting processed while multipathd is shutting down.  Looking
> through the core file, Most of the data structures have already been deleted,
> and the pthread exit_cond contitional no longer has anyone waiting on it.  Is
> it possible that you tried to kill multipathd when it was hanging, found out
> that that didn't work, and then tried to run show paths?  

Infact since "multipath -ll" hung, a SIGINT was sent to it but of no avail. Eventually multipathd crashed during one of the subsequent fabric faults. 

But now I hit another multipathd crash - this time during machine shut down when multipathd was being stopped:

Program terminated with signal 11, Segmentation fault.
#0  0x00000000004069de in reconfigure (vecs=0x6a7d3d0) at main.c:1280
1280            if (VECTOR_SIZE(vecs->mpvec))
(gdb) where
#0  0x00000000004069de in reconfigure (vecs=0x6a7d3d0) at main.c:1280
#1  0x00000000004076fd in sighup (sig=1) at main.c:1499
#2  <signal handler called>
#3  0x00000034cf6cc9f7 in ioctl () from /lib64/libc.so.6
#4  0x00000034d0619365 in ?? () from /lib64/libdevmapper.so.1.02
#5  0x00000034d0617b1b in dm_task_run () from /lib64/libdevmapper.so.1.02
#6  0x00000000004305dc in waiteventloop (waiter=0x2aaaac0040c0) at waiter.c:129
#7  0x000000000043083a in waitevent (et=0x2aaaac0040c0) at waiter.c:193
#8  0x00000034d020673d in start_thread () from /lib64/libpthread.so.0
#9  0x00000034cf6d40cd in clone () from /lib64/libc.so.6
(gdb)

Comment 6 Martin George 2011-02-03 17:38:26 UTC
Created attachment 476832 [details]
Multipathd core during shutdown

Comment 7 Ben Marzinski 2011-02-23 23:53:57 UTC
So obviously there are some bugs in the multipathd shutdown code.  I'll get to work on those.  Do you have any information from the multipathd hang, from before shutdown?

Comment 8 Martin George 2011-02-24 13:35:42 UTC
(In reply to comment #7)
> So obviously there are some bugs in the multipathd shutdown code.  I'll get to
> work on those.  Do you have any information from the multipathd hang, from
> before shutdown?

Unfortunately no. We are still struggling to reproduce this reliably.

Comment 9 Ben Marzinski 2011-04-07 21:12:57 UTC
I've fixed a number of shutdown crashes that were caused by multipathd removing resources before shutting down all the threads.  If you can reproduce the hang, please open a separate bugzilla for it.

*** This bug has been marked as a duplicate of bug 639429 ***


Note You need to log in before you can comment on or make changes to this bug.