Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 670836

Summary:

[NetApp 5.6 Bug] Multipathd occasionally crashes during IO with fabric faults

Product:

Red Hat Enterprise Linux 5

Reporter:

Martin George <marting>

Component:

device-mapper-multipath

Assignee:

Ben Marzinski <bmarzins>

Status:

CLOSED DUPLICATE

QA Contact:

Storage QE <storage-qe>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

5.6

CC:

agk, bmarzins, bmr, christophe.varoqui, coughlan, dwysocha, heinzm, junichi.nomura, kueda, lmb, mbroz, moli, prajnoha, prockai, qcai, revers, xdl-redhat-bugzilla

Target Milestone:

Target Release:

---

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-04-07 21:12:57 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Logs containing dumps & config details	none
Multipathd core during shutdown	none

Description Martin George 2011-01-19 13:59:01 UTC

Description of problem:
During IO with fabric faults on a 5.6 root multipathed host, multipathd occasionally hangs causing IO to fail (both 'multipath -ll' &  multipathd -k"list paths" commands block indefinitely indicating the hang) eventually resulting in a crash as seen by the following entry in the /var/log/messages:

kernel: multipathd[16114]: segfault at 0000000000000000 rip 0000000000408fe7 rsp 0000000041f7ceb0 error 4

The multipathd status then shows up as "multipathd dead but subsys locked".

Running gdb on the multipathd core reveals the following:

Core was generated by `multipathd -d'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000408fe7 in find_key (str=0x2019b9b0 "show") at cli.c:167
167             vector_foreach_slot (keys, kw, i) {
(gdb) where
#0  0x0000000000408fe7 in find_key (str=0x2019b9b0 "show") at cli.c:167
#1  0x000000000040910d in get_cmdvec (cmd=0x1fffe8c0 "show paths") at cli.c:220
#2  0x0000000000409525 in parse_cmd (cmd=0x1fffe8c0 "show paths", reply=0x41f7d0a0, len=0x41f7d0b4, data=0x1ffdf8f0) at cli.c:341
#3  0x0000000000404fff in uxsock_trigger (str=0x1fffe8c0 "show paths", reply=0x41f7d0a0, len=0x41f7d0b4, trigger_data=0x1ffdf8f0)at main.c:652
#4  0x00000000004086a9 in uxsock_listen (uxsock_trigger=0x404f35 <uxsock_trigger>, trigger_data=0x1ffdf8f0) at uxlsnr.c:148
#5  0x0000000000405680 in uxlsnrloop (ap=0x1ffdf8f0) at main.c:793
#6  0x00000039d620673d in start_thread () from /lib64/libpthread.so.0
#7  0x00000039d56d40cd in clone () from /lib64/libc.so.6
(gdb)

Version-Release number of selected component (if applicable):
RHEL 5.6 GA (2.6.18-238.el5)
device-mapper-multipath-0.4.7-42

How reproducible:
Occasionally (usually after 24 hours of continuous IO run with fabric faults)

Comment 1 Martin George 2011-01-19 14:08:40 UTC

Created attachment 474274 [details]
Logs containing dumps & config details

Logs contain the following:
1) /etc/multipath.conf
2) /var/log/messages
3) Sysrq dumps collected during the multipathd hang
4) Multipathd core file

Comment 2 Martin George 2011-01-24 20:01:57 UTC

Ben,

Do you require any additional logs/dumps for this issue?

Comment 3 Martin George 2011-01-28 11:24:20 UTC

Any updates?

This is currently impacting our testing efforts on 5.6.

Comment 4 Ben Marzinski 2011-01-30 05:14:14 UTC

The odd thing about this crash is that it's happening because the show paths request is getting processed while multipathd is shutting down.  Looking through the core file, Most of the data structures have already been deleted, and the pthread exit_cond contitional no longer has anyone waiting on it.  Is it possible that you tried to kill multipathd when it was hanging, found out that that didn't work, and then tried to run show paths?  However it happened, it looks like multipathd was in the process of shutting down, which means that something sent it a SIGTERM or a SIGINT.  I assume that since it was most likely already stuck when it started shutting down, it wasn't able to finish, and apparently the interactive command listening thread was still able to process connections.

I'd really like to see where the process is hung before it is crashing or shutting itself down.  You could either run gbd on it, and step through all of the threads and run bt to get a backtrace on all them.  Either that or sending the process a SIGSEGV and sending me the core dump should allow me to go digging through it.

Comment 5 Martin George 2011-02-03 17:37:32 UTC

(In reply to comment #4)
> The odd thing about this crash is that it's happening because the show paths
> request is getting processed while multipathd is shutting down.  Looking
> through the core file, Most of the data structures have already been deleted,
> and the pthread exit_cond contitional no longer has anyone waiting on it.  Is
> it possible that you tried to kill multipathd when it was hanging, found out
> that that didn't work, and then tried to run show paths?  

Infact since "multipath -ll" hung, a SIGINT was sent to it but of no avail. Eventually multipathd crashed during one of the subsequent fabric faults. 

But now I hit another multipathd crash - this time during machine shut down when multipathd was being stopped:

Program terminated with signal 11, Segmentation fault.
#0  0x00000000004069de in reconfigure (vecs=0x6a7d3d0) at main.c:1280
1280            if (VECTOR_SIZE(vecs->mpvec))
(gdb) where
#0  0x00000000004069de in reconfigure (vecs=0x6a7d3d0) at main.c:1280
#1  0x00000000004076fd in sighup (sig=1) at main.c:1499
#2  <signal handler called>
#3  0x00000034cf6cc9f7 in ioctl () from /lib64/libc.so.6
#4  0x00000034d0619365 in ?? () from /lib64/libdevmapper.so.1.02
#5  0x00000034d0617b1b in dm_task_run () from /lib64/libdevmapper.so.1.02
#6  0x00000000004305dc in waiteventloop (waiter=0x2aaaac0040c0) at waiter.c:129
#7  0x000000000043083a in waitevent (et=0x2aaaac0040c0) at waiter.c:193
#8  0x00000034d020673d in start_thread () from /lib64/libpthread.so.0
#9  0x00000034cf6d40cd in clone () from /lib64/libc.so.6
(gdb)

Comment 6 Martin George 2011-02-03 17:38:26 UTC

Created attachment 476832 [details]
Multipathd core during shutdown

Comment 7 Ben Marzinski 2011-02-23 23:53:57 UTC

So obviously there are some bugs in the multipathd shutdown code.  I'll get to work on those.  Do you have any information from the multipathd hang, from before shutdown?

Comment 8 Martin George 2011-02-24 13:35:42 UTC

(In reply to comment #7)
> So obviously there are some bugs in the multipathd shutdown code.  I'll get to
> work on those.  Do you have any information from the multipathd hang, from
> before shutdown?

Unfortunately no. We are still struggling to reproduce this reliably.

Comment 9 Ben Marzinski 2011-04-07 21:12:57 UTC

I've fixed a number of shutdown crashes that were caused by multipathd removing resources before shutting down all the threads.  If you can reproduce the hang, please open a separate bugzilla for it.

*** This bug has been marked as a duplicate of bug 639429 ***