| Summary: | [NetApp 5.6 Bug] Multipathd occasionally crashes during IO with fabric faults | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Martin George <marting> | ||||||
| Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> | ||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Storage QE <storage-qe> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | urgent | ||||||||
| Version: | 5.6 | CC: | agk, bmarzins, bmr, christophe.varoqui, coughlan, dwysocha, heinzm, junichi.nomura, kueda, lmb, mbroz, moli, prajnoha, prockai, qcai, revers, xdl-redhat-bugzilla | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | All | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-04-07 21:12:57 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
Created attachment 474274 [details]
Logs containing dumps & config details
Logs contain the following:
1) /etc/multipath.conf
2) /var/log/messages
3) Sysrq dumps collected during the multipathd hang
4) Multipathd core file
Ben, Do you require any additional logs/dumps for this issue? Any updates? This is currently impacting our testing efforts on 5.6. The odd thing about this crash is that it's happening because the show paths request is getting processed while multipathd is shutting down. Looking through the core file, Most of the data structures have already been deleted, and the pthread exit_cond contitional no longer has anyone waiting on it. Is it possible that you tried to kill multipathd when it was hanging, found out that that didn't work, and then tried to run show paths? However it happened, it looks like multipathd was in the process of shutting down, which means that something sent it a SIGTERM or a SIGINT. I assume that since it was most likely already stuck when it started shutting down, it wasn't able to finish, and apparently the interactive command listening thread was still able to process connections. I'd really like to see where the process is hung before it is crashing or shutting itself down. You could either run gbd on it, and step through all of the threads and run bt to get a backtrace on all them. Either that or sending the process a SIGSEGV and sending me the core dump should allow me to go digging through it. (In reply to comment #4) > The odd thing about this crash is that it's happening because the show paths > request is getting processed while multipathd is shutting down. Looking > through the core file, Most of the data structures have already been deleted, > and the pthread exit_cond contitional no longer has anyone waiting on it. Is > it possible that you tried to kill multipathd when it was hanging, found out > that that didn't work, and then tried to run show paths? Infact since "multipath -ll" hung, a SIGINT was sent to it but of no avail. Eventually multipathd crashed during one of the subsequent fabric faults. But now I hit another multipathd crash - this time during machine shut down when multipathd was being stopped: Program terminated with signal 11, Segmentation fault. #0 0x00000000004069de in reconfigure (vecs=0x6a7d3d0) at main.c:1280 1280 if (VECTOR_SIZE(vecs->mpvec)) (gdb) where #0 0x00000000004069de in reconfigure (vecs=0x6a7d3d0) at main.c:1280 #1 0x00000000004076fd in sighup (sig=1) at main.c:1499 #2 <signal handler called> #3 0x00000034cf6cc9f7 in ioctl () from /lib64/libc.so.6 #4 0x00000034d0619365 in ?? () from /lib64/libdevmapper.so.1.02 #5 0x00000034d0617b1b in dm_task_run () from /lib64/libdevmapper.so.1.02 #6 0x00000000004305dc in waiteventloop (waiter=0x2aaaac0040c0) at waiter.c:129 #7 0x000000000043083a in waitevent (et=0x2aaaac0040c0) at waiter.c:193 #8 0x00000034d020673d in start_thread () from /lib64/libpthread.so.0 #9 0x00000034cf6d40cd in clone () from /lib64/libc.so.6 (gdb) Created attachment 476832 [details]
Multipathd core during shutdown
So obviously there are some bugs in the multipathd shutdown code. I'll get to work on those. Do you have any information from the multipathd hang, from before shutdown? (In reply to comment #7) > So obviously there are some bugs in the multipathd shutdown code. I'll get to > work on those. Do you have any information from the multipathd hang, from > before shutdown? Unfortunately no. We are still struggling to reproduce this reliably. I've fixed a number of shutdown crashes that were caused by multipathd removing resources before shutting down all the threads. If you can reproduce the hang, please open a separate bugzilla for it. *** This bug has been marked as a duplicate of bug 639429 *** |
Description of problem: During IO with fabric faults on a 5.6 root multipathed host, multipathd occasionally hangs causing IO to fail (both 'multipath -ll' & multipathd -k"list paths" commands block indefinitely indicating the hang) eventually resulting in a crash as seen by the following entry in the /var/log/messages: kernel: multipathd[16114]: segfault at 0000000000000000 rip 0000000000408fe7 rsp 0000000041f7ceb0 error 4 The multipathd status then shows up as "multipathd dead but subsys locked". Running gdb on the multipathd core reveals the following: Core was generated by `multipathd -d'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000408fe7 in find_key (str=0x2019b9b0 "show") at cli.c:167 167 vector_foreach_slot (keys, kw, i) { (gdb) where #0 0x0000000000408fe7 in find_key (str=0x2019b9b0 "show") at cli.c:167 #1 0x000000000040910d in get_cmdvec (cmd=0x1fffe8c0 "show paths") at cli.c:220 #2 0x0000000000409525 in parse_cmd (cmd=0x1fffe8c0 "show paths", reply=0x41f7d0a0, len=0x41f7d0b4, data=0x1ffdf8f0) at cli.c:341 #3 0x0000000000404fff in uxsock_trigger (str=0x1fffe8c0 "show paths", reply=0x41f7d0a0, len=0x41f7d0b4, trigger_data=0x1ffdf8f0)at main.c:652 #4 0x00000000004086a9 in uxsock_listen (uxsock_trigger=0x404f35 <uxsock_trigger>, trigger_data=0x1ffdf8f0) at uxlsnr.c:148 #5 0x0000000000405680 in uxlsnrloop (ap=0x1ffdf8f0) at main.c:793 #6 0x00000039d620673d in start_thread () from /lib64/libpthread.so.0 #7 0x00000039d56d40cd in clone () from /lib64/libc.so.6 (gdb) Version-Release number of selected component (if applicable): RHEL 5.6 GA (2.6.18-238.el5) device-mapper-multipath-0.4.7-42 How reproducible: Occasionally (usually after 24 hours of continuous IO run with fabric faults)