Bug 696157
| Summary: | [NetApp 6.1 Bug] multipathd crashes occasionally during IO with fabric faults | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Rajashekhar M A <rajashekhar.a> | ||||||
| Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Storage QE <storage-qe> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | urgent | ||||||||
| Version: | 6.1 | CC: | agk, bdonahue, bmarzins, coughlan, ddumas, dwysocha, heinzm, marting, mbroz, prajnoha, prockai, revers, xdl-redhat-bugzilla, zkabelac | ||||||
| Target Milestone: | rc | Keywords: | ZStream | ||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | All | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | device-mapper-multipath-0.4.9-41.el6 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
When a device path came online after another device path failed, the multipathd daemon failed to remove the restored path correctly. Consequently, multipathd sometimes terminated unexpectedly with a segmentation fault on a multipath device with the path_grouping_policy option set to "group_by_prio". With this update, multipathd removes and restores such paths as expected, thus fixing this bug.
|
Story Points: | --- | ||||||
| Clone Of: | |||||||||
| : | 721245 (view as bug list) | Environment: | |||||||
| Last Closed: | 2011-05-19 14:13:04 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 702402, 721245 | ||||||||
| Attachments: |
|
||||||||
Created attachment 491753 [details]
core dump file, messages and multipath.conf
Attaching a zip file with core dump, full syslog and multipath.conf file.
And I'm now hitting the same crash on a 6.0.z host as well with device-mapper-multipath-0.4.9-31.el6_0.3:
Core was generated by `/sbin/multipathd'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000000004073c2 in update_prio (pp=0x7f9754000d30, refresh_all=1) at main.c:960
960 vector_foreach_slot (pp->mpp->pg, pgp, i) {
Missing separate debuginfos, use: debuginfo-install device-mapper-libs-1.02.53-8.el6_0.4.x86_64 glibc-2.12-1.7.el6_0.4.x86_64 libaio-0.3.107-10.el6.x86_64 libselinux-2.0.94-2.el6.x86_64 libsepol-2.0.41-3.el6.x86_64 libudev-147-2.29.el6.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64 readline-6.0-3.el6.x86_64
(gdb) where
#0 0x00000000004073c2 in update_prio (pp=0x7f9754000d30, refresh_all=1) at main.c:960
#1 0x0000000000407a43 in check_path (vecs=0x6c57a0, pp=0x8b1af0) at main.c:1108
#2 0x0000000000407cf9 in checkerloop (ap=0x6c57a0) at main.c:1151
#3 0x00000037a62077e1 in start_thread () from /lib64/libpthread.so.0
#4 0x00000037a5ae151d in clone () from /lib64/libc.so.6
(gdb)
Created attachment 492333 [details]
6.0.z multipathd crash coredump
I'm pretty sure that I've fixed this. Can you try the patches available at: http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/i686/ and http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/x86_64 *** Bug 693524 has been marked as a duplicate of this bug. *** Have you been able to try out the test patches above? (In reply to comment #11) > Have you been able to try out the test patches above? We are currently testing it. Will keep you posted on the results. The fix from the test packages fixes this for me. Please let know how your testing turns out. Yes, the test package seems to have fixed the issue. Our tests ran successfully and we did not hit this issue.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
The multipathd daemon could have terminated unexpectedly with a segmentation fault on a multipath device with the path_grouping_policy option set to the group_by_prio value. This occurred when a device path came online after another device path failed because the multipath daemon did not manage to remove the restored path correctly. With this update multipath removes and restores such paths correctly.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0725.html
Technical note updated. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
Diffed Contents:
@@ -1 +1 @@
-The multipathd daemon could have terminated unexpectedly with a segmentation fault on a multipath device with the path_grouping_policy option set to the group_by_prio value. This occurred when a device path came online after another device path failed because the multipath daemon did not manage to remove the restored path correctly. With this update multipath removes and restores such paths correctly.+When a device path came online after another device path failed, the multipathd daemon failed to remove the restored path correctly. Consequently, multipathd sometimes terminated unexpectedly with a segmentation fault on a multipath device with the path_grouping_policy option set to "group_by_prio". With this update, multipathd removes and restores such paths as expected, thus fixing this bug.
|
Description of problem: The multipathd daemon on a RHEL6.1 host crashes during IO with fabric faults. We see the followig message in the syslog: kernel: multipathd[9675]: segfault at 170 ip 00000000004073f1 sp 00007ffba3544c60 error 4 in multipathd (deleted)[400000+10000] The stack trace is as below: (gdb) bt #0 0x00000000004073f1 in update_prio (pp=0x7ffa74000b80, refresh_all=1) at main.c:964 #1 0x0000000000407aff in check_path (vecs=0xc1bf30, pp=0x7ffa3c1330a0) at main.c:1116 #2 0x0000000000407db5 in checkerloop (ap=0xc1bf30) at main.c:1159 #3 0x00007ffba54497e1 in start_thread () from /lib64/libpthread.so.0 #4 0x00007ffba46bd78d in clone () from /lib64/libc.so.6 (gdb) Version-Release number of selected component (if applicable): RHEL6.1 Snapshot 2 - kernel-2.6.32-128.el6.x86_64 device-mapper-multipath-0.4.9-40.el6.x86_64 device-mapper-1.02.62-3.el6.x86_64 How reproducible: Frequently observed. Steps to Reproduce: 1. Map 40 LUNs (with 4 FC paths each, i.e, 160 SCSI devices) from controllers and configure multipath devices on the host. 2. Create 5 LVs on the dm-multipath devices and start IO to the Lvs. 3. Introduce fabric faults repeatedly. Additional info: The multipath.conf, syslog and the core dump file are attached will be attached with the bugzilla.