RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 696157 - [NetApp 6.1 Bug] multipathd crashes occasionally during IO with fabric faults
Summary: [NetApp 6.1 Bug] multipathd crashes occasionally during IO with fabric faults
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: device-mapper-multipath
Version: 6.1
Hardware: All
OS: All
urgent
high
Target Milestone: rc
: ---
Assignee: Ben Marzinski
QA Contact: Storage QE
URL:
Whiteboard:
: 693524 (view as bug list)
Depends On:
Blocks: 702402 721245
TreeView+ depends on / blocked
 
Reported: 2011-04-13 12:56 UTC by Rajashekhar M A
Modified: 2011-08-09 16:46 UTC (History)
14 users (show)

Fixed In Version: device-mapper-multipath-0.4.9-41.el6
Doc Type: Bug Fix
Doc Text:
When a device path came online after another device path failed, the multipathd daemon failed to remove the restored path correctly. Consequently, multipathd sometimes terminated unexpectedly with a segmentation fault on a multipath device with the path_grouping_policy option set to "group_by_prio". With this update, multipathd removes and restores such paths as expected, thus fixing this bug.
Clone Of:
: 721245 (view as bug list)
Environment:
Last Closed: 2011-05-19 14:13:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
core dump file, messages and multipath.conf (868.58 KB, application/x-sdlc)
2011-04-13 13:13 UTC, Rajashekhar M A
no flags Details
6.0.z multipathd crash coredump (191.79 KB, application/octet-stream)
2011-04-15 10:31 UTC, Martin George
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0725 0 normal SHIPPED_LIVE device-mapper-multipath bug fix and enhancement update 2011-05-19 09:37:12 UTC

Description Rajashekhar M A 2011-04-13 12:56:29 UTC
Description of problem:

The multipathd daemon on a RHEL6.1 host crashes during IO with fabric faults.

We see the followig message in the syslog:

kernel: multipathd[9675]: segfault at 170 ip 00000000004073f1 sp 00007ffba3544c60 error 4 in multipathd (deleted)[400000+10000]

The stack trace is as below:

(gdb) bt
#0  0x00000000004073f1 in update_prio (pp=0x7ffa74000b80, refresh_all=1)
    at main.c:964
#1  0x0000000000407aff in check_path (vecs=0xc1bf30, pp=0x7ffa3c1330a0)
    at main.c:1116
#2  0x0000000000407db5 in checkerloop (ap=0xc1bf30) at main.c:1159
#3  0x00007ffba54497e1 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffba46bd78d in clone () from /lib64/libc.so.6
(gdb)


Version-Release number of selected component (if applicable):

RHEL6.1 Snapshot 2 -
kernel-2.6.32-128.el6.x86_64
device-mapper-multipath-0.4.9-40.el6.x86_64
device-mapper-1.02.62-3.el6.x86_64

How reproducible:

Frequently observed.

Steps to Reproduce:

1. Map 40 LUNs (with 4 FC paths each, i.e, 160 SCSI devices) from controllers and configure multipath devices on the host.
2. Create 5 LVs on the dm-multipath devices and start IO to the Lvs.
3. Introduce fabric faults repeatedly.
  
Additional info:

The multipath.conf, syslog and the core dump file are attached will be attached with the bugzilla.

Comment 2 Rajashekhar M A 2011-04-13 13:13:39 UTC
Created attachment 491753 [details]
core dump file, messages and multipath.conf

Attaching a zip file with core dump, full syslog and multipath.conf file.

Comment 5 Martin George 2011-04-14 13:29:21 UTC
And I'm now hitting the same crash on a 6.0.z host as well with device-mapper-multipath-0.4.9-31.el6_0.3:

Core was generated by `/sbin/multipathd'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004073c2 in update_prio (pp=0x7f9754000d30, refresh_all=1) at main.c:960
960                     vector_foreach_slot (pp->mpp->pg, pgp, i) {
Missing separate debuginfos, use: debuginfo-install device-mapper-libs-1.02.53-8.el6_0.4.x86_64 glibc-2.12-1.7.el6_0.4.x86_64 libaio-0.3.107-10.el6.x86_64 libselinux-2.0.94-2.el6.x86_64 libsepol-2.0.41-3.el6.x86_64 libudev-147-2.29.el6.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64 readline-6.0-3.el6.x86_64
(gdb) where
#0  0x00000000004073c2 in update_prio (pp=0x7f9754000d30, refresh_all=1) at main.c:960
#1  0x0000000000407a43 in check_path (vecs=0x6c57a0, pp=0x8b1af0) at main.c:1108
#2  0x0000000000407cf9 in checkerloop (ap=0x6c57a0) at main.c:1151
#3  0x00000037a62077e1 in start_thread () from /lib64/libpthread.so.0
#4  0x00000037a5ae151d in clone () from /lib64/libc.so.6
(gdb)

Comment 6 Martin George 2011-04-15 10:31:05 UTC
Created attachment 492333 [details]
6.0.z multipathd crash coredump

Comment 7 Ben Marzinski 2011-04-16 03:49:25 UTC
I'm pretty sure that I've fixed this. Can you try the patches available at:

http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/i686/

and

http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/x86_64

Comment 8 Ben Marzinski 2011-04-16 03:58:25 UTC
*** Bug 693524 has been marked as a duplicate of this bug. ***

Comment 11 Ben Marzinski 2011-04-19 16:21:38 UTC
Have you been able to try out the test patches above?

Comment 13 Martin George 2011-04-19 17:21:27 UTC
(In reply to comment #11)
> Have you been able to try out the test patches above?

We are currently testing it. Will keep you posted on the results.

Comment 14 Ben Marzinski 2011-04-19 19:43:51 UTC
The fix from the test packages fixes this for me. Please let know how your testing turns out.

Comment 16 Rajashekhar M A 2011-04-20 10:39:25 UTC
Yes, the test package seems to have fixed the issue. Our tests ran successfully and we did not hit this issue.

Comment 17 Eva Kopalova 2011-05-02 13:57:32 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
The multipathd daemon could have terminated unexpectedly with a segmentation fault on a multipath device with the path_grouping_policy option set to the group_by_prio value. This occurred when a device path came online after another device path failed because the multipath daemon did not manage to remove the restored path correctly. With this update multipath removes and restores such paths correctly.

Comment 20 errata-xmlrpc 2011-05-19 14:13:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0725.html

Comment 21 Tomas Capek 2011-08-09 16:46:00 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-The multipathd daemon could have terminated unexpectedly with a segmentation fault on a multipath device with the path_grouping_policy option set to the group_by_prio value. This occurred when a device path came online after another device path failed because the multipath daemon did not manage to remove the restored path correctly. With this update multipath removes and restores such paths correctly.+When a device path came online after another device path failed, the multipathd daemon failed to remove the restored path correctly. Consequently, multipathd sometimes terminated unexpectedly with a segmentation fault on a multipath device with the path_grouping_policy option set to "group_by_prio". With this update, multipathd removes and restores such paths as expected, thus fixing this bug.


Note You need to log in before you can comment on or make changes to this bug.