Bug 680140
Summary: | emc_clariion error handler panics with multiple failures | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Eddie Williams <eddie.williams> | ||||||||
Component: | kernel | Assignee: | Mike Snitzer <msnitzer> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Storage QE <storage-qe> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 6.0 | CC: | agk, bdonahue, berthiaume_wayne, bmarzins, dhoward, dwysocha, eddie.williams, heinzm, jwest, mbroz, msnitzer, prajnoha, prockai, zkabelac | ||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | kernel-2.6.32-128.el6 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Deleting an SCSI (Small Computer System Interface) device attached to a device handler caused applications running in user space, which were performing I/O operations on that device, to become unresponsive. This was due to the fact that the SCSI device handler's activation did not propagate the SCSI device deletion via an error code and a callback to the Device-Mapper Multipath. With this update, deletion of an SCSI device attached to a device handler is properly handled and no longer causes certain applications to become unresponsive.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 680480 (view as bug list) | Environment: | |||||||||
Last Closed: | 2011-05-19 12:05:36 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 696889 | ||||||||||
Attachments: |
|
Description
Eddie Williams
2011-02-24 13:19:55 UTC
Created attachment 480763 [details]
serial port output captured from panic after FC switch powered off
I duplicated the scenario where I powered a FC switch off and the system paniced. Note that the panic was different in that I did not get a stack trace. I have the log level set to 7 (echo 7 > /proc/sys/kernel/printk). I have the kernel configured to output to the serial console:
title Red Hat Enterprise Linux (2.6.32-71.el6.x86_64) with serial console
root (hd0,0)
kernel /vmlinuz-2.6.32-71.el6.x86_64 ro root=/dev/mapper/vg_fiji-lv_root rd_LVM_LV=vg_fiji/lv_root rd_LVM_LV=vg_fiji/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=auto rhgb quiet console=ttyS0,9600n8 console=tty0
initrd /initramfs-2.6.32-71.el6.x86_64.img
I turned reservations off and was able to duplicate the same panic in clariion_activate. I will put my patched scsi_dh_emc.ko and verify that with the new device-mapper-multipath this still resolves the panic. (In reply to comment #3) > I turned reservations off and was able to duplicate the same panic in > clariion_activate. I will put my patched scsi_dh_emc.ko and verify that with > the new device-mapper-multipath this still resolves the panic. OK, thanks for all your work on this. Please attach your fix to this bz once you've verified all works as expected. Created attachment 480855 [details]
patch to scsi_dh_emc.c to avoid panic in clariion_activate()
So for with the attached patch to scsi_dh_emc.c I have not had the panic. I am instead seeing a core on multipathd on occasion though. I have not looked into the core yet. If the daemon is restarted then everything continues OK.
The patch is to simply avoid printing the message in clariion_activate when there is a failure. This is perhaps fixing the symptom rather than the problem. The issue here is that it looks like the code it touching the device after it has been freed. Should there be a lock held to avoid this so rather than changing the message the right fix is to lock the device so it does not disappear while we are working on it?
Created attachment 481021 [details]
gdb output from multipathd coredump
during double-path failure testing either by physically powering off a fibre channel switch or simulating it the multipathd will sometimes coredump. The failure is a segmentation fault in sysfs_get_timeout() in libmultipath/discovery.c
Should I write a separate bugzilla for this?
(In reply to comment #6) > Created attachment 481021 [details] > gdb output from multipathd coredump > > during double-path failure testing either by physically powering off a fibre > channel switch or simulating it the multipathd will sometimes coredump. The > failure is a segmentation fault in sysfs_get_timeout() in > libmultipath/discovery.c > > Should I write a separate bugzilla for this? I created bug 680480, thanks. Has there been any work on this issue at Red Hat? Is the patch I provided accepted as the fix or is there a better fix that Red Hat will implement? Thanks I posted a slightly revised patch (based on comment#5) to linux-scsi: http://marc.info/?l=linux-scsi&m=130073744426591&w=2 It'll hopefully get included in 2.6.39. This patch should be posted for inclusion in RHEL6.1 by this Thursday (making it eligible for RHEL6.1 snapshot2). (In reply to comment #9) > I posted a slightly revised patch (based on comment#5) to linux-scsi: > http://marc.info/?l=linux-scsi&m=130073744426591&w=2 > > It'll hopefully get included in 2.6.39. This patch should be posted for > inclusion in RHEL6.1 by this Thursday (making it eligible for RHEL6.1 > snapshot2). The above referenced patch has proven unnecessary. But the following refcounting fix is needed (discovered while chasing this issue): http://marc.info/?l=linux-scsi&m=130074943509100&w=2 Patch(es) available on kernel-2.6.32-128.el6 Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Deleting an SCSI (Small Computer System Interface) device attached to a device handler caused applications running in user space, which were performing I/O operations on that device, to become unresponsive. This was due to the fact that the SCSI device handler's activation did not propagate the SCSI device deletion via an error code and a callback to the Device-Mapper Multipath. With this update, deletion of an SCSI device attached to a device handler is properly handled and no longer causes certain applications to become unresponsive. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html |